CEMLB-YOLO: Efﬁcient Detection Model of Maize Leaf Blight in Complex Field Environments

.


Introduction
Corn is one of the world's major cereal crops, second only to wheat and rice in terms of cultivation area, and it serves a vital role as an essential feed and industrial raw material [1]. Northern maize leaf blight (NLB), caused by the phytopathogenic fungus Setosphaeria turcica, occurs frequently in northern China and greatly restricts photosynthesis and the transport of nutrients in the maize leaves, seriously affecting the yield and quality of the maize. As a result, the most critical task for maize producers is to detect whether maize is contaminated with NLB in a timely and accurate manner, thereby preventing the spread of the disease and the resulting decrease in maize production.
Currently, the primary method for detecting NLB is still visual identification, but it is difficult for inexperienced growers to identify similar diseases with the naked eye, leading to inappropriate pesticide applications that affect maize yield and quality, while relying on plant pathologists to identify disease types on site is time-consuming, inefficient and prone to subjective errors, especially in large field environments, significantly increasing labour costs. Many researchers have increasingly utilized machine vision and image-processing techniques to overcome the limitations of manual detection [2,3]. The idea of these studies is often based on the analysis of the colour, texture and spatial structure of the image, using edge arithmetic, threshold segmentation clustering and other methods [4][5][6][7], but it is difficult to meet the natural conditions of complex background images; there are poor adaptability, weak anti-interference ability and other problems, leading to serious limitations in the practical application [8].
Compared to the detection of other crop diseases, the small size of the disease area spots on the leaves of maize leaf blight in the early stages of the pathology, coupled with interfering factors such as the growth chain, lighting, climatic conditions and shading, poses a huge challenge to the visual detection of maize leaf blight. This requires that the algorithm model should have the ability to accurately detect small targets and understand 1.
We introduced a key information position attention mechanism into our model to enhance critical information representation in the feature map, reducing information loss during the downsampling process.

2.
To aggregate global context data more effectively and affordably, the MobileBit is added to the feature extraction network to improve the model's ability to understand complex scenarios. 3.
To exploit the deep feature map's potential for semantic information, FRAFM is incorporated into the model to reorganize and up-sample the semantic information of the deep feature map while adaptively adjusting the proportion of cross-scale feature map information for efficient feature aggregation.

Object Detection Algorithm Based on CNN
In 2012, Krizhevsky et al. proposed AlexNet [9], a deep convolutional neural networkbased image classification system. AlexNet achieved remarkable results in the ImageNet image classification competition, causing CNNs to gain significant attention. Object detection algorithms based on deep neural networks have advanced rapidly since then.
There are two types of CNN-based object detection algorithms: two-stage detection based on candidate regions and one-stage detection based on regression. The R-NN (R-CNN [10], Fast R-CNN [11], Faster R-CNN [12]) series is a representative two-stage algorithm series that can achieve better detection accuracy but is far from real-time in terms of speed. Single-stage algorithms represented by the SSD series [13][14][15] and YOLO series [16][17][18] have comparatively fewer parameters and superior real-time performance but inferior detection accuracy.

Plant Disease Detection Based on Convolutional Networks
As an effective feature extraction tool, convolutional networks have a broad range of applications in crop disease detection. Liao [19] combined the preliminary feature information obtained from manually extracted texture features and colour features with the high-level semantic information extracted with ResNeXt through a graph attention mechanism to achieve strawberry disease type classification. Xie [20] introduced the Inception module and SE module to modify the backbone network of Faster R-CNN and designed a bidirectional region candidate structure to locate grape disease lesion spots. Liu [21] proposed a lightweight model for the real-time detection of tomato leaf diseases by combining YOLOV3 with MobileNetV2; the proposed model accurately detects various types of tomato leaf diseases while maintaining a fast processing speed. Zhao [22] introduced the CBAM attention mechanism and adopted the pyramid structure to construct a multi-scale Faster R-CNN for detecting common strawberry diseases in natural environments. The multi-scale structure enables the network to effectively detect small and large strawberry lesions. Lv [23] developed the DMS-Robust AlexNet model by incorporating cavity convolution and multi-scale convolution into the AlexNet architecture, enhancing the model's feature extraction capabilities and showing strong robustness when detecting maize disease in the natural environment. Afzaal [24] constructed a Mask R-CNN architecture for detecting strawberry diseases using ResNet-101 as the model backbone, providing a foundation for future research in this field. Albattah [25] proposed an improved CenterNet algorithm to identify diseased and healthy leaves of tomatoes, using the Plant Village Kaggle database as the main data source and DenseNet-77 as the base network for deep-level key point extraction.

CEMLB-YOLO Network Model
YOLO is an end-to-end target detection algorithm proposed by Joseph Redmon, which divides the image into a number of S × S grids and predicts the bounding box and species probabilities for each grid cell. Compared to other object detection models, the YOLO series is more capable of meeting various conditions in industrial applications, which has led to widespread attention. YOLOv5 [26], the most widely used version of the YOLO series, has the advantages of fast detection and strong generalization ability; it is widely used, and, in recent years, scholars have proposed YOLOv7 [27], YOLOv8 [28] and more excellent YOLO series of detection models. YOLOv5, YOLOv7 and YOLOv8 can generate different variants, such as YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, YOLOv7, YOLOv7x, YOLOv8n, YOLOv8s, YOLOv8m and YOLOv8l, by adjusting the width and depth multipliers of the network model.
However, YOLO as a single-stage algorithm. There is still potential for improvement, and many scholars have proposed improvements based on the YOLO series of detection algorithms. Souza [29] and Stefenon [30] proposed a hybrid architecture YOLO, which first detects defective insulators in transmission lines through the YOLO detection algorithm, then slices the defective insulators out of the picture and adds a new convolutional network for secondary classification to achieve higher accuracy than simple YOLO. Yao [31] proposed an adaptive feature fusion pyramid, which can better achieve cross-scale feature fusion and added multi-branch cavity convolution, which improves the model's long-range sensing ability. Xu [32] introduced the coordinate attention mechanism in YOLOv5 to increase the model's ability to detect small targets and, secondly, to improve the model's feature extraction ability by replacing the model loss function.

Architectureof CEMLB-YOLO
In this study, we aim to utilize YOLOV5 to create a lightweight model for identifying maize leaf blight to reduce the model complexity and enhance detection speed; we employ MobileNetV3 [33] as the backbone network for feature extraction. FRAFM is based on the idea of CARAFE [34]; it performs up-sampling by extracting the potential semantic information of a high-level feature map and reassembling the feature information. It adaptively adjusts the proportion of information at different scales in the feature map, achieving more effective cross-scale feature information fusion. MobileBit combines the advantages of inductive bias in CNNs and long-range perception in Vision Transformers [35] (VIT), enabling the model to balance the processing of local detail information and long-range information modelling capabilities. CIPAM first uses self-attention to fuse feature information from multiple channels, enhancing the representation ability of key features. Then it uses two-directional, one-dimensional pooling layers to encode the spatial position of key features to improve the model's ability to perceive their spatial locations. The overall architecture of the model is shown in Figure 1.

Mobile Bi-Level Vision Transformer
In complex and varied maize planting areas, it is essential for the model to capture global feature information to comprehend the whole scene. Such as, VIT divides the image into patches, and each patch calculates the affinity with other patches enabling the model to capture long-range dependencies effectively. However, this leads to higher model complexity and incurs heavy memory footprints, which is not conducive to model deployment for edge devices. In addition, VIT requires a larger amount of training data and longer training time due to the lack of convolutional inductive bias characteristics [36].
To solve the above problems, we propose a lightweight hybrid architecture that combines the convolution and transformer, which can effectively model both local and global information simultaneously and is easier deploy for the edge devices. The overall architecture of the MobileBit is shown in Figure 2

Mobile Bi-Level Vision Transformer
In complex and varied maize planting areas, it is essential for the model to capture global feature information to comprehend the whole scene. Such as, VIT divides the image into patches, and each patch calculates the affinity with other patches enabling the model to capture long-range dependencies effectively. However, this leads to higher model complexity and incurs heavy memory footprints, which is not conducive to model deployment for edge devices. In addition, VIT requires a larger amount of training data and longer training time due to the lack of convolutional inductive bias characteristics [36].
To solve the above problems, we propose a lightweight hybrid architecture that combines the convolution and transformer, which can effectively model both local and global information simultaneously and is easier deploy for the edge devices. The overall architecture of the MobileBit is shown in Figure 2.  MobileBit is divided into three sections: the Convolution section, Transformer section and Fusion section. In the Convolution section, we first use depth-wise separable convolution to encode the spatial information in the image and model the local features, adjusting the channel dimension of the feature map by 1 × 1 convolution to reduce the operation of the transformer. In the Transformer section, we use a bi-level transformer [37] to segment the image into several non-overlapping regions. For each region, only the most

Mobile Bi-Level Vision Transformer
In complex and varied maize planting areas, it is essential for the model to capture global feature information to comprehend the whole scene. Such as, VIT divides the image into patches, and each patch calculates the affinity with other patches enabling the mode to capture long-range dependencies effectively. However, this leads to higher model com plexity and incurs heavy memory footprints, which is not conducive to model deploymen for edge devices. In addition, VIT requires a larger amount of training data and longer training time due to the lack of convolutional inductive bias characteristics [36].
To solve the above problems, we propose a lightweight hybrid architecture that com bines the convolution and transformer, which can effectively model both local and globa information simultaneously and is easier deploy for the edge devices. The overall archi tecture of the MobileBit is shown in Figure 2. MobileBit is divided into three sections: the Convolution section, Transformer section and Fusion section. In the Convolution section, we first use depth-wise separable convo lution to encode the spatial information in the image and model the local features, adjust ing the channel dimension of the feature map by 1 × 1 convolution to reduce the operation of the transformer. In the Transformer section, we use a bi-level transformer [37] to seg ment the image into several non-overlapping regions. For each region, only the mos MobileBit is divided into three sections: the Convolution section, Transformer section and Fusion section. In the Convolution section, we first use depth-wise separable convolution to encode the spatial information in the image and model the local features, adjusting the channel dimension of the feature map by 1 × 1 convolution to reduce the operation of the transformer. In the Transformer section, we use a bi-level transformer [37] to segment the image into several non-overlapping regions. For each region, only the most relevant K subregions are preserved for the execution of the self-attention mechanism. This selective approach not only enables the model to comprehend long-range perceptual correlations amongst non-overlapping regions but also significantly reduces the model's complexity. In the Fusion section, the local modelling information is concatenated with the global modelling information, then through 1 × 1 convolution to fuse the information.

Bi-Level Transformer
The bi-Level transformer first constructs a coarse-grained affinity graph of query-keys and performs pruning at the coarse-grained region level instead of directly at the finegrained token level, retaining the most critical part for token-token attention, as shown in Figure 3. The bi-Level transformer divides the input feature map X ∈ R H×W×C into Appl. Sci. 2023, 13, 9285 5 of 21 non-overlapped areas and reshapes X to X r ∈ R S 2 × HW S 2 ×C , then with linear projections to (1) keys and performs pruning at the coarse-grained region level instead of directly at fine-grained token level, retaining the most critical part for token-token attention shown in Figure 3. The bi-Level transformer divides the input feature map R H W X   into non-overlapped areas and reshapes X to , , R W are projection weights for the query, key and value, respectively. . The overall architecture of the bi-level transformer with a coarse-grained relationship graph to filter the most relevant k candidate patches for each patch; then fine-grained token-to-token attention are applied to candidate patches.
W q , W k , W v are projection weights for the query, key and value, respectively. Then, the bi-Level transformer calculates the mean value Q, K of each patch to obtain the region-level Q r , K r ∈ R S 2 ×C and performs matrix multiplication between Q r and the transpose K r to derive the region-to-region affinity adjacency matrix A r ∈ R S 2 ×S 2 : A r Indicates the degree of semantic information associated between the two regions. Next, only retain the k highest associated regions for each region, trimming A r to obtain the region of interest index matrix I r ∈ R S 2 ×k . Finally, using the index matrix I r to obtain the key-value pairs of the K most relevant regions associated with the i th region and apply self-attention to the gathered key, the values are as follows: The ith row of I r indicates the k regions that are most relevant to the ith region.
is the key-value pair tensor for each region token-token.
√ d k is used to avoid concentrated weights and gradient vanishing.

Feature Restructuring and Fusion Module
Multi-scale fusion features can improve the detection ability of the model, but the deep feature maps are often up-sampled by interpolation methods with a small sense field, which does not fully use the semantic information in the deep feature maps. Second, the information fusion ratio of different feature maps is 1:1, which cannot adjust the proportion of information in the feature maps.
To achieve more effective cross-scale fusion, we propose FRAFM, as shown in Figure 4. FRAFM first employs CARAFE to up-sample the deep feature map to preserve the intricate details embedded in the deep features. We use a Spatial Attention Mechanism (SAM) for shallow feature maps and a Channel Attention Mechanism(CAM) for deep feature maps to better highlight important information in feature maps at different scales. In addition, we concatenate shallow and deep feature maps, then pass them through a 3 × 3 convolutional layer, a Batch Normalization (BN) layer and a Sigmoid activation function to generate learnable weights, which are used to adjust the ratio of information contributed by feature maps of different scales during the fusion process. In the following subsections, we will delve into a detailed exploration of the CARAFE, CAM and SAM.
is the key-value pair tensor for each region token-token. k d used to avoid concentrated weights and gradient vanishing.

Feature Restructuring and Fusion Module
Multi-scale fusion features can improve the detection ability of the model, but t deep feature maps are often up-sampled by interpolation methods with a small sen field, which does not fully use the semantic information in the deep feature maps. Secon the information fusion ratio of different feature maps is 1:1, which cannot adjust the p portion of information in the feature maps.
To achieve more effective cross-scale fusion, we propose FRAFM, as shown in Figu 4. FRAFM first employs CARAFE to up-sample the deep feature map to preserve the tricate details embedded in the deep features. We use a Spatial Attention Mechanis (SAM) for shallow feature maps and a Channel Attention Mechanism(CAM) for deep fe ture maps to better highlight important information in feature maps at different scales. addition, we concatenate shallow and deep feature maps, then pass them through a 3 convolutional layer, a Batch Normalization (BN) layer and a Sigmoid activation functi to generate learnable weights, which are used to adjust the ratio of information contr uted by feature maps of different scales during the fusion process. In the following su sections, we will delve into a detailed exploration of the CARAFE, CAM and SAM.  . The overall architecture of FRAFM. The F low , F low represents low feature maps, and the F high , F high , F high represents high feature maps, where α, 1 − α ∈ R 1×H 1 ×W 1 .

CARAFE
CARAFE consists of kernel prediction and content-aware reassembly modules, as shown in Figure 5. The kernel prediction module generates a reassembly kernel in a content-aware manner for the input feature map χ ∈ R C×H×W , using k encoder × k encoder convolution to generate a reassembly kernel W l for each position in the target feature map χ ∈ R C×σH×σW based on χ ∈ R C×H×W ; finally, it uses the Softmax function to normalize so that the sum of the weights of each convolution kernel is 1. For l = (i , j ) in χ ∈ R C×H×W , the content-aware reassembly module performs a dot product operation between square region N = (k up , k up ) centred at l = (i, j) in χ ∈ R C×H×W and W l . The mathematical formula for CARAFE is expressed as shown in Equations (7) and (8), where r = k up /2:

CAM and SAM
The CAM and SAM refer to the attention mechanism in CBAM [38], as shown in

CAM and SAM
The CAM and SAM refer to the attention mechanism in CBAM [38], as shown in Figure 6. For deep feature maps F high , we use CAM to obtain key information and ignore redundant information. Specifically, CAM uses a pooling layer to compress the spatial dimension to obtain two features of dimension C × 1 × 1; then, through multi-layer perceptron, it determines the weights of each channel and, finally, multiplies the weights with F high to obtain F high :

CAM and SAM
The CAM and SAM refer to the attention mechanism in CBAM [38], as shown in  Spatial attention maintains the spatial dimension and compresses the channel dimension. For shallow feature map F low , we use SAM to locate the location information of the target:

Crucial Information Position Attention Mechanism
Maize leaf blight occurs in small and dense areas; some critical information is lost or blurred during the image feature extraction model's down-sampling procedure, impairing the model's detection capability.
To address this issue, we propose a Crucial Information Position Attention Mechanism (CIPAM) that helps the model to be able to focus on specific regions of important details and highlight the most informative regions. The model can retain and utilise the most

Crucial Information Position Attention Mechanism
Maize leaf blight occurs in small and dense areas; some critical information is lost or blurred during the image feature extraction model's down-sampling procedure, impairing the model's detection capability.
To address this issue, we propose a Crucial Information Position Attention Mechanism (CIPAM) that helps the model to be able to focus on specific regions of important details and highlight the most informative regions. The model can retain and utilise the most essential parts of the image even during the down-sampling process, reducing the potential loss of critical information. The structure is shown in Figure 7. For the input feature map , CIPAM first constructs interdependencies between channels of the feature map using a self-attention approach. Specifically, the feature map of the th i channel is then reweighted and fused with the feature maps of other channels based on their correlation coefficients to enable information interaction between For the input feature map X ∈ R C×H×W , CIPAM first constructs interdependencies between channels of the feature map using a self-attention approach. Specifically, the feature map of the i th channel is then reweighted and fused with the feature maps of other channels based on their correlation coefficients to enable information interaction between different channels and enhance the representation of crucial information. For the i th channel of the feature map, it is as follows: Weight ij denotes the correlation of the i th channel with the j-th channel. More importantly, in the complex and ever-changing natural environment, accurately pinpointing the location of plant diseases is crucial for enhancing the model's performance. Specifically, for input feature maps X ∈ R C×H×W , two one-dimensional pooling layers are used along the horizontal and vertical directions, respectively, to obtain Z h , Z w . The Z h , Z w captures both long-range dependencies and retains precise positional information of crucial information. Z h , Z w can be expressed mathematically as follows in Equations (13) and (14): The two directional feature maps are concatenated and passed through a 1 × 1 convolution layer to obtain the feature map F, representing the interaction between the height and width directions. After applying batch normalization and a non-linear activation function, the feature maps are split into two directional feature maps f w , f h . Then using the Sigmoid function to obtain the weights g w , g h of the feature maps in height and width. Finally, the weights are multiplied by X . The process uses mathematical expressions as shown in Equations (15)- (18): The resulting feature map X out significantly enhances the representation of crucial information in the feature map and accurately captures the location of such information. Our experiments demonstrate that our proposed strategy focuses better than previous attention techniques on the location of disease occurrence in complex field environments.

The Loss Function of CEMLB-Yolov5
The regression loss in YOLOV5 adjusts the position of the predicted bounding box by calculating the intersection over the union ratio between the ground truth box and the predicted box, as demonstrated by Equation (19). The continued research on loss functions, GIOU [39], DIOU [40] and CIOU [41], have been proposed. CIOU regression loss converges faster than other alternative regression losses. This paper adopts CIOU as the model's regression loss, with its expression shown in Equation (20): where A and B respectively, denote the area of the ground truth-bounding boxes and the predicted boxes, ρ(·) indicates the Euclidean distance between the predicted and true box centroids and d represents the diagonal distance between the smallest closed regions. α indicates the trade-off indicator. The value of v describes the similarity of the ground truth and bounding box shapes. α, v can be expressed mathematically as follows: Generally, the distance between the centre of the predicted box and the true box increases as the size of an object increases. The larger the object, the more significant its contribution to the loss function, which can reduce the model's ability to detect smaller defects and result in false negatives. Therefore, in order to balance this difference, this paper takes the square root of the numerator when calculating CIOU. The enhanced expression of the CIOU function is as follows:

Data
The NLB dataset [42] was created for detecting maize leaf blight disease and is the largest dataset of its kind, with each image annotated by one of two anthropologists. The dataset was divided into three parts: the first was taken with a handheld camera device, the second part was taken by mounting the camera on a 5 m long boom and the third part was taken with a DJI Matrice 600 sUAS camera on board, flying at an altitude of 6 m and a speed of 1 m/s, capturing images every two seconds. The handheld datasets, which include 1019 images with different angles and backgrounds and 7669 annotations, are used in this research study because it offers clear and training-friendly images. Figure 8  We apply data augmentation techniques, such as overexposure, haze, rain, random rotation and random cropping, to the original dataset to mitigate the effects of a small dataset on the model training. This technique is randomly combined to generate a total of 9070 images. The dataset is split into training and validation sets at 7:3 ratio.

Experimental Configuration
The experiments were conducted under Windows 10 with the PyTorch deep learning framework, CUDA version 11.1, NVIDIA GeForce GTX3060 graphics card, 12 GB of video memory and a 12-core Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz. The initial learning rate was set to 0.01, the optimizer was chosen from SGD [43], the momentum decay was set to 0.937, the weight decay was set to 0.0005, the epoch was set to 300 and the batch size was set to 36. During the experiment, the learning rate will be adjusted according to the cosine annealing strategy during the training process.

Model evaluation indicators
This study used average precision (AP) to evaluate the detection model's performance. AP uses a combination of Precision(P) and Recall(R) to evaluate the model's performance in detecting a particular class. AP evaluates the model's performance in detect- We apply data augmentation techniques, such as overexposure, haze, rain, random rotation and random cropping, to the original dataset to mitigate the effects of a small dataset on the model training. This technique is randomly combined to generate a total of 9070 images. The dataset is split into training and validation sets at 7:3 ratio.

Experimental Configuration
The experiments were conducted under Windows 10 with the PyTorch deep learning framework, CUDA version 11.1, NVIDIA GeForce GTX3060 graphics card, 12 GB of video memory and a 12-core Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz. The initial learning rate was set to 0.01, the optimizer was chosen from SGD [43], the momentum decay was set to 0.937, the weight decay was set to 0.0005, the epoch was set to 300 and the batch size was set to 36. During the experiment, the learning rate will be adjusted according to the cosine annealing strategy during the training process.

Model Evaluation Indicators
This study used average precision (AP) to evaluate the detection model's performance. AP uses a combination of Precision (P) and Recall (R) to evaluate the model's performance in detecting a particular class. AP evaluates the model's performance in detecting a specific class using a combination of Precision and Recall. Mean average-precision (mAP) is the average of the AP of multiple categories. mAP@0.5 is the average of the AP calculated for all categories when the IOU threshold between the predicted and true boxes is set to 0.5. In this paper, mAP@0.5 is used as the evaluation criterion for the model. The expressions that calculate P, R, AP and mAP are shown in Equations (24)

Analysis and Comparison of Experimental Results
We evaluate the superiority of the proposed CEMLB-YOLO algorithm in detecting maize leaf blight in complex field environments using other object detection algorithms as comparison experiments. Table 1 compares the detection effectiveness of CEMLB-YOLO with other models. In this paper, we replaced the backbone network in YOLOV5 with MobileNetv3 as the original model. Based on the results of the comparative experiments in this paper, the proposed model performs well in detecting maize leaf blight in complex environments in the field. Compared to YOLOv3-tiny and YOLOv7-tiny, which have lower parameter quantities and complexity, our model proposed in this paper has lower parameter and model complexity but higher accuracy. Compared to YOLOv3-tiny, the accuracy of the model is improved by 11.3%, and the model and parameter amount is reduced by 4 Million, 2.5 GFLOPs.
The accuracy of our model is 6.2% higher than YOLOv7-tiny, and the model parameter amount and complexity are reduced by 1.5 Million, 3.7 GFLOPs. During the comparison experiments, we chose Resnet50 as the backbone network of Faster R-CNN and RetinaNet, which results in a higher complexity and a higher number of parameters of the model. The number of parameters and complexity of our model is one-tenth of the Faster R-CNN, RetinaNet, and the accuracy has been reduced by 4.9% and 4.1%. Compared with the emerging YOLOv8s and YOLOX, our model complexity is reduced by nearly 2/3, and the accuracy is only 1.9% less than YOLOv8s and 3.5% less than YOLOX. Compared to YOLOV8n, which has a similar number of model parameters and complexity, the accuracy of our model increases by 8.8%, while the number of parameters increases by only 150 w and 1.2 GFLOPs. We also compared our approach with other researchers, and our model is 5.4% more accurate than Song's and 4.3% less accurate than Sun's approach.
In this paper, FPS is introduced to evaluate the impact of different methods on the detection speed. From the table, we can see that our proposed model's detection speed is lower than YOLOv3-tiny, YOLOv7-tiny, YOLOV8n and YOLOv8s, which we analysed due to the introduction of MobileBit. Our model's detection speed is faster than YOLOX, RetinaNet, YOLOv7 and Faster RCNN models; particularly, the detection speed of our model is roughly twice as fast as the detection speed of Faster RCNN, RetinaNet. The detection speed of our model also shows a superior performance when compared to the detection speed of other researchers' methods.
Overall, our model achieves a better balance between detection speed and detection accuracy, and our model is more suitable for detecting maize leaf blight in complex environments.

Ablation Experiment
In this section, we have conducted a series of experiments to verify the validity of our proposed method. The results of the ablation experiments are shown in Table 2, and Figure 9 shows the mAP@0.5 curves of the ablation experiment. As can be seen from Figure 9, the number of convergence iterations of MobileBit is reduced by 50 rounds compared to the original model, which can effectively shorten the training time of the model; in addition, through the experimental ablation curve, it is observed that CIPAM can also accelerate the training of the model, proving that the model pays more attention to the location of disease occurrence and ignores irrelevant information; the combination of CIPAM + MobileBit can effectively accelerate the training of the model, as can be seen in the figure. It converges to around 120 fewer iterations than the original model; by adding the FRAFM module to alleviate the aliasing effects caused by feature fusion, the training time of the model can be accelerated even further.
From the table, it can be seen that all three methods proposed in this paper improve the model's detection capability. Specifically, the FRAFM module is the greatest improvement among the individual methods, resulting in a 4% increase in model accuracy, the combination of FRAFM and MobileBit achieves the highest accuracy improvement of 4.6% among the two-method combinations. The MobileBit+CIPAM combination improves the average precision by 4.3%, but it also introduces the highest increase in the number of model parameters and GFLOPs among all the methods. We observe that adding MobileBit or CIPAM increases more GFLOPs due to the incorporation of self-attentive computation, but this increase in GFLOPs is within acceptable limits considering the significant improvement in model performance they provide. In this paper, FPS is introduced to evaluate the impact of different methods on the detection speed. It can be seen from the table results that the improved model still achieves real-time detection compared with the original detection method.  Figure 9. The mAP@0.5 curves of the ablation experiment.
As can be seen from Figure 9, the number of convergence iterations of MobileBit is reduced by 50 rounds compared to the original model, which can effectively shorten the In Figure 10, we show the impact of different improvement methods on the detection capability of the model. In Figure 10, the third column represents the detection results of the original model, and the middle represents the detection results of the different improvement methods. From the table, it can be seen that all three methods proposed in this paper improve the model's detection capability. Specifically, the FRAFM module is the greatest improvement among the individual methods, resulting in a 4% increase in model accuracy, the combination of FRAFM and MobileBit achieves the highest accuracy improvement of 4.6% among the two-method combinations. The MobileBit+CIPAM combination improves the average precision by 4.3%, but it also introduces the highest increase in the number of model parameters and GFLOPs among all the methods. We observe that adding MobileBit or CIPAM increases more GFLOPs due to the incorporation of self-attentive computation, but this increase in GFLOPs is within acceptable limits considering the significant improvement in model performance they provide. In this paper, FPS is introduced to evaluate the impact of different methods on the detection speed. It can be seen from the table results that the improved model still achieves real-time detection compared with the original detection method.
In Figure 10, we show the impact of different improvement methods on the detection capability of the model. In Figure 10, the third column represents the detection results of the original model, and the middle represents the detection results of the different improvement methods.
(a)  Figure 10a shows that the original model has the problem of inaccurate detection of small-area diseases; adding CIPAM can help the model increase its ability to locate smallarea diseases. In Figure 10c, the original model has a more serious leakage detection in complex scenes; adding MobileBit can increase the detection ability of the model in complex scenes and reduce leakage detection. In Figure 10d, the model's original detection results suffered from inaccurate localization and missed detection when the background information is complex and the disease location occurs in a small area. MobileBit + CIPAM method can effectively detect the location of the maize leaf blight disease in a complex background.

Visualization of Results
To further validate the effectiveness of our proposed CIPAM in focusing more effectively on the location of leaf blight disease in complex environments compared to other attention mechanisms, we used the Grad-CAM [47] method to visualize the detection results to see which part of the image the model is most concerned with to make a judgement. Some example images are presented in Figure 11. From the figure, CIPAM can effectively focus more on the location of disease occurrence, even in small and complex scenarios, proving the effectiveness of the CIPAM method compared to other methods.  Figure 10a shows that the original model has the problem of inaccurate detection of small-area diseases; adding CIPAM can help the model increase its ability to locate small-area diseases. In Figure 10c, the original model has a more serious leakage detection in complex scenes; adding MobileBit can increase the detection ability of the model in complex scenes and reduce leakage detection. In Figure 10d, the model's original detection results suffered from inaccurate localization and missed detection when the background information is complex and the disease location occurs in a small area. MobileBit + CIPAM method can effectively detect the location of the maize leaf blight disease in a complex background.

Visualization of Results
To further validate the effectiveness of our proposed CIPAM in focusing more effectively on the location of leaf blight disease in complex environments compared to other attention mechanisms, we used the Grad-CAM [47] method to visualize the detection results to see which part of the image the model is most concerned with to make a judgement. Some example images are presented in Figure 11. From the figure, CIPAM can effectively focus more on the location of disease occurrence, even in small and complex scenarios, proving the effectiveness of the CIPAM method compared to other methods.
During model training, we trained the model in the handheld portion of the dataset because this portion of the dataset is clearer, has more distinct disease features and is more friendly for model training. In addition, we tested the model in the other two parts of the dataset to verify the generalisation ability of our proposed model. The accuracy of our proposed model on the three partial datasets is shown in Table 3. Figure 11. Visualization results of different methods. Experimental comparison group SE [48], CBAM, CIPAM can locate disease more accurately than other attention mechanisms, while SE and CBAM are sensitive to the approximate extent of disease location.
During model training, we trained the model in the handheld portion of the dataset because this portion of the dataset is clearer, has more distinct disease features and is more friendly for model training. In addition, we tested the model in the other two parts of the dataset to verify the generalisation ability of our proposed model. The accuracy of our proposed model on the three partial datasets is shown in Table 3. From the table, we can see that, although the Faster R-CNN model can achieve the highest accuracy in the handheld set, the detection accuracy in the boom set and the drone set decreases significantly by 13.3% and 16.7%. The detection accuracy of the original model in the two parts of the dataset decreases by 13.5% and 10.8%. We analyse that there are large differences in the shooting angle, background and illumination of the three datasets, which cause the model to fail to extract the disease features well.
Although the accuracy of the model proposed in this paper is not as good as Faster R-CNN in the handheld part of the dataset, the accuracy in the boom set, drone set part of the dataset only decreases by 3.2% and 3.9, which suggests that our proposed model can focus on the location of the disease occurrence more efficiently and extract the features of the disease effectively. It suggests that our proposed model can handle the effect of Figure 11. Visualization results of different methods. Experimental comparison group SE [48], CBAM, CIPAM can locate disease more accurately than other attention mechanisms, while SE and CBAM are sensitive to the approximate extent of disease location. From the table, we can see that, although the Faster R-CNN model can achieve the highest accuracy in the handheld set, the detection accuracy in the boom set and the drone set decreases significantly by 13.3% and 16.7%. The detection accuracy of the original model in the two parts of the dataset decreases by 13.5% and 10.8%. We analyse that there are large differences in the shooting angle, background and illumination of the three datasets, which cause the model to fail to extract the disease features well.
Although the accuracy of the model proposed in this paper is not as good as Faster R-CNN in the handheld part of the dataset, the accuracy in the boom set, drone set part of the dataset only decreases by 3.2% and 3.9, which suggests that our proposed model can focus on the location of the disease occurrence more efficiently and extract the features of the disease effectively. It suggests that our proposed model can handle the effect of environmental factors on model performance more effectively and has a stronger generalisation ability. We selected some sample images to show the detection effect on the three datasets, as shown in Figure 12. As shown in Figure 12, due to the strong illumination of the images taken by the UAV, the original model has a poor detection capability in the drone part of the dataset, creating a missed detection problem. The original model generates false detections in complex scenarios in the boom part of the dataset. However, CEMLB-YOLOv5 detects the drone and boom portion of the NLB dataset significantly better than the original model.

Conclusions
This paper proposes the CEMLB-YOLO maize leaf blight detection algorithm based on YOLOv5 to address the challenge of balancing accuracy and detection speed when detecting maize leaf blight in complex scenarios. CIPAM enhances the feature representation of key information more effectively than other attention mechanisms, enabling the model to focus more precisely on the disease's location and ignore irrelevant information in complex environments. MobileBit uses a combination of convolution and transformer architectures to enable the model to efficiently sense long-distance dependencies while at the same time having the inductive bias of convolution, which greatly reduces training time and model complexity compared to standard vision transformers. FRAFM makes full use of the important information in feature maps of different scales and introduces learnable parameters to control the proportion of information in the fusion process of deep and shallow feature maps to achieve more effective cross-scale fusion. The experiments demonstrate that the method proposed in this paper has fewer parameters and lower complexity than other models, which is more suitable for deployment on edge devices and can replace human experts for field identification. However, one limitation of our current study is the lack of evaluation of the model's robustness under specific weather condi- As shown in Figure 12, due to the strong illumination of the images taken by the UAV, the original model has a poor detection capability in the drone part of the dataset, creating a missed detection problem. The original model generates false detections in complex scenarios in the boom part of the dataset. However, CEMLB-YOLOv5 detects the drone and boom portion of the NLB dataset significantly better than the original model.

Conclusions
This paper proposes the CEMLB-YOLO maize leaf blight detection algorithm based on YOLOv5 to address the challenge of balancing accuracy and detection speed when detecting maize leaf blight in complex scenarios. CIPAM enhances the feature representation of key information more effectively than other attention mechanisms, enabling the model to focus more precisely on the disease's location and ignore irrelevant information in complex environments. MobileBit uses a combination of convolution and transformer architectures to enable the model to efficiently sense long-distance dependencies while at the same time having the inductive bias of convolution, which greatly reduces training time and model complexity compared to standard vision transformers. FRAFM makes full use of the important information in feature maps of different scales and introduces learnable parameters to control the proportion of information in the fusion process of deep and shallow feature maps to achieve more effective cross-scale fusion. The experiments demonstrate that the method proposed in this paper has fewer parameters and lower complexity than other models, which is more suitable for deployment on edge devices and can replace human experts for field identification. However, one limitation of our current study is the lack of evaluation of the model's robustness under specific weather conditions, such as rain, snow and fog. Our future research will be focused on enhancing the model's ability to detect maize leaf blight under more complex weather conditions. We will remain committed to improving the precision and robustness of our model in the face of environmental variables.