One-Stage Disease Detection Method for Maize Leaf Based on Multi-Scale Feature Fusion

Plant diseases such as drought stress and pest diseases significantly impact crops’ growth and yield levels. By detecting the surface characteristics of plant leaves, we can judge the growth state of plants and whether diseases occur. Traditional manual detection methods are limited by the professional knowledge and practical experience of operators. In recent years, a detection method based on deep learning has been applied to improve detection accuracy and reduce detection time. In this paper, we propose a disease detection method using a convolutional neural network (CNN) with multi-scale feature fusion for maize leaf disease detection. Based on the one-stage plant disease network YoLov5s, the coordinate attention (CA) attention module is added, along with a key feature weight to enhance the effective information of the feature map, and the spatial pyramid pooling (SSP) module is modified by data augmentation to reduce the loss of feature information. Three experiments are conducted under complex conditions such as overlapping occlusion, sparse distribution of detection targets, and similar textures and backgrounds of disease areas. The experimental results show that the average accuracy of the MFF-CNN is higher than that of currently used methods such as YoLov5s, Faster RCNN, CenterNet, and DETR, and the detection time is also reduced. The proposed method provides a feasible solution not only for the diagnosis of maize leaf diseases, but also for the detection of other plant diseases.


Introduction
Plant diseases are one of the main factors that affect plant growth, and the detection and identification of plant diseases are the keys to early diagnosis and the precise control of pests and diseases. Maize is the world's top-producing food crop. When encountering diseases, maize plants infected with viruses and fungi produce physiological lesions. The infected parts of the leaves show characteristics such as deformation, discoloration, curling, rotting, and discoloration. There is a need for the quick, easy, and accurate detection of plant disease areas and identification of the disease species. Traditional leaf detection methods by manual observation and judgment of maize leaves require extensive practical experience and professional knowledge, which are time-consuming and have a high cost and high false detection rate.
In recent years, deep learning has been widely used in various fields such as face recognition, intelligent transportation, and automatic driving. Deep learning applied to plant disease detection and identification can overcome the drawbacks of traditional diagnostic methods and significantly improve the accuracy of disease detection and identification, and it has attracted widespread attention [1,2]. Girshick et al. proposed R-CNN [3], which uses a convolutional neural network (CNN) to extract image features for plant disease detection. The proposed Fast R-CNN is based on R-CNN [4] to solve the problem of large numbers of overlapping boxes in the process of candidate region selection in R-CNN. Faster R-CNN is widely used in the disease detection of grapes [5] and rice [6] due to its outstanding detection accuracy. However, Faster R-CNN is a two-stage detector, and the computational effort of selecting candidate frames is heavy, which leads to its slow detection speed and makes it difficult to achieve real-time detection. Redmon et al. proposed the YoLo method [7], which is based on one-stage detection and does not require the generation of a proposal box, but it divides the image into a grid to determine target boundaries and classes, which improves the detection speed and alleviates the real-time detection problem compared with Faster R-CNN. YoLo-based methods are also widely used in plant disease detection, such as YoLov3-based tea disease detection [8] and YoLov4-based citrus disease detection [9], but YoLo is not accurate enough in the detection and localization of small targets.
Zhou et al. [10] proposed CenterNet, a detection algorithm without anchor frames, which removes the operation of generating anchor frames and saves some time-consuming operations by estimating the loss from the heat map, thus improving the detection speed. Albattah et al. [11] improved CenterNet by extracting deep-seated key points based on DenseNet-77 and classifying and recognizing 26 kinds of plant diseases in 14 plants, such as tomatoes, apples, grapes, etc., but the detection effect of small plant diseases was not ideal. Rashid et al. performed potato disease detection [12] based on YoLov5 using multiscale pooling and feature pyramid up-and down-sampling to enhance contextual linguistic features to accommodate multi-scale plant diseases and directly improve small plant disease accuracy. It also combined the advantages of the accuracy of the anchor-less frame algorithm and the detection speed of the single-segment algorithm, and the detection effect is remarkable.
Scholars have undertaken a lot of research on the target detection and classification and recognition of plant leaf diseases based on deep learning technology, as shown in Table 1. In practical application scenarios, plant disease detection and recognition still face many challenges. The main reasons are as follows: (1) The change of illumination makes it difficult to locate the target area accurately. Due to the change in light intensity and reflection, as well as other reasons, it is difficult to accurately locate the diseased area in some detection images. Even under the same light intensity, the shooting angle and height may cause the color depth of the diseased area to be different, making the disease characteristics not significant, and thus affecting the detection accuracy. (2) The complex background makes it difficult to detect the target accurately. The image background of plant leaf disease is complex and may include leaves, trunks, weeds, fallen leaves, shadows, etc. The color and shape of the plant disease may be similar to other objects in the background, resulting in an increased difficulty of target detection. (3) Occlusion leads to missing target features and overlapping noise. Occlusion problems include blade occlusion caused by blade attitude changes, branch occlusion, light occlusion caused by external illumination, and mixed occlusion caused by different occlusion types. Due to occlusion, feature deletion and noise overlap lead to false detection or even missed detection. (4) The sparse target distribution affects the detection accuracy. Due to the limitation of the convolution receptive field, the connection between target pixels with sparse distribution is not strong, and the context extraction is not sufficient, which leads to the failure of modeling, thus affecting the detection accuracy.
This paper proposes a multi-scale feature fusion convolutional neural network (MFF-CNN) for the disease detection of maize leaves based on the anchor-free frame one-stage plant disease detection method, with the addition of the CA attention module and improvement of the SSP module based on YoLov5s.
The main contributions of this paper are as follows: (1) We add a coordinate attention (CA) module to the backbone network and increase the weight of key features to strengthen the effective information of the feature map.
(2) We improve the spatial pyramid pooling (SPP) module to reduce the loss of feature information caused by traditional pooling. (3) We solve the problem of the insufficient dataset through data enhancement, enrich the training data, improve the generalization performance and robustness of the model, and prevent overfitting.

Deep Learning-Based Plant Disease Detection Technology
Plant disease detection technology uses computer vision technology to detect plant disease-infested areas and their exact locations under complex natural conditions, which is a prerequisite for the accurate classification and identification of plant diseases and the assessment of disease damage levels.
Early plant disease algorithms used a sliding window strategy to select candidate regions, then extract candidate region features, and, finally, using a classifier, classify them to obtain the target regions. This method traverses the image by setting different scales and widths. Although this method does not miss any disease region target, the ensuing redundant candidate windows bring great computational effort and it is very time-consuming to traverse the disease image all over again, resulting in poor real-time detection [21].
With the rapid development of artificial intelligence technology, different techniques are based on artificial vision for digital image processing and the implementation of the image classification model. Reference [22] proposed methodology that consists of five stages, as shown in Figure 1, i.e., image acquisition, preprocessing, segmentation, feature extraction, and classification, to find the damages caused by the cogollero worm in corn fields. Appl. Sci. 2022, 12, x FOR PEER REVIEW 4 of 20 stages, as shown in Figure 1, i.e., image acquisition, preprocessing, segmentation, feature extraction, and classification, to find the damages caused by the cogollero worm in corn fields. Moreover, many deep learning-based methods can be applied to plant disease diagnosis, as shown in Figure 2.

Anchor-Based Plant Disease Detection Algorithm
The anchor-based plant disease detection algorithm adopts the detection idea of an anchor plus a priori box, setting a priori boxes (anchor-box) with different aspect ratios at each feature point of the plant feature map for screening and adjustment to obtain the final prediction box. Due to the redundant computation of the anchor-box, anchor-based plant disease detection is slow. Moreover, many deep learning-based methods can be applied to plant disease diagnosis, as shown in Figure 2. stages, as shown in Figure 1, i.e., image acquisition, preprocessing, segmentation, feature extraction, and classification, to find the damages caused by the cogollero worm in corn fields. Moreover, many deep learning-based methods can be applied to plant disease diagnosis, as shown in Figure 2.

Anchor-Based Plant Disease Detection Algorithm
The anchor-based plant disease detection algorithm adopts the detection idea of an anchor plus a priori box, setting a priori boxes (anchor-box) with different aspect ratios at each feature point of the plant feature map for screening and adjustment to obtain the final prediction box. Due to the redundant computation of the anchor-box, anchor-based plant disease detection is slow.

Anchor-Based Plant Disease Detection Algorithm
The anchor-based plant disease detection algorithm adopts the detection idea of an anchor plus a priori box, setting a priori boxes (anchor-box) with different aspect ratios at each feature point of the plant feature map for screening and adjustment to obtain the final prediction box. Due to the redundant computation of the anchor-box, anchor-based plant disease detection is slow.
Anchor-based plant disease detection frameworks can be divided into two categories: two-stage detectors and one-stage detectors [23], as shown in Figure 3. Anchor-based plant disease detection frameworks can be divided into two categories: two-stage detectors and one-stage detectors [23], as shown in Figure 3.

Two-Stage Detector
The main plant disease algorithms based on a two-stage detector are RCNN [3], SPP-Net [24], Fast R-CNN [4], Faster R-CNN [25], etc. The two-stage detector utilizes two networks to implement classification and regression, respectively. A feature extractor (backbone) is used to generate a series of proposal boxes that may contain plant disease targets to be detected, and then some filtering rules are applied to filter the proposal boxes and identify the disease targets.
Taking Faster-RCNN as an example, firstly, the images are fed into the feature extractor for feature extraction, and the extracted features are fed into the region proposal network (RPN) to generate candidate frames. Secondly, the final proposal boxes are filtered according to the results of the classification and regression. Next, features are extracted from the suggestion frames in the feature map, and each feature is input to the pooling layer of the region of interest (ROI) and unified into a 7 × 7 size. Then, it is transformed into a one-dimensional vector by a fully connected layer. Finally, a classification and regression task is carried out to further correct the proposal box and determine the specific class of targets. The two-stage detector is more advantageous in the detection accuracy and classification precision of plant diseases.

One-Stage Detector
One-stage detector-based plant disease detection algorithms mainly comprise YoLo [26], SSD [27], and RetinaNet [28]. A one-stage detector accomplishes the classification and localization of plant disease targets in a network and extracts features directly from the network for plant disease category and location prediction.
Anchor-based plant disease algorithms have been dominating the field of plant disease detection, beginning with the initial RCNN, with the rapid development of algorithms with decreasing numbers of parameters and an increasing speed and accuracy of plant disease detection.

Two-Stage Detector
The main plant disease algorithms based on a two-stage detector are RCNN [3], SPP-Net [24], Fast R-CNN [4], Faster R-CNN [25], etc. The two-stage detector utilizes two networks to implement classification and regression, respectively. A feature extractor (backbone) is used to generate a series of proposal boxes that may contain plant disease targets to be detected, and then some filtering rules are applied to filter the proposal boxes and identify the disease targets.
Taking Faster-RCNN as an example, firstly, the images are fed into the feature extractor for feature extraction, and the extracted features are fed into the region proposal network (RPN) to generate candidate frames. Secondly, the final proposal boxes are filtered according to the results of the classification and regression. Next, features are extracted from the suggestion frames in the feature map, and each feature is input to the pooling layer of the region of interest (ROI) and unified into a 7 × 7 size. Then, it is transformed into a one-dimensional vector by a fully connected layer. Finally, a classification and regression task is carried out to further correct the proposal box and determine the specific class of targets. The two-stage detector is more advantageous in the detection accuracy and classification precision of plant diseases.

One-Stage Detector
One-stage detector-based plant disease detection algorithms mainly comprise YoLo [26], SSD [27], and RetinaNet [28]. A one-stage detector accomplishes the classification and localization of plant disease targets in a network and extracts features directly from the network for plant disease category and location prediction.
Anchor-based plant disease algorithms have been dominating the field of plant disease detection, beginning with the initial RCNN, with the rapid development of algorithms with decreasing numbers of parameters and an increasing speed and accuracy of plant disease detection.

Anchor-Free Plant Disease Detection Algorithms
The anchor-free plant disease detection algorithms mainly include YoLo, Corner-Net [29], FSAF [30], FoveaBox [31], and CenterNet [32]. They abandon the idea of the prior bounding box and adopt the detection idea of key point prediction to obtain the final plant disease prediction box. These algorithms have a small number of network parameters, small calculations, and fast plant disease detection, but their accuracy is not very high.

YoLo
The main versions of YoLo are YoLov3 and YoLov5. YoLov3 first compresses the image size to 416 × 416 and extracts feature maps of the same size through the feature extraction network, then it divides the image into 13 × 13 grids and locks the grid to predict the certain target once the center coordinate of a target in the ground truth falls in a grid. Each grid corresponds to three anchors, predicts three bounding boxes, and outputs three feature maps at different scales. YoLov3 uses multiple independent logical classifiers for object prediction to calculate the likelihood of belonging to a specific label while using a binary cross-entropy loss for each label when calculating the classification loss, reducing the complexity of the computation.
YoLov5 adds a new focus module (Focus) to YoLov4 [33] to reduce the information loss during the under-sampling operation. In addition, the number of anchors of positive samples is increased to improve the convergence speed.

CenterNet
CenterNet is a detection algorithm based on key points estimation, which enables the detection of disease targets by estimating the center point or corner point. CenterNet improves on CornerNet by detecting an additional key point in addition to a pair of corner points, enhancing the ability to synthesize information about the target as a whole. As a result, CenterNet's detection speed and accuracy are considerably improved compared to the frames with both one-stage and two-stage detectors. CenterNet was proven to be applicable to plant disease detection under natural conditions. Xia et al. [34] performed the detection of apples through CenterNet's detection network combined with MobileNet v3, and the detection speed and accuracy were superior to SSD. However, there is still the problem of inaccurate matching of key points in intensive targets, and the results are less satisfactory for small target diseases of plants.

Transformer-Based Plant Disease Detection Algorithm
CNN-based target detection algorithms (such as Faster RCNN, YoLo, FCOS [35], etc.) usually rely on a lot of manual design, such as the rules-based label matching mechanism, inspirational reprocessing processing, etc. An end-to-end concise target detection framework based on a transformer [36] was proposed, which has good detection performance.
The transformer is a new neural network structure that mainly uses an attention mechanism to capture global contextual information and achieves long-range information fusion to extract more effective feature extraction. The transformer has had great success in natural language processing.
Carion et al. first proposed the transformer-based detection transformer (DETR) [37]. DETR treats target detection as a simple set prediction problem, removes the NMS and anchor design, has a concise pipeline, and introduces an attention mechanism to enhance feature representation, enabling simple and complete end-to-end target detection. This algorithm has high feature fusion capability and a high accuracy of detection, but the cost of training is significant.
DETR extracts maize leaf image features using CNN networks and compresses the feature dimension into one dimension. The features are encoded at a fixed position before being fed into an encoder-decoder converter. The decoder uses a multi-head attention mechanism to decode N objects in parallel at each decoding layer to produce N outputs. Finally, the target class recognition and bounding box regression are performed by feedforward neural networks (FFNs) to achieve disease target detection.

The Method Proposed in This Paper
The design framework is divided into the following stages, as shown in the activity diagram in Figure 4. First, obtain the image dataset of maize leaf disease in Kaggle and enhance the data. Secondly, load the relevant model parameters for pre-training. Thirdly, the MFF-CNN model is obtained through multiple training. Then input the maize leaf data to test the disease area, and, finally, receive the maize leaf disease detection results.
forward neural networks (FFNs) to achieve disease target detection.

The Method Proposed in This Paper
The design framework is divided into the following stages, as shown in the activity diagram in Figure 4. First, obtain the image dataset of maize leaf disease in Kaggle and enhance the data. Secondly, load the relevant model parameters for pre-training. Thirdly, the MFF-CNN model is obtained through multiple training. Then input the maize leaf data to test the disease area, and, finally, receive the maize leaf disease detection results.

Network Structure
A one-stage plant disease detection method based on MFF-CNN is proposed based on YoLov5s with a one-stage detector, as shown in Figure 5. The MFF-CNN consists of three parts, i.e., the backbone, neck, and detection head.

Network Structure
A one-stage plant disease detection method based on MFF-CNN is proposed based on YoLov5s with a one-stage detector, as shown in Figure 5. The MFF-CNN consists of three parts, i.e., the backbone, neck, and detection head.

The Method Proposed in This Paper
The design framework is divided into the following stages, as shown in the activity diagram in Figure 4. First, obtain the image dataset of maize leaf disease in Kaggle and enhance the data. Secondly, load the relevant model parameters for pre-training. Thirdly, the MFF-CNN model is obtained through multiple training. Then input the maize leaf data to test the disease area, and, finally, receive the maize leaf disease detection results.

Network Structure
A one-stage plant disease detection method based on MFF-CNN is proposed based on YoLov5s with a one-stage detector, as shown in Figure 5. The MFF-CNN consists of three parts, i.e., the backbone, neck, and detection head.

Backbone
The backbone is based on CSP Darknet53 and mainly uses the Conv module, CBL module, and CSP1_X module to obtain the disease information characteristics of maize leaves.
The Conv module consists of a convolutional layer, a batch normalization operation, and a SiLU activation function. Its kernel size is 3 × 3, and the step size is 2. The CBL module is similar to the Conv module in that it also uses convolutional layers with batch normalization operations, except that it uses Leaky ReLU as the activation function.
The MFF-CNN uses two different CSP structures. The CSP1-X with residual components (X bottlenecks) in the backbone is shown in Figure 6, while the neck uses convolutional layers (X CBLs) instead of residual components, shown as CSP2-X in Figure 7. The cross-layer design of the CSP reduces computation, improves inference speed, reduces memory cost, and guarantees accuracy. normalization operations, except that it uses Leaky ReLU as the activation function.
The MFF-CNN uses two different CSP structures. The CSP1-X with residual components (X bottlenecks) in the backbone is shown in Figure 6, while the neck uses convolutional layers (X CBLs) instead of residual components, shown as CSP2-X in Figure 7. The cross-layer design of the CSP reduces computation, improves inference speed, reduces memory cost, and guarantees accuracy.
The MFF-CNN model also adds the coordinate attention (CA) attention module (see Section 2.4.2) and the spatial pyramid pooling (SSP) improvement module (see Section 2.4.3) to the backbone.

Neck
The neck adopts the feature pyramid network (FPN) [38] and path aggregation network (PAN) [39]. The MFF-CNN borrows from PAN and adds a bottom-up feature pyramid network after sampling on the FPN for feature fusion. The FPN extracts stronger semantic information from the top-down, while the PAN extracts stronger localization information from the bottom-up, thus fusing the feature maps of the different layers of the CNN and strengthening the feature information extraction ability.
Three feature maps of different sizes with rich semantic information are obtained after three different concatenation operations to meet the needs of plant disease target detection at different scales. Finally, the CSP2-1 operation is added to each of the three feature maps and then sent to the detection end. normalization operations, except that it uses Leaky ReLU as the activation function.
The MFF-CNN uses two different CSP structures. The CSP1-X with residual components (X bottlenecks) in the backbone is shown in Figure 6, while the neck uses convolutional layers (X CBLs) instead of residual components, shown as CSP2-X in Figure 7. The cross-layer design of the CSP reduces computation, improves inference speed, reduces memory cost, and guarantees accuracy.
The MFF-CNN model also adds the coordinate attention (CA) attention module (see Section 2.4.2) and the spatial pyramid pooling (SSP) improvement module (see Section 2.4.3) to the backbone.

Neck
The neck adopts the feature pyramid network (FPN) [38] and path aggregation network (PAN) [39]. The MFF-CNN borrows from PAN and adds a bottom-up feature pyramid network after sampling on the FPN for feature fusion. The FPN extracts stronger semantic information from the top-down, while the PAN extracts stronger localization information from the bottom-up, thus fusing the feature maps of the different layers of the CNN and strengthening the feature information extraction ability.
Three feature maps of different sizes with rich semantic information are obtained after three different concatenation operations to meet the needs of plant disease target detection at different scales. Finally, the CSP2-1 operation is added to each of the three feature maps and then sent to the detection end.

Neck
The neck adopts the feature pyramid network (FPN) [38] and path aggregation network (PAN) [39]. The MFF-CNN borrows from PAN and adds a bottom-up feature pyramid network after sampling on the FPN for feature fusion. The FPN extracts stronger semantic information from the top-down, while the PAN extracts stronger localization information from the bottom-up, thus fusing the feature maps of the different layers of the CNN and strengthening the feature information extraction ability.
Three feature maps of different sizes with rich semantic information are obtained after three different concatenation operations to meet the needs of plant disease target detection at different scales. Finally, the CSP2-1 operation is added to each of the three feature maps and then sent to the detection end.

Detection Head
The detection network uses three detection heads with GIOU-LOSS as the loss function and outputs three scales of feature maps with 20 × 20, 40 × 40, and 80 × 80 grids for detecting small, medium, and large maize disease targets, respectively. Each grid contains three prediction boxes, each containing information about the confidence of the object and the position of the prediction box. Maize disease detection is accomplished at the detection head by non-maximal suppression (NMS) [40] as post-processing to eliminate duplicate redundant prediction frames and retain the prediction frame with the highest confidence.
Intersection over union (IOU) is a metric for evaluating the accuracy and can be expressed as: The MFF-CNN for border regression uses complete-IOU (CIOU) instead of IOU [41] for model training. CIOU is expressed as follows: Among them, a denotes ground truth, B denotes the predicted frame, and ρ 2 b, b gt represents the Euclidean distance between the centroids of the prediction box and the ground truth.
The LOSS when CIOU is regressed can be calculated as: where α and ν are denoted as follows:

Coordinate Attention
In this paper, the weight of key features is enhanced by adding coordinate attention (CA) [42] in the backbone to select the feature information extracted from the backbone, as shown in Figure 8 and Algorithm 1.
the position of the prediction box. Maize disease detection is accomplished at the detection head by non-maximal suppression (NMS) [40] as post-processing to eliminate duplicate redundant prediction frames and retain the prediction frame with the highest confidence.
Intersection over union (IOU) is a metric for evaluating the accuracy and can be expressed as: The MFF-CNN for border regression uses complete-IOU (CIOU) instead of IOU [41] for model training. CIOU is expressed as follows: Among them, a denotes ground truth, B denotes the predicted frame, and ρ 2 (b, b gt ) represents the Euclidean distance between the centroids of the prediction box and the ground truth.
The LOSS when CIOU is regressed can be calculated as: where α and ν are denoted as follows:

Coordinate Attention
In this paper, the weight of key features is enhanced by adding coordinate attention (CA) [42] in the backbone to select the feature information extracted from the backbone, as shown in Figure 8 and Algorithm 1.   In order to enhance the effective features of the feature map, the coordinate attention module embeds the location information into the attention of the channel in the following two steps: coordinate information embedding and coordinate attention generation.

Coordinate Information Embedding
In Figure 8, the coordinate information is embedded into the CA module through the X average pool and the Y average pool. Global pooling is often used for global encoding as it compresses global spatial information into channel descriptors, and it is difficult to preserve location information.
To capture more precise location information, the CA module converts the global pooling into a one-to-one feature encoding operation as established by Equation (6): Specifically, given the input x c (i, j), each channel is first encoded along the horizontal and vertical coordinates using a pooling kernel of size (H, 1) or (1, W). The output of the c channel of height h can be written as: Similarly, the output of the c channel of width w can be written as: The above two transformations, i.e., the X average pool and Y average pool of Figure 7, correspond to two aggregated features of the spatial directions of x and y, thus obtaining a pair of feature maps with perceptual capabilities in different directions. The two transformations enable spatial attention to obtain long-term dependencies along one spatial direction and to preserve precise position information along another spatial direction. This addresses the difficulty of global pooling to preserve location information and helps the network to locate the target of interest more accurately.

Coordinate Attention Generation
To take advantage of the embedded information in Section 3.2.1, the CA module concatenates them and then transforms them using a shared 1 × 1 convolutional transform function, as follows: f = δ F 1 z h , z w (9) where, [ * , * ] denotes a concatenation operation along the spatial dimension, δ is a nonlinear activation function, f ∈ R C/r×(H+W) is a feature map encoding spatial information in the horizontal and vertical directions, and r is the reduction ratio. Subsequently, along the spatial dimension, f are cut into two independent tensors, f h ∈ R C/r×H and f w ∈ R C/r×W ,. The feature maps f w and f h are then transformed to the same number of channels as input x c (i, j) using the two convolutional transforms F h and F w , respectively, with the following results: Then, expanding g h and g w as attention weights, the final output of the CA module is as follows:

SSP Improvement
Both maximum pooling and average pooling will lose some of the feature information of the image during the process of pooling and reduce the performance of the whole network. To solve this problem, we improve the spatial pyramid pooling (SPP) module in the YoLov5s network, as shown in Figure 9. The SSP module has a spatial pyramidal pooling structure, including three SoftPools [43] and a Skip Connection. SoftPool maps all pixels within the receptive field to the next network layer in a Softmax-weighted summation. The SSP improvement module down-samples the feature map while retaining more fine-grained plant information in the feature map, thus reducing information loss. Appl Conv SoftPool CSP1-X = CSP1-X CSP1-X = CSP1-X SSP Figure 9. SSP improvement module.
After SoftPool, we can obtain the standard summation output value ã of all weighted activations in R, as follows:

Experimentation and Performance Evaluation
To examine the effectiveness of the MFF-CNN in maize leaf detection, we present a comparative analysis of the maize leaf detection results of five plant disease algorithms, YoLov5s, DETR, CenterNet, Faster RCNN, and MFF-CNN, based on the maize leaf dataset.

Dataset and Parameter Settings
There are 2265 maize leaf data in the dataset [44] with image resolutions of 2448 × 3264 and 3456 × 4608.To prevent over-fitting and, at the same time, to improve the robustness of the maize leaf detection network, the data are enhanced with the help of flip transform, random clipping, and scale transform to expand the maize leaf dataset to 6795, and they take an RGB image format. The dataset is in PASCAL VOC [45,46], and it is divided into a training set, a validation set, and a test set according to the ratio of 8:1:1.
To enhance the robustness of the leaf detection network, we use 5437 images of maize leaves with location annotations for training the maize leaf detection model. The detection result is considered correct when the intersection over union (IOU) between the prediction box and the truth box is greater than 0.5. To verify the effectiveness of the MFF-CNN, we selected four mainstream plant disease algorithms for comparison, including Faster R-CNN based on the two-stage detector, YoLov5s based on the one-stage detector, anchorfree CenterNet, and transformer-based DETR. The Faster R-CNN uses a Resnet50 model pre-trained on ImageNet [47] as the backbone with a learning rate of 0.00001, an input image size of 600 × 600, and a training batch of two. CenterNet also uses Resnet50 as the backbone with a learning rate of 0.00001, an input image size of 512 × 512, and a training batch of eight. DETR uses Resnet50 as the backbone with a learning rate of 0.00001, input image sizes of 2448 × 3264 and 3456 × 4608, and a training batch of two. CenterNet, Faster-RCNN, and DETR all have an epoch of 300, and YOLO v5 and the MFF-CNN have an epoch of 150.
YoLov5s uses Darknet53 as the backbone with a learning rate of 0.00001, an input image size of 640 × 640, and a training batch of 16. The MFF-CNN proposed in this paper uses New CSP-Darknet53 as the backbone with an initial learning rate of 0.00001, an input This method achieves the fusion of global and local features, thus enriching the expression capability of the feature map, and it is suitable for the feature extraction network of plant diseases.
Suppose the activation value of each pixel in the SoftPool corresponding to the receptive field is a i and the other activation values in the kernel region R are a j . Then, the weight w i of each pixel in the receptive field is as follows: After SoftPool, we can obtain the standard summation output value a of all weighted activations in R, as follows:

Experimentation and Performance Evaluation
To examine the effectiveness of the MFF-CNN in maize leaf detection, we present a comparative analysis of the maize leaf detection results of five plant disease algorithms, YoLov5s, DETR, CenterNet, Faster RCNN, and MFF-CNN, based on the maize leaf dataset.

Dataset and Parameter Settings
There are 2265 maize leaf data in the dataset [44] with image resolutions of 2448 × 3264 and 3456 × 4608.To prevent over-fitting and, at the same time, to improve the robustness of the maize leaf detection network, the data are enhanced with the help of flip transform, random clipping, and scale transform to expand the maize leaf dataset to 6795, and they take an RGB image format. The dataset is in PASCAL VOC [45,46], and it is divided into a training set, a validation set, and a test set according to the ratio of 8:1:1.
To enhance the robustness of the leaf detection network, we use 5437 images of maize leaves with location annotations for training the maize leaf detection model. The detection result is considered correct when the intersection over union (IOU) between the prediction box and the truth box is greater than 0.5. To verify the effectiveness of the MFF-CNN, we selected four mainstream plant disease algorithms for comparison, including Faster R-CNN based on the two-stage detector, YoLov5s based on the one-stage detector, anchorfree CenterNet, and transformer-based DETR. The Faster R-CNN uses a Resnet50 model pre-trained on ImageNet [47] as the backbone with a learning rate of 0.00001, an input image size of 600 × 600, and a training batch of two. CenterNet also uses Resnet50 as the backbone with a learning rate of 0.00001, an input image size of 512 × 512, and a training batch of eight. DETR uses Resnet50 as the backbone with a learning rate of 0.00001, input image sizes of 2448 × 3264 and 3456 × 4608, and a training batch of two. CenterNet, Faster-RCNN, and DETR all have an epoch of 300, and YOLO v5 and the MFF-CNN have an epoch of 150.
YoLov5s uses Darknet53 as the backbone with a learning rate of 0.00001, an input image size of 640 × 640, and a training batch of 16. The MFF-CNN proposed in this paper uses New CSP-Darknet53 as the backbone with an initial learning rate of 0.00001, an input image size of 640 × 640, and a training batch of 16. The graphics card for this experiment is RTX3080ti, with Toolkit CUDA of version 11. and deep neural networks GPU-accelerated library CUDNN of version 8.0.4 developed by NVIDIA Corporation. We installed PyTorch 1.8.2 + cu111 developed by Facebook AI Research, and open source python 3.8.12 on a Linux system.

Experimental Results and Analysis
The quantitative analysis of all methods in this paper is conducted based on the same indicators and the same dataset.
The mean average precision (mAP) is an essential performance evaluation metric for target detection models. The mAP here is the average of the average precision (AP) of detection calculated for each category when IOU = 0.5. The experimental results are shown in Table 2, and the MFF-CNN has the best performance, in terms of mAP, compared to the others' maize disease detection. It is 10.4% higher than Faster R-CNN and 5.3% higher than CenterNet, and it also outperforms YoLov5s and DETR. There are two main reasons that the MFF-CNN achieves the best detection performance in maize leaves. One is that the CA channel attention module enhances the feature information of the detected targets. In particular, it enhances the feature information of small targets, overlapping obscured targets and fuzzy targets, which makes the detection better. Another reason is the application of Softpool, which uses down-sampling to reduce the amount of data while retaining as much feature information as possible, thus preventing the loss of maize leaf detection information with blurred edges and corners. These two modules significantly improve the detection accuracy and efficiency of the MFF-CNN model in maize leaf detection.
To verify the superiority in temporal performance of the MFF-CNN, we use detection time, i.e., the time to complete an image, to compare the time performance. To get the detection time, we loop through 1000 corn leaf images for detection and then calculate the average inference time between one image input to the network model and the output model. The detection time of the MFF-CNN is 0.039 s, which is not as good as YoLov5s, but its detection efficiency is faster than DETR, CenterNet, and Faster RCNN by at least 0.37 s. Since the FOCUS module reduces the image size and the computation, it greatly reduces the MFF-CNN detection time. However, as the MFF-CNN also adds a CA module and an SSP module on top of the YoLov5s network, the detection time of the MFF-CNN algorithm is slightly longer than YoLov5s.
We use floating point operations (FLOPs) to test the algorithm's complexity. The value of the FLOPs for the MFF-CNN is 4.2 G, which is much lower than the other algorithms, indicating little model complexity and a minimal computational effort.
In the detection of plant leaf diseases, missing detection is one of the important factors affecting the accuracy of disease detection. This section uses sensitivities to test the missed detection rate of the classifier model [48,49]. Sensitivity describes the ratio of true positive (TP) to actual positive (TP plus FN), where FN is a false negative. The higher the sensitivity, the lower the missed detection rate. Table 3 compares the performance of the proposed MFF-CNN versus the other algorithms in terms of sensitivities (%) at various false positives (FPs) per image. The numbers 0.5, 1, 2, and 4 represent the different values of FPs. As can be seen from the table, when IOU = 0.5 and there are four false positives in each picture, the sensitivity of our algorithm is 3.16% higher than Yolov5s and 9.83% higher than Faster RCNN. The results show that the MFF CNN has the highest sensitivity, i.e., the lowest miss detection rate, and it exhibits the optimal detection performance.
We analyze several specific cases in the following sections.

Detection of Target Area Overlap Occlusion
As shown in Figure 10, with the disease area of the maize leaves overlapped and obscured, we can see that the three algorithms (with the exception of MFF-CNN) missed the overlapping leaves in the upper left corner of the image. Both DETR and CenterNet did not perform very well in the detection of the remaining disease areas in the images.
DETR is a transformer-based target detection framework that transforms target detection into a simple set prediction problem to achieve end-to-end target detection. However, it has a long training time, misses the detection of small target diseases that appear densely in the middle of maize leaves in the experiment, and has insufficient detection accuracy for dense, small targets. CenterNet is an anchor-free method by direct prediction of the centroid coordinates of objects. When there are multiple objects in overlapping centers, they will be misidentified as one object. Further, CenterNet predicts that if the centroids of two objects also overlap in down-sampling during the prediction process, they will also be mistaken for one object. When there are dense targets in the detection area, such as maize leaves close together and overlapping occlusions, CenterNet will mismatch the key points and exhibit poor performance.
In our proposed MFF-CNN, the X average pool and Y average pool aggregate feature along two spatial directions obtains a pair of feature maps with perceptual capabilities in different directions, which helps the network to locate the target of interest more accurately. Thus, the MFF-CNN performed best in the complex case of overlapping shading of diseased areas of maize leaves, achieving a mAP of 0.486.
We analyze several specific cases in the following sections.

Detection of Target Area Overlap Occlusion
As shown in Figure 10, with the disease area of the maize leaves overlapped and obscured, we can see that the three algorithms (with the exception of MFF-CNN) missed the overlapping leaves in the upper left corner of the image. Both DETR and CenterNet did not perform very well in the detection of the remaining disease areas in the images. DETR is a transformer-based target detection framework that transforms target detection into a simple set prediction problem to achieve end-to-end target detection. However, it has a long training time, misses the detection of small target diseases that appear densely in the middle of maize leaves in the experiment, and has insufficient detection

Detection of Sparsely Distributed Targets
As can be seen in Figure 11, the diseased areas of the maize leaves were sparsely distributed, mainly concentrated in the upper right part and the lower left part of the leaves, and there was a certain distance between the two diseased areas. DETR and CenterNet mainly detected disease in the upper right part of the leaf, and there were a large number of missed detections. Although the transformer used in DETR focuses more on local key information, the transformer model often requires a large amount of data and a long training period to make the model converge. While the maize leaf dataset uses data enhancement technology to mitigate the over-fitting, the dataset for this experiment is much less massive than ImageNet, resulting in not fully meeting the data volume requirements of the transformer model. This is the main reason why DETR has a large number of missed detections in the lower left part of the leaf, as well as the leaf edges. Although YoLov5s detects most of the diseases, the detection confidence is lower than that of the MFF-CNN, and the detection effect is not satisfactory in the case of sparse disease distribution.
The MFF-CNN algorithm proposed in this paper uses SoftPool to reduce the loss of information of features in the pooling process, thus maximizing the assurance of the comprehensible extraction of disease feature information in plant leaves. Thus, the MFF-CNN performs best even in the complex case of sparse disease distribution in maize leaves.

Detection of Target and Background Texture Similarity
In Figure 12, the texture and color of the maize leaves are very similar to the background, and it is hard to detect disease. YoLov5s does not have the same detection confidence as our proposed MFF-CNN on the maize leaf dataset. The main reason is the relatively large image resolution of the maize leaf dataset used in this experiment, which makes YoLov5s ineffective at detecting small targets with inadequate multi-scale information.
Our proposed MFF-CNN enhances the multi-scale features of the CA module and prevents information loss as much as possible with the help of Softpool, which provides a good foundation for the detection of multi-scale targets afterward. It is experimentally demonstrated that with the joint efforts of the CA module and Softpool pooling, the MFF-CNN obtains the optimal detection accuracy, with a 1.6% improvement compared to YoLov5s and a 1.9% improvement compared to DETR.

Detection of Target and Background Texture Similarity
In Figure 12, the texture and color of the maize leaves are very similar to the background, and it is hard to detect disease. YoLov5s does not have the same detection confidence as our proposed MFF-CNN on the maize leaf dataset. The main reason is the relatively large image resolution of the maize leaf dataset used in this experiment, which makes YoLov5s ineffective at detecting small targets with inadequate multi-scale information.

Discussion
The proposed method of multi-scale feature fusion realizes the extraction of context information and the detection of maize leaf diseases. However, in the detection of edge targets and dense small targets, the prefetching of small target detection cannot be divided, and even the problem of missed detection occurs. This is because the limitations of Our proposed MFF-CNN enhances the multi-scale features of the CA module and prevents information loss as much as possible with the help of Softpool, which provides a good foundation for the detection of multi-scale targets afterward. It is experimentally demonstrated that with the joint efforts of the CA module and Softpool pooling, the MFF-CNN obtains the optimal detection accuracy, with a 1.6% improvement compared to YoLov5s and a 1.9% improvement compared to DETR.

Discussion
The proposed method of multi-scale feature fusion realizes the extraction of context information and the detection of maize leaf diseases. However, in the detection of edge targets and dense small targets, the prefetching of small target detection cannot be divided, and even the problem of missed detection occurs. This is because the limitations of convolution operation cannot model the image globally, resulting in insufficient global context information extraction. To solve the above problems, in the next stage of work, we use the idea of a transformer to achieve the optimization of the baseline and use a DropBlock [50] convolutional regularization method to improve the detection accuracy.
In our model, we treat the images in a simple way such that the data are enhanced with the help of flip transform, random clipping, and scale transform. In some techniques of segmentation and classification using color [51], they are capable of processing trivial features such as shadows, noise, pixel saturation, low light, different crop varieties, and intrinsic camera parameters to improve model quality. We will attempt to use these methods in our model to improve detection performance in the future.
The quality of the dataset plays an important role in the detection effect of the algorithm. In a future study, we will sample maize leaves of different varieties, fertility stages, and shooting angles under field conditions, and strictly label them to establish a large dataset of maize leaves.
Moreover, through learning and analysis of big data, we will develop and design plant disease models applicable to maize leaves in general, explore deep learning networks with better performance, improve the accuracy of maize leaf disease detection, and apply them to early plant diagnosis and intelligent monitoring.
Meanwhile, to improve the practical application capability, the MFF-CNN model was developed into a corresponding access point installed on mobile devices such as UAVs and smartphones to provide timely, accurate, and wide-range, real-time monitoring information for maize leaf disease identification. In addition, we will attempt to use the MFF-CNN for research on the detection of biotic and abiotic stresses in agriculture, such as saline stress and drought stress.

Conclusions
In order to realize accurate and real-time maize leaf disease detection, based on deep learning technology, this paper proposes an MFF-CNN based on multi-scale feature fusion for maize leaf disease detection.
We conducted experiments under the complex conditions of combined overlapping occlusion, sparse target distribution, and similar textures of the diseased areas and backgrounds. The results show that the proposed method obtains the best detection performance compared with the maize disease algorithms of Faster R-CNN, CenterNet, YoLov5s, and DETR.