A Method for Segmenting Disease Lesions of Maize Leaves in Real Time Using Attention YOLACT++

: Northern leaf blight (NLB) is a serious disease in maize which leads to signiﬁcant yield losses. Automatic and accurate methods of quantifying disease are crucial for disease identiﬁcation and quantitative assessment of severity. Leaf images collected with natural backgrounds pose a great challenge to the segmentation of disease lesions. To address these problems, we propose an image segmentation method based on YOLACT++ with an attention module for segmenting disease lesions of maize leaves in natural conditions in order to improve the accuracy and real-time ability of lesion segmentation. The attention module is equipped on the output of the ResNet-101 backbone and the output of the FPN. The experimental results demonstrate that the proposed method improves segmentation accuracy compared with the state-of-the-art disease lesion-segmentation methods. The proposed method achieved 98.71% maize leaf lesion segmentation precision, a comprehensive evaluation index of 98.36%, and a mean Intersection over Union of 84.91%; the average processing time of a single image was about 31.5 ms. The results show that the proposed method allows for the automatic and accurate quantitative assessment of crop disease severity in natural conditions.


Introduction
Maize is an important economic crop, with the third-largest area sown and total production in the world, after rice and wheat. The northern leaf blight of maize, caused by the fungus S. turcica, is a major disease impacting maize in wet climates and typically shows symptoms of oblong, "cigar-shaped" tan or greyish lesions. Leaf lesions result in a reduction in the leaf area where photosynthesis takes place. The more lesions on maize leaves and the earlier in the season the lesions occur, the greater the loss of photosynthetic area and the reduction in maize yield [1]. The annual yield loss of maize grown in the United States and Canada due to northern leaf blight reached approximately 14 million tons between 2012 and 2015, accounting for a quarter of the total loss caused by the disease globally [2]. Therefore, a timely grasp of the severity of crop diseases is of great significance for effective disease prevention and the formulation of scientific prevention and control strategies.
The segmentation of disease lesions in maize leaf images directly affects the recognition effect of crop diseases and the accuracy of a quantitative assessment of the severity of crop diseases [3,4]. How to segment the diseased leaves of crops with high efficiency and high quality is a research hotspot. In the last two decades, traditional image processing techniques, such as edge detection, color space transformation, feature space transformation, etc., were used to achieve the extraction and recognition of lesions [5,6]. Using the grayscale structure intensity histogram of channel H (from the HSV color space) and channel a (from the L*a*b* color space), one can find the pixel value that can best separate healthy and diseased tissues, and segment the lesions [7]. On the basis of image enhancement, a strong correlation-based approach was applied to segment the apple leaf lesions, and the extracted In recent years, the attention model has been widely used in various types of deep learning tasks-for example, natural language processing, image recognition, and speech recognition. The association of attention mechanisms and deep learning makes the task of plant disease lesion recognition and segmentation more interesting and in-depth. The main common attention modules are SE-Net (Squeeze-and-Excitation Networks) [41], CBAM (Convolutional Block Attention Module) [42], and VSG-Net (Visual-Spatial-Graph Network) [43], etc. Zhong et al. proposed a grouped attention module based on a grouped activation strategy, which used high-order features to guide the enhancement of low-order features. Meanwhile, the enhancement coefficients within groups were calculated by grouping to reduce the suppression between different groups and enhance the ability of feature expression. The pixel accuracy of segmentation was 93.9%, and the mIoU was 78.6% [44].
Images of crop leaves collected under complex backgrounds are affected by various factors, such as weeds, soil, light intensity, etc. More importantly, each disease spot has its own shape, color, and texture. The presence of these factors presents the researcher with a great challenge. Based on the above problems, in this study, a novel model named Attention YOLACT++ for maize NLB lesion recognition and segmentation is proposed. It can better detect and segment the edges of diseased spots. It also provides a technical tool for subsequent accurate identification and quantitative assessment of the severity of diseased maize leaves. The main contributions of this study are as follows: • We proposed a new instance segmentation architecture. We adopted the YOLACT++ model for the segmentation task and applied the convolutional block attention module to the ResNet-101 module and the FPN module to improve segmentation performance and model robustness.

•
Our model achieved more accurate segmentation speed and accuracy on a maize northern leaf blight dataset collected in a complex background, wherein its performance was superior to that of current instance segmentation models.

Dataset Description
The maize images used for evaluating the performance of the proposed method corresponded to aerial images of northern leaf blight, which can be downloaded from the website https://osf.io/vfawp/ (accessed on 10 October 2021). This study took the northern leaf blight of maize as the research object. All trials were captured at Cornell University's Musgrave Research Farm in Aurora, New York, in the summer of 2017. The dataset was captured by mounting the camera on a UAV flying at an altitude of 6 m [45]. Examples of the maize images collected by the drone are shown in Figure 1.

Dataset Annotation
In this study, images of maize leaf taken by the drone were cropped using Photoshop. To maintain the aspect ratio of the lesions, the pixel size of the maize leaf images at the time of cropping was 550 × 550. Some example images are shown in Figure 2a. We used 1200 images of maize northern blight leaf, randomly divided in a 2:1:1 ratio into a training set (600), a validation set (300), and a test set (300), manually annotated as the reference (ground truth) of the diseased areas ( Figure 2b). To increase the diversity of the dataset and to avoid overfitting, in this study we used photometric distortion, random contrast, random cropping, flipping, and random rotation operations for data enhancement in both the training and validation sets. The above enhancement operations expanded the training set to 2000, and the validation set to 700. For the test set, no augmentation was performed and it was used directly for model evaluation to ensure the authenticity of the test set. responded to aerial images of northern leaf blight, which can be downloaded from the website https://osf.io/vfawp/ (accessed on 10 October 2021). This study took the northern leaf blight of maize as the research object. All trials were captured at Cornell University's Musgrave Research Farm in Aurora, New York, in the summer of 2017. The dataset was captured by mounting the camera on a UAV flying at an altitude of 6 m [45]. Examples of the maize images collected by the drone are shown in Figure 1.

Dataset Annotation
In this study, images of maize leaf taken by the drone were cropped using Photoshop. To maintain the aspect ratio of the lesions, the pixel size of the maize leaf images at the time of cropping was 550 × 550. Some example images are shown in Figure 2a. We used 1200 images of maize northern blight leaf, randomly divided in a 2:1:1 ratio into a training set (600), a validation set (300), and a test set (300), manually annotated as the reference (ground truth) of the diseased areas ( Figure 2b). To increase the diversity of the dataset and to avoid overfitting, in this study we used photometric distortion, random contrast, random cropping, flipping, and random rotation operations for data enhancement in both the training and validation sets. The above enhancement operations expanded the training set to 2000, and the validation set to 700. For the test set, no augmentation was performed and it was used directly for model evaluation to ensure the authenticity of the test set.

Dataset Annotation
In this study, images of maize leaf taken by the drone were cropped using Photoshop. To maintain the aspect ratio of the lesions, the pixel size of the maize leaf images at the time of cropping was 550 × 550. Some example images are shown in Figure 2a. We used 1200 images of maize northern blight leaf, randomly divided in a 2:1:1 ratio into a training set (600), a validation set (300), and a test set (300), manually annotated as the reference (ground truth) of the diseased areas ( Figure 2b). To increase the diversity of the dataset and to avoid overfitting, in this study we used photometric distortion, random contrast, random cropping, flipping, and random rotation operations for data enhancement in both the training and validation sets. The above enhancement operations expanded the training set to 2000, and the validation set to 700. For the test set, no augmentation was performed and it was used directly for model evaluation to ensure the authenticity of the test set.

Model Architecture
To improve the accuracy and time of the lesion segmentation of maize leaf under natural conditions, we propose an image segmentation method based on YOLACT++ [46] with an attention module. First, we introduce a Convolutional Block attention module (CBAM) [42] between the multi-scale output of ResNet-101 [47] and the input of the Feature Pyramid Networks (FPN) [48]. Additionally, we add it to the output of the FPN.
The attention module can obtain the importance of each feature channel through automatic learning and assign different weights to different feature channels so that the

Model Architecture
To improve the accuracy and time of the lesion segmentation of maize leaf under natural conditions, we propose an image segmentation method based on YOLACT++ [46] with an attention module. First, we introduce a Convolutional Block attention module (CBAM) [42] between the multi-scale output of ResNet-101 [47] and the input of the Feature Pyramid Networks (FPN) [48]. Additionally, we add it to the output of the FPN.
The attention module can obtain the importance of each feature channel through automatic learning and assign different weights to different feature channels so that the network can focus on the most relevant features and improve the segmentation performance of the network. Focusing on the diseased areas of the maize leaves during feature extraction improves the accuracy of network recognition and detection.
The architecture of the proposed method is illustrated in Figure 3. The detailed structure consists of five parts: feature extraction, attention module, FPN architecture, segmentation network, and image post-processing. Attention module 1 and attention module 2 are both CBAM. network can focus on the most relevant features and improve the segmentation performance of the network. Focusing on the diseased areas of the maize leaves during feature extraction improves the accuracy of network recognition and detection.
The architecture of the proposed method is illustrated in Figure 3. The detailed structure consists of five parts: feature extraction, attention module, FPN architecture, segmentation network, and image post-processing. Attention module 1 and attention module 2 are both CBAM. (1) Feature extraction.
In this study, we use a residual network with 101 layers (ResNet-101). ResNet-101, which is less computationally intensive and preforms better, is applied as the feature extraction network. Furthermore, Deformable convolutional neural networks (DCNs) [49,50] are deployed on the last three (C3 to C5) ResNet-101 stages with an interval of three (i.e., this network replaces the 3 × 3 convolutional layers in the ResNet module with a 3 × 3 deformable convolutional layer at intervals of 3 convolutional layers.). The reason for adding the DCNS structure to this network is that YOLACT [51] is a one-shot sampling method without a resampling process and that DCNs can enhance the network's ability to handle different scales, rotations, and aspect ratios. The sampling method of the ResNet network was changed by using free-form sampling rather than the rigid grid sampling found in traditional CNNs.
(2) Attention module. The CBAM [42] used in this study, is shown in Figure 4. CBAM consists of two separate sub-modules: the channel attention module (CAM) and the spatial attention module (SAM).  (1) Feature extraction.

CAM SAM
In this study, we use a residual network with 101 layers (ResNet-101). ResNet-101, which is less computationally intensive and preforms better, is applied as the feature extraction network. Furthermore, Deformable convolutional neural networks (DCNs) [49,50] are deployed on the last three (C 3 to C 5 ) ResNet-101 stages with an interval of three (i.e., this network replaces the 3 × 3 convolutional layers in the ResNet module with a 3 × 3 deformable convolutional layer at intervals of 3 convolutional layers.). The reason for adding the DCNS structure to this network is that YOLACT [51] is a one-shot sampling method without a resampling process and that DCNs can enhance the network's ability to handle different scales, rotations, and aspect ratios. The sampling method of the ResNet network was changed by using free-form sampling rather than the rigid grid sampling found in traditional CNNs.
(2) Attention module. The CBAM [42] used in this study, is shown in Figure 4. CBAM consists of two separate sub-modules: the channel attention module (CAM) and the spatial attention module (SAM). network can focus on the most relevant features and improve the segmentation performance of the network. Focusing on the diseased areas of the maize leaves during feature extraction improves the accuracy of network recognition and detection. The architecture of the proposed method is illustrated in Figure 3. The detailed structure consists of five parts: feature extraction, attention module, FPN architecture, segmentation network, and image post-processing. Attention module 1 and attention module 2 are both CBAM. (1) Feature extraction.
In this study, we use a residual network with 101 layers (ResNet-101). ResNet-101, which is less computationally intensive and preforms better, is applied as the feature extraction network. Furthermore, Deformable convolutional neural networks (DCNs) [49,50] are deployed on the last three (C3 to C5) ResNet-101 stages with an interval of three (i.e., this network replaces the 3 × 3 convolutional layers in the ResNet module with a 3 × 3 deformable convolutional layer at intervals of 3 convolutional layers.). The reason for adding the DCNS structure to this network is that YOLACT [51] is a one-shot sampling method without a resampling process and that DCNs can enhance the network's ability to handle different scales, rotations, and aspect ratios. The sampling method of the ResNet network was changed by using free-form sampling rather than the rigid grid sampling found in traditional CNNs.
(2) Attention module. The CBAM [42] used in this study, is shown in Figure 4. CBAM consists of two separate sub-modules: the channel attention module (CAM) and the spatial attention module (SAM).  The CAM structure is displayed in Figure 5. Its structure takes the input feature map C i or P j (C i ∈ R H×W×C , i = 3, 4, 5, P j ∈ R H×W×C , j = 3, . . . , 7) through global maximum pooling and global average pooling, respectively, to obtain two feature maps, where C i is the input of attention module 1 and P j is the input of attention module 2 ( Figure 3). These are then sent to a two-layer neural network (MLP); the activation function is Relu. The    C was added to obtain feature map 4 P , and the same method was used to obtain M cc or M cp is multiplied with the input feature map C i or P j to generate the feature maps F c or F p . Of these, M cc and F c are the results of attention module 1; M cp and F p are the results of Attention module 2, respectively.

CAM SAM
The structure of the SAM is shown in Figure 6, taking F c or F p as the input feature map of this module. Firstly, two H × W × 1 feature maps are obtained through global maximum pooling and global average pooling based on the channel. Additionally, the two feature maps are spliced based on the channel. Through a 7 × 7 convolutional layer, the dimension is then reduced to 1 channel, i.e., H × W × 1. Next, the spatial attention feature M sc or M sp (M sc , M sp ∈ R H×W×1 ) is generated through the sigmoid activation function, in which M sc and M sp are the outputs of the SAM after attention module 1 and attention module 2, respectively.   Figure 5. Channel attention module.
The structure of the SAM is shown in Figure 6, taking C was added to obtain feature map 4 P , and the same method was used to obtain Finally, M sc or M sp and the input feature maps F c or F p of the module are multiplied to obtain the final feature F c or F p , i.e., C i or P j .
(3) FPN architecture. The feature maps P 3 − P 7 in the FPN structure were obtained from the convolutional layer C 5 through one convolutional layer to obtain feature map P 5 . The bilinear interpolation method was then used to double the size of feature map P 5 , the convolution of C 4 was added to obtain feature map P 4 , and the same method was used to obtain feature map P 3 . Finally, feature maps P 5 and P 6 were convolved and down-sampled to obtain the feature maps P 6 and P 7 . These feature maps obtained from FPN network were fed into CBAM. P j (j = 3, . . . , 7), which was obtained as the input of segmentation network.
(4) Segmentation network. The Protonet (Figure 3) was used to generate k-many prototype masks of the same size as the original image by means of a fully convolutional neural network (FCN) [52]. It takes the feature map P 3 as input, and the dimensions of the output mask are 138 × 138 × k; that is, k-many prototype masks are obtained, and the size of each mask is 138 × 138.
The Prediction Head structure ( Figure 3) uses a shared convolutional network to improve the segmentation speed. It takes the five feature maps P 3 − P 7 from the feature extraction network as input, and completes the three tasks of target classification prediction, bounding box prediction and mask coefficient prediction. The fast non-maximum suppression (NMS) algorithm then obtains the mask coefficients with the highest confidence. The outputs of the Protonet branch and the Prediction Head branch are derived from the mask by basic matrix multiplication and sigmoid function, as shown in Equation (1).
where P is the prototype mask, and C is a n × k matrix of mask coefficients. (5) Image post-processing. Image post-processing mainly includes crop, fast mask re-scoring and threshold. First, the terminal masks are cropped with the predicted bounding box, i.e., the pixels outside the box region are zeroed out.
Second, the branch of fast mask re-scoring network is composed of six convolutional layers and one global average pooling layer, as shown in Figure 7. The function is to re-score the mask based on the Intersection over Union (IoU) between the predicted mask and the original leaf mask. The specific steps are as follows: (I) the mask of leaf disease image with size 138 × 138 × 1-obtained after cropping, as input and output, the IoU of the original leaf mask of the corresponding category is taken; (II) the IoU of the mask corresponding to the category predicted by the classification branch is multiplied with the corresponding category confidence as the final score of the mask.
The Prediction Head structure ( Figure 3) uses a shared convolutional network to improve the segmentation speed. It takes the five feature maps ' 3 P − ' 7 P from the feature extraction network as input, and completes the three tasks of target classification prediction, bounding box prediction and mask coefficient prediction. The fast non-maximum suppression (NMS) algorithm then obtains the mask coefficients with the highest confidence.
The outputs of the Protonet branch and the Prediction Head branch are derived from the mask by basic matrix multiplication and sigmoid function, as shown in Equation (1).

() T M sigmiod PC =
where P is the prototype mask, and C is a nk  matrix of mask coefficients.
(5) Image post-processing. Image post-processing mainly includes crop, fast mask re-scoring and threshold. First, the terminal masks are cropped with the predicted bounding box, i.e., the pixels outside the box region are zeroed out.
Second, the branch of fast mask re-scoring network is composed of six convolutional layers and one global average pooling layer, as shown in Figure 7. The function is to rescore the mask based on the Intersection over Union (IoU) between the predicted mask and the original leaf mask. The specific steps are as follows: (I) the mask of leaf disease image with size 138 × 138 × 1-obtained after cropping, as input and output, the IoU of the original leaf mask of the corresponding category is taken; (II) the IoU of the mask corresponding to the category predicted by the classification branch is multiplied with the corresponding category confidence as the final score of the mask.
Finally, thresholding performs image-thresholding processing on the re-scored mask to obtain the final segmented image.

Loss Function
The loss function used in this study includes three parts, i.e., training classification loss function cls L , prediction frame loss function box L , and mask-generation loss function mask L . The total loss function L of the network is shown in Equation (2): where the mask generation loss function mask L is defined as: Finally, thresholding performs image-thresholding processing on the re-scored mask to obtain the final segmented image.

Loss Function
The loss function used in this study includes three parts, i.e., training classification loss function L cls , prediction frame loss function L box , and mask-generation loss function L mask . The total loss function L of the network is shown in Equation (2): where the mask generation loss function L mask is defined as: The classification loss function L cls and the prediction frame loss function L box are expressed, respectively, as: where is the index number of the anchor, p i is the predicted probability of the target, and p * i is the original leaf probability; α, β and γ are the weights of each loss.

Experimental Setup
The network training and testing hardware environment was Intel(R) Core i7-9700k 3.60 GHz processor, 16  In this study, the feature extraction network adopted the mode of transfer learning to fine-tune the pre-trained parameters of the ImageNet classification model [53]. The training parameters in this study were set as follows: eight images per batch were trained using the stochastic gradient descent method with a momentum factor of 0.9 [54][55][56]. The initial learning rate was set as 10 −3 , and the maximum number of iterations was 400,000. The learning rate decreased by one-tenth at 180,000,220,000 and 350,000 times, respectively. The weight decay parameter was 5 × 10 4 .
The weights α, β and γ of the loss function Equation (2) were set to 1, 1.5, and 6.125, respectively. The classification loss function L cls , the prediction frame loss function L box , the mask loss function L mask , and the overall loss value L of the network for different iterations are shown in Figure 8. It can be seen from the figure that after 400,000 iterations, the losses of the network started to converge and gradually stabilize. In the experiment, in addition to qualitative assessment consisting of visually comparing segmentation results, several indices were calculated to quantitatively evaluate the performance for disease lesion segmentation with different methods. They are the comprehensive evaluation indexes 1 F score and the mean Intersection over Union (mIoU). The 1 F score reflects the overall segmentation accuracy of the lesions. The larger the fraction is, the more stable the model is. The 1 F score is given by: where, P and R represent precision and recall, respectively.
The mIoU is the ratio of intersection and union of two sets of ground truth and prediction of leaf lesion area. The larger the mIoU value is, the better the segmentation effect will be. The mIoU is defined as:

TP mIoU FP FN TP
= ++ (8) In addition, the segmentation time of each leaf image in the test set is calculated, and the average time is used as the performance index to evaluate the real-time performance of the model. In the experiment, in addition to qualitative assessment consisting of visually comparing segmentation results, several indices were calculated to quantitatively evaluate the performance for disease lesion segmentation with different methods. They are the comprehensive evaluation indexes F 1 score and the mean Intersection over Union (mIoU).

Results and Discussion
The F 1 score reflects the overall segmentation accuracy of the lesions. The larger the fraction is, the more stable the model is. The F 1 score is given by: where, P and R represent precision and recall, respectively. The mIoU is the ratio of intersection and union of two sets of ground truth and prediction of leaf lesion area. The larger the mIoU value is, the better the segmentation effect will be. The mIoU is defined as: In addition, the segmentation time of each leaf image in the test set is calculated, and the average time is used as the performance index to evaluate the real-time performance of the model.

Results
The IoU threshold represents the degree of overlap between the true and predicted values. In the experiment, only IoU > 0.5 can be considered as a correctly predicted value. The segmentation precision at different IoU thresholds is shown in Figure 9. When the IoU threshold was 0.5, the segmentation precision was 99.06%, indicating that the model performs well in the loose IoU threshold range. The mean precision (mP) of the maize leaf lesion images in the test set was 86.2% when the IoU threshold range was 0.5 to 0.95. In this paper, 0.7 was chosen as the IoU threshold for maize leaf segmentation. When the number of iterations was 400,000, the training time of the proposed method was approximately four to five days. To improve the robustness of the model, 10-fold cross-validation was used in this study. After five cross-validation experiments, the average was taken. The Attention YOLACT++ network correctly segmented 296 out of 300 maize NLB images with precision of 98.71%, recall of 98.02%, and mIoU of 84.91%. Table 1 shows the precision, recall, 1 F scores, mIoU, and total segmentation time of the network for maize leaf segmentation.  Figure 10 shows the segmentation results of maize leaf lesions under different influencing factors, such as weeds, light intensity, soil, and mutual covering of leaves, etc. It was discovered that the proposed method provided satisfying segmentation results with clear edges. However, due to the influence of soil and other factors, the problem of owesegmentation occurred in the segmentation of maize leaf lesions, as shown in Figure 10d In this paper, 0.7 was chosen as the IoU threshold for maize leaf segmentation. When the number of iterations was 400,000, the training time of the proposed method was approximately four to five days. To improve the robustness of the model, 10-fold crossvalidation was used in this study. After five cross-validation experiments, the average was taken. The Attention YOLACT++ network correctly segmented 296 out of 300 maize NLB images with precision of 98.71%, recall of 98.02%, and mIoU of 84.91%. Table 1 shows the precision, recall, F 1 scores, mIoU, and total segmentation time of the network for maize leaf segmentation.  Figure 10 shows the segmentation results of maize leaf lesions under different influencing factors, such as weeds, light intensity, soil, and mutual covering of leaves, etc. It was discovered that the proposed method provided satisfying segmentation results with clear edges. However, due to the influence of soil and other factors, the problem of owe-segmentation occurred in the segmentation of maize leaf lesions, as shown in Figure 10d. Figure 10 shows the segmentation results of maize leaf lesions under different influencing factors, such as weeds, light intensity, soil, and mutual covering of leaves, etc. It was discovered that the proposed method provided satisfying segmentation results with clear edges. However, due to the influence of soil and other factors, the problem of owesegmentation occurred in the segmentation of maize leaf lesions, as shown in Figure 10d.

Prediction Results Comparison
To further validate the performance of the proposed method in this paper, we compared it with the state-of-the-art instance segmentation model Mask R-CNN [57] and YO-LACT++ [46]. As observed from Table 2, the proposed method achieved a better segmentation performance than the Mask R-CNN and YOLACT++ methods for the quantitative indices. The segmentation precision of the proposed method was about 15.14% and 1.27% higher than that of Mask R-CNN and YOLACT++ models. The mIoU of the proposed method reached 84.91%, which was 11.91% and 6.26% higher than Mask R-CNN and YO-LACT++ models. The main reason for this was that the proposed method added an attention module, which could accurately extract the features of a lesion. The segmentation time in the test set was obtained by averaging the prediction time of all images. The prediction time for our model was slightly longer than YOLACT++, and shorter than Mask R-CNN, but the segmentation mIoU achieved by our model was the highest. This suggested that our model was well-suited for real-time NLB lesion segmentation.

Prediction Results Comparison
To further validate the performance of the proposed method in this paper, we compared it with the state-of-the-art instance segmentation model Mask R-CNN [57] and YOLACT++ [46]. As observed from Table 2, the proposed method achieved a better segmentation performance than the Mask R-CNN and YOLACT++ methods for the quantitative indices. The segmentation precision of the proposed method was about 15.14% and 1.27% higher than that of Mask R-CNN and YOLACT++ models. The mIoU of the proposed method reached 84.91%, which was 11.91% and 6.26% higher than Mask R-CNN and YOLACT++ models. The main reason for this was that the proposed method added an attention module, which could accurately extract the features of a lesion. The segmentation time in the test set was obtained by averaging the prediction time of all images. The prediction time for our model was slightly longer than YOLACT++, and shorter than Mask R-CNN, but the segmentation mIoU achieved by our model was the highest. This suggested that our model was well-suited for real-time NLB lesion segmentation.  Figure 11 shows a comparison of the mask quality of Mask R-CNN, YOLACT++ and the proposed method in this study for the segmentation of maize leaf disease. It could be observed that the method proposed in this paper provided better segmentation results when visually compared to other methods. The Mask R-CNN segmentation model was affected by uneven illumination and a complex background, and could segment the approximate area of the lesions, but there was still inaccurate segmentation. In particular, the reason why Mask R-CNN segmentation was inaccurate is that the feature extraction of maize lesions was not accurate enough. It led to poor detection and segmentation of maize lesion edge regions, as seen in Figure 11a.
Agriculture 2021, 11, x FOR PEER REVIEW 12 of 15 Figure 11 shows a comparison of the mask quality of Mask R-CNN, YOLACT++ and the proposed method in this study for the segmentation of maize leaf disease. It could be observed that the method proposed in this paper provided better segmentation results when visually compared to other methods. The Mask R-CNN segmentation model was affected by uneven illumination and a complex background, and could segment the approximate area of the lesions, but there was still inaccurate segmentation. In particular, the reason why Mask R-CNN segmentation was inaccurate is that the feature extraction of maize lesions was not accurate enough. It led to poor detection and segmentation of maize lesion edge regions, as seen in Figure 11a.
The YOLACT++ segmentation model featured improved segmentation accuracy and segmentation speed. In cases where there were many lesion targets on maize leaves, blurred boundaries, and close proximity, results, such as missed detections and misjudgments, can easily occur. This model could not detect all targets in maize leaves very well, as shown in Figure 11b.
The segmentation method proposed in this study introduced the convolutional attention module, which can accurately extract maize lesion features, especially the edge features of lesions. It could quickly and accurately detect and segment the area where the lesions were located, with a fast segmentation speed and high segmentation accuracy. Compared with YOLACT++, although the segmentation time per image was reduced, the segmentation precision of maize lesions was improved to a certain extent, as shown in illustration Figure 11c

Conclusions
This work has proposed an image segmentation method based on the YOLACT++ with an attention module for segmenting disease lesions of maize leaf. Since feature extraction is blind and uncertain, we introduce CBAM based on the YOLACT++ network to improve the segmentation performance of the network. While improving the segmentation accuracy, the attention-module-based feature-network gives more attention to the diseased part of maize leaves, making the detection and identification of the lesion edges more accurate. The YOLACT++ segmentation model featured improved segmentation accuracy and segmentation speed. In cases where there were many lesion targets on maize leaves, blurred boundaries, and close proximity, results, such as missed detections and misjudgments, can easily occur. This model could not detect all targets in maize leaves very well, as shown in Figure 11b.
The segmentation method proposed in this study introduced the convolutional attention module, which can accurately extract maize lesion features, especially the edge features of lesions. It could quickly and accurately detect and segment the area where the lesions were located, with a fast segmentation speed and high segmentation accuracy. Compared with YOLACT++, although the segmentation time per image was reduced, the segmentation precision of maize lesions was improved to a certain extent, as shown in illustration Figure 11c.

Conclusions
This work has proposed an image segmentation method based on the YOLACT++ with an attention module for segmenting disease lesions of maize leaf. Since feature extraction is blind and uncertain, we introduce CBAM based on the YOLACT++ network to improve the segmentation performance of the network. While improving the segmentation accuracy, the attention-module-based feature-network gives more attention to the diseased part of maize leaves, making the detection and identification of the lesion edges more accurate.
To deal with the problem wherein current maize leaf blight detection and segmentation models are susceptible to interference from shadows, occlusions, and light intensity, in our work, we applied a model for maize NLB segmentation. The results of comparative experiments demonstrate that introducing an attention mechanism allows for better detection and segmentation of disease edges, thus improving the accuracy of disease segmentation, which outperforms current example segmentation models, e.g., Mask R-CNN, YOLACT++, etc. The proposed method can be adapted to complex natural environments and lays the foundation for subsequent quantitative assessment of disease severity.
However, northern leaf blight is only one of the most important fungal diseases of maize. It would be interesting to apply the proposed method for other diseases of maize and more types of plants and diseases. Furthermore, according to the disease type, it would be helpful to introduce more highly accurate and lightweight modules into the proposed approach to further improve the segmentation efficiency of agricultural mobile equipment used in fields.

Data Availability Statement:
The data presented in this study are available within the article.