Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

YOLOv7-GCA: A Lightweight and High-Performance Model for Pepper Disease Detection

Agronomy 2024, 14(3), 618; https://doi.org/10.3390/agronomy14030618

by Xuejun Yue^1,†, Haifeng Li^1,†, Qingkui Song¹, Fanguo Zeng¹

, Jianyu Zheng¹, Ziyu Ding¹, Gaobi Kang¹, Yulin Cai¹, Yongda Lin², Xiaowan Xu^3,4 and Chaoran Yu^3,4,*

Reviewer 1:

Sumit Kumar

Reviewer 2: Anonymous

Reviewer 3:

Nikos Petrellis

Agronomy 2024, 14(3), 618; https://doi.org/10.3390/agronomy14030618

Submission received: 26 February 2024 / Revised: 8 March 2024 / Accepted: 12 March 2024 / Published: 19 March 2024 / Corrected: 14 July 2025

(This article belongs to the Special Issue Computer Vision and Deep Learning Technology in Agriculture: 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The submitted paper presents “YOLOv7-GCA: A Lightweight and High-Performance Model for Pepper Disease Detection. Although the paper might have some novelties, some points need clarification:

· The authors should provide a more detailed explanation of the methodology, particularly regarding implementing the GhostNetV2, CFNet, and CBAM modules and how these modules are integrated into the YOLOv7 architecture.

· Authors should discuss the reason behind each module's inclusion and its impact on the model's performance in the manuscript.

· The manuscript mentions challenges in the dataset, such as inter-crop occlusion and complex backgrounds. A detailed discussion is required on how these challenges impact the model's performance and the strategies employed to mitigate them during training.

· Authors should explain why YOLOv7-GCA outperforms other models in specific scenarios or conditions in the manuscript.

· The manuscript mentions early training fluctuations in the GhostNetV2 module and the YOLOv7-GCA model. The authors should discuss the reasons behind these fluctuations and how they are addressed in subsequent training epochs from which we can understand the model's convergence behavior.

· The manuscript should include a sensitivity analysis to assess how changes in key parameters (e.g., learning rate, batch size) might affect the model's performance proving the robustness of the proposed YOLOv7-GCA model under different training conditions.

Author Response

Dear reviewer：

Thank you for your review and feedback on my paper. I am glad that you are interested in my work and recognize my contribution. I will try my best to answer your questions and provide more details in the following answers and revisions. Below I will mark the question you raised in red, and in blue for some supplement and explanation to your question.

Question 1:

The authors should provide a more detailed explanation of the methodology, particularly regarding implementing the GhostNetV2, CFNet, and CBAM modules and how these modules are integrated into the YOLOv7 architecture.

Author's Notes to Reviewer:

Both GhostNetV2 and ELAN are lightweight feature extraction networks that can be used in target detection models such as YOLOv7. The main difference is that GhostNetV2 introduces a new attention mechanism, called DFC attention, which can effectively capture pixel dependencies over long distances, thus improving the expression ability and detection performance of features. The ELAN module of the feature extraction network of the original YOLOv7 is based on the traditional convolution operation, which can only capture local information, and is easily affected by redundant features. The principle of its action has been shown in the article 2.2.2.

According to the [1] paper, the classification accuracy of GhostNetV2 on ImageNet reached 75.3%, compared with only 74.5% for ELAN. In the target detection task on COCO, GhostNetV2 also outperforms ELAN, and the specific results can be referred to in the [2] article. Therefore, the GhostNetV2 model is better as a feature extraction network than the ELAN model, and is more suitable for mobile application scenarios.

The traditional feature pyramid network (Feature pyramid network, FPN) and its variants are widely used in multi-scale feature extraction and fusion models. These models typically contain a large number of classification backbones for bone used to extract multiscale features and lightweight fusion modules used to fuse these features. To fuse multi-scale features, the features from adjacent layers are first integrated by performing element addition, and then the summed features are transformed using a single 3×3 convolution. We refer to these two steps as feature integration and feature transformation, constituting feature fusion. Because to the parameters assigned for feature fusion compared to the reclassification backbone, we believe that using this paradigm may not be sufficient to fuse multiscale features, and better performance is only achieved by assigning a larger proportion of parameters for feature fusion. To solve the above problems, this paper proposes a novel cascade fusion network (cascade fusion network, CFNet) architecture for intensive prediction, and conducts a series of research and validation for this network. As shown in Fig.6, for the CFNet network model, we calculate and compare in different conversion blocks and focused blocks, and different block combinations will have different structural variants. At present, the combination of the article is the best effect we explore through the layer number calculation. As mentioned in 2.2.4, to maintain less parameters and computational power while improving the nonlinear fit, we chose the Residual network bottleneck as the block in CFNet. The above innovative improvements are reflected in the ablation experiments. CFNet is a feature fusion network, which can cascade and fuse different levels of features, and thus enhance the expression ability and robustness of features. The principle of its action has been shown in the article 2.2.4. So we used CFNet as a feature fusion network for YOLOv7, replacing the original ELAN-W network.

Due to the lightweight processing of the YOLOv7 model, it needs to capture the key information and improve its feature extraction ability to cope with the complex environment when object detection, we inserted the spatial and channel attention mechanism CBAM into the three effective feature layers of the backbone output of the improved YOLOv7 model, which enhances the expression ability of features in the channel and spatial dimensions, respectively, enabling the network to selectively focus on important features. CBAM can improve the feature extraction efficiency of the network without adding too much computational overhead. It can give attention mechanisms to channel and spatial dimensions separately, thus highlighting key features and suppressing irrelevant features. The principle of its action has been shown in the article 2.2.3. We inserted CBAM as the feature enhancement module of YOLOv7 into the last convolution layer of each output layer, as shown in the original Figure 8. We trained CBAM on the pepper disease dataset and optimized it together with GhostNetV2 and CFNet.

To this end, we added at the beginning of 2.2.2, 2.2.3 and 2.2.4 why we integrated these three modules in the YOLOv7 architecture.

Question 2:

Authors should discuss the reason behind each module's inclusion and its impact on the model's performance in the manuscript.

Author's Notes to Reviewer:

We chose to use GhostNetV2 as the feature extraction network is that we lightweight processing for the YOLOv7 model to achieve the purpose of deployment, and GhostNetV2 is a lightweight convolutional neural network, it is an iterative version of GhostNet, using the Ghost module to reduce the computation and parameters, at the same time introduced a hardware-friendly attention mechanism, which is DFC attention, to enhance the expression ability of features. Compared with the ELAN network used in the original YOLOv7 model, GhostNetV2 can greatly improve the detection speed of the model and reduce the size of the model without losing excessive accuracy, to meet the deployment requirements of mobile devices. As can be seen in our ablation experiments, we found that GhostNetV2, compared with ELAN, can reduce the parameter volume of the model by 18.8%, improve the detection speed by 35 frames /s, and reduce the model size by 13.1MB.

The reason we chose to use CFNet as a feature fusion network is that since we have achieved lightweight in the backbone network, we need a technical means to improve the model lightweight and improve the accuracy of network model identification. CFNet inserts feature integration operations into the backbone network, so that more parameters can be used for feature fusion, increasing the richness and robustness of feature fusion. According to the summary of reference [3], CFNet uses lightweight conversion blocks and focus blocks to integrate and enhance features at different scales, improving the expression ability and detection performance of features. The CFNet architecture is simple and easy to implement, and can easily benefit from large-scale pre-training weights, reducing the need for training data. CFNet achieves excellent results on multiple datasets and tasks, surpassing state-of-the-art methods such as ConvNeXt and Swin Transformer. Compared with the feature fusion network used in the original YOLOv7 model, CFNet can realize multi-scale feature extraction and fusion from small to large, thus improving the expression ability of the model in a complex background and fitting with the pepper disease detection task. In our experiments, we found that CFNet can improve the mAP of YOLOv7, reduced parameter volume by 0.8% by 15.5%, and increased detection speed by 22 frames /s.

The reason we chose to use CBAM as an attention module is that CBAM is a convolutional block attention module that uses two sub-modules to compute attention on the channel and spatial dimensions of the feature graph and outputs an attention weight matrix for weighting the feature graph. CBATM can improve the feature extraction efficiency of the network without adding excessive computational overhead. Compared to the YOLOv7 original model using no attention mechanism, CBAM can allow the model to focus on important features in the image and improve the accuracy and robustness of the model. In our experiments, we found that CBAM can improve the mAP of the model by 0.4% compared to not using any attention mechanism, while the parameter number does not increase significantly, and the detection speed is increased by 20 frames / s.

To sum up, the premise of selecting the improvement module is to improve the accuracy under the premise of a lightweight model. It does not require each module to improve the detection speed of the model and reduce the calculation parameters. As long as the number of parameters is not increased and the inference speed is not reduced, it meets the requirements. At the same time, we found in the pre-experiment that although the original YOLOv7 model can identify pepper diseases, it is prone to miss detection in the face of occlusion and multi-scale targets. Therefore, our improvement was focused on these two aspects, and the bottleneck block calculation of the CFNet module was optimized to obtain the final results of the experiment.

Question 3: The manuscript mentions challenges in the dataset, such as inter-crop occlusion and complex backgrounds. A detailed discussion is required on how these challenges impact the model's performance and the strategies employed to mitigate them during training.

Author's Notes to Reviewer:

The challenges in the data set mainly include two aspects: (1) occlusion between crops, that is, multiple pepper plants block each other, resulting in incomplete or unclear shape and color of some peppers, which increases the difficulty of identifying the model. (2) Complex background, that is, the presence of a variety of interference factors around the pepper plant, such as soil, water droplets, weeds, insects, shadows, etc., which reduces the discriminatory ability of the model.

The impact of these challenges on the model performance is mainly manifested in the following aspects: (1) reducing the detection accuracy of the model, that is, the model may miss or mistakenly detect some pepper or pepper diseases, thus affecting the mAP of the model. (2) Reduce the detection speed of the model, that is, the model may spend more time processing some complex images, thus affecting the FPS of the model. (3) Reduce the generalization ability of the model, that is, the model may not perform well on some new or different images, thus affecting the robustness of the model.

The following mitigation strategies were adopted in the training process: (1) data enhancement, that is, a series of transformations of the collected data set, such as rotation, zoom, cutting, flipping, color transformation, etc., to increase the diversity and difficulty of the data, so as to improve the generalization ability of the model. (2) Model improvement, that is, on the basis of YOLOv7, three modules of GhostNetV2, CFNet and CBAM are introduced to improve the ability of feature extraction, feature fusion, and feature attention of the model, so as to improve the detection accuracy and detection speed of the model. (3) Sample control training, that is, we will train an occluded image and the unoccluded image together to optimize the training effect of the model, so as to improve the stability and convergence of the model. As shown in Figure 1, We will take a separate, unobstructed picture of the occluded object.

Figure 1. Picture combination training

Of course, we have considered the use of image repair techniques in the subsequent experiments, such as using the generative adversarial network (GAN) or other generative models, to recover the content of the occlusion area and keep it consistent with the surrounding images. There are also many methods of image repair, such as context-based attention (Context Attention), collaborative modulation (Co-Modulation), local and global consistency (Local and Global Consistency), etc.

Question 4: Authors should explain why YOLOv7-GCA outperforms other models in specific scenarios or conditions in the manuscript.

Author's Notes to Reviewer:

Thank you for your advice and reminders. We have added the following sentences in the 3.3 comparison of the paper:

In conclusion, compared with other models, the YOLOv7-GCA model can have better detection performance and differentiation ability, and can better handle the occlusion between peppers and the identification of different disease spots in each background. Therefore, when identifying pepper diseases in a complex environment, it has a lightweight and rapid detection speed, which meets the real-time needs of the agricultural field.

Question 5: The manuscript mentions early training fluctuations in the GhostNetV2 module and the YOLOv7-GCA model. The authors should discuss the reasons behind these fluctuations and how they are addressed in subsequent training epochs from which we can understand the model's convergence behavior.

Author's Notes to Reviewer:

Sorry, we do not express it clearly. The larger fluctuation in the original text refers to the original YOLOv7 model, which has been modified in the article 3.32. What I wanted to say is that even though the original YOLOv7 model had more fluctuations during early training, it finally converged at about the 82th epochs, compared with the improved model at about the 80th epochs. The convergence rate is the same before and after the improvement, so the convergence rate is not affected by the improvement of mAP.

We also consulted the data and analyzed the reasons for the fluctuations in the training of the original YOLOv7 model [4,5]:

The original YOLOv7 model uses a variety of training techniques, such as label smoothing, adaptive learning rate, random tailoring, multi-scale training, etc., which can improve the generalization ability and robustness of the model, but may also lead to instability and shock in the training process.
The original YOLOv7 model uses SPPCSPDarknet50 as the backbone network, which is a deep network structure based on the cross-stage partial connection (CSP). It can effectively reduce the redundancy of feature maps and improve the efficiency of feature extraction, but may also increase the nonlinearity and complexity of the model, thus affecting the stability of training.

Therefore, in the subsequent improvement, we also try to reduce the complexity of the model in the lightweight improvement, so that the improved model can achieve better performance and expression effect.

Question 6: The manuscript should include a sensitivity analysis to assess how changes in key parameters (e.g., learning rate, batch size) might affect the model's performance proving the robustness of the proposed YOLOv7-GCA model under different training conditions.

Author's Notes to Reviewer:

We agree with your suggestion that we should include a sensitivity analysis in the paper to assess how changes in key parameters affect the performance of the model, thus demonstrating the robustness of our YOLOv7-GCA model under different training conditions. In response to your comments, we have made the following revisions to the paper:

Sensitivity analysis is a method to evaluate the effect of changes in the model input on the model output, which can reveal the sensitivity of the model to different input features and thus evaluate the model stability and reliability of the model. Sensitivity analysis plays an important role in deep learning models. It can help model developers and users understand the internal mechanism of the model, optimize the parameter setting of the model, improve the performance and generalization ability of the model, and discover the potential defects and risks of the model.

For this, we supplemented this experiment and added 3.5. Sensitivity Analysis to the 3. Results and Discussion section of the paper, specifically as follows:

“

Our model YOLOv7-GCA is based on the improvement of the YOLOv7 model, which introduces three key improvement points, namely GhostNetV2, CFNet, and CBAM. These improvement points all bring some advantages to the model, but also make the performance of the model affected by some parameters. To evaluate the impact of these parameters, we selected the following three key parameters as objects for sensitivity analysis, namely, learning rate, batch size, and optimizer parameters. Learning rate is the parameter controlling the learning speed of the model, which determines the amplitude of the model updating the weights in each iteration, which affects the convergence rate and accuracy of the model. Batch size is the parameter controlling the amount of data the model processes each time, which determines the computation and memory footprint of the model in each iteration, which affects the training speed and stability of the model. The optimizer parameter is the parameter that controls the model optimization algorithm, which determines the momentum, decay, adaptation and other factors of the model in the optimization process, which affects the convergence and robustness of the model.[6,7]。

Table 1. Sensitivity analysis summary.

Index	Number	Loss function	mAP@0.5 (%)	FPS (frames/s)
Learning Rate	0.001	0.03705	95.5	301
Learning Rate	0.01	0.03635	96.8	303
Learning Rate	0.1	0.04175	93.4	305
Batch Size	16	0.03635	96.8	303
Batch Size	32	0.03685	96.4	313
Batch Size	64	0.03715	96.1	323
Optimization Function	Adam	0.03675	96.2	303
Optimization Function	SGD	0.03635	96.8	303
Optimization Function	RMSprop	0.03655	96.3	303

The performance of the YOLOv7-GCA model varies under different parameter values, but the magnitude and direction of the change are different. We found that the learning rate has the greatest influence on the model performance. In the case the learning rate is too large or too small, the loss function and accuracy will decrease, while when the learning rate is moderate, the loss function and accuracy of the model will reach the best, 0.03635 and 96.8%, respectively. Batch size has little influence on the performance of the model. When the batch size increases, the loss function and accuracy of the model will decrease slightly, while the speed of the model will increase slightly. The optimizer parameters also have less influence on the performance of the model, and the different optimizer parameters have no obvious differences in the loss function, accuracy and speed of the model.

Through the sensitivity analysis, we obtained the following conclusions and implications. First, our model YOLOv7-GCA was able to achieve better performance under different parameter values, demonstrating the robustness and adaptability of the model. Second, our model YOLOv7-GCA performs best at the learning rate of 0.01, the batch size of 16 and the optimizer parameter of SGD, demonstrating that these parameters are the optimal parameter setting for the model. Finally, the advantage of our model YOLOv7-GCA over other models is that it achieves a balanced performance in speed, model size and accuracy, illustrating the effectiveness and superiority of the model.

“

Kind regards,

Authors

Reference

Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y.J.A.i.N.I.P.S. GhostNetv2: enhance cheap operation with long-range attention. 2022, 35, 9969-9982.
Dou X, Wang T, Shao S. A Lightweight YOLOv5 Model Integrating GhostNet and Attention Mechanism; proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, F, 2023 [C]. IEEE.
Zhang, G.; Li, Z.; Li, J.; Hu, X.J.a.p.a. Cfnet: Cascade fusion network for dense prediction. 2023.
Wang C, Bochkovskiy A, Liao H J a p a.YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022 [J].
Xie X, Wu D, Xie M, et al.GhostFormer: Efficiently amalgamated CNN-transformer architecture for object detection [J]. Pattern Recognition, 2024, 148:
Ankenbrand M J, Shainberg L, Hock M, et al.Sensitivity analysis for interpretation of machine learning based segmentation models in cardiac MRI [J]. BMC Med Imaging, 2021, 21(1).
Taylor R, Ojha V, Martino I, et al.Sensitivity Analysis for Deep Learning: Ranking Hyper-parameter Influence [J]. Int Conf Tool Artif Intell, 2021: 512-6.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors introduce YOLOv7-GCA, a high-performance model for detecting 395 pepper diseases based on YOLOv7. It incorporates Ghost-NetV2 and CBAM attention modules in the backbone network and CFNet feature fusion module in the head. The model comprises five components: input layer, backbone network, neck, head, and loss function.

The input layer employs mosaic data augmentation, adaptive anchor frame calculation, and adaptive image scaling. Mosaic data augmentation enhances small object detection by splicing and scaling samples, addressing the dataset's non-uniform distribution of small and large objects. The network training adapts the anchor frame to the dataset, improving model recall, and all input images are adaptively scaled for normalization, enhancing input image quality and diversity.

The backbone network, crucial for feature extraction, undergoes significant improvements. The original 50-layer YOLOv7 backbone network is enhanced with a GhostNetV2 module, replacing the original ELAN module, and a CBAM attention module. These modifications reduce parameters by 52%, capturing long-distance pixel dependencies and improving long-term self-attention without increasing parameters.

Further refinements target the head and neck of YOLOv7, aiming for more efficient feature fusion. The replacement of the ELAN-W module with CFNet in the head structure optimizes feature fusion. The original head design, utilizing the feature pyramid network (FPN) and path aggregation network (PAN) to create the PA-FPN structure, is enhanced by the introduction of CFNet. This lightweight feature fusion structure effectively leverages multi-scale features extracted by the backbone network through adaptive weight allocation and channel attention mechanisms. This substitution results in a reduction of parameters and computational requirements, ultimately improving the efficiency and quality of feature fusion. As a consequence, the pepper diseases detection process becomes faster and more accurate.

The paper combines various existing modules, making it challenging to assess the novelty of the work. The authors should provide a clearer indication of the distinctive elements that set their approach apart from existing models or methodologies. Enhancing the clarity surrounding the original contributions would assist readers in better understanding the unique aspects and innovations introduced by the authors in their research.

Author Response

Dear reviewer：

Question: The paper combines various existing modules, making it challenging to assess the novelty of the work. The authors should provide a clearer indication of the distinctive elements that set their approach apart from existing models or methodologies. Enhancing the clarity surrounding the original contributions would assist readers in better understanding the unique aspects and innovations introduced by the authors in their research.

Author's Notes to Reviewer:

This paper is designed for the specific application scenario of pepper disease detection, considering the characteristics and challenges of pepper disease, such as small target detection, category imbalance, data scarcity, etc. This paper aims to address the inadequate performance of pepper disease detection in the field and in various environments. Based on the improved YOLOv7, a high-performance and lightweight pepper disease detection model is proposed, which can accurately and quickly identify pepper diseases, and help to realize the intelligence of pepper agriculture. The main contributions of this study are as follows:

Using GhostNetV2 as a backbone network can reduce the number of parameters brought by redundant feature calculation, improve the detection speed, and reduce the computational cost while ensuring high performance.

(2) In order to solve the problem of complex background, the cascade fusion network (CFNet) set into a feature fusion network, so that more parameters can be used for feature fusion, and improve the performance of the model.

(3) The Convolutional Block Attention Module (CBAM) was introduced to improve the model by emphasizing only key features. In this way, the model can better distinguish the features of different channels and better capture key information in space, thus improving the feature extraction ability of the model.

As introduced in 2.26, the location of the CBAM attention mechanism, the number of layers of the controlled convolution, the inference speed and the number of parameters are the best suitable for the current data set after various experimental comparisons and other papers, making full preparation for the deployment.

The innovative points of our method are as follows: firstly, the GhostNetV2 + CBATM module can effectively lightweight the network, which can ensure that the amount of calculation is reduced without losing the key feature information, and to some extent can directly replace the feature extraction function of the original ELAN module. Secondly, the traditional feature pyramid network (Feature pyramid network, FPN) and its variants are widely used in multi-scale feature extraction and fusion models. These models typically contain a large number of classification backbones for bone used to extract multiscale features and lightweight fusion modules used to fuse these features. To fuse multi-scale features, the features from adjacent layers are first integrated by performing element addition, and then the summed features are transformed using a single 3×3 convolution. We refer to these two steps as feature integration and feature transformation, constituting feature fusion. Because to the parameters assigned for feature fusion compared to the reclassification backbone, we believe that using this paradigm may not be sufficient to fuse multiscale features, and better performance is only achieved by assigning a larger proportion of parameters for feature fusion. To solve the above problems, this paper proposes a novel cascade fusion network (CFNet) architecture for intensive prediction, and conducts a series of research and validation for this network. As shown in Fig.6, for the CFNet network model, we calculate and compare in different conversion blocks and focused blocks, and different block combinations will have different structural variants. At present, the combination of the article is the best effect we explore through the layer number calculation. As mentioned in 2.2.4, to maintain less parameters and computational power while improving the nonlinear fit, we chose the ResNet bottleneck as the block in CFNet. The above innovative improvements are reflected in the ablation experiments.

To sum up, our method innovation is based on YOLOv7 model, for the natural environment of pepper disease data set, replace the structure of YOLOv7 and convolution change, improve the performance of the model can well adapt to the natural environment of pepper disease identification task, and prepare for the mobile end deployment of lightweight model training.

Kind regards,

Authors

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes the pepper disease detection using YOLO v7. It is optimized for a high performance and lightweight model. Additional contributions include: Incorporating GhostNetV2 as the backbone network, to reduce the number of model parameters, tackling the problem of complex backgrounds with Cascade Fusion Network (CFNet) and Convolutional Block Attention Module (CBAM) to improve the model by emphasizing only on key features.

The introduction refers to several YOLO based approaches. However, other popular machine learning approaches like Smart Phone Image Processing for Plant Disease Diagnosis are not referenced nor compared with the proposed approach

This approach uses a dataset of 1259 images displaying 4 pepper diseases. The symptoms of these diseases can appear on leaves or fruits. The number of images in the dataset is relatively small so many augmentation methods were employed as described in 2.1.2 and Fig. 1.

In 2.2 a detailed description of the deep learning models employed, their theoretical background and the architectural improvements performed by the authors are presented. Be careful to define every acronum because many acronyms are used without prior definition ELAN, CBS, etc. Please provide definitions of these acronyms before using them.

In 2.3 the platform used for training and testing is defined and the training parameters used. The metrics used to determine the diagnosis accuracy are also defined. Comparison between Yolo models in Table 3 and 4.

Fonts in Fig. 9 and 11 should be increased.

Although this paper is well written and has a merit, my major issues are the following:

a) comparison with other referenced DL and non referenced ML approaches is missing to assess the accuracy of the proposed method

b) Although the platform used for training and testing is a full desktop computer, the authors do not describe how their approach could be used in the field: could the trained model be executed on a smart phone in the field? should samples be carried out in the lab. They should provide the architecture of the end user application that could use this approach

The most important and most time consuming modification to do is:

c) How extensible is this approach for other plant diseases? how easy would be to classify more diseases from more plants. It would be useful if the results had been presented both for 4 and e.g, 6 diseases to see how the accuracy is affected if more diseases from different plants were included in the study.

Author Response

Dear reviewer：

Thanks to the reviewers for their valuable comments. We feel that the questions raised by the reviewers are very important and very helpful in improving the quality of our articles. We have made the following revisions to our article as suggested by the reviewers. Below I will mark the question you raised in red, and in blue for some supplement and explanation to your question.

Question 1: The introduction refers to several YOLO based approaches. However, other popular machine learning approaches like Smart Phone Image Processing for Plant Disease Diagnosis are not referenced nor compared with the proposed approach.

Author's Notes to Reviewer:

Thanks for your comments and suggestions, we have added some researches of machine learning on crop diseases in the introduction, and compared the proposed methods. The modifications are as follows:

“

Machine learning plays an important role in the study of agricultural pest identification. Zhang et al. [8] used HSI (Hue, Saturation and Intensity), YUV and grayscale models to extract 38 color, texture and shape features, and used support vector machine (Support Vector Machine, SVM) classifier to identify. As a result, the correct identification rate of three diseases of apple leaves reached more than 90%. Soarov et al [9] used Otsu threshold segmentation and histogram equalization to process data images. After image segmentation and using SVM for classification, the recognition rate of apple diseased lobe reached 96%. Zhang et al. [10] used K-mean clustering algorithm, segmented the images to obtain the shape and color characteristics of diseases and insect pests information, and had a good recognition effect on the 7 major diseases of cucumber, with a total recognition rate of 85.7%. Although the above methods have high identification accuracy, traditional machine learning methods usually require manual feature engineering and require a single experimental background. The current lack of effective interaction and feedback mechanism, when combined with agricultural practice, leads to insufficient accuracy and robustness of the algorithm. Ashutosh Kumard[11] et al. used convolutional neural network, Bayesian optimized SVM and random forest classifier to conduct plant leaf disease detection based on hybrid characteristics. The results showed that the convolutional neural network achieved the highest accuracy of 96.1% in detecting leaf disease of apple, corn, potato, tomato and rice plants. Nurul Nabilah[12] et al. compared the pest and pepper features extracted by the traditional method with the deep learning-based methods, which outperformed the traditional feature-based methods.

”

Question 2: In 2.2 a detailed description of the deep learning models employed, their theoretical background and the architectural improvements performed by the authors are presented. Be careful to define every acronym because many acronyms are used without prior definition ELAN, CBS, etc. Please provide definitions of these acronyms before using them.

Author's Notes to Reviewer:

CBS means: Conv+BatchNorm+SiLU. The specific combination of CBS has been shown in Figure 8. ELAN means Efficient Layer Aggregation Network, I have made a supplementary explanation in article 2.2.1. For MPConv, although MP is represented as maxpool, it does not simply represent the combination of maxpool+conv, but the specific combination form as shown in Figure 8. Therefore, MP is not a simple acronym but a combination naming of structure diagrams.

Question 3: Fonts in Fig. 9 and 11 should be increased.

Author's Notes to Reviewer:

We have increased the fonts.

Question 4: Comparison with other referenced DL and non referenced ML approaches is missing to assess the accuracy of the proposed method.

Author's Notes to Reviewer:

Although machine learning recognition method can identify specific plant diseases, but the researchers need to master the relevant disease knowledge artificial select the right characteristics and complex scheme design, and the test conditions (image background, light conditions, leaf put position, etc.), the universality is low, its promotion and application has certain difficulties.

In terms of algorithm mechanism, machine learning algorithms need to be designed and optimized according to different types and characteristics of diseases and pests, but the current lack of unified standards and methods, resulting in insufficient complexity, generalization and interpretability of algorithms. In addition, machine learning algorithms usually require a large amount of computational resources and time, but agricultural fields often lack high-performance hardware equipment and network environment, resulting in low real-time and availability of algorithms.

In terms of application, the application of machine learning in pest detection needs to be combined with agricultural practice, but currently, effective interaction and feedback mechanisms are lacking, resulting in insufficient accuracy and robustness of the algorithm. As a result, the popularity and acceptability of the algorithm at the application level is not high.[1, 2]

At the same time, the evaluation indicators of machine learning include accuracy, precision rate, recall rate, F1, ROC curve and AUC curve. Compared with deep learning methods, the evaluation criteria of deep learning are different, which lack the calculation of reasoning speed and parameters. So much so, this paper does not compare experiments on machine learning methods.

The network model compared in this paper: Faster R-CNN, SSD, YOLOv3, YOLOv5s, YOLOv7, and YOLOv8n, are classic target detection networks, which have been widely used in academia and industry, and their principles and methods have been widely verified and explored. The six models have their own characteristics in design. For example, Faster RCNN is a classic two-stage target detection model with high accuracy; SSD is lightweight, which is a single-stage target detection network with fast early reasoning; YOLO series network mentioned in the original introduction that YOLO series is iterated around accuracy, lightweight and reasoning speed. In this paper, YOLOv3, YOLOv5s, YOLOv7 and and YOLOv8n were used for comparison to improve the data set of pepper diseases to overcome small targets and occlusion problems, and the results are more advantageous than the original model, and even surpass YOLOv8 in some aspects. And this paper take the six model in the field of crop disease by previous researchers do a lot of research, such as the introduction in the original part mentioned some predecessors work in other data sets have achieved good results, so choose the six network as contrast object can better reflect the performance and characteristics of the proposed model, and provide more reference and basis for its improvement and optimization.

Question 5: Although the platform used for training and testing is a full desktop computer, the authors do not describe how their approach could be used in the field: could the trained model be executed on a smart phone in the field? should samples be carried out in the lab. They should provide the architecture of the end user application that could use this approach.

Author's Notes to Reviewer:

In the following results, we deployed the trained model to the Android terminal for use, and achieved good results. The modification was presented in 3.4. The modifications are as follows:

“

3.4. Android Deployment Testing

Deep learning models usually save their parameters in specific formats that are not compatible with all hardware platforms. To deploy a model on a Android devices, the parameters need to be exported and converted to a suitable format. Figure 13 illustrates the deployment process of the pepper diseases identification model on Android devices. Ncnn Convolutional Neural Network (NCNN) is a high-performance neural network inference framework for mobile devices that supports multiple deep learning frameworks. It provides software development kits for Android and iOS, which can easily run various deep learning models on mobile devices. First, the PTH model files trained by PyTorch are converted to Open Neural Network Exchange (ONNX) model files, and then the universal properties of ONNX are used to generate the BIN and PARAM model files that the NCNN library can load. Then, the model is verified and tested. Finally, according to the design requirements of the application, an Android project is created to deploy the YOLOv7-GCA model on the phone for accuracy testing. The main functions of the pepper diseases identification app include image acquisition, automatic image saving, CPU-based pepper diseases detection, GPU-based pepper diseases detection, and disease grade evaluation of the detection results. Users can obtain the pepper images through the image acquisition module or use their own images from album.

Figure 1. Flowchart of deployment process on Android terminal.

The pepper diseases detection module analyzes the type, number, and severity of the diseases affecting the target pepper, and outputs the number of different pepper diseases in the target image. GPU detection has the advantage of using the parallel computing power and high memory bandwidth of the GPU to speed up the inference process of the model, and improve the detection accuracy and efficiency. However, the model’s compatibility and stability may be affected by the diversity and performance of the mobile phone GPUs on the market. CPU detection has the advantage of being able to run on any phone without considering the GPU hardware configuration, and improve the model’s versatility and portability. Figure 14 shows the results of the pepper diseases CPU-based detection on the Xiaomi13 mobile phone.


(a)	(b)	(c)

Figure 2. Effective picture for pepper diseases detection: (a) anthracnose; (b) umbilical rot diseases; (c) viral diseases.

“

Question 6: How extensible is this approach for other plant diseases? how easy would be to classify more diseases from more plants. It would be useful if the results had been presented both for 4 and e.g, 6 diseases to see how the accuracy is affected if more diseases from different plants were included in the study.

Author's Notes to Reviewer:

The method of this paper is designed for the specific application scenario of pepper disease detection, considering the characteristics and challenges of pepper disease, such as small target detection, category imbalance, data scarcity, etc. This paper aims to address the insufficient performance of pepper disease detection in the field and in various environments. Based on the improved YOLOv7, a high-performance and lightweight pepper disease detection model is proposed, which can accurately and quickly identify pepper diseases, and help to realize the intelligence of pepper agriculture. In view of the reviewers' suggestions, we also did the following experiments. Due to the time and data set acquisition problems, aloe anthrax (752 images) was added to our second experiment to expand our data set and test scope. We collected images of aloe diseases from the laboratory test site and appropriately pre-treated and annotated them to accommodate our method.

In our results, we show the accuracy and other evaluation measures of our method under seven plant disease categories, as well as a comparative analysis with other methods. We found that our method was still able to maintain high accuracy and efficiency when handling more plant diseases, and it outperforms other methods. Our method had an average accuracy of 91.6% under the seven plant disease categories, while the other methods were 83.5% (YOLOv8n), 81.6% (YOLOv7), 81.2% (YOLOv5s), 76.3% (YOLOv3) 71.7% (SSD) and 78.4% (Faster R-CNN), respectively.

Because some aloe leaves look similar to pepper fruits, there are some misdetection, so the accuracy has decreased. In order to confirm the advantages of this experimental method, we trained our method on the aloe disease data set, showing the results of aloe anthrax.


(a)	(b)

Fig 3. ALOE results of the PR plots in YOLOv7 (a) and YOLOv7-GCA (b) model.

As can be seen from Figure 3, the average recognition accuracy of aloe anthrax has increased from 79% to 86.6%, which shows that our method has improved the recognition accuracy of other plant diseases, and our experimental method is still very meaningful. As can be seen in Figure 4, the disease spots are basically identified.

Fig 4. Aloe anthracnose identification results.

In our discussion, we explain the advantages and limitations of our approach in handling more plant diseases, as well as possible directions of improvement. We believe that our method is able to maintain a high performance when dealing with more plant diseases because it has several advantages:

Our method using the GhostNetV2 module and CBAM attention module combination, on the premise of guarantee calculation reduction is not lose key feature information, can realize the network lightweight reduce parameter number and can ensure the network of feature extraction ability, makes our method to capture the subtle differences between different plant diseases and similarity.
Our method uses CFNet to optimize the efficiency of feature fusion, so that our method can effectively use multi-scale features to obtain more parameter information, so that it can adapt to various scales of disease spots on fruits and leaves in a complex background, while reducing the number of parameters and computation.
Our method uses techniques such as mosaic data enhancement, adaptive anchor frame calculation, and adaptive image scaling, allowing our method to adapt to plant disease targets of different sizes and distribution, while improving the quality and diversity of the input images.

We also acknowledge that there are still some limitations and challenges of our method when dealing with more plant diseases, such as:

In the mixed training detection of 7 disease categories, there will be false detection, resulting in a decrease in the accuracy of accuracy. Therefore, the next improvement should be improved around the plant similarity.
Our method relies on sufficient data quantity and quality, and the classification of plant disease with too many or too few, or low image quality in the dataset, may affect the performance and stability of our method.
Our method has not been validated on three or multiple plant disease detection datasets, so the generalization ability and extensibility of our method should be further proved and evaluated.
Our method has not considered some factors in practical applications, such as night, backlight, plant similarity, etc., these factors may have a certain impact on the detection effect of our method.

To solve these problems, we plan to make the following improvements and attempts in our future work:

We plan to collect and collate more plant disease image data, including different plant species, different disease types, different image quality, to build a more comprehensive and representative plant disease detection dataset.
We plan to test and validate the performance and robustness of our method on other plant disease detection datasets, such as the Plant Village dataset or Plant Doctor dataset, to demonstrate the adaptability and superiority of our method in different scenarios and conditions.
We plan to introduce some more advanced and efficient modules and technologies, such as YOLOv8 or Transformer, to further enhance the ability of our method for feature extraction and feature fusion, as well as the ability to detect plant diseases in complex environments.

We hope that these amendments can meet the requirements of the reviewers and make our articles more complete and persuasive. We once again thank the reviewers for their comments and suggestions.

Kind regards,

Authors

Reference

Tsaftaris S A, Minervini M, Scharr H J T i p s. Machine learning for plant phenotyping needs image processing [J]. 2016, 21(12): 989-91.
Li L, Zhang S, Wang B. Plant Disease Detection and Classification by Deep Learning—A Review [J]. IEEE Access, 2021, 9: 56683-98.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors had made all the requested revisions. The work can be accepted as it is.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors addressed all of my comments except from a comparison with ML methods explaining why they did not addressed it. Other important ML methods for plant disease diagnosis are still missing although they added some references

Article Menu

YOLOv7-GCA: A Lightweight and High-Performance Model for Pepper Disease Detection

Further Information

Guidelines

MDPI Initiatives

Follow MDPI