Review Reports - The Development of a Lightweight DE-YOLO Model for Detecting Impurities and Broken Rice Grains

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

General comments:
Authors presented an article entitled “Development of a Lightweight DE-YOLO Model for Detecting 2 Impurities and Broken in Rice”. The authors proposed modifications to the YOLOX model to specifically address challenges in detecting small targets (broken grains and impurities) in rice samples. Although the paper is well-written and appears to contain some points of novelty, several questions arise after careful revision. Do not forget that a paper must guarantee that the study can be replicated by other readers.

Q1. More details about the data collection process are needed. How was the communication established between the computer and the device? Was it via Matlab, Arduino, or another platform? Please provide additional details on this.

Q2: It is unclear how many classes were used for training the model. Did the authors use bounding boxes? Which tool did they use for annotation purposes? How do they justify the selected data augmentation techniques? Please provide more details on this.

Q3: It is important to include software versions, such as the Python version and annotation software used. Please also provide appropriate references for these tools.

Q4: try to use abbreviations instead of repeatedly writing out full terms such as "machine learning" (ML) or "deep learning" (DL).

Q5: There is no discussion in this paper, nor any comparison with other studies. The results are presented together with the discussion, but I cannot see an actual discussion. It might be better to separate this into two distinct sections—**Results** and **Discussion**—to clearly highlight and discuss the authors' findings.

Q6: Could the authors share the dataset and code with the general public? I think this is important to ensure the replication of the study and the availability of the code. There are repositories like Zenodo or GitHub that allow for this. The option of requesting the data involves a lot of uncertainty; in many cases, I have requested the data, but it was not provided.

Q7: Please add DOI to the references this is agree with MDPI policy.

Specific comments:

Line 24: please avoid using keywords that already appear in the title.

Lines 168–173: It is unclear whether the authors included the augmented data in the test and validation sets after applying the data augmentation techniques. Moreover, from my perspective, tripling (from ~3000 to ~9000) the dataset through augmentation techniques could be risky. The authors assume the model will perform better with augmented data, but they have not provided any comparative results without data augmentation.

Line 253-264: These parameters related to GPU, PyTorch, etc., should be moved to the Materials and Methods section, as they are more appropriate there.

I missed seeing more results in this paper; only brief results are presented. This is always the same—it is like training an object detection model for a different application, but there is no clear novelty in this study.

Author Response

Answer: We have revised the manuscript carefully according to the comments, and the also comments made by the reviewers also was included in the discussion part.

Answer: Thank you for your feedback. We have added more details about the data collection process in the revised manuscript. Specifically, we clarified that the industrial camera communicates with the computer via a USB connection and relies on the installed SDK driver for control. The data acquisition process is primarily conducted using the official image capture software, which supports parameter adjustments and automated photography to meet experimental requirements. Additionally, we have included a detailed explanation in lines 147-155 of the latest manuscript:" The selected industrial camera provides an SDK library compatible with the Windows system for camera integration and development. It includes official image acquisition software for image capture. The camera communicates with the computer via a USB connection and relies on the installed SDK driver for operation. Within the acquisition software, various camera parameters can be configured, including exposure, trigger mode, color adjustments, IO operations, video parameters, and resolution. Furthermore, the software’s image acquisition function allows for the setting of an automatic photography cycle to capture images of grains falling within the sampling device."

Answer: We used two categories for model training: broken grains and impurity stems. The data annotation was performed using the LabelImg software, accurately labeling the bounding boxes of stems and broken grains. The object detection task was based on the YOLO-X model, utilizing bounding boxes to locate and identify target categories. In the latest manuscript, we removed the explanation of "data augmentation" and replaced it with "data expansion." A total of 2,458 raw images were collected and annotated. However, due to the limited dataset size, it was insufficient to meet the data requirements of deep learning models. Therefore, common data expansion techniques, such as rotation, scaling, and flipping, were applied. These techniques helped expand the training dataset while working with a limited amount of original data.

Q3: It is important to include software versions, such as the Python version and annotation software used. Please also provide appropriate references for these tools.

Answer: Thank you for your suggestion. We indeed used Python and relevant software tools for the experiments, and LabelImg was employed for data annotation. The specific software environment was configured according to the training device's specifications to ensure stable model training and optimization.

Q4: try to use abbreviations instead of repeatedly writing out full terms such as "machine learning" (ML) or "deep learning" (DL).

Answer: We have updated the latest manuscript by replacing technical terms such as "deep learning" and "Depthwise Separable Convolution" with "DL" and "DSConv".

Answer: Thank you for your valuable feedback. We have revised the manuscript, particularly the result sand discussion sections, to more clearly present the research findings and provide an in-depth discussion and comparative analysis. Your suggestions have been instrumental in enhancing the logical flow and structural integrity of the paper, We are deeply grateful for your guidance.

Answer: Thank you for your interest in our work. Unfortunately, we are unable to publicly share the dataset and code due to data privacy and institutional restrictions. However, we appreciate your understanding, and if you have specific questions regarding our methodology or implementation, we would be happy to provide further clarifications.

Q7: Please add DOI to the references this is agree with MDPI policy.

Answer: We have added the DOIs to the references in the latest manuscript.

Specific comments:
Line 24: please avoid using keywords that already appear in the title.

Answer: The keywords have been updated to "Keywords: Deep learning, Object Detection, image processing, Attention Mechanism." and are presented in the latest manuscript.

Answer: A total of 2,458 raw images were collected and annotated. However, due to the limited dataset, it was insufficient to meet the data requirements of deep learning models. Therefore, common data augmentation techniques, such as rotation, scaling, and flipping, were applied to expand the training dataset. By leveraging these techniques, we enriched the training dataset and enhanced its diversity, thereby reducing the risk of overfitting. Although we did not provide comparative experiments without data augmentation, existing research has shown that data augmentation is widely adopted in many similar applications (Krizhevsky et al., 2012).

Line 253-264: These parameters related to GPU, PyTorch, etc., should be moved to the Materials and Methods section, as they are more appropriate there.

Answer: We have revised the relevant parameters, including "Operating System: Ubuntu 18.04, Graphics Card: NVIDIA GeForce GTX 1060, based on the deep learning framework Python 3.8, PyTorch 1.9.0+cu111, CUDA 11.1, cuDNN 8.0.50, and installed libraries required for the model such as Torchvision 0.10.0+cu111, OpenCV 4.7.0.72, Pillow 9.4.0, etc." These changes have been reflected in the latest manuscript.

Answer: Thank you for your valuable feedback. We understand your concern regarding the presentation of results. Although the results section of this paper provides a brief overview, we have elaborated in detail on the innovations of the DE-YOLO model. Our improvements focus primarily on lightweight network design, addressing class imbalance issues, and optimizing small object detection. We believe these enhancements have significant application value for rice target detection tasks, especially on resource-constrained devices.

Reviewer 2 Report

Comments and Suggestions for Authors

1. Supplementary ablation experiment can further explore the interaction between each module. For example, three improvement methods DWSConv, Focal Loss and ECANet and the detection effect of any two improvement combinations are added to the original model.
2. The model has fewer comparative experiments, so more comparative experimental models can be added to increase the persuasiveness of the model;
3. Why is YOLOX-s taken as the baseline model?

Author Response

Supplementary ablation experiment can further explore the interaction between each module. For example, three improvement methods DSConv, Focal Loss and ECANet and the detection effect of any two improvement combinations are added to the original model.

Answer: Thank you for your suggestion. Based on the original experiments, we have added ablation models for Model3 and Model4. Model3 is built upon Model1 with the addition of the attention mechanism ECANet, while Model4 investigates the effects of ECANet and Focal Loss on the model without the lightweight DSConv. The experimental results and discussion have been updated in the latest manuscript.

Table 1. Experimental results under different models

Model	DSConv	Focal Loss	ECANet	mAP	Parameters	GFLOPS
YOLOX				94.65%	9.0 M	13.08
Model1	√			92.86%	4.0 M	6.53
Model2	√	√		95.14%	4.2 M	6.71
Model3	√		√	94.76%	4.2M	6.88
Model4		√	√	97.42%	9.0 M	13.26
DE-YOLO	√	√	√	97.55%	4.6 M	7.02

The experimental results are shown in Table 1. Model1 replaces the standard convo-lution (CBS) in YOLOX with depthwise separable convolution (DSConv), reducing the number of parameters by 5M and lowering GFLOPS by 6.55. However, its mAP decreases to 92.86% (a drop of 1.79%). Model2 builds upon Model1 by incorporating Focal Loss to address the class imbalance in rice grains, improving the mAP to 95.14%. Model3 intro-duces the ECANet attention mechanism into Model1, increasing the mAP to 94.76%, which is slightly lower than Model2 (95.14%). This indicates that ECANet enhances fea-ture extraction but does not fully compensate for the information loss caused by DSConv. The proposed DE-YOLO model achieves a parameter size of only 4.6M and reduces GFLOPS to 7.02 while maintaining an mAP of 97.55%, demonstrating a balance between lightweight design and high accuracy.

The study improves the YOLOX model through lightweight modifications and ana-lyzes the impact of depthwise separable convolution (DSConv), Focal Loss, and the ECANet attention mechanism on detection performance. After adopting DSConv in Mod-el1, the mAP decreased by 1.79%. This may be due to DSConv decomposing traditional convolution into DWConv and PWConv, reducing cross-channel feature interaction and affecting the model’s representational capacity. Similar phenomena have been reported in studies on MobileNetV1/V2 and ShuffleNet[47]. Although DSConv significantly reduces computational complexity, its suitability for accuracy-sensitive tasks requires careful con-sideration.

To address the potential information loss caused by DSConv, Model2 introduces Fo-cal Loss based on Model1 to enhance the detection of hard-to-classify targets, such as rice grains with similar colors and broken grains. Experimental results show that Focal Loss improves the mAP to 95.14%, exceeding the original YOLOX model (94.65%). This demonstrates that Focal Loss effectively mitigates class imbalance, allowing the model to focus more on easily confused categories, thereby enhancing overall detection perfor-mance. This suggests that optimizing the loss function in a lightweight YOLOX structure can improve detection accuracy without significantly increasing computational cost.

Model3 incorporates the ECANet attention mechanism into Model1, increasing the mAP to 94.76%, which is slightly higher than Model1 (92.86%) but lower than Model2 (95.14%). This indicates that ECANet enhances feature extraction but has limited im-provement under the DSConv structure. Since DSConv reduces the integrity of feature representation during convolution, ECANet’s channel-wise adaptive mechanism may not fully compensate for this loss. Therefore, while ECANet provides some enhancement, its effectiveness is constrained in lightweight convolution structures. In contrast, Model4, which integrates both Focal Loss and ECANet, achieves an mAP of 97.42%, suggesting that ECANet performs better with complete convolutional feature representation (CBS). This further confirms that simply introducing DSConv in the YOLOX structure may im-pair feature extraction effectiveness, whereas combining it with an appropriate loss func-tion and attention mechanism can significantly improve detection performance.

Ultimately, the proposed DE-YOLO model achieves both lightweight design and high detection accuracy.

The model has fewer comparative experiments, so more comparative experimental models can be added to increase the persuasiveness of the model;
Answer: Thank you for your valuable suggestion. Due to time and computational resource constraints, we have added a two-stage object detection model, Faster R-CNN, as a comparative model based on the original experiments and conducted additional comparative experiments to enhance the credibility of our model. The following modification has been made in the latest manuscript:

"To further validate the performance of DE-YOLO for rice impurity and broken grain monitoring, we selected detection models from the YOLO series based on their recognition performance on the test samples. Using the same experimental platform and dataset, we trained YOLOv3, YOLOv5, YOLOX, DE-YOLO, YOLOv8, and Faster R-CNN in the same batch. The detection results are shown in Table 3."

YOLO模型 YOLO model	精度% Precision	召回率% Recall	平均精度均值% mAP	F1 F1-score	Parameters
YOLOv3	93.68	92.39	94.63	0.94	62M
YOLOv5	94.94	93.51	95.26	0.94	7.3M
YOLOX	94.08	92.46	94.65	0.94	9.0M
DE-YOLO	96.66	94.46	97.55	0.96	4.6 M
YOLOv8	96.89	95.41	98.26	0.96	11.2M
Faster R-CNN	88.52	86.93	89.71	0.88	41M

The experimental results demonstrate that the DE-YOLO model performs exception-ally well in the detection of impurity and breakage rates in rice, achieving high-accuracy detection results. Specifically, DE-YOLO achieves a Precision of 96.66%, a Recall of 94.46%, an mAP of 97.55%, and an F1-score of 0.96. These metrics indicate that DE-YOLO effectively reduces false positives and false negatives while accurately identifying rice stems and broken grains, delivering high-quality detection outcomes. Moreover, DE-YOLO has a parameter size of 4.6M, significantly reducing computational resource consumption compared to other YOLO series models, such as YOLOv3 (62M) and YOLOv8 (11.2M). This lightweight design enables DE-YOLO to operate efficiently on resource-constrained devices.

DE-YOLO demonstrates significant advantages over other mainstream object detection models. Firstly, compared to YOLOv3, YOLOv5, and YOLOX, DE-YOLO outperforms them in key metrics such as Precision, Recall, mAP, and F1-score. For example, DE-YOLO's Precision is 2.98 percentage points higher than YOLOv3 and 1.72 percentage points higher than YOLOv5, showcasing its superior accuracy in effectively reducing false positives. Similarly, DE-YOLO achieves a higher Recall than YOLOv3 and YOLOX, indicating its enhanced capability in detecting small and hard-to-identify objects, thereby minimizing missed detections.

Compared to YOLOv8, DE-YOLO achieves similar accuracy but reduces parameter size by 59% due to its lightweight optimizations, significantly lowering computational costs and making it more suitable for deployment on resource-constrained devices.

Faster R-CNN, as a two-stage detector, performs well in feature extraction. However, DE-YOLO exhibits a considerable advantage across all performance metrics. Faster R-CNN achieves a Precision of 88.52%, a Recall of 86.93%, and an mAP of 89.71%, while DE-YOLO surpasses these figures by over 8 percentage points, with its F1-score improving by 0.08. Additionally, DE-YOLO’s parameter size is only 4.6M, an 89% reduction com-pared to Faster R-CNN’s 41M. This reduction not only allows DE-YOLO to outperform Faster R-CNN in accuracy but also gives it a clear advantage in computational efficiency.

Overall, DE-YOLO’s combination of high accuracy and low computational requirements makes it highly applicable for rice quality inspection. Its ability to operate efficiently in resource-limited environments provides an optimal solution for real-time detection ap-plications.

Why is YOLOX-s taken as the baseline model?

Answer: In the YOLOX series, YOLOX-s (small) is optimized for mid-to-small devices or resource-constrained systems. Compared to larger versions like YOLOX-m and YOLOX-l, YOLOX-s achieves a better balance between model size, computational efficiency, and real-time performance (Ge et al., 2021). Therefore, this study selects YOLOX-s as the base model and further improves it by optimizing the model's lightweight structure, feature extraction module, and loss function design to enhance the accuracy and efficiency of rice impurity and broken grain detection, better aligning with practical application needs.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript presents the development of a lightweight DE-YOLO (You Only Look Once) model designed for detecting impurities and broken grains in rice. By leveraging advanced Deep Learning (DL) techniques, this approach aims to improve the operational efficiency of intelligent combine harvesters.

Effective Performance Improvement: The experimental results demonstrate a notable accuracy enhancement, with a mAP increase of 2.9%, while simultaneously reducing computational cost.

Innovative Model Enhancements: The integration of the DBS module and depth-wise separable convolution effectively reduces model complexity, ensuring a lightweight design without compromising accuracy.

Comprehensive Evaluation: The study provides a thorough comparison with the original YOLOX algorithm, analysing key performance metrics such as precision, recall, F1-score, parameter count, and GFLOPS.

Integration of the ECANet Module: This enhancement strengthens the model’s ability to focus on rice impurities and broken grains, improving detection reliability.

There are several areas in the manuscript that require improvement:

Comparative Analysis: While the reported performance improvements are significant, the manuscript would benefit from a more in-depth comparison between DE-YOLO and other state-of-the-art lightweight models for rice impurity detection.

Clarity and Readability: Certain sections contain awkward phrasing and grammatical errors. A thorough language revision is recommended to improve clarity and readability.

Ablation Study: A detailed analysis of the individual contributions of each proposed modification (DBS module, ECANet, depth-wise separable convolution) is necessary to quantify their specific impact on the model’s performance.

Real-world Application: The manuscript should provide a discussion on practical deployment scenarios, including hardware requirements and the model’s adaptability to different rice varieties and environmental conditions.

Overall evaluation:

The manuscript makes a valuable contribution to rice impurity detection by presenting an improved lightweight deep learning model. The authors provide a well-structured methodology, optimizing the YOLOX-s model with depth-wise separable convolution, the ECANet attention mechanism, and Focal Loss. These innovations result in notable improvements in detection accuracy and computational efficiency, making the model well-suited for deployment on resource-constrained devices. The experimental validation highlights DE-YOLO’s superiority over traditional YOLO models, particularly in detecting small rice targets.

However, a more detailed discussion on comparative model performance, an ablation study to assess the impact of each modification, and considerations for real-world deployment would enhance the manuscript.

Comments on the Quality of English Language

A thorough language revision is recommended to improve clarity and readability.

Author Response

The study introduces DE-YOLO, a novel rice impurity detection algorithm built upon an improved YOLOX-s model. To enhance small crop target recognition and impurity detection, the model incorporates several modifications, including replacing the CBS module with a DBS module, utilizing depth-wise separable convolution, and integrating the ECANet module to strengthen attention mechanisms. Additionally, the Focal Loss function is employed to mitigate class imbalance. Experimental results demonstrate the effectiveness of DE-YOLO, achieving a mAP of 97.55%, a recall rate of 94.46%, and a reduction in both parameter count and computational complexity. Effective Performance Improvement: The experimental results demonstrate a notable accuracy enhancement, with a mAP increase of 2.9%, while simultaneously reducing computational cost. Innovative Model Enhancements: The integration of the DBS module and depth-wise separable convolution effectively reduces model complexity, ensuring a lightweight design without compromising accuracy. Comprehensive Evaluation: The study provides a thorough comparison with the original YOLOX algorithm, analysing key performance metrics such as precision, recall, F1-score, parameter count, and GFLOPS. Integration of the ECANet Module: This enhancement strengthens the model’s ability to focus on rice impurities and broken grains, improving detection reliability.

Answer: Thank you for your recognition of the work presented in this manuscript.

There are several areas in the manuscript that require improvement:

Question 1: Comparative Analysis: While the reported performance improvements are significant, the manuscript would benefit from a more in-depth comparison between DE-YOLO and other state-of-the-art lightweight models for rice impurity detection.

Answer: Thank you for your valuable suggestion. We have added a two-stage object detection model, Faster R-CNN, as a comparative model based on the original experiments and conducted additional comparative experiments to enhance the credibility of our model. The following modification has been made in the latest manuscript:

YOLO model	Precision，%	Recall，%	mAP，%	F1-score	Parameters
YOLOv3	93.68	92.39	94.63	0.94	62M
YOLOv5	94.94	93.51	95.26	0.94	7.3M
YOLOX	94.08	92.46	94.65	0.94	9.0M
DE-YOLO	96.66	94.46	97.55	0.96	4.6 M
YOLOv8	96.89	95.41	98.26	0.96	11.2M
Faster R-CNN	88.52	86.93	89.71	0.88	41M

Question 2: Clarity and Readability: Certain sections contain awkward phrasing and grammatical errors. A thorough language revision is recommended to improve clarity and readability.

Answer: Thank you for your valuable feedback. We have revised the manuscript to enhance its professionalism, clarity, and readability, ensuring that the expressions are more fluent and precise.

Question 3: Ablation Study: A detailed analysis of the individual contributions of each proposed modification (DBS module, ECANet, depth-wise separable convolution) is necessary to quantify their specific impact on the model’s performance.

Question 4: Real-world Application: The manuscript should provide a discussion on practical deployment scenarios, including hardware requirements and the model’s adaptability to different rice varieties and environmental conditions.

Answer: Thank you for your valuable feedback. Indeed, research on practical deployment is a key focus of our future work. We plan to apply this method to the impurity and broken grain rate detection task for unmanned combine harvesters. The main focus of this study is the optimization of object detection methods for small target crops, incorporating features of rice stalks and broken grains. The aim is to ensure high detection accuracy and efficiency while improving model performance through lightweight optimization, thus laying a solid foundation for future deployment on mobile platforms.

Overall evaluation:

Question 5: The manuscript makes a valuable contribution to rice impurity detection by presenting an improved lightweight deep learning model. The authors provide a well-structured methodology, optimizing the YOLOX-s model with depth-wise separable convolution, the ECANet attention mechanism, and Focal Loss. These innovations result in notable improvements in detection accuracy and computational efficiency, making the model well-suited for deployment on resource-constrained devices. The experimental validation highlights DE-YOLO’s superiority over traditional YOLO models, particularly in detecting small rice targets. However, a more detailed discussion on comparative model performance, an ablation study to assess the impact of each modification, and considerations for real-world deployment would enhance the manuscript.

Answer: Thank you for your recognition of the work presented in this manuscript. Based on the original experiments, we have added ablation models for Model3 and Model4. Model3 is built upon Model1 with the addition of the attention mechanism ECANet, while Model4 investigates the effects of ECANet and Focal Loss on the model without the lightweight DSConv. The experimental results and discussion have been updated in the latest manuscript. Indeed, research on practical deployment is a key focus of our future work. We plan to apply this method to the impurity and broken grain rate detection task for unmanned combine harvesters. The main focus of this study is the optimization of object detection methods for small target crops, incorporating features of rice stalks and broken grains. The aim is to ensure high detection accuracy and efficiency while improving model performance through lightweight optimization, thus laying a solid foundation for future deployment on mobile platforms. Thanks for your work on our manuscript.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Although authors answered my questions, they did not carry out all the changes in the document. For example, they mentioned using LabelImg for annotation purposes, but you did not include a citation for it. Additionally, the Python version and libraries were not provided. The authors should this carefully to reach the quality of a paper.

Author Response

Thank you for your valuable feedback. We sincerely appreciate your thorough review and suggestions. In the latest manuscript, we have supplemented the specific version information of LabelImg and added the relevant reference (Ji et al., 2022[36])、(Zhang et al., 2023[37]), citing it appropriately in the text. In this study, we used LabelImg 1.8.6 for data annotation. The specific versions of the experimental tools are detailed in Section 2.4, Experimental Environment, as follows:

"Operating System:Windows11, Graphics Card: NVIDIA GeForce GTX 1060, based on the deep learning framework Python 3.8, PyTorch 1.9.0+cu111, CUDA 11.1, cuDNN 8.0.50, and installed libraries required for the model such as Torchvision 0.10.0+cu111, OpenCV 4.7.0.72, Pillow 9.4.0, etc."