Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification
Cristina Maria Ribeiro Caridade
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper presents an experimental study of the wheat grain classification system that was envisioned as quality control tool for discarding 4 types of damaged seeds from the regular ones. The paper is of interest and provides solid overview of the method that was applied to this specific use case. In technical terms, introduced CBAM feature extraction module and some other adaptations that were made are not something new and specific for this type of problem. Nevertheless, the main contribution is their application, so there are different elements that are of interest to readers. What is somewhat missing is rationale for interpretations of each of binary classifiers (provided on page 11 in the text), which should be somehow addressed in the revised manuscript (or the existing text should be adapted in order to emphasize that given interpretations are something that could be expected from the successfully working system, but not proven in the paper). There are also some other elements that should be addressed (in the following), but in general the paper is well organized and written. Please carefully address the above mentioned and other comments provided below:
Page 3, lines 103-106, please rephrase the second part of the sentence, in the current form it suggests that the network is identifying enhanced images, which was not the goal of the method in [20].
Please correct the formatting of references throughout the text. These usually appear next to previous word, without spacing, e.g. text[ref], or with some added dots after ”]”.
Page 3, line 128, and in several other places in the text, there is an error that the same adverb appears twice, e.g. “In 2025, In 2025”. Please correct and check why this systematically appars in the text.
Page 14 and 13, please check the fig. 9 and descriptions before the figure is introduced, since there is an inconsistency in naming of figures and it also seems that the subfigures in figure 9 are mismatched.
Since application is relatively specific, and in general falls in the type of problems with small number of categories, what should be better highlighted in the text is the design choice for binary classifiers as the outputs of neural network.
Namely, in the text it is mentioned that this approach is more flexible and that it could allow for class specific adaptations, but other than this there is no additional rationale for such choice. Thus, please try to comment on this aspect of the design and provide some more results in the text.
For example, although the existing confusion matrices in figure 9 are based on choosing the category with the highest probability of “category detection” (positive outcome of binary classification”, and we see that the results are quite good (meaning that there are very few misclassifications, even in the case without CBAM module), we cannot see how good each of binary classifiers is performing.
Thus, some class specific metrics, and empirical ROC curves for each of binary classifiers, would be welcomed to help to better understand how well the system is performing. Also, since there are 500 test samples in total, it would mean that these empirical ROC curves and class specific metrics would be computed for 1:4 proportion of positive and negative samples, which is also important to know if the user would be interested in system operation and application in case of only some of the categories.
Please comment on all of these and try to improve the manuscript content in the given direction.
Author Response
Comments 1:What is somewhat missing is rationale for interpretations of each of binary classifiers (provided on page 11 in the text), which should be somehow addressed in the revised manuscript (or the existing text should be adapted in order to emphasize that given interpretations are something that could be expected from the successfully working system, but not proven in the paper).
Response 1:We thank the reviewer for this insightful comment regarding the interpretation of the individual binary classifiers. Following this suggestion, we have carefully revised the relevant content on Page 11 to clarify the rationale and the scope of these interpretations.
In the revised manuscript, the motivation for adopting the multiple binary classification strategy is now explicitly discussed in terms of the strong inter-class visual similarity and ambiguous boundaries among different wheat seed categories. We emphasize that the role of each Sigmoid-based classifier is to independently learn a class-specific decision boundary in a one-vs-rest manner, which represents the expected behavior of a successfully functioning system rather than a directly validated internal mechanism.
To avoid over-interpretation, the wording has been adjusted to focus on design motivation and empirical effectiveness, and we clarify that the final category prediction is determined by selecting the classifier with the maximum confidence score, ensuring suitability for mutually exclusive classification tasks. In addition, a comparative experiment with a conventional Softmax-based multiclass classification head has been conducted and is now reported in Section 3.2, providing empirical support for the effectiveness of the proposed formulation.
Comments 2:Page 3, lines 103–106, please rephrase the second part of the sentence, in the current form it suggests that the network is identifying enhanced images, which was not the goal of the method in [20].
Response 2:We thank the reviewer for identifying this issue and agree that the original wording could lead to a misleading interpretation of the objective of the method proposed in [20]. Specifically, the previous phrasing may have implied that the classification network was designed to identify enhanced images themselves, rather than to perform recognition based on enhanced image inputs.
To address this concern, we have rephrased the relevant passage on Page 3, Lines 103–106. The revised text now clearly states that the FFDNet-VGG algorithm in [20] is employed to enhance terahertz images of unsound wheat grains in order to improve image quality, and that a CNN-based classification network is subsequently used to evaluate recognition performance based on these enhanced images. This clarification accurately reflects the intent of the referenced work and avoids any ambiguity regarding the role of image enhancement in the classification process.We believe that this revision eliminates the potential misunderstanding and improves the clarity and accuracy of the literature review.
Comments 3:Please correct the formatting of references throughout the text. These usually appear next to previous word, without spacing, e.g. text[ref], or with some added dots after ”]”.
Response 3:We thank the reviewer for pointing out these formatting issues. We have carefully reviewed the entire manuscript and corrected the citation formatting throughout the text. All in-text references have been revised to ensure proper spacing between the preceding word and the citation bracket, and any unintended punctuation or extra symbols following the closing bracket have been removed.
Comments 4:Page 3, line 128, and in several other places in the text, there is an error that the same adverb appears twice, e.g. “In 2025, In 2025”. Please correct and check why this systematically appears in the text.
Response 4:We thank the reviewer for identifying this issue. We acknowledge that duplicated adverbs, such as the repeated phrase “In 2025, In 2025” on Page 3, Line 128, were present in the original manuscript.To address this problem, we have carefully reviewed the entire text and corrected all instances of duplicated adverbs and similar repetitions. These errors were caused by an unintended formatting issue during manuscript editing, and they have now been fully resolved. The manuscript has been thoroughly checked to ensure that no similar repetitions remain.
Comments 5:Page 14 and 13, please check Fig. 9 and the descriptions before the figure is introduced, since there is an inconsistency in naming of figures and it also seems that the subfigures in Figure 9 are mismatched.
Response 5:We thank the reviewer for pointing out this inconsistency regarding Fig. 9. We carefully reviewed the figure numbering, in-text references, and the corresponding descriptions on Pages 13 and 14. In response, the naming of Fig. 9 and all related in-text citations have been corrected to ensure consistency throughout the manuscript. In addition, the subfigures within Fig. 9 have been carefully checked and reordered to match their descriptions in the text. The figure caption has also been revised accordingly to clearly reflect the correct correspondence between each subfigure and its description.
Comments 6:Namely, in the text it is mentioned that this approach is more flexible and that it could allow for class specific adaptations, but other than this there is no additional rationale for such choice. Thus, please try to comment on this aspect of the design and provide some more results in the text.
For example, although the existing confusion matrices in figure 9 are based on choosing the category with the highest probability of “category detection” (positive outcome of binary classification”, and we see that the results are quite good (meaning that there are very few misclassifications, even in the case without CBAM module), we cannot see how good each of binary classifiers is performing.
Thus, some class specific metrics, and empirical ROC curves for each of binary classifiers, would be welcomed to help to better understand how well the system is performing. Also, since there are 500 test samples in total, it would mean that these empirical ROC curves and class specific metrics would be computed for 1:4 proportion of positive and negative samples, which is also important to know if the user would be interested in system operation and application in case of only some of the categories.
Response 6:We thank the reviewer for this comprehensive and insightful comment. In the revised manuscript, we have expanded the methodological discussion to clarify the rationale for the multi-binary classification design. Specifically, each Sigmoid-based classifier independently learns a one-vs-rest decision boundary, allowing class-specific adaptation and reducing inter-class competition, which is beneficial given the strong visual similarity and heterogeneous defect characteristics among wheat seed categories. To provide empirical evidence for the performance of individual binary classifiers, class-specific Precision, Recall, and F1-score values have been computed for each category in a one-vs-rest manner, as reported in Tables 3 and 4. These metrics reflect classifier behavior under the 1:4 positive-to-negative ratio inherent to the test set. The consistently high scores across all classes indicate that each binary classifier reliably distinguishes its corresponding category from the rest. While the final system prediction is determined by selecting the class with the highest confidence score, these class-specific metrics allow readers to assess the reliability of each individual classifier.
We believe that these results sufficiently address the reviewer’s concern regarding the performance of each binary classifier and demonstrate the robustness of the proposed framework.
Reviewer 2 Report
Comments and Suggestions for Authors- Publish the Dataset – to increase citation rates and scientific credibility, the authors should provide open access to the dataset (e.g., via platforms like Kaggle or Zenodo).
- Investigate Latency – include an analysis of inference time on both CPU and GPU to evaluate the model’s suitability for real-time applications.
- Conduct Cross-Validation – test the model on public datasets (such as GrainSpace) to benchmark the results against other global state-of-the-art developments.
Comments for author File:
Comments.pdf
Author Response
Comments 1:Publish the Dataset – to increase citation rates and scientific credibility, the authors should provide open access to the dataset (e.g., via platforms like Kaggle or Zenodo).
Response 1:We thank the reviewer for this valuable suggestion. We fully agree that open access to datasets enhances transparency, reproducibility, and scientific credibility.In response, the dataset used in this study has been made publicly available at Zenodo and can be accessed via the following DOI: https://doi.org/10.5281/zenodo.18222641. The repository includes detailed descriptions of the data structure and class annotations to facilitate reuse by the research community. We have also added this information to the revised manuscript to ensure readers can easily access the dataset.
Comments 2:Investigate Latency – include an analysis of inference time on both CPU and GPU to evaluate the model’s suitability for real-time applications.
Response 2:We thank the reviewer for this constructive suggestion. In response, we have measured the inference latency of the proposed EfficientNet-B1 + CBAM model on a CPU-only platform using real validation images with a batch size of one. After a warm-up phase, the average inference time was calculated over 100 runs. The results indicate that the model achieves an average inference time of 40.29 ms per image, with a standard deviation of 1.02 ms, corresponding to approximately 24.8 frames per second. These results demonstrate that the proposed method can meet near real-time requirements even without GPU acceleration. The latency analysis has been added to the revised manuscript in Section 3.4 to provide a clear evaluation of the model’s suitability for real-time applications.
Comments 3:Conduct Cross-Validation – test the model on public datasets (such as GrainSpace) to benchmark the results against other global state-of-the-art developments.
Response 3:We thank the reviewer for this valuable suggestion. In the revised manuscript, we have added five-fold cross-validation results on our dataset, summarized in Table 8. The proposed model maintains consistently high macro-precision scores across all folds, with only minor performance variations. The low standard deviation indicates that the model’s performance is stable and not sensitive to a particular data split, highlighting the robustness and generalization capability of the proposed method. While publicly available datasets such as GrainSpace have not yet been evaluated, we acknowledge that benchmarking on these datasets would provide additional insight into the model’s performance relative to other state-of-the-art methods. We plan to conduct such cross-dataset evaluations in future work to further validate the generalization potential of the proposed attention-based architecture.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript presents a successful application of deep learning in the field of agricultural quality control. The authors successfully optimized the EfficientNet-B1 architecture by integrating a Block Convolutional Attention Module (CBAM), resulting in a high-performance model for identifying defective wheat grains.
However, the following revisions must be made by the authors before the manuscript is accepted:
- Please clarify if the dataset includes different wheat varieties or if it was collected from a single species. This information is crucial for understanding the model's generalization capabilities across different agricultural contexts.
- While the comparison with other architectures is excellent, a more explicit ablation study showing the individual contribution of the CBAM module versus the modified classification head (MLP) would provide deeper insight into the source of the performance gains.
- Authors mention the model is lightweight and suitable for edge devices. Including data on "inference time per image" (latency) would significantly strengthen the argument for its application in real-time industrial sorting systems.
- In Figure 2 or similar plates showing grain types, please ensure the visual differences between "scabby," "moldy," and "sprouted" grains are easily visible to the naked eye in the printed version.
Author Response
Comments 1: Please clarify if the dataset includes different wheat varieties or if it was collected from a single species. This information is crucial for understanding the model's generalization capabilities across different agricultural contexts.
Response 1:We thank the reviewer for raising this important point. The defective wheat kernel dataset used in this study was collected from local farmers’ households. Since wheat from different cultivars is commonly stored together after harvesting, the dataset comprises kernels from multiple varieties. Despite the presence of multiple wheat varieties, the defective kernels exhibit consistent visual characteristics across cultivars, enabling unified analysis and classification. The dataset includes five categories of wheat kernels—sound kernels, insect-damaged kernels, Fusarium-damaged kernels, moldy kernels, and kernels with black embryos—as illustrated with representative examples in Fig. 1.
Comments 2:While the comparison with other architectures is excellent, a more explicit ablation study showing the individual contribution of the CBAM module versus the modified classification head (MLP) would provide deeper insight into the source of the performance gains.
Response 2:We thank the reviewer for this valuable suggestion. We agree that a detailed ablation study could provide additional insight into the relative contributions of the CBAM module and the modified classification head (MLP) to the overall performance gains. In response, the revised manuscript now includes an ablation analysis comparing the baseline EfficientNet-B1 model, the baseline with only the MLP-based classification head, and the full EfficientNet-B1 + CBAM model. The results, summarized in Table 3, Table 4 and Table 7, indicate that both the CBAM module and the modified classification head contribute to performance improvements, with the attention mechanism providing notable gains in precision and F1-score across multiple kernel categories.
Comments 3:Authors mention the model is lightweight and suitable for edge devices. Including data on "inference time per image" (latency) would significantly strengthen the argument for its application in real-time industrial sorting systems.
Response 3:We thank the reviewer for this constructive suggestion. In the revised manuscript, we have measured the inference latency of the proposed EfficientNet-B1 + CBAM model on a CPU-only platform using real validation images with a batch size of one. After a warm-up phase, the average inference time was calculated over 100 runs. The results indicate that the model achieves an average inference time of 40.29 ms per image, with a standard deviation of 1.02 ms, corresponding to approximately 24.8 frames per second. These results demonstrate that the proposed model can satisfy near real-time requirements even without GPU acceleration, supporting its suitability for deployment on edge devices in industrial sorting systems.
Comments 4:In Figure 2 or similar plates showing grain types, please ensure the visual differences between "scabby," "moldy," and "sprouted" grains are easily visible to the naked eye in the printed version.
Response 4:We thank the reviewer for this important suggestion. To improve clarity, we have revised Figure 2 and the corresponding plates to enhance the visual distinction between scabby, moldy, and sprouted grains. Adjustments include increasing image contrast, enlarging individual grain images, and providing clearer labels for each category. These modifications ensure that the differences among the grain types are easily discernible to the naked eye, even in the printed version of the manuscript.
Reviewer 4 Report
Comments and Suggestions for AuthorsGeneral Assessment
The manuscript addresses a topical and practically significant problem — the automated recognition of defective wheat kernels using deep learning techniques. The authors propose a modified EfficientNet-B1 architecture incorporating the CBAM attention mechanism and an original cascaded binary classification scheme for identifying five types of wheat grains.
The topic fully corresponds to the scope of the journal Applied Sciences. The manuscript is well structured, logically presented, and includes ablation studies and comparative experiments. Exceptionally high classification performance has been reported (up to 99.8%), which makes the study potentially attractive for publication.
However, in its current form, the manuscript requires substantial revision, primarily with regard to experimental validation and methodological justification.
Major Comments
- Insufficient dataset volume and domain diversity
Page 4–5, Section 2.1 Data Collection and Dataset Development
The dataset consists of only 300 real images per class, all captured under identical laboratory conditions, and expanded only by Gaussian blur and Poisson noise augmentation. With the reported classification accuracy approaching 99.8%, there is a high risk of overfitting and domain memorization. The current experimental setup does not demonstrate real generalization capability.
The authors are required to add cross-validation, external dataset testing, or explicitly discuss the limited generalization ability of the proposed model.
- Unjustified multi-binary classification scheme
Page 11–12, Section 2.5 Design of Multiple Binary Classifiers
The manuscript employs five independent Sigmoid binary classifiers for mutually exclusive classes. This multi-label formulation is not theoretically justified and deviates from standard Softmax-based multiclass classification. The manuscript does not explain how conflicting outputs are handled, nor why this architecture improves discrimination.
A comparative experiment with a conventional Softmax multiclass head must be provided.
- Over-optimistic performance without domain-shift evaluation
Page 14–15, Section 3.1 Results
The reported near-perfect accuracy (99–100%) is presented without cross-domain evaluation. Since all data were collected under strictly controlled conditions, the reported metrics represent only intra-domain performance and cannot be interpreted as industrial robustness.
A limitation analysis or additional validation under different illumination and background conditions is required.
- Lack of industrial-scale applicability demonstration
Page 17, Conclusions
All experiments were conducted on single-kernel images. No evaluation on bulk-grain scenes, overlapping kernels, or conveyor belt imagery is provided, making industrial applicability unclear.
A «Limitations and Future Work» subsection must be added.
- Duplicate figure numbering
Page 14–16
Figure 10 is duplicated. The second occurrence must be renumbered and all in-text references corrected.
- Incomplete hardware description
Page 12, Section 2.7
RAM size and GPU VRAM are not reported. Hardware configuration must be fully specified.
- Learning rate selection justification
Page 13, Section 3.1
The learning rate search range is not theoretically or empirically justified. Please add justification for the selected interval.
Comments on the Quality of English Language
Pages 1–19
The manuscript contains multiple grammatical and stylistic inconsistencies that require professional English editing.
Author Response
Comments 1:The dataset consists of only 300 real images per class, all captured under identical laboratory conditions, and expanded only by Gaussian blur and Poisson noise augmentation. With the reported classification accuracy approaching 99.8%, there is a high risk of overfitting and domain memorization. The current experimental setup does not demonstrate real generalization capability. The authors are required to add cross-validation, external dataset testing, or explicitly discuss the limited generalization ability of the proposed model.
Response 1:We thank the reviewer for this valuable suggestion. In the revised manuscript, we have added cross-validation results on our dataset, summarized in Table 8. The results show that the proposed model maintains consistently high macro-precision scores across all five folds, with only minor performance variations. The low standard deviation indicates that the observed performance is not sensitive to a particular data partition, highlighting the stability and robustness of the proposed method. Despite the relatively limited dataset size, these cross-validation results demonstrate that the model is unlikely to suffer from sample memorization and is capable of learning discriminative features with good generalization ability. Regarding external dataset evaluation, publicly available datasets such as GrainSpace contain images captured under different acquisition conditions. While evaluation on these datasets would provide additional benchmarking, domain shifts may require adaptation and were therefore not conducted in the current study.
Comments 2:The manuscript employs five independent Sigmoid binary classifiers for mutually exclusive classes. This multi-label formulation is not theoretically justified and deviates from standard Softmax-based multiclass classification. The manuscript does not explain how conflicting outputs are handled, nor why this architecture improves discrimination. A comparative experiment with a conventional Softmax multiclass head must be provided.
Response 2:We thank the reviewer for this constructive comment. To justify the use of the proposed multi-binary classification scheme, we conducted a comparative experiment between the Sigmoid-based multi-binary head and a conventional Softmax-based multiclass classification head. Both models share the same EfficientNet-B1 backbone with CBAM attention modules and were trained and evaluated on the same dataset under identical training configurations to ensure a fair comparison.
As shown in Table 5, the proposed multi-binary classification scheme consistently outperforms the conventional Softmax-based classifier in terms of macro-averaged precision and F1-score. This demonstrates that independent binary decision heads are more effective in capturing subtle inter-class differences among visually similar wheat seed categories. The revised manuscript now includes these results in Section 3.2 to clarify both the theoretical rationale and empirical benefits of the multi-binary classifier design.
Comments 3:The reported near-perfect accuracy (99–100%) is presented without cross-domain evaluation. Since all data were collected under strictly controlled conditions, the reported metrics represent only intra-domain performance and cannot be interpreted as industrial robustness. A limitation analysis or additional validation under different illumination and background conditions is required.
Response 3:We thank the reviewer for this important comment. We acknowledge that the reported classification metrics are based on images collected under controlled laboratory conditions, and thus primarily reflect intra-domain performance. To provide additional insight into the model’s stability and mitigate concerns regarding overfitting, we have conducted cross-validation on the dataset. The results, summarized in Table 8, demonstrate consistently high macro-precision scores across all folds, with low standard deviation, indicating that the model is robust with respect to data partitioning. These additions, including the cross-validation results and an explicit discussion of limitations, clarify the scope of the reported metrics and acknowledge the need for further validation under more realistic operating conditions.
Comments 4:All experiments were conducted on single-kernel images. No evaluation on bulk-grain scenes, overlapping kernels, or conveyor belt imagery is provided, making industrial applicability unclear. A «Limitations and Future Work» subsection must be added.
Response 4:We thank the reviewer for this comment. In the current study, all experiments were conducted on single-kernel images, which allows precise control over image quality and kernel positioning and facilitates fine-grained classification analysis. We acknowledge that no evaluation was performed on bulk-grain scenes, overlapping kernels, or conveyor belt imagery. Accordingly, the industrial applicability of the current model is not directly demonstrated. This limitation has been explicitly noted in a new “Limitations and Future Work” subsection in the revised manuscript, clarifying the scope of the reported results and the context in which the high classification accuracy is achieved.
Comments 5:Figure 10 is duplicated. The second occurrence must be renumbered and all in-text references corrected.
Response 5:We thank the reviewer for pointing out this issue. The second occurrence of Figure 10 has been renumbered, and all in-text references have been updated accordingly throughout the manuscript. This ensures consistent figure numbering and correct cross-references.
Comments 6:RAM size and GPU VRAM are not reported. Hardware configuration must be fully specified.
Response 6:We thank the reviewer for this comment. The experiments were conducted on a system equipped with 16 GB of RAM and an AMD Radeon GPU with 512 MB of VRAM. This information has been added to Section 2.7 of the revised manuscript to fully specify the hardware configuration.
Comments 7:The learning rate search range is not theoretically or empirically justified. Please add justification for the selected interval.
Response 7:We thank the reviewer for this comment. The learning-rate search interval was determined based on both transfer-learning principles and empirical observations. Since the EfficientNet-B1 backbone was initialized with ImageNet pretrained weights, excessively large learning rates may disrupt pretrained feature representations, while overly small learning rates can lead to insufficient parameter updates and slow convergence. Therefore, a moderate range commonly adopted for fine-tuning deep convolutional neural networks was explored, specifically from 1×10⁻⁴ to 6×10⁻⁴. This explanation has been added to Section 3.1 of the revised manuscript to clarify the rationale behind the selected learning-rate interval.
Comments 8:The manuscript contains multiple grammatical and stylistic inconsistencies that require professional English editing.
Response 8:We thank the reviewer for this comment. The manuscript has been carefully revised to correct grammatical errors, improve sentence structure, and ensure stylistic consistency throughout the text. Specific improvements include unifying verb tenses, standardizing scientific terminology, correcting spelling and punctuation errors, and optimizing sentence clarity for readability.
Reviewer 5 Report
Comments and Suggestions for AuthorsManuscript title:
Imperfect wheat grain recognition based on EfficientNet and attention mechanism
This manuscript addresses an enhanced recognition method for defective wheat grains, based on the EfficientNet-B1 architecture. Building upon the original EfficientNet-B1 network structure, this approach incorporates the lightweight attention mechanism known as CBAM (Convolutional Block Attention Module) to augment the model's capacity to discern features in critical regions.
The findings show that the improved model achieves a classification accuracy of 99.80% on the test set, which represents a 2.6% improvement over its performance before the improvement. Additionally, the F1-score, Precision, and Recall have all shown notable gains. The classification accuracy for sound grains, black germ grains, and mouldy grains likewise reaches 100%, as does the accuracy for recognising scab-damaged and insect-damaged grains. The goal of this paper is to provide a dependable method for the early identification of fish diseases in aquaculture by improving detection accuracy and robustness in complicated underwater environments.
The classification performance of standard EfficientNet-B1 models was compared to that of models incorporating attention mechanisms. Results indicate that the EfficientNet-B1 model with attention mechanisms exhibited superior recognition accuracy compared to the model without attention mechanisms, achieving a high accuracy rate of 99.6%.
The cited references are relevant and well-integrated, reinforcing the manuscript’s content and underscoring its contribution to wheat grain detection, attention mechanism, Deep Learning, and EfficientNet.
My suggestions are as follows:
The Introduction is missing a short description of the paper's content. Needs to be added.
Page 8, Line 296, In Figure 4, the FC1 and FC2 blocks are depicted. What do they represent needs to be described. Also, what are the input and output blocks?
Page 10, Line 379, In Figure 7, the third block is depicted. What does it represent? Need to be described.
Page 12, Line 451, this Subsection title is missing a word: 3.1 and Learning Rate Optimization Model Testing. Need to be updated.
The research results, which improve the precision of automated recognition of flawed grains, are important for the intelligent and automated classification of grain quality.
Technically, the manuscript is well-structured and methodologically sound. The experiments are thorough, with clear sections, mathematical details, ablation studies, and comparisons.
Wheat quality inspection directly affects food security, agricultural automation, and grain processing.
The manuscript is generally well written, but:
- Minor grammatical and stylistic refinements are still needed.
- Some sentences are overly formal or repetitive.
From a research contribution perspective, the core idea is: EfficientNet + CBAM for image classification. This combination has been well explored in the literature across various fields, including agriculture, medical imaging, and defect detection.
Achieving nearly 100% accuracy across almost all classes is unusual in real-world agricultural datasets. No external validation dataset is used, and no cross-dataset testing is reported. My suggestions are to add cross-validation or external testing. Include comparisons with at least one non-EfficientNet architecture to strengthen claims.
Therefore, before acceptance for publication, my recommendation is to undergo a major revision.
Comments for author File:
Comments.pdf
Minor grammatical and stylistic refinements are still needed. Also, some sentences are overly formal or repetitive.
Author Response
Comments 1:The Introduction is missing a short description of the paper's content. Needs to be added.
Response 1:We thank the reviewer for this comment. A concise overview of the paper’s content has been added to the end of the Introduction. Specifically, the revised text summarizes that the study adopts the EfficientNet architecture as the backbone with a convolutional attention module and replaces the traditional single-head five-class classification with a multi-binary classification strategy. A curated dataset of five wheat kernel categories was constructed, and comparative experiments along with ablation studies were conducted to evaluate classification performance and stability, as well as to benchmark the proposed method against mainstream deep learning models. The primary goal is to develop a more accurate deep learning model for defective wheat kernel identification, providing an effective technical solution for intelligent wheat quality inspection and grading.
Comments 2:Page 8, Line 296, In Figure 4, the FC1 and FC2 blocks are depicted. What do they represent needs to be described. Also, what are the input and output blocks?
Response 2:We thank the reviewer for this comment. In the revised manuscript, we have clarified the meaning of the blocks in Figure 4. The FC1 and FC2 blocks correspond to the two fully connected layers in the Squeeze-and-Excitation (SE) module. he input block represents the feature map obtained from the preceding convolutional and attention modules, while the output block corresponds to the recalibrated feature map after channel-wise weighting, which is fed into the subsequent network layers for classification. These descriptions have been added to both the figure caption and the main text, ensuring that readers can clearly understand the structure and function of each block in Figure 4.
Comments 3:Page 10, Line 379, In Figure 7, the third block is depicted. What does it represent? Need to be described.
Response 3:We thank the reviewer for this comment. In the revised manuscript, we have clarified that the third block in Figure 7 represents the convolutional layer in the spatial attention submodule of the CBAM (Convolutional Block Attention Module). This layer takes the pooled feature maps (from max-pooling and average-pooling) as input and generates the spatial attention map, which highlights salient regions of the feature map. The attention map is then multiplied with the input feature map to emphasize important spatial locations for improved feature representation and classification. A description of this block has been added to both the figure caption and the main text to ensure readers can clearly understand its role in Figure 7.
Comments 4:Page 12, Line 451, this Subsection title is missing a word: 3.1 and Learning Rate Optimization Model Testing. Need to be updated.
Response 4:We thank the reviewer for pointing out this issue. The title of Subsection 3.1 has been corrected in the revised manuscript to “3.1 EfficientNet-B1-CBAM Model Evaluation and Learning Rate Optimization Results”, ensuring that it accurately reflects the content of the section. All in-text references to this subsection have also been updated accordingly.
Comments 5:The manuscript is generally well written, but: Minor grammatical and stylistic refinements are still needed. Some sentences are overly formal or repetitive.
Response 5:We thank the reviewer for this constructive feedback. The manuscript has been carefully reviewed again to correct minor grammatical issues, eliminate repetitive expressions, and improve sentence flow. Stylistic refinements were applied to enhance readability while maintaining scientific accuracy, ensuring that the text is clear, concise, and accessible to the readers.
Comments 6:Achieving nearly 100% accuracy across almost all classes is unusual in real-world agricultural datasets. No external validation dataset is used, and no cross-dataset testing is reported. My suggestions are to add cross-validation or external testing. Include comparisons with at least one non-EfficientNet architecture to strengthen claims.
Comments 6:We thank the reviewer for this comment. To evaluate model stability and generalization within the available dataset, cross-validation was conducted across five folds. As summarized in Table 8, the proposed model maintains consistently high macro-precision scores across all folds, with low standard deviation, indicating that the performance is robust with respect to different data partitions and not dependent on specific samples. Additionally, the proposed model was compared with several mainstream deep learning architectures, including ResNet and VGG variants, under identical training and evaluation settings. These additions provide empirical evidence of the model’s stability and comparative performance within the current dataset, while clarifying that reported results are based on intra-dataset evaluation.
Comments 7:Minor grammatical and stylistic refinements are still needed. Also, some sentences are overly formal or repetitive.
Response 7:We thank the reviewer for this constructive feedback. The manuscript has been carefully reviewed again to correct minor grammatical issues, eliminate repetitive expressions, and improve sentence flow. Stylistic refinements were applied to enhance readability while maintaining scientific accuracy, ensuring that the text is clear, concise, and accessible to the readers.
Round 2
Reviewer 4 Report
Comments and Suggestions for AuthorsOur comments and recommendations have been taken into account when updating the manuscript. The manuscript now has a sound scientific and methodological foundation. We wish you continued success!
Reviewer 5 Report
Comments and Suggestions for AuthorsI do not have any other suggestions.