An Improved Single-Stage Object Detection Model and Its Application to Oil Seal Defect Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper proposes an efficient oil seal defect detection model based on the improved YOLOv11 architecture, achieving performance improvements by introducing the Kolmogorov-Arnold Network (KAN), designing convolutional modules, and other methods. However, the paper still has the following issues:
1. The images in the manuscript are blurred; it is recommended to use vector graphics.
2. The font size in Figure 1 is too small, and key parts are not clearly described (e.g., what is the purple cube?). It is suggested to supplement figure captions to clarify the content of the image.
3. Please check if there is an issue with the section numbering of "2.2. Key Technologies for Real-time Object Detection"? The knowledge distillation model seems irrelevant to the design of the paper. It is recommended to add papers related to network structure lightweighting, such as bio-inspired object detection models.
4. Have the two pathways of NAPConv been verified through ablation experiments and visualization to ensure their respective effects are consistent with the descriptions in the manuscript?
5. Figure 5 should show the specific structure of the designed C2f_Dynamic, clearly presenting the specific application positions of MADConv, CondConv, and ExpertFusion Conv2d.
6. It is recommended that the authors supplement the quantitative analysis of the model's false positive rate and false negative rate in the experimental section.
7. The YOLO series has multiple versions (e.g., n, s, m). Please specify which version of YOLO is used in the manuscript (the YOLO11m mentioned in the abstract seems inconsistent with the experiment in terms of parameter quantity).
8. It is recommended that the authors compare their model with other advanced models.
9. It is suggested to supplement the visualization of the network's detection effects and compare it with the original YOLO11n.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript shows solid innovation and has strong potential for publication. However, the missing dataset details, the absence of segmentation baselines, and the lack of interpretability experiments must be addressed before acceptance.
- Figures 6, 7, and 10 are informative but too small to read clearly.
- Several architectural components are described in lengthy paragraphs. Break into stepwise descriptions or pseudo-code.
- Please provide numbers per defect category, per severity level, and per sample type. Without this, the model’s robustness and ability to handle class imbalance cannot be evaluated.
- To clarify the motivation behind KANConv2d, please explain the advantages compared to other nonlinear feature extractors (e.g., MLP-Mixer, SIREN).
- To include detailed failure-case analysis, please provide examples and explanations for false positives/false negatives across defect types.
- Is it possible to expand baseline comparisons to include segmentation models for given the nature of small defects, such as Mask R-CNN, U-Net, etc?
- To provide a more detailed explanation of the defect scoring scheme, clarify how thresholds were derived, and how scores correlate with industrial QC standards.
- Tables require consistent formatting (e.g., units, significant digits).
- Some figures require more descriptive captions.
Some grammatical inconsistencies need editing.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript proposes an improved single-stage oil seal defect detection model based on an enhanced YOLOv11 architecture with multiple innovations, including KANConv2d, NAPConv, and several dynamic convolution variants. The work is well-motivated for real-time industrial inspection and demonstrates clear improvements in accuracy and computational cost. The methodology is sound, and the experiments are comprehensive, including comparisons across multiple YOLO versions and ablation studies. The figures and tables—including Figures 1–10 and Tables 1–5—are informative and help illustrate the progression of the proposed model. However, there are several areas where the paper would benefit from clarification, deeper explanation, and improved organization. Some details of key modules, dataset properties, and hyperparameter decisions require more explicit justification. Additionally, the English writing can be polished to enhance precision and readability. Below are detailed Major and Minor comments. I kindly request the authors to provide detailed, point-by-point responses to each of the comments below, clarifying how the concerns have been addressed or will be addressed in a revised version of the manuscript.
Major Comments:
1. Clarify the contribution relative to existing lightweight YOLO improvements. The Introduction (page 1–2) references many dynamic convolution and lightweight architectures, but the novelty of combining NAPConv + KANConv2d + ExpertFusionConv2d is not clearly distinguished from existing approaches. A more explicit statement is needed explaining what is fundamentally new and what gaps in recent YOLOv11-based research this work fills.
2. Provide more detail on dataset distribution and class imbalance. Page 4–5 describes the dataset, but the exact distribution of defect classes (Dent, Burr, Distortion, Scratch, Scoop) is not provided. Since the model heavily relies on small-defect sensitivity, class frequencies and imbalance strategies should be explicitly discussed.
3. Explain the motivation for KANConv2d placement at the 11th backbone layer. The paper states this (page 11) but does not provide rationale for why this layer, specifically, benefits from KAN-based nonlinear feature extraction. Additional justification or ablation at alternative placements would strengthen the design argument.
4. Clarify the design rationale behind NAPConv. Figure 4 illustrates the module, but the text (page 6) does not explain why asynchronous pooling improves feature aggregation for oil seal defects. A discussion connecting defect characteristics (small, low contrast) to receptive field mixing would be helpful.
5. Expand the explanation of the Proposed Quality Assessment Scoring. Equations (1)–(3) on page 4 define the scoring rules, but the threshold of 60 is arbitrary without justification. Please explain how this threshold was chosen, its statistical basis, and whether it generalizes across different manufacturers or lighting conditions.
6. Include more information about training stability and convergence. The paper reports results but does not show loss curves or mention whether any models diverged (especially dynamic convolution variants). Given the complexity of MADConv and ExpertFusionConv2d, readers would benefit from insight into convergence behavior.
7. Justify the choice of 500 training epochs. Table 1 specifies 500 epochs, which is unusually large for YOLO-based models. Were early stopping or plateau criteria applied? Was overfitting observed? Please elaborate.
8. Provide more in-depth comparison with YOLOv11m or YOLOv11n variants. The paper compares YOLOv10–13, but does not compare against YOLOv11m or YOLOv11n baselines with similar computational budgets. This would help contextualize the proposed model’s efficiency compared to standard Ultralytics scales.
9. Expand on Figures 9 and 10 interpretation. Figures 9 and 10 (page 11–12) present visual comparisons but lack detailed explanation. For example, Figure 9 shows heatmaps, but the authors should describe observed strengths/weaknesses of the model across defect types.
10. Improve the clarity of the ablation study narrative. Table 4 (page 10) provides detailed ablation results, but the corresponding text is brief. Provide a clearer breakdown of how each module contributes individually and synergistically, especially since the final model combines three complex modules.
Minor Comments:
1. Improve grammar and clarity throughout the manuscript. Some sentences (e.g., page 3 lines 173–182) are long and difficult to follow. A language polishing pass is recommended.
2. Add references for the data augmentation methods used. Gaussian blur, perspective transform, and occlusion strategies should reference standard augmentation frameworks or prior works.
3. Clarify whether all images were RGB or grayscale. Page 4 mentions camera configuration but not whether images were preprocessed into RGB channels.
4. Provide bounding box examples in Figure 2. Figure 2 shows oil seal images but including annotated bounding boxes for defects would make the dataset clearer.
5. Add inference time results on Jetson Nano. Page 8 mentions Jetson Nano deployment but does not report FPS or latency. This is crucial for real-time applications.
6. Use consistent naming for the proposed model. Sometimes referred to as "optimized YOLOv11," “CDK,” or “Base_C2f_Dynamic_KANConv2d_NAPConv.” Choose one official name.
7. Include the FLOPs and parameter count of YOLOv11 official variants. This helps contextualize the efficiency gains claimed.
8. Clarify whether softmax routing temperature in CondConv was tuned. Page 7 notes a temperature parameter but does not specify the chosen value.
9. Improve resolution and readability of Figures 3, 4, 6, 7, 8. Some module diagrams (pages 5–8) have small text that is difficult to read.
10. Add discussion of potential limitations. Although Section 6 briefly mentions future work, the manuscript should also explicitly describe limitations such as dataset homogeneity, lighting variations, or potential overfitting risks.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have moderately replied to all my questions.
Comments on the Quality of English LanguageSome grammatical inconsistencies need editing.
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you for your detailed and constructive revision. I have carefully reviewed your reply letter and the revised manuscript. Your responses address each of the reviewers’ comments thoroughly and with clarity. In addition, the requested modifications—including updates to the introduction, dataset description, methodological explanations, ablation analysis, convergence curves, figures, and language quality—have been fully and accurately incorporated into the revised manuscript. Overall, the revision is comprehensive, technically sound, and significantly improves the quality and clarity of the paper. I am satisfied with the authors’ responses and the updated manuscript. I believe the paper is now suitable for publication after the editorial checks.