Research on a Unified Multi-Type Defect Detection Method for Lithium Batteries Throughout Their Entire Lifecycle Based on Multimodal Fusion and Attention-Enhanced YOLOv8
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript addresses an important and timely industrial problem, namely real-time defect detection in lithium-ion batteries using deep learning, with an emphasis on multimodal data (visible light and X-ray) and edge deployment constraints. The proposed enhancement of YOLOv8 through channel attention (SE) and a multi-scale fusion module (MFM), combined with ablation studies, baseline comparisons, and a field verification experiment, demonstrates a clear effort toward practical applicability. However, the manuscript requires major revisions before it can be considered for publication. The main concerns relate to the clarity and rigor of the multimodal fusion strategy, experimental reproducibility, methodological consistency, justification of the chosen detection framework, and overall presentation quality.
Major Comments
-
Multimodal Fusion Strategy Is Insufficiently Defined
The manuscript claims to exploit both visible-light and X-ray data, but the multimodal fusion pipeline is not clearly or consistently described.
- The input is described as a standard 3-channel RGB image, which contradicts the stated multimodal setting.
- The phrase “random channel fusion of visible and X-ray images” is ambiguous and does not allow reproduction.
- The SE module is sometimes implicitly presented as enabling modality fusion, although SE is fundamentally a channel recalibration mechanism rather than a multimodal fusion operator.
- Clearly specify the fusion level (data-level, feature-level, decision-level, or dual-stream architecture).
- Explicitly describe the input representation (number of channels, normalization, alignment between visible and X-ray images).
- Add a clear schematic illustrating the complete multimodal pipeline from data acquisition to the detection head.
2-Insufficient Justification of the YOLOv8 Choice
The manuscript adopts YOLOv8 as the core detection framework, but the rationale for this choice is not sufficiently justified.
- It is unclear why YOLOv8 was preferred over other state-of-the-art detection approaches (e.g., Faster R-CNN variants, RetinaNet, EfficientDet, or transformer-based detectors).
- The advantages of YOLOv8 with respect to small defect detection, multimodal fusion, or industrial robustness are not explicitly discussed.
- Provide a clear justification for selecting YOLOv8, supported by references or empirical arguments.
- Clarify whether the choice was driven primarily by performance, deployment constraints, architectural properties, or implementation maturity.
-
Experimental Protocol and Rigor of Comparisons
The comparative evaluation against YOLOv5 and YOLOv7 is relevant, but it is not sufficient to establish the generality or robustness of the proposed method.
- Training hyperparameters, data augmentations, initialization strategies (pretrained vs. from scratch), and random seeds are not fully specified.
It is unclear whether identical training settings were used for all baselines. - The comparison is limited to YOLO-family models, which restricts the scientific scope.
- Provide a complete and unified description of training settings for all compared models.
- Clearly state whether identical hyperparameters and augmentation strategies were used.
- If feasible, include comparisons with at least one non-YOLO baseline (e.g., RetinaNet, Faster R-CNN, EfficientDet, or a lightweight transformer-based detector).
- Report results over multiple runs (mean ± standard deviation) or explicitly state fixed random seeds.
4.Dataset Description and Annotation Protocol Need Strengthening
- Although overall dataset sizes and a train/validation/test split are provided, essential details are missing.
- Defect class definitions and per-class sample distributions are not reported.
It is unclear whether visible and X-ray images are paired at the instance level or used independently. - The annotation process and the meaning of “annotation accuracy ≥ 98%” are not formally defined.
- Include a table summarizing defect classes, modalities used, number of samples per class, and data splits.
- Clarify whether multimodal samples are paired and how missing modalities are handled.
- Describe the annotation protocol (number of annotators, validation procedure, agreement metric).
-
Metrics and Definitions Contain Errors and Omissions
Several metric definitions are incomplete or inconsistent.
- Recall and F1-score are not formally defined, although they are reported.
- There is a typographical error in the precision definition (FN used instead of FP).
mAP@0.5:0.95 is defined but not consistently reported in the results. - Correct and complete all metric definitions.
- Report mAP@0.5:0.95 in addition to mAP@0.5
- Consider providing per-class AP results or a confusion matrix to support qualitative claims.
-
Terminology Inconsistency Regarding “SE”
In the conclusion, “SE” appears to be interpreted as “Supervision-Enhanced,” whereas throughout the manuscript it refers to “Squeeze-and-Excitation.” This inconsistency is confusing. Use a single, consistent definition of SE throughout the manuscript. -
Reproducibility, Data, and Code Availability
At the current stage, the manuscript does not provide sufficient information to ensure full reproducibility.
- The dataset used in this study (visible-light and X-ray battery defect images) is not publicly available, and no clear statement is provided regarding data accessibility or usage restrictions.
- The source code implementing the proposed YOLOv8-based architecture, including the SE and MFM modules and the training pipeline, is not shared.
The authors are strongly encouraged to make the dataset publicly available, or at least partially accessible, subject to industrial or confidentiality constraints.
If full data release is not possible, the limitations should be clearly stated and sufficient metadata provided to enable replication. The authors should also consider releasing the source code or a well-documented implementation to allow independent validation.
Providing access to data and code would significantly enhance transparency, reproducibility, and scientific credibility.
-
Presentation Quality and Figure Readability
The manuscript contains several presentation issues that affect readability and clarity.
Figures 2 to 5 are too small and lack sufficient resolution; they should be redrawn or enlarged.
Tables contain typographical artifacts and broken words.
Figure and table references are sometimes inconsistent (e.g., Table 3 labeled as ablation while presenting field verification results).
Residual template elements remain (e.g., “Journal Not Specified,” empty received/revised dates).Redesign Figures 2–5 with improved resolution and layout.
Carefully revise all tables and captions to ensure clarity and self-containment.
Remove all template placeholders before resubmission.
Minor Comments
The ADown operation in the MFM module should be explicitly defined.
Several typographical errors remain (e.g., “tin sine test set”).
Clarify how FPS is measured (inference only vs. full pipeline including preprocessing and NMS).
Add qualitative examples illustrating typical failure cases.The manuscript presents a promising and practically motivated contribution. However, major revisions are required, particularly regarding the justification of the chosen detection framework, the rigor of comparative evaluations, the clarity of the multimodal fusion strategy, the availability of data and code, and the overall presentation quality. Addressing these points would significantly strengthen the manuscript and its suitability for publication in Sensors.
The overall English language quality of the manuscript is understandable, and the technical content can generally be followed. However, the manuscript requires significant language revision to meet the standards of an international scientific journal.
Several issues are observed throughout the paper, including grammatical errors, awkward sentence structures, inconsistent terminology, typographical mistakes, and imprecise technical phrasing. In addition, some sentences are overly long or unclear, which affects readability and scientific precision.
It is strongly recommended that the manuscript be thoroughly revised by a fluent or native English speaker, or by a professional scientific editing service, before resubmission. Improving the clarity, consistency, and conciseness of the English will substantially enhance the overall quality and impact of the manuscript.
Author Response
Response to Reviewers
Dear Editors and Reviewers,
We really appreciate for your precious time in reviewing our paper and providing valuable comments. It was your valuable and insightful comments that led to possible improvements in the current version. The authors have carefully considered the comments and tried our best to address every one of them. The detailed corrections are listed below.
Comments 1:Multimodal Fusion Strategy Is Insufficiently Defined
The manuscript claims to exploit both visible-light and X-ray data, but the multimodal fusion pipeline is not clearly or consistently described.
The input is described as a standard 3-channel RGB image, which contradicts the stated multimodal setting.
The phrase “random channel fusion of visible and X-ray images” is ambiguous and does not allow reproduction.
The SE module is sometimes implicitly presented as enabling modality fusion, although SE is fundamentally a channel recalibration mechanism rather than a multimodal fusion operator.
Clearly specify the fusion level (data-level, feature-level, decision-level, or dual-stream architecture).
Explicitly describe the input representation (number of channels, normalization, alignment between visible and X-ray images).
Add a clear schematic illustrating the complete multimodal pipeline from data acquisition to the detection head.
Response 1:We sincerely appreciate your rigorous comments and valuable suggestions. We clearly supplement that the multimodal fusion strategy adopts dual fusion at the feature level and channel level: the input is a 4-channel representation (3 RGB channels + 1 X-ray channel) aligned via SIFT feature matching (with an alignment error ≤ 1 pixel). The X-ray channel is normalized to the range [0,1] using min-max normalization, while the visible light channels follow standard RGB normalization (mean values: 0.485/0.456/0.406; variances: 0.229/0.224/0.225). We have added a complete schematic diagram of the multimodal workflow, which clearly illustrates the end-to-end pipeline from data acquisition, alignment and normalization, dual-mode dynamic fusion (feature-level concatenation and channel-level expansion switched with a 50% probability each) to the detection head. It is explicitly clarified that the SE module is only responsible for channel weight calibration rather than serving as the fusion core, completely resolving the ambiguity in the original description. We sincerely thank you again for your guidance, which has effectively improved the clarity and reproducibility of our research.
Comments 2:Insufficient Justification of the YOLOv8 Choice
The manuscript adopts YOLOv8 as the core detection framework, but the rationale for this choice is not sufficiently justified.
It is unclear why YOLOv8 was preferred over other state-of-the-art detection approaches (e.g., Faster R-CNN variants, RetinaNet, EfficientDet, or transformer-based detectors).
The advantages of YOLOv8 with respect to small defect detection, multimodal fusion, or industrial robustness are not explicitly discussed.
Provide a clear justification for selecting YOLOv8, supported by references or empirical arguments.
Clarify whether the choice was driven primarily by performance, deployment constraints, architectural properties, or implementation maturity.
Response 2:Thank you for your comments. Our selection of YOLOv8 as the core framework is primarily based on four key empirically supported advantages: firstly, its modular architecture enables seamless integration of the SE and MFM modules to support multimodal fusion, outperforming the adaptation efficiency of Faster R-CNN’s two-stage architecture and Transformer-based detectors; secondly, its anchor-free design is inherently suitable for micro-defect detection, with a small-defect recall rate 8.3% higher than RetinaNet and 4.1% higher than EfficientDet; thirdly, its real-time and lightweight characteristics perfectly meet the low-latency requirement (≤50ms) and edge deployment needs of industrial scenarios, featuring 35% fewer parameters and 22% faster inference speed compared with EfficientDet; finally, its mature ecosystem supports fast transfer learning and deployment, significantly reducing the costs of industrial implementation. This choice effectively balances performance, deployment feasibility, and scalability, making YOLOv8 the optimal solution for the task of lithium battery defect detection.
Comments 3:Experimental Protocol and Rigor of Comparisons
The comparative evaluation against YOLOv5 and YOLOv7 is relevant, but it is not sufficient to establish the generality or robustness of the proposed method.
Training hyperparameters, data augmentations, initialization strategies (pretrained vs. from scratch), and random seeds are not fully specified.
It is unclear whether identical training settings were used for all baselines.
The comparison is limited to YOLO-family models, which restricts the scientific scope.
Provide a complete and unified description of training settings for all compared models.
Clearly state whether identical hyperparameters and augmentation strategies were used.
If feasible, include comparisons with at least one non-YOLO baseline (e.g., RetinaNet, Faster R-CNN, EfficientDet, or a lightweight transformer-based detector).
Report results over multiple runs (mean ± standard deviation) or explicitly state fixed random seeds.
Response 3:Thank you for your rigorous suggestions. We have supplemented and improved the experimental protocol and comparison rigor as follows: Firstly, we have publicly shared the complete training settings of all comparative models, including the AdamW optimizer (initial learning rate = 0.01), batch size of 32, 300 training epochs, data augmentation strategies (geometric/pixel/modal fusion augmentation), and pretraining initialization (COCO pretrained weights), etc.; secondly, we have clarified that all baseline models and our proposed model adopt identical training configurations to ensure fair comparison; thirdly, we have added RetinaNet and EfficientDet as non-YOLO baselines and supplemented the corresponding experimental results; fourthly, all performance metrics are reported as the mean ± standard deviation of three independent runs (with a fixed random seed = 42). These supplements have been integrated into the revised manuscript, enhancing the reproducibility and scientific rigor of the study.
Comments 4:Dataset Description and Annotation Protocol Need Strengthening
Although overall dataset sizes and a train/validation/test split are provided, essential details are missing.
Defect class definitions and per-class sample distributions are not reported.
It is unclear whether visible and X-ray images are paired at the instance level or used independently.
The annotation process and the meaning of “annotation accuracy ≥ 98%” are not formally defined.
Include a table summarizing defect classes, modalities used, number of samples per class, and data splits.
Clarify whether multimodal samples are paired and how missing modalities are handled.
Describe the annotation protocol (number of annotators, validation procedure, agreement metric).
Response 4:Thank you for your detailed suggestions. We have added a comparison table of defect categories - modalities - sample counts - data split, clearly presenting the detailed distribution of samples across 8 defect categories, as well as healthy and composite defect samples. We reiterate that all multimodal samples undergo strict instance-level pairing (visible light + X-ray images of the same battery), and unimodal input is adopted as a fallback in case of missing modalities. We have also supplemented the complete annotation protocol: annotations were jointly performed by 3 senior quality inspection engineers and 1 computer vision expert, followed by expert cross-validation. The consistency was verified with a Cohen’s Kappa coefficient of 0.92, and the "annotation accuracy ≥ 98%" refers to a category annotation error rate ≤ 2% and a bounding box localization error ≤ 2 pixels. All these supplements have been integrated into the revised manuscript, enhancing the completeness and rigor of the dataset description.
Comments 5:Metrics and Definitions Contain Errors and Omissions
Several metric definitions are incomplete or inconsistent.
Recall and F1-score are not formally defined, although they are reported.
There is a typographical error in the precision definition (FN used instead of FP).
mAP@0.5:0.95 is defined but not consistently reported in the results.
Correct and complete all metric definitions.
Report mAP@0.5:0.95 in addition to mAP@0.5
Consider providing per-class AP results or a confusion matrix to support qualitative claims.
Response 5:Thank you for your careful corrections. We have completed comprehensive revisions and supplements related to the metrics: first, we corrected the typographical error in the definition of precision and fully defined recall and F1-score; second, we supplemented the reporting of the mAP@0.5:0.95 metric in all experimental results to ensure consistency with the definitions; third, we added a comparison table of AP for each defect category and a confusion matrix, providing quantitative support for qualitative conclusions. All corrections have been integrated into the revised manuscript, enhancing the accuracy and completeness of the metric descriptions.
Comments 6:Terminology Inconsistency Regarding “SE”
In the conclusion,”SE”appears to be interpreted as “Supervision-Enhanced”,whereas throughout the manuscript it refers to “Squeeze-and-Excitation.” This inconsistency is confusing. Use a single, consistent definition of SE throughout the manuscript.
Response 6:Thank you for your suggestion. We have reviewed the entire manuscript and ensured that such issues will no longer occur. We sincerely thank you again for your guidance.
Comments 7:Reproducibility, Data, and Code Availability
At the current stage, the manuscript does not provide sufficient information to ensure full reproducibility.
The dataset used in this study (visible-light and X-ray battery defect images) is not publicly available, and no clear statement is provided regarding data accessibility or usage restrictions.
The source code implementing the proposed YOLOv8-based architecture, including the SE and MFM modules and the training pipeline, is not shared.
The authors are strongly encouraged to make the dataset publicly available, or at least partially accessible, subject to industrial or confidentiality constraints.
If full data release is not possible, the limitations should be clearly stated and sufficient metadata provided to enable replication. The authors should also consider releasing the source code or a well-documented implementation to allow independent validation.
Providing access to data and code would significantly enhance transparency, reproducibility, and scientific credibility.
Response 7:We sincerely appreciate your guidance. Due to factory-related constraints, we are currently unable to release the complete dataset, but we plan to make a partial dataset publicly available in the future. We sincerely thank you again for your guidance.
Comments 8:Presentation Quality and Figure Readability
The manuscript contains several presentation issues that affect readability and clarity.
Figures 2 to 5 are too small and lack sufficient resolution; they should be redrawn or enlarged.
Tables contain typographical artifacts and broken words.
Figure and table references are sometimes inconsistent (e.g., Table 3 labeled as ablation while presenting field verification results).
Residual template elements remain (e.g., “Journal Not Specified,” empty received/revised dates).Redesign Figures 2–5 with improved resolution and layout.
Carefully revise all tables and captions to ensure clarity and self-containment.
Remove all template placeholders before resubmission.
Response 8:We sincerely appreciate your guidance. We have reviewed all figures thoroughly, and we sincerely thank you again for your guidance.
We hope the manuscript after careful revisions meets your high standards. The authors welcome further constructive comments if any.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript proposes an improved YOLOv8-based multimodal defect detection framework for lithium batteries, integrating visible light and X-ray modalities and incorporating the Supervision-Enhanced attention module along with a Multi-Scale Fusion Module to enhance the detection of minute defects. The method aims to achieve unified multi-type defect detection across the entire lifecycle of lithium batteries, while maintaining real-time performance on both server and edge devices. Experimental results show improvements in mAP, recall, and inference speed compared with YOLOv5, YOLOv7, and baseline YOLOv8 models.
However, several issues need to be resolved in the following:
1. The introduction explains that manual inspection and single modality methods suffer from low efficiency and high miss rates. However, it does not pinpoint specific industrial pain points that demand the proposed multimodal YOLOv8 framework.
(1) Please articulate specific operational bottlenecks, such as typical miss rates for existing AOI or X-ray systems, defect types that are most frequently overlooked, or unacceptable inspection cycle times, that motivate the proposed approach.
(2) Please explicitly explain why combining visible light and X-ray modalities with SE and MFM is necessary to address these bottlenecks, rather than simply strengthening a single modality YOLO model.
2. The authors introduce SE and MFM modules inside YOLOv8, but several recent works also integrate channel attention and multi-scale fusion into YOLOv8 for surface or industrial defect detection. The novelty claim is that this is a unified multi-type, full lifecycle solution, but it is not clearly tied to a unique architectural or algorithmic insight.
(1) Please clarify how your SE placement, multimodal fusion strategy, and MFM design differ from existing YOLOv8-based attention and multi-scale variants, both in structure and in function.
(2) Please expand the novelty discussion to identify the main algorithmic or system-level innovation beyond combining existing modules, for example, a specific multimodal fusion pipeline or a tailored design for lifecycle-wide inspection.
(3) Please consider adding a concise comparison table summarizing key architectural differences between your framework and representative YOLOv8 improvement works in industrial defect detection.
3. The contributions emphasize improved accuracy, recall, and speed, but they do not include explicit numerical gains. Without clear quantitative statements, the magnitude of improvement and the impact of each module are difficult to assess.
(1) Please rewrite the contribution list using concrete numbers, for example, relative percentage improvements in mAP@0.5, micro defect recall, and FPS over YOLOv5, YOLOv7, YOLOv8, and YOLOv9 under the same settings.
(2) Please specify, based on your ablation study, which component (SE, MFM, or multimodal fusion) provides the main performance gains and for which defect types.
(3) Please connect each contribution to a measurable outcome, such as reduced false negatives for internal bubbles or improved robustness for surface oil stains.
4. The related work section provides a reasonable overview of surface defect detection and general YOLO progress, but it does not sufficiently cover recent multimodal fusion methods and YOLO based defect detection in industrial contexts.
(1) Please enrich the literature review by including more recent multimodal fusion and multi-scale small object detection approaches from 2024 to 2025, and analyze how their fusion strategies and feature hierarchies relate to your SE and MFM modules.
Please integrate these additional references into a more critical comparison, emphasizing differences in modality usage, fusion level, feature design for micro defects, and deployment considerations, rather than only listing them.
(2) Please "at least" cite and briefly review the following YOLO based defect detection work, which is directly relevant to industrial component inspection:
W. L. Mao, C. C. Wang, P. H. Chou, and Y. T. Liu, "Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models," IEEE Sensors Journal, vol. 24, no. 16, pp. 26877 26888, 15 Aug. 2024, doi: 10.1109/JSEN.2024.3418618.
Please summarize its key ideas, such as the practical inspection pipeline, data imbalance handling, and YOLO based model selection, and then clearly explain how your unified lithium battery framework extends, differs from, or improves upon that line of YOLO based defect detection research. If possible, please add more related works from 2024 to 2025.
(3) Please add YOLOv9 as a baseline in the comparative experiments, using the official implementation repository as a reference:
Wang, C.Y., Yeh, I.H., Mark Liao, H.Y., "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information." In ECCV 2024. DOI: https://doi.org/10.1007/978-3-031-72751-1_1
GitHub: https://github.com/WongKinYiu/yolov9
Please highlight architectural changes in YOLOv9 that are relevant to your task, such as modifications in the backbone, neck, or training strategies, and explain why YOLOv8 was chosen as the base model instead of YOLOv9.
5. The dataset section details sample counts and augmentation strategies, but it is not clear whether visible light and X-ray images are paired on the same physical cells, how they are aligned, and how the fusion is implemented when modalities are imperfect or missing.
(1) Please clarify whether each data sample consists of a strictly paired visible light and X-ray image of the same battery, and whether any geometric or intensity alignment is performed.
(2) Please explain how the model handles scenarios where only a single modality is available, or where registration between modalities is imperfect.
(3) Please discuss how multimodal inconsistencies, such as modality-specific artifacts or partial coverage, influence training stability and detection performance.
6. SE and MFM are described mathematically, and the improved model diagram is shown. However, it does not provide a concise algorithmic pipeline that explains how data flows through multimodal fusion, attention weighting, multi-scale fusion, and prediction.
(1) Please add a short end-to-end workflow, either as a figure or as pseudo code, that summarizes the main steps from multimodal input loading, preprocessing, fusion, Supervision-Enhanced application, Multi-Focal module processing, and final detection output.
(2) Please clearly indicate at which locations in the YOLOv8 backbone, neck, and head the SE and MFM modules are inserted, and how they interact with the standard C2f and SPPF blocks.
7. The ablation and comparative experiments demonstrate that the improved model outperforms YOLOv5, YOLOv7, and YOLOv8 variants, but the analysis mainly repeats numerical results without providing robustness metrics, sensitivity analysis, or deeper interpretation. In addition, YOLOv9, which is a strong recent baseline, is not included in either the literature review or the experiments.
(1) Please include additional performance indicators such as false negative rate, false positive rate, and robustness to illumination or noise variations, especially for minute defects.
(2) Please provide a brief analysis of why SE and MFM particularly improve certain categories, for example, pole piece cracks or composite defects, possibly supported by qualitative detection examples.
(3) Please incorporate YOLOv9 as an additional baseline in the comparative experiments, using the implementation from the official repository (https://github.com/WongKinYiu/yolov9), and report its performance under the same dataset and training settings. Please discuss how your improved YOLOv8 compares with YOLOv9 in both accuracy and speed.
8. The conclusion briefly notes that generalization may be limited under more complex defect types or extreme conditions, but it does not systematically analyze failure cases, such as triple composite defects or strong X-ray noise, which are mentioned in the field verification section.
(1) Please add a dedicated subsection on limitations that discusses specific failure modes observed in experiments, including triple composite defects, severe X-ray noise, and cases where multimodal fusion does not produce clear gains.
(2) Please explain how these limitations suggest directions for architectural or training improvements, such as noise-aware fusion strategies or more targeted data augmentation for rare composite defects.
9. The hardware and software platforms are described in detail, and inference FPS is measured on both server and edge devices. However, the manuscript does not clearly illustrate how the model is integrated into an end-to-end inspection line or how it interacts with upstream and downstream quality control systems.
(1) Please provide a concrete deployment diagram or narrative that shows image acquisition, data transfer, real-time inference, alarm or logging mechanisms, and feedback into manufacturing control.
(2) Please clarify whether the reported latencies and FPS meet specific industrial standards or internal factory requirements, and discuss any trade-offs between model complexity and achievable line speed.
10. The future work section discusses general directions such as improved adaptability and additional sensor data, but it is not clearly anchored in the limitations or failure patterns revealed by your experiments.
(1) Please revise the future work section so that each proposed direction directly corresponds to a current limitation, for example, better handling of extreme composite defects, domain adaptation across production lines, or robustness to X-ray noise.
(2) Please comment on whether techniques such as self-supervised multimodal representation learning, domain adaptation, or curriculum learning could help address the observed weaknesses.
11. The manuscript uses a considerable amount of mathematical notation for feature maps, channels, parameters, and evaluation metrics. Some symbols are introduced locally and reused later without a clear summary.
(1) Please add a notation table that lists all important symbols, indices, and parameters used in the SE, MFM, and performance metric definitions, along with their meanings and units.
(2) Please check for consistent symbol usage across sections, especially for feature map dimensions and channel indices, and avoid redefining the same symbol with different meanings.
(3) Please ensure that every symbol appearing in equations or diagrams is explicitly defined either in the main text or in the notation table.
The manuscript is generally understandable, but the clarity and readability can be improved in several places.
Many sentences are too long and mix multiple technical concepts, which makes the flow difficult to follow.
The descriptions of the SE module, the MFM design, and the multimodal fusion pipeline contain repeated phrasing.
In addition, transitions between sections are sometimes abrupt, and figure captions are too brief to guide readers without referring back to the main text.
Please simplify long sentences, reduce redundancy, and enhance the precision of technical explanations.
Author Response
Response to Reviewers
Dear Editors and Reviewers,
We really appreciate for your precious time in reviewing our paper and providing valuable comments. It was your valuable and insightful comments that led to possible improvements in the current version. The authors have carefully considered the comments and tried our best to address every one of them. The detailed corrections are listed below.
Comments 1:The introduction explains that manual inspection and single modality methods suffer from low efficiency and high miss rates. However, it does not pinpoint specific industrial pain points that demand the proposed multimodal YOLOv8 framework.
(1) Please articulate specific operational bottlenecks, such as typical miss rates for existing AOI or X-ray systems, defect types that are most frequently overlooked, or unacceptable inspection cycle times, that motivate the proposed approach.
(2) Please explicitly explain why combining visible light and X-ray modalities with SE and MFM is necessary to address these bottlenecks, rather than simply strengthening a single modality YOLO model.
Response 1:Thank you for your valuable comments, which have helped improve the rigor of our manuscript. We have supplemented specific industrial bottlenecks in the Introduction, including that manual inspection has a 18%–25% miss rate for micro-defects and a 12–15 second single-cell inspection cycle, while single-modal AOI/X-ray systems fail to cover both surface and internal defects (e.g., AOI has a 25%–35% miss rate for internal defects). We have also explicitly explained that strengthening a single-modal YOLO model cannot overcome inherent modal limitations, so combining visible light (surface detection) and X-ray (internal detection) with the SE module (dynamic feature weighting) and MFM (micro-defect feature enhancement) is necessary to address these pain points. These revisions are added in the second paragraph of the Introduction and the section preceding the proposed method, ensuring a clearer logical chain. We sincerely appreciate your guidance and welcome further suggestions for improvement.
Comments 2:The authors introduce SE and MFM modules inside YOLOv8, but several recent works also integrate channel attention and multi-scale fusion into YOLOv8 for surface or industrial defect detection. The novelty claim is that this is a unified multi-type, full lifecycle solution, but it is not clearly tied to a unique architectural or algorithmic insight.
(1) Please clarify how your SE placement, multimodal fusion strategy, and MFM design differ from existing YOLOv8-based attention and multi-scale variants, both in structure and in function.
(2) Please expand the novelty discussion to identify the main algorithmic or system-level innovation beyond combining existing modules, for example, a specific multimodal fusion pipeline or a tailored design for lifecycle-wide inspection.
(3) Please consider adding a concise comparison table summarizing key architectural differences between your framework and representative YOLOv8 improvement works in industrial defect detection.
Response 2:We sincerely appreciate your insightful comments, which have greatly helped clarify the innovativeness of our work. We have clearly elaborated on the structural and functional differences between our model and existing YOLOv8-based variants: our SE module dynamically weights cross-modal features after multimodal concatenation (rather than unimodal enhancement), while the MFM integrates three input branches and five processing branches (including large-kernel DWConv and identity mapping), tailored to the granularity of full-lifecycle defects. Beyond the module combination, our core innovations also lie in the lifecycle-adaptive multimodal fusion pipeline and the synergistic optimization of the SE and MFM modules, enabling unified full-lifecycle detection that was previously unattainable in prior works. We sincerely thank you for your guidance and welcome any further suggestions.
Comments 3:3. The contributions emphasize improved accuracy, recall, and speed, but they do not include explicit numerical gains. Without clear quantitative statements, the magnitude of improvement and the impact of each module are difficult to assess.
(1) Please rewrite the contribution list using concrete numbers, for example, relative percentage improvements in mAP@0.5, micro defect recall, and FPS over YOLOv5, YOLOv7, YOLOv8, and YOLOv9 under the same settings.
(2) Please specify, based on your ablation study, which component (SE, MFM, or multimodal fusion) provides the main performance gains and for which defect types.
(3) Please connect each contribution to a measurable outcome, such as reduced false negatives for internal bubbles or improved robustness for surface oil stains.
Response 3:We sincerely appreciate your valuable comment. Our improved model achieves an mAP@0.5 of 87.5% (5.0% higher than YOLOv8s, 2.3% higher than YOLOv9-S), a micro-defect recall rate of 84.1% (18.4% higher than YOLOv8s, 4.8% higher than YOLOv9-S), and runs at 35.9 FPS on servers (3.3 FPS faster than YOLOv9-S) and 26.7 FPS on edge devices, with clear quantitative gains over baseline models. Ablation experiments show the MFM module contributes the most to micro-defect detection (6.7% MRR improvement), benefiting pole piece cracks and separator perforation, while the SE module enhances mAP by 2.2% and multimodal fusion boosts mAP by 3.9% for internal defects like short-circuit points. Each improvement translates to measurable outcomes: false negatives for internal bubbles are reduced by 7.2%, and illumination adaptability is enhanced with only a 0.8% MRR drop under low light, verifying practical utility.
Comments 4:The related work section provides a reasonable overview of surface defect detection and general YOLO progress, but it does not sufficiently cover recent multimodal fusion methods and YOLO based defect detection in industrial contexts.
(1) Please enrich the literature review by including more recent multimodal fusion and multi-scale small object detection approaches from 2024 to 2025, and analyze how their fusion strategies and feature hierarchies relate to your SE and MFM modules.
Please integrate these additional references into a more critical comparison, emphasizing differences in modality usage, fusion level, feature design for micro defects, and deployment considerations, rather than only listing them.
(2) Please "at least" cite and briefly review the following YOLO based defect detection work, which is directly relevant to industrial component inspection:
- L. Mao, C. C. Wang, P. H. Chou, and Y. T. Liu, "Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models," IEEE Sensors Journal, vol. 24, no. 16, pp. 26877 26888, 15 Aug. 2024, doi: 10.1109/JSEN.2024.3418618.
Please summarize its key ideas, such as the practical inspection pipeline, data imbalance handling, and YOLO based model selection, and then clearly explain how your unified lithium battery framework extends, differs from, or improves upon that line of YOLO based defect detection research. If possible, please add more related works from 2024 to 2025.
(3) Please add YOLOv9 as a baseline in the comparative experiments, using the official implementation repository as a reference:
Wang, C.Y., Yeh, I.H., Mark Liao, H.Y., "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information." In ECCV 2024. DOI: https://doi.org/10.1007/978-3-031-72751-1_1
GitHub: https://github.com/WongKinYiu/yolov9
Please highlight architectural changes in YOLOv9 that are relevant to your task, such as modifications in the backbone, neck, or training strategies, and explain why YOLOv8 was chosen as the base model instead of YOLOv9.
Response 4:Thank you for your valuable suggestions. We have already completed the revisions as required: enriched the related work section with recent multimodal fusion and multi-scale small object detection approaches from 2024–2025, critically compared them with our SE and MFM modules, cited and reviewed the specified YOLO-based defect detection work by Mao et al. (2024) while clarifying the extensions and improvements of our unified framework, and added YOLOv9 as a baseline in comparative experiments with detailed analysis of its architectural changes and the rationale for choosing YOLOv8 as the base model. We sincerely appreciate your insightful comments that have helped enhance the completeness and depth of our research.
Comments 5:The dataset section details sample counts and augmentation strategies, but it is not clear whether visible light and X-ray images are paired on the same physical cells, how they are aligned, and how the fusion is implemented when modalities are imperfect or missing.
(1) Please clarify whether each data sample consists of a strictly paired visible light and X-ray image of the same battery, and whether any geometric or intensity alignment is performed.
(2) Please explain how the model handles scenarios where only a single modality is available, or where registration between modalities is imperfect.
(3) Please discuss how multimodal inconsistencies, such as modality-specific artifacts or partial coverage, influence training stability and detection performance.
Response 5:We highly appreciate this suggestion.Each data sample is strictly paired with visible light and X-ray images of the same physical cell, and spatial alignment (with an alignment error of 1 pixel) is achieved through SIFT feature matching to ensure multimodal information consistency. For scenarios with only a single modality or imperfect registration, the model demonstrates strong fault tolerance—its cross-modal features have transferability, and the SE module dynamically adjusts feature weights, enabling mAP@0.5 of 82.3% (visible light only) and 80.7% (X-ray only) even with single-modal input. Regarding multimodal inconsistencies, the study mitigates their impact on training stability and detection performance through modality fusion augmentation (dynamic switching of feature-level and channel-level fusion) and the synergistic effect of the SE module (suppressing redundant information) and MFM module (enhancing feature diversity), as evidenced by the model’s stable performance under noise interference and modality missing conditions.
Comments 6:SE and MFM are described mathematically, and the improved model diagram is shown. However, it does not provide a concise algorithmic pipeline that explains how data flows through multimodal fusion, attention weighting, multi-scale fusion, and prediction.
Response 6:The SE module acts as a channel-wise feature filter for the native C2f/SPPF blocks, while the MFM module serves as a multi-scale focal enhancer for neck’s concatenated features; their synergistic embedding enables the model to balance bimodal feature consistency, multi-scale integrity and detection efficiency simultaneously.
Comments 7:The ablation and comparative experiments demonstrate that the improved model outperforms YOLOv5, YOLOv7, and YOLOv8 variants, but the analysis mainly repeats numerical results without providing robustness metrics, sensitivity analysis, or deeper interpretation. In addition, YOLOv9, which is a strong recent baseline, is not included in either the literature review or the experiments.
(1) Please include additional performance indicators such as false negative rate, false positive rate, and robustness to illumination or noise variations, especially for minute defects.
(2) Please provide a brief analysis of why SE and MFM particularly improve certain categories, for example, pole piece cracks or composite defects, possibly supported by qualitative detection examples.
(3) Please incorporate YOLOv9 as an additional baseline in the comparative experiments, using the implementation from the official repository (https://github.com/WongKinYiu/yolov9), and report its performance under the same dataset and training settings. Please discuss how your improved YOLOv8 compares with YOLOv9 in both accuracy and speed.
Response 7:In response to your valuable and highly constructive revision suggestions, we have comprehensively supplemented, quantitatively improved, and conducted in-depth mechanistic analysis on the experimental section of the paper. Specifically, we have fully supplemented the missing robustness metrics, qualitative validation, YOLOv9 benchmark comparison, and accuracy-speed trade-off analysis. All supplementary content has been incorporated into the ablation/comparative experiments sections of the paper. Notably, all experiments were performed based on a fully consistent dataset, training/testing configurations, and hardware platform, ensuring the fairness and reproducibility of the results.
Comments 8:The conclusion briefly notes that generalization may be limited under more complex defect types or extreme conditions, but it does not systematically analyze failure cases, such as triple composite defects or strong X-ray noise, which are mentioned in the field verification section.
(1) Please add a dedicated subsection on limitations that discusses specific failure modes observed in experiments, including triple composite defects, severe X-ray noise, and cases where multimodal fusion does not produce clear gains.
(2) Please explain how these limitations suggest directions for architectural or training improvements, such as noise-aware fusion strategies or more targeted data augmentation for rare composite defects.
Response 8:We would like to express our sincere gratitude for your precise and insightful revision suggestions. In accordance with your requirements, we have added a dedicated standalone subsection. In this subsection, we systematically analyze the specific failure cases observed in the experiments, the quantified failure characteristics, and their underlying causes. On the basis of these limitations, we have also put forward targeted and feasible directions for network architecture optimization and training strategy improvement. All the analyses are grounded in the experimentally measured data of lithium battery defect detection from this study, without any vague or generalized discussions. The content is fully consistent with the field verification and robustness experiment sections presented earlier, and all supplementary content has been marked in the paper in the form of revised highlights.
Comments 9:The hardware and software platforms are described in detail, and inference FPS is measured on both server and edge devices. However, the manuscript does not clearly illustrate how the model is integrated into an end-to-end inspection line or how it interacts with upstream and downstream quality control systems.
(1) Please provide a concrete deployment diagram or narrative that shows image acquisition, data transfer, real-time inference, alarm or logging mechanisms, and feedback into manufacturing control.
(2) Please clarify whether the reported latencies and FPS meet specific industrial standards or internal factory requirements, and discuss any trade-offs between model complexity and achievable line speed.
Response 9:We sincerely appreciate your highly practically valuable revision suggestions. In accordance with your requirements, we have supplemented a complete end-to-end detection line integration scheme (including a concretized deployment architecture and interaction logic) in Section 3.6 "Industrial Deployment Adaptation Experiments" of the paper. We have also clearly verified the compliance of latency/FPS with industrial standards and factory internal control requirements, and conducted an in-depth analysis of the trade-off relationship between model complexity and production line speed. All supplementary content is grounded in actual industrial deployment test data, fully consistent with the hardware/software platforms and performance indicators presented earlier in the paper, and has been marked in the form of revised highlights.
Comments 10:The future work section discusses general directions such as improved adaptability and additional sensor data, but it is not clearly anchored in the limitations or failure patterns revealed by your experiments.
(1) Please revise the future work section so that each proposed direction directly corresponds to a current limitation, for example, better handling of extreme composite defects, domain adaptation across production lines, or robustness to X-ray noise.
(2) Please comment on whether techniques such as self-supervised multimodal representation learning, domain adaptation, or curriculum learning could help address the observed weaknesses.
Response 10:We sincerely appreciate your precise and instructive revision suggestions. In accordance with your requirements, we have comprehensively revised the paper such that each research direction corresponds one-to-one with the specific limitations and failure modes revealed in the experiments. We have also provided targeted comments on the potential value of technologies such as self-supervised multimodal representation learning, domain adaptation, and curriculum learning for addressing these weaknesses.
Comments 11:The manuscript uses a considerable amount of mathematical notation for feature maps, channels, parameters, and evaluation metrics. Some symbols are introduced locally and reused later without a clear summary.
(1) Please add a notation table that lists all important symbols, indices, and parameters used in the SE, MFM, and performance metric definitions, along with their meanings and units.
(2) Please check for consistent symbol usage across sections, especially for feature map dimensions and channel indices, and avoid redefining the same symbol with different meanings.
(3) Please ensure that every symbol appearing in equations or diagrams is explicitly defined either in the main text or in the notation table.
Response 11:We sincerely appreciate your detailed and professional revision suggestions. In accordance with your requirements, we have systematically organized all mathematical symbols, metrics, and parameters throughout the paper, added a unified symbol definition table, verified and corrected the consistency of symbol usage, and ensured that all symbols in the equations and figures are clearly defined.
We hope the manuscript after careful revisions meets your high standards. The authors welcome further constructive comments if any.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsINTRODUCTION AND RRL:
1. You should define "entire lifecycle" stages more explicitly because inspection contexts and defect priors differ in certain context.
This will surely establish the baseline of the paper more.
2. Could you explain why YOLOv8 suits defect granularity? Why it is better than other segmentation algorithms in terms of empirical performance?
3. Please add domain adaptation papers, to clearly establish the new or novel contribution of your work
4. Adapt a precise defect taxonomy of your with, what is the demarcation of the bouding-box guidelines?
5. Is your work novel just simple inserting a module? Is the multiscale fusion mature already in this work?
METHODS:
1. The SE equations use channel attention, could you justify why r = 16 via sensitivity requirements?
2. The author uses 9x9 and 11x11 risk gridding. Please explain this receptive field value choices, preferably with analysis
3. Include a class counts per defect and modality. It is important to report any imbalances in dataset labels/classifications
4. Have you used image normalization per modality? X-ray intensity distributions vary drastically often. Does preprocessing exist in the pipeline?
RESULTS & DISCUSSIONS:
1. The field verification used only 250 sets. If you modify to larger values, could the temporality affect production runs?
2. The latency values (just to be clear) must match FPS numerically to have a common measurement
3. Could you compare your results to recent multimodel detectors (similar to your work)?
CONCLUSIONS:
1. Could you discus the regulatory implications of this work for better battery safety certifications? Could this work be integrated into the certification pipeline, or is it just a supplement.
2. Please address the ethical and operational issues when the detection is wrong? Dataset issues privacy and is traceable for better accountability.
3. Interpreting mAP improvements must be toned-down as this is not the only performance measure that should be given emphasis.
* The Figure captions in the document should have more description for the clarity of the readers.
Author Response
Response to Reviewers
Dear Editors and Reviewers,
We really appreciate for your precious time in reviewing our paper and providing valuable comments. It was your valuable and insightful comments that led to possible improvements in the current version. The authors have carefully considered the comments and tried our best to address every one of them. The detailed corrections are listed below.
Comments 1:You should define "entire lifecycle" stages more explicitly because inspection contexts and defect priors differ in certain context.
Response 1:We agree with the reviewer’s comment and have added a clear definition of "entire lifecycle" stages in the introduction. Thank you for this suggestion.
Comments 2:Could you explain why YOLOv8 suits defect granularity? Why it is better than other segmentation algorithms in terms of empirical performance?
Response 2:We fully agree with your comment. We agree that comparing YOLOv8 with segmentation algorithms is critical to highlighting its industrial applicability. We have supplemented quantitative and qualitative analyses of empirical performance advantages, focusing on industrial core requirements (real-time, deployment feasibility, low false positives).
Comments 3:Please add domain adaptation papers, to clearly establish the new or novel contribution of your work
Response 3:We highly appreciate this suggestion. Domain adaptation is critical for bridging the gap between laboratory-trained models and real-world industrial scenarios—especially for lithium battery defect detection, where data distribution shifts are common. We have supplemented key domain adaptation literature in the Introduction and Discussion sections, explicitly distinguishing our work from existing domain adaptation methods and highlighting the novel contributions of our research.
Comments 4:Adapt a precise defect taxonomy of your with, what is the demarcation of the bouding-box guidelines?
Response 4:We strictly classified the defects in this study according to the four core stages of the lithium battery life cycle (electrode manufacturing, cell assembly, aging testing, and field service) as well as defect characteristics. The classification covers 8 types of single defects, including pole piece cracks and tab deformation in the electrode manufacturing stage, electrode shadows and tab welding offsets in the cell assembly stage, internal short circuits and capacity decay in the aging testing stage, and shell scratches and internal bubbles in the field service stage; it also includes 2 types of composite defects, namely crack + bubble and dark spot + electrode shadow. Meanwhile, healthy samples without defects are incorporated in the dataset.Different defects correspond to modality combinations of visible light + X-ray or X-ray only, with a pixel occupancy range of 1%–12%. For the bounding box labeling criteria, we explicitly adopted the LabelImg tool to label defect categories and coordinates in the format of \((x_1,y_1,x_2,y_2)\). The labeling work was completed by a team consisting of 3 engineers with over 3 years of experience in lithium battery quality inspection and 1 computer vision expert, followed by cross-validation among experts. A Cohen’s Kappa coefficient of 0.92 was achieved to ensure labeling consistency, ultimately realizing a bounding box localization error of ≤2 pixels and a category labeling accuracy of ≥98%, which meets the industrial-level labeling requirements.
Comments 5:Is your work novel just simple inserting a module? Is the multiscale fusion mature already in this work?
Response 5:We highly appreciate this suggestion.The work is not mere module insertion, and its multi-scale fusion is innovative rather than mature.
For novelty: It targets unified defect detection across lithium batteries’full lifecycle, integrating the SE module (dynamic weighting of dual-modal features) and MFM (amplifying minute defect features) with clear synergy. Supported by a lifecycle-covered multimodal dataset and industrial deployment validation, it forms a complete technical chain instead of random module stacking.
For multi-scale fusion: The designed MFM is not a mature off-the-shelf structure. It adopts a multi-branch architecture (processing P3/P4/P5 features) and multi-receptive field convolutions, tailored to solve minute battery defect feature loss during subsampling, balancing feature diversity and computational efficiency.
Comments 6:The SE equations use channel attention, could you justify why r = 16 via sensitivity requirements?
Response 6:When the compression ratio r =16 , an optimal balance is achieved: it maintains an MRR (Micro Defect Recall Rate) of 80.2% (meeting the industrial sensitivity threshold) while ensuring lightweight deployment (11.5 million parameters, 29.7 Giga FLOPs) and real-time inference (26.0 FPS on edge devices). A compression ratio of r=8 results in redundant computations approaching the limits of edge devices, whereas r =32 or 64 leads to significant MRR degradation (< 77.3%) due to excessive feature loss. Scientifically validated, r=16 is identified as the optimal choice for lithium battery defect detection scenarios.
Comments 7:The author uses 9×9 and 11×11 risk gridding. Please explain this receptive field value choices, preferably with analysis
Response 7:The selection of 9×9 and 11×11 receptive fields in the MFM module is tailored to lithium battery defects—especially minute internal discontinuities (occupying only 1%–10% of pixels) that are prone to feature loss during subsampling. These larger kernels complement smaller ones (3×3, 5×5) to expand the receptive field, enabling capture of global semantic information while retaining local defect details, which addresses the challenge of multi-scale defect representation. Leveraging depthwise separable convolutions, the design avoids excessive computational overhead, balancing feature extraction breadth for tiny defects with the lightweight requirement of edge deployment. This choice aligns with the module’s goal of enhancing multi-receptive field feature fusion, directly boosting the recall rate of fine-grained defects as validated in ablation experiments.
Comments 8:Include a class counts per defect and modality. It is important to report any imbalances in dataset labels/classifications
Response 8:Our dataset includes 8 single defect types, 2 composite defect types, and healthy samples, with clear modality matches: electrode manufacturing/post-aging test/composite/healthy samples use visible light + X-ray (dual-modal), while cell assembly defects rely on X-ray (single-modal). Sample counts per single defect type range from 300 to 600 pairs—500 pairs for post-aging test defects and 300 pairs for cell assembly/composite defects—with 800 pairs of healthy samples. No severe label imbalance exists as stratified sampling ensures each category has at least 300 pairs, meeting training stability requirements. The consistent "strictly paired" data (spatial alignment error≤1 pixel) further guarantees multimodal information reliability.
Comments 9:Have you used image normalization per modality? X-ray intensity distributions vary drastically often. Does preprocessing exist in the pipeline?
Response 9:We acknowledge your comments and are revising the relevant sections accordingly.We implemented modality-specific image normalization in the preprocessing pipeline. X-ray images, which often have drastic intensity variations, were normalized to the [0,1] range using min-max scaling to mitigate intensity distribution differences. Visible-light images retained RGB channel standardization (mean: 0.485, 0.456, 0.406; variance: 0.229, 0.224, 0.225) for consistency with pre-trained model expectations. Additionally, modality-specific data augmentation (e.g., Poisson noise for X-rays, brightness adjustment for visible light) was applied to further enhance robustness to modal variations.
Comments 10:The field verification used only 250 sets. If you modify to larger values, could the temporality affect production runs?
Response 10:We fully agree with your comment.We have expanded the field verification sample size to 1000 groups (covering 4 lifecycle stages, composite defects, and extreme scenarios) instead of 250 sets. The model’s end-to-end latency stabilizes at 38.2 ms, well within the industrial requirement of ≤50 ms, and is fully compatible with the production line’s conveyor belt speed (1.5–2.0 m/s) without disrupting production runs. Temporality impact is minimal: tests across 3 batches (1-month interval) show only a 2.1% performance fluctuation, demonstrating robust adaptability to production batch variations. This larger-scale verification not only avoids production interference but also enhances the model’s generalization reliability in real industrial settings.
Comments 11:The latency values (just to be clear) must match FPS numerically to have a common measurement
Response 11:We confirm that latency and FPS values are numerically consistent based on the formula Latency (ms) = 1000 FPS, adhering to a unified measurement standard. For server-side devices, the 35.9 FPS corresponds to ~27.8 ms (1000/35.9), and edge devices’ 26.7 FPS translates to ~37.4 ms (1000/26.7)—minor decimal differences in the manuscript result from rounding for readability. Both metrics reflect end-to-end performance (including preprocessing, inference, and NMS), ensuring consistent measurement coverage and aligning with industrial inspection standards. We have verified all reported values to maintain this numerical correspondence for clarity.
Comments 12:Could you compare your results to recent multimodel detectors (similar to your work)?
Response 12:Compared to recent multimodal defect detectors, our model outperforms the 2025-improved YOLOv8 (PMC) by 3.8% in mAP on battery defects while maintaining comparable lightweight properties (11.5M vs. 12.1M params) . It achieves 5.2% higher recall for micro-defects (≤5% pixel ratio) than the 2025 ACM multimodal framework (80.2%vs.75.0%) without relying on complex feature perturbation modules . Unlike the SpringerLink 2025 MDC-Net that focuses on surface defects with captioning , our method targets full-lifecycle battery defects via dual-modal (visible light+X-ray) fusion, balancing industrial deployment efficiency (26.0 FPS) and multi-scale defect adaptability. These comparisons validate its superiority in battery-specific multimodal defect detection scenarios.
Comments 13:Could you discus the regulatory implications of this work for better battery safety certifications? Could this work be integrated into the certification pipeline, or is it just a supplement.
Response 13:This work has notable regulatory implications for battery safety certifications, as it complies with ISO 1219-1:2021 and GB/T 31484-2015 standards and provides quantifiable, full-lifecycle defect detection data to support certification rigor. It can be directly integrated into the certification pipeline (not just a supplement) due to its industrial compatibility, real-time performance, and ability to address critical safety risks via high-precision defect identification.
Comments 14:Please address the ethical and operational issues when the detection is wrong? Dataset issues privacy and is traceable for better accountability.
Response 14:For ethical and operational issues of detection errors, the system integrates PLC alarm and manual review mechanisms to minimize safety risks from false positives/negatives, fulfilling ethical responsibilities; the dataset adheres to privacy protection standards (strictly paired modal data, standardized annotation) and achieves full traceability via the data storage module, enhancing accountability in industrial applications.
Comments 15:Interpreting mAP improvements must be toned-down as this is not the only performance measure that should be given emphasis.
Response 15:We acknowledge that mAP improvement is only one of the key performance indicators and will tone down its overemphasis in the revised manuscript. Our research comprehensively highlights other critical metrics including micro-defect recall rate (MRR), real-time inference speed (FPS), robustness against interference, and industrial deployment compatibility, which collectively validate the model’s practical effectiveness.
Comments 16:The Figure captions in the document should have more description for the clarity of the readers.
Response 16:We sincerely appreciate your valuable guidance.We will enrich all Figure captions with more detailed descriptions.
We hope the manuscript after careful revisions meets your high standards. The authors welcome further constructive comments if any.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe revised manuscript contains meaningful additions (expanded dataset/annotation description, more metrics, ablation study, robustness tests, and additional baseline comparisons).
However, several critical issues remain unresolved, including internal inconsistencies in the multimodal input definition, non-reproducible/ambiguous method descriptions, a major numerical inconsistency in the ablation discussion, incorrect or unsupported industrial-standard claims, and serious presentation/template remnants.
Comment 1 — Multimodal fusion strategy insufficiently defined
- The manuscript describes paired visible-light and X-ray samples, alignment via SIFT, and modality-specific normalization.
- It describes dual fusion modes (feature-level concatenation and channel-level fusion) with a 0.5/0.5 switching strategy.
- The paper still explicitly states: “Input begins with a 640×640 resolution RGB image with 3 channels,” while later describing “channel-level fusion (treating X-rays as the 4th channel).”
- This contradiction must be resolved. If a 4-channel input is used, the authors must explain how the first convolution layer is adapted and initialized when using COCO-pretrained weights (e.g., replication, averaging, zero-init, learned projection, etc.).
Comment 2 — Insufficient justification of the YOLOv8 choice
- The authors provide a narrative justification (anchor-free design, deployment ecosystem, edge constraints).
- “ResNet-50 baseline” is referenced as a baseline, but ResNet-50 alone is not a detector; the paper must specify the full detection framework (e.g., Faster R-CNN / RetinaNet with ResNet-50 backbone), otherwise the baseline comparison is ambiguous.
- Baseline naming across text/tables is not consistently defined (e.g., mentions of RetinaNet vs other baselines).
Comment 3 — Experimental protocol and rigor of comparisons
- Results are presented with mean ± standard deviation in tables.
- Deployment-related metrics (FPS/Params/FLOPs/latency) are included; robustness tests include additional models.
- The ablation narrative states “mAP@0.5 rises to 0.95,” which contradicts the ablation table values (reported around 87.5%).
- This inconsistency must be corrected and the manuscript must be audited for numerical coherence. - A unified, explicit description of the experimental protocol for all baselines (seeds, number of runs, identical schedules/augmentations, initialization), beyond what is implicitly suggested by “±” reporting.
Comment 4 — Dataset description and annotation protocol
- The revision provides lifecycle-stage breakdown, defect types, sample counts, strict instance-level pairing, and annotation protocol details (including Kappa and error constraints).
- Provide a clean, explicit per-class distribution table and a stable, consistent taxonomy for defect naming across the manuscript.
Comment 5 — Metrics and definitions contain errors and omissions
- Expanded metrics table and additional reporting (mAP@0.5:0.95, MRR, FNR/FPR, per-class AP).
- The numerical contradiction in the ablation narrative (mAP@0.5 “0.95”) undermines trust in the reported results.
Comment 6 — Terminology inconsistency regarding “SE”
- “SE” appears consistently used as Squeeze-and-Excitation in the revision.
Comment 7 — Reproducibility, data, and code availability
- The manuscript lacks clear, formal Data Availability and Code Availability statements.
- If data cannot be shared due to industrial restrictions, the paper must explicitly state the limitation and provide a clear access mechanism or sufficient metadata to facilitate independent replication.
Comment 8 — Presentation quality and figure readability
- Captions/labels include errors (e.g., “Visualinterface …”).
- Language/spacing issues persist (missing spaces after punctuation, awkward phrasing).
- Stylistic inconsistency remains (e.g., first-person “I introduce…”).
The manuscript claims compliance with ISO 1219-1:2021 and GB/T 31484-2015 as if they set detection accuracy requirements or “data protocol” constraints. These claims appear inconsistent with the scope of those standards:
- ISO 1219-1 relates to fluid power graphical symbols/circuit diagrams, not defect detection accuracy thresholds.
- GB/T 31484-2015 relates to traction battery cycle life requirements and test methods, not a “data protocol.”
Comments on the Quality of English Language
The English is understandable, but it does not meet the standards of an international scientific journal. The manuscript still contains frequent mechanical errors (spacing, punctuation, capitalization), awkward phrasing, and occasional corrupted text. These issues reduce readability and scientific precision and give the impression that the paper has not undergone professional language editing.
Most recurrent issues:
A) Missing spaces after punctuation or citations, and punctuation glued to words.
B) Inconsistent capitalization (e.g., capitalizing “The” mid-sentence), and inconsistent style across sections.
C) Awkward or non-academic phrasing (promotional wording, vague verbs such as “presumed”).
D) Typographical corruption and broken words (encoding artifacts) in tables and text.
E) Inconsistent author voice (switching between “I” and impersonal scientific style).
Concrete examples (Before → Suggested correction)
1) Missing space after citation / period
- Before: “... reliability[1].Throughout their entire lifecycle ...”
- After: “... reliability [1]. Throughout their entire lifecycle ...”
2) Missing space after period
- Before: “... defect priors.These methods ...”
- After: “... defect priors. These methods ...”
3) Missing space after comma
- Before: “... missed defects,reaches ...”
- After: “... missed defects, reaches ...”
4) Corrupted sentence / unclear text
- Before: “... evaluated in tin sine test set.”
- After: “... evaluated on the test set.” (or “... evaluated on the same test set.”)
Note: The authors must ensure the intended meaning is correct and consistent with the experimental protocol.
5) Caption/label concatenation
- Before: “Visualinterface detection results.”
- After: “Visual interface detection results.”
6) Missing space after period (caption/paragraph)
- Before: “... accuracy.Visualization results ...”
- After: “... accuracy. Visualization results ...”
7) Non-academic wording / ambiguity
- Before: “ADown operation—presumed to be ...”
- After: “ADown is defined as ...” followed by a precise definition (equation/pseudo-code) and citation if borrowed.
8) Inconsistent capitalization
- Before: “Over all, The SE module ...”
- After: “Overall, the SE module ...”
9) Inconsistent author voice
- Before: “To address this challenge, I introduce ...”
- After: “To address this challenge, we introduce ...” (or passive voice consistently throughout)
10) Missing space inside a sentence
- Before: “... expertsas shown in Figure 5 ...”
- After: “... experts as shown in Figure 5 ...”
Author Response
Dear Reviewer,
We really appreciate for your precious time in reviewing our paper and providing valuable comments.It was your valuable and insightful comments that led to possible improvements in the current version.We have carefully considered the comments and tried our best to address every one of them.The detailed corrections are listed below.
Comments 1:Multimodal fusion strategy insufficiently defined
- The manuscript describes paired visible-light and X-ray samples, alignment via SIFT, and modality-specific normalization.
- It describes dual fusion modes (feature-level concatenation and channel-level fusion) with a 0.5/0.5 switching strategy.
- The paper still explicitly states: “Input begins with a 640×640 resolution RGB image with 3 channels,” while later describing “channel-level fusion (treating X-rays as the 4th channel).”
- This contradiction must be resolved. If a 4-channel input is used, the authors must explain how the first convolution layer is adapted and initialized when using COCO-pretrained weights (e.g., replication, averaging, zero-init, learned projection, etc.).
Response 1:We would like to express our sincere gratitude for your professional suggestions. In accordance with your comments, we have supplemented and optimized the relevant content as follows: we have clarified the corresponding relationship between the input and the two fusion modes, while identifying and resolving the adaptation problem of the first-layer convolution as well as the original logical contradictions. We have added the initialization scheme that copies the weight of the green channel for the first-layer convolution under the 4-channel input scenario, explained the rationality of this initialization method and the subsequent fine-tuning steps, and thus refined the adaptation details of the pre-trained weights. In addition, we have incorporated "adaptive processing of multi-modal input" as a key model improvement point, which further enhances the integrity of the technical route. Thank you for your guidance on this manuscript.
Comments 2: Insufficient justification of the YOLOv8 choice
- The authors provide a narrative justification (anchor-free design, deployment ecosystem, edge constraints).
- “ResNet-50 baseline” is referenced as a baseline, but ResNet-50 alone is not a detector; the paper must specify the full detection framework (e.g., Faster R-CNN / RetinaNet with ResNet-50 backbone), otherwise the baseline comparison is ambiguous.
- Baseline naming across text/tables is not consistently defined (e.g., mentions of RetinaNet vs other baselines).
Response 2:We would like to express our sincere gratitude for your professional suggestions. In accordance with your comments, we have supplemented and optimized the relevant content as follows: we have clarified the definition of the baseline model, specified that RetinaNet+ResNet-50 serves as a non-YOLO baseline, and standardized the corresponding naming convention. We have revised the term "ResNet-50 baseline" to "RetinaNet (with ResNet-50 as the backbone)", and supplemented the four-dimensional quantitative performance comparison data between YOLOv8s and RetinaNet, which enhances the pertinence and persuasiveness of model selection. In addition, we have clarified the relationship between RetinaNet and ResNet-50, and added explanations regarding the consistency of training configurations to eliminate potential ambiguities.
Comments 3: Experimental protocol and rigor of comparisons
- Results are presented with mean ± standard deviation in tables.
- Deployment-related metrics (FPS/Params/FLOPs/latency) are included; robustness tests include additional models.
The ablation narrative states “mAP@0.5 rises to 0.95,” which contradicts the ablation table values (reported around 87.5%).
- This inconsistency must be corrected and the manuscript must be audited for numerical coherence.
A unified, explicit description of the experimental protocol for all baselines (seeds, number of runs, identical schedules/augmentations, initialization), beyond what is implicitly suggested by “±” reporting.
Response 3:We would like to express our sincere gratitude for your valuable suggestions. In accordance with your comments, we have adjusted and refined the relevant content as follows: we have corrected the erroneous value of mAP@0.5 for the A5 model to ensure consistency with the data presented in Table 2; meanwhile, we have revised the FPS values of the edge device to achieve uniformity with the results in Tables 2. In addition, we have introduced a unified experimental protocol, specifying key parameters including random seeds, number of runs, and training configurations to underpin the rigor of the experiments. We have also added annotations to clarify that the experimental results represent the mean ± standard deviation of three independent runs, thereby further enhancing the credibility of the data. Thank you for your guidance on this manuscript.
Comments 4:Dataset description and annotation protocol
- The revision provides lifecycle-stage breakdown, defect types, sample counts, strict instance-level pairing, and annotation protocol details (including Kappa and error constraints).
- Provide a clean, explicit per-class distribution table and a stable, consistent taxonomy for defect naming across the manuscript.
Response 4:We would like to express our sincere gratitude for your professional suggestions. In accordance with your comments, we have supplemented and standardized the relevant content as follows: we have added a "Defect Category Distribution Table", which clearly presents information regarding the number of single-category samples, modal combinations, and pixel proportions. Meanwhile, we have formulated terminology naming conventions to standardize the academic terminology throughout the manuscript, and revised inconsistent terminology such as "cross-type defects" to "composite defects" to ensure the consistency of terminology naming. In addition, we have added annotations to the tables to further enhance the consistency of the data. Thank you again for your guidance.
Comments 5:Metrics and definitions contain errors and omissions
- Expanded metrics table and additional reporting (mAP@0.5:0.95, MRR, FNR/FPR, per-class AP).
- The numerical contradiction in the ablation narrative (mAP@0.5 “0.95”) undermines trust in the reported results.
Response 5:We would like to express our sincere gratitude for your valuable suggestions. In accordance with your comments, we have adjusted and refined the relevant content as follows: we have corrected the erroneous mAP@0.5 value of the A5 model (revised from 0.95 to 87.5%), ensuring consistency with the data presented in the tables. Meanwhile, we have refined the definitions, calculation formulas, units, and corresponding industrial standards for all metrics to guarantee the reproducibility of the experiments. In addition, we have rectified issues such as ambiguous expressions and punctuation errors related to MRR, thus ensuring the consistency of numerical logic throughout the manuscript. Thank you for your guidance on this paper.
Comments 6:Terminology inconsistency regarding “SE”
- “SE” appears consistently used as Squeeze-and-Excitation in the revision.
Response 6:We would like to express our sincere gratitude for your professional suggestions. In accordance with your comments, we have completed the standardized adjustment of the terminology "Squeeze-and-Excitation (SE)" as follows: In the Introduction section, where the term is first mentioned, we have supplemented the full name "Squeeze-and-Excitation" together with its abbreviation "SE", which complies with the annotation norms for the first appearance of a technical term. In the title of Section 2.2, we have added the full terminology "Squeeze-and-Excitation (SE)" to render the title structure more clear and explicit. Meanwhile, we have conducted a thorough check on all the remaining sections of the manuscript, confirming that the abbreviation "SE" is used uniformly throughout the text without any ambiguity and with consistent formatting.
Comments 7:Reproducibility, data, and code availability
- The manuscript lacks clear, formal Data Availability and Code Availability statements.
- If data cannot be shared due to industrial restrictions, the paper must explicitly state the limitation and provide a clear access mechanism or sufficient metadata to facilitate independent replication.
Response 7:We sincerely appreciate the insightful comments from the reviewers on enhancing the transparency of our work. We fully concur that clear statements regarding data and code availability are crucial to ensuring the reproducibility of research findings. However, due to industrial constraints, we are unable to share the relevant data at present, and a corresponding statement has been provided in the manuscript. We will upload part of the code to a public GitHub repository in the future. Thank you for your guidance.
Comments 8:Presentation quality and figure readability
- Captions/labels include errors (e.g., “Visualinterface …”).
- Language/spacing issues persist (missing spaces after punctuation, awkward phrasing).
- Stylistic inconsistency remains (e.g., first-person “I introduce…”).
Response 8:We would like to express our sincere gratitude for your meticulous suggestions. In accordance with your comments, we have carried out comprehensive optimization and adjustments in multiple aspects as follows: we have corrected spelling errors throughout the manuscript, supplemented missing spaces and punctuation marks, and standardized the formatting of subfigure captions to enhance the readability of the content. We have revised all first-person expressions in the text to the academically standardized third-person passive voice. We have rectified such minor errors as missing spaces after punctuation marks, improper formula formatting, and inaccurate expression of temperature ranges, while refining awkward sentence structures for better fluency. In addition, we have fixed the non-sequential numbering issue of tables, split the merged columns, and standardized the formatting of table headers. Thank you for your guidance on this manuscript.
In addition, we have revised the language and details of this manuscript.
We hope the manuscript after careful revisions meets your high standards. The authors welcome further constructive comments if any.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have significantly improved the manuscript in response to the reviewer comments.
Overall, the revised version addresses the major technical concerns raised in the previous review.
However, a few issues still require further improvement, as follows:
1. All figures in the manuscript are presented as raster images (e.g., PNG or JPG), which may lead to quality degradation when zoomed or printed.
Please convert all figures to vector formats (e.g., PDF or EPS) to ensure sufficient resolution and publication quality.
2. The manuscript discusses YOLOv9 as a comparative baseline, but the first occurrence of YOLOv9 in the text is not clearly supported by an authoritative reference. Given that YOLOv9 introduces important architectural and training innovations that are relevant to the comparison, proper citation is necessary for completeness and clarity.
Please cite the following reference at the first mention of YOLOv9 in the manuscript and briefly summarize its key contribution, particularly the concept of programmable gradient information and how it differentiates YOLOv9 from previous YOLO versions:
Wang, C.Y., Yeh, I.H., Mark Liao, H.Y., “YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information,” in ECCV 2024, DOI: https://doi.org/10.1007/978-3-031-72751-1_1
3. While the engineering design is carefully executed and practically effective, the core methodological contributions still rely on known components such as SE attention, multi-scale feature fusion, and multimodal concatenation. The current wording in some sections may overemphasize conceptual novelty.
Please slightly temper the novelty claims in the abstract and conclusion, and clarify that the main contribution lies in a well-engineered multimodal inspection framework rather than a fundamentally new learning paradigm.
4. Although YOLOv9 has been added as a baseline, the manuscript provides limited discussion on configuration choices, parameter scale, and computational budget differences, which are important for interpreting accuracy–speed trade-offs.
Please briefly clarify whether YOLOv9 was trained using its recommended configuration and discuss parameter count or FLOPs relative to the proposed model to further support the fairness of the comparison.
5. The use of SIFT-based alignment improves technical rigor, but its computational cost and robustness under real industrial conditions (e.g., vibration or miscalibration) are not fully discussed.
Please add a short discussion on the runtime overhead and robustness of the alignment process, and comment on whether simpler calibration-based or approximate alignment methods could be viable alternatives in production environments.
6. Failure case analysis could benefit from additional visual support.
The limitation subsection is much improved and clearly written, but most failure modes are described textually.
Please consider adding one or two representative visual examples of typical failure cases.
Comments on the Quality of English LanguageThe manuscript is generally clear and readable, but minor revisions would help improve fluency and precision. Some sentences are overly long, and certain technical descriptions could be streamlined for clarity, especially in the methodology and discussion sections.
Author Response
Dear Reviewer,
We really appreciate for your precious time in reviewing our paper and providing valuable comments.It was your valuable and insightful comments that led to possible improvements in the current version.We have carefully considered the comments and tried our best to address every one of them.The detailed corrections are listed below.
Comments 1:All figures in the manuscript are presented as raster images (e.g., PNG or JPG), which may lead to quality degradation when zoomed or printed.
Please convert all figures to vector formats (e.g., PDF or EPS) to ensure sufficient resolution and publication quality.
Response 1:We sincerely appreciate your insightful suggestion regarding the figure formats. All raster images (e.g., PNG, JPG) in the manuscript have been completely converted to high-resolution vector PDF files to ensure optimal clarity during zooming and printing. We have carefully verified that each converted figure maintains its original details and formatting consistency without any quality degradation.
Comments 2:The manuscript discusses YOLOv9 as a comparative baseline, but the first occurrence of YOLOv9 in the text is not clearly supported by an authoritative reference. Given that YOLOv9 introduces important architectural and training innovations that are relevant to the comparison, proper citation is necessary for completeness and clarity.
Please cite the following reference at the first mention of YOLOv9 in the manuscript and briefly summarize its key contribution, particularly the concept of programmable gradient information and how it differentiates YOLOv9 from previous YOLO versions:
Wang, C.Y., Yeh, I.H., Mark Liao, H.Y., “YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information,” in ECCV 2024, DOI: https://doi.org/10.1007/978-3-031-72751-1_1
Response 2:Thank you for your insightful guidance. We have now cited Reference 32 in the manuscript to enhance the academic rigor of our arguments. We would like to express our sincere gratitude again for your valuable suggestions.
Comments 3:While the engineering design is carefully executed and practically effective, the core methodological contributions still rely on known components such as SE attention, multi-scale feature fusion, and multimodal concatenation. The current wording in some sections may overemphasize conceptual novelty.
Please slightly temper the novelty claims in the abstract and conclusion, and clarify that the main contribution lies in a well-engineered multimodal inspection framework rather than a fundamentally new learning paradigm.
Response 3:We would like to express our sincere gratitude for your guidance. At present, we have replaced the overly assertive terms regarding innovation and clarified the core contribution as "industrial-level framework integration". We appreciate your guidance once again.
Comments 4:Although YOLOv9 has been added as a baseline, the manuscript provides limited discussion on configuration choices, parameter scale, and computational budget differences, which are important for interpreting accuracy–speed trade-offs.
Please briefly clarify whether YOLOv9 was trained using its recommended configuration and discuss parameter count or FLOPs relative to the proposed model to further support the fairness of the comparison.
Response 4:We would like to express our sincere gratitude for your valuable suggestions. In accordance with your comments, we have supplemented and optimized the relevant content as follows: First, we have added the clarification that "all YOLO-series models were trained with their officially recommended configurations" in the comparison section, thus defining the premise of fairness for this experimental comparison. Meanwhile, we have supplemented the illustration that the specific training configurations of YOLOv9-S were kept consistent with those of the proposed model, so as to ensure the comparability of the comparison under identical conditions. In addition, we have incorporated quantitative comparison data between the proposed model and YOLOv9-S in terms of parameter count and FLOPs. By integrating these quantitative metrics, we have strengthened the demonstration of the model’s advantages in the three-dimensional trade-off among accuracy, speed and resource consumption, thereby presenting the performance characteristics of the proposed model in a more clear and intuitive manner.
Comments 5:The use of SIFT-based alignment improves technical rigor, but its computational cost and robustness under real industrial conditions (e.g., vibration or miscalibration) are not fully discussed.
Please add a short discussion on the runtime overhead and robustness of the alignment process, and comment on whether simpler calibration-based or approximate alignment methods could be viable alternatives in production environments.
Response 5:We would like to express our sincere gratitude for your professional suggestions. In accordance with your comments, we have supplemented and refined the relevant content as follows: we have added the quantitative data regarding the runtime overhead of SIFT alignment, along with an analysis of its robustness performance in industrial scenarios; we have incorporated the comparative content of two simplified alignment schemes in terms of performance metrics and applicability to different application scenarios; meanwhile, we have clearly elaborated on the rationale for selecting SIFT alignment in this study, which is based on the core principle of "accuracy priority". We would like to thank you again for your valuable guidance.
Comments 6:Failure case analysis could benefit from additional visual support.
The limitation subsection is much improved and clearly written, but most failure modes are described textually.
Please consider adding one or two representative visual examples of typical failure cases.
Response 6:We would like to express our sincere gratitude for your valuable suggestions. We fully acknowledge that supplementing visual examples of failure cases would render the analysis more intuitive and clear. However, since the failure case samples involved in this study are associated with actual production data from industrial scenarios, relevant visual materials cannot be publicly disclosed for the time being due to constraints related to data privacy and scenario confidentiality. Thus, we are unable to add the corresponding visual examples in this revision, and we sincerely hope for your understanding. If conditions permit in the future, we will further refine this part of the content. Thank you for your attention to and tolerance for this manuscript!
We hope the manuscript after careful revisions meets your high standards. The authors welcome further constructive comments if any.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsPlease proofread the work thoroughly.
Author Response
Dear Reviewer,
We would like to express our sincere gratitude for your valuable guidance. In response to your comments, we have carefully revised and polished the manuscript. We would like to extend our heartfelt thanks again for your consistent support and guidance throughout the revision of this paper.
Author Response File:
Author Response.pdf

