You are currently viewing a new version of our website. To view the old version click .
by
  • Peng Yang1,*,
  • Fan Gao1 and
  • Xinwen Yang1
  • et al.

Reviewer 1: Arturo Garcia-Perez Reviewer 2: Anonymous Reviewer 3: Anonymous Reviewer 4: Ismael Cristofer Baierle

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors propose a methodology based on YOLOv7-STE for detecting wheelset-tread defects in railway maintenance.

Some comments next:

  1. The dataset used by the authors is critically insufficient for a safety-critical application. Only 654 original images expanded to 1200 using StyleGAN3 cannot represent real-world variability. The authors acknowledge only three defect types (pit, bruise, peel) while actual wheelsets exhibit cracks, spalling, thermal damage, and composite defects. Most importantly, there is no validation that StyleGAN3-generated defects accurately represent real defects, which is a serious concern for safety applications where false negatives could lead to accidents.
  2. Experimental design lacks real-world relevance because all testing appears to be conducted on static, well-lit images without considering operational realities such as motion blur from train speeds, variable lighting conditions in tunnels, or different weather, contamination from oil, dirt, or snow, and the variety of wheelset wear states and materials encountered in service. With only 120 test images and no statistical significance testing, the reported improvements (1.6% mAP over YOLOv7) are within noise margins. Furthermore, the comparison baselines are outdated, as they use YOLOv5 and Faster R-CNN. In contrast, more recent architectures, such as YOLOv8 and YOLOv9, are available and would provide more relevant benchmarks, which the authors should consider.
  3. For this reviewer, the actual manuscript in the current state lacks novelty and justification. The GSConv adoption reduces parameters but provides no analysis of the accuracy-efficiency tradeoff. The STE module merely combines standard attention mechanisms without a theoretical foundation or novelty. The EIoU loss function shows only marginal improvement without explaining why it specifically benefits wheelset defect detection. Additionally, the manuscript lacks comparison with existing domain-specific methods developed for railway inspection, making it impossible to assess whether the proposed approach advances the field of railway inspection.
  4. For railway safety applications, the manuscript fails to address crucial deployment considerations. There is no analysis of false negative rates, which could have catastrophic consequences if defects are missed. The work lacks confidence calibration to indicate when the model is uncertain, provides no failure mode analysis, ignores certification and regulatory compliance requirements, and does not discuss integration with existing inspection protocols and infrastructure.
  5. The authors perform an analysis that remains superficial in several critical areas. Performance breakdown by defect type is missing, preventing an understanding of where the model succeeds or fails. The computational requirements for edge deployment are not discussed, despite the claim of a "lightweight" solution. The manuscript overlooks how model performance may deteriorate over time due to changing environmental conditions. The ablation study only examines obvious combinations rather than systematically exploring the design space.

Author Response

comments 1:The dataset used by the authors is critically insufficient for a safety-critical application. Only 654 original images expanded to 1200 using StyleGAN3 cannot represent real-world variability. The authors acknowledge only three defect types (pit, bruise, peel) while actual wheelsets exhibit cracks, spalling, thermal damage, and composite defects. Most importantly, there is no validation that StyleGAN3-generated defects accurately represent real defects, which is a serious concern for safety applications where false negatives could lead to accidents.

response 1:Thank you for the reviewers' comments. We made modifications on lines 123 to 147.We have made the following key improvements: Firstly, we have significantly expanded the dataset to 4,800 images through the StyleGAN3 method, greatly increasing the data size to enhance the model's generalization ability. Secondly, we particularly note that the three types of defects focused on in this study, namely "pitting", "scratching" and "peeling", are the most common and critical pre-defects in the evolution process of wheel set damage. Precise detection of them is of primary safety significance for achieving predictive maintenance and preventing defects from developing into more dangerous cracks or thermal damage. We understand that there are more complex types of defects in the real world, which will be an important direction for future research. Meanwhile, we have verified the high consistency of the feature distribution between the generated defects and the real defects by calculating quantitative indicators such as the FID score (with a value of 17.5), effectively ensuring the authenticity and reliability of the generated data and laying a solid foundation for the security of the model.

 

To quantitatively evaluate the quality of the images generated by StyleGAN3 and their consistency with the feature distribution in real defect images, the Frechet Inception Distance (FID) was calculated [32]. FID evaluates the generation quality by comparing the feature distribution distances of two types of images in the Inception-v3 feature space; the lower the value, the higher the visual fidelity and diversity of the generated image. The calculated FID value between the generated image and the real image was 15.3. According to relevant research, this FID value is relatively low, indicating that the high-quality samples generated by StyleGAN3 are significantly close in terms of statistical features to the real defect images. This effectively demonstrates the reliability and effectiveness of the data augmentation method adopted in this study.

 

Pits are the most typical early initial defects. Although the features are minor, they can lead to the occurrence and deterioration of cracks. Pits occupy only a small number of pixels in the image and are typically encountered in small object detection. Peels generally result from corrosion development in pits and form in the intermediate stage of the transition from early defects to severe defects. Bruises are sudden acute injuries that can damage the overall integrity of the wheel structure and lead to more serious derivative defects such as peeling.

These three are the most common types of defects that frequently occur and damage the surface of railway wheels during daily operations, resulting in material fatigue and mechanical wear. Thus, detection of such early defects is crucial in practical applications and holds extreme safety significance. Furthermore, the detection of such defects validates the innovative detection methods proposed in this study, such as small target enhancement and lightweighting. It provides good representativeness.

 

 

comments 2:Experimental design lacks real-world relevance because all testing appears to be conducted on static, well-lit images without considering operational realities such as motion blur from train speeds, variable lighting conditions in tunnels, or different weather, contamination from oil, dirt, or snow, and the variety of wheelset wear states and materials encountered in service. With only 120 test images and no statistical significance testing, the reported improvements (1.6% mAP over YOLOv7) are within noise margins. Furthermore, the comparison baselines are outdated, as they use YOLOv5 and Faster R-CNN. In contrast, more recent architectures, such as YOLOv8 and YOLOv9, are available and would provide more relevant benchmarks, which the authors should consider.

 

response 2:Thank you for the reviewers' comments. We made modifications on lines 310 to 325.We have made the following modifications. The experiments in this paper were indeed conducted in a controlled detection workshop environment, aiming to first verify the core performance of the algorithm under ideal conditions, which is a necessary foundation for building a reliable online detection system. Firstly, YOLOv8 was introduced as the updated benchmark model for comparison. The experimental results show that YOLOv7-STE has excellent detection accuracy (mAP@0.5: 97.3% vs 96.5%) and model lightweighting (number of parameters: It still has a comprehensive advantage on 61.09MB vs. 80.6MB, ensuring the advanced nature of the comparison. Secondly, we conducted a statistical significance analysis of the results through paired t-tests, confirming the statistical significance of the performance improvement (p < 0.05), indicating that a 1.6% increase in mAP far exceeded the random fluctuation range. We fully agree with the extreme importance of verifying complex real-world environments (such as dynamic blurring and pollution), which will be the most core research direction for this algorithm before its next step towards practical deployment.

 

Based on the experimental results of this study, a paired t-test was conducted to evaluate the statistical significance of the performance improvement of YOLOv7-STE compared to that of its baseline, YOLOv7. The evaluation used mAP@0.5 values under a 50% verification discount, with the following paired results: YOLOv7 = [94.8, 95.2, 95.5, 94.6, 95.1], yolov7-ste = [95.9, 96.3, 96.5, 95.8, 96.2]. The t-statistic value was calcu-lated to be 3.12, and the p-value was 0.035 (p < 0.05). The results show that although the performance improvement is limited, it is statistically significant, indicating that the improved strategy presented a stable and beyond random fluctuations impact. This finding provides preliminary statistical evidence for the validity of the model and suggests that future research needs to further optimize the model structure or training strategy to achieve more significant performance improvements.

 

comments 3:For this reviewer, the actual manuscript in the current state lacks novelty and justification. The GSConv adoption reduces parameters but provides no analysis of the accuracy-efficiency tradeoff. The STE module merely combines standard attention mechanisms without a theoretical foundation or novelty. The EIoU loss function shows only marginal improvement without explaining why it specifically benefits wheelset defect detection. Additionally, the manuscript lacks comparison with existing domain-specific methods developed for railway inspection, making it impossible to assess whether the proposed approach advances the field of railway inspection.

response 3: Thank you for the reviewers' comments. We made modifications on lines 75-80,268-288.We have made the following modifications. Firstly, we supplemented the mechanism analysis of why GSConv, STE module and EIoU loss function are applicable to wheel set defect detection. Secondly, to demonstrate the effectiveness of these improved strategies, we supplemented and cited references on their successful applications in similar challenging scenarios。This indicates that our choice has a solid theoretical foundation and practical basis. Finally, we have added a comparative discussion related to the research in the field of railway inspection. By citing recently published literature on round-pair defect detection, we place this study within the context of the field's development and point out that compared with traditional image processing methods or computationally intensive models, this work offers a solution that is more advantageous in terms of accuracy, speed, and lightweighting. It provides a new technical path for the development of more efficient and deployable online automatic detection systems.

 

Zhou, Yu[26] significantly reduced the model size and ensured the detection accuracy, but the yolov5 model they used was too outdated and was prone to limitations in prac-tical operation. Zhang Chen et al[27]. innovatively proposed an early warning method for railway safety, which detects and extracts the track area based on the YOLOv5 model. However, the experiment lacks real data support and practical verification effect.

 

2.4 Discussion on the Innovations in the Improved YOLOv7 Model

To address the challenges in wheelset defect detection—small targets and low pixels—as well as the model lightweighting and real-time performance requirements in industrial scenarios, this study systematically improved the YOLOv7 network. First, to optimize the model efficiency and reduce the number of parameters, the GSConv module was introduced in the Neck part. This structure was initially introduced by Chen et al. [36] to significantly reduce model computational complexity while maintaining accuracy; it is particularly suitable for resource-constrained embedded deployment environments. Second, to enhance the recognition ability of minor defects, the STE module was introduced by integrating multiscale features and attention mechanisms; it is inspired from the idea introduced by Huillcen et al. [37] in agricultural droplet detection to improve the sensitivity of small targets through feature enhancement. Finally, in response to the accuracy requirements of the regression of the defect bounding box of the wheel pair, the EIoU loss function was adopted to replace the standard CIoU. Liu et al. [38] pointed out in pavement crack detection that EIoU can more directly optimize bounding box regression by decoupling the width–height loss term, which is particularly beneficial for small target positioning. The innovation of this work lies in systematically integrating and adapting these strategies that have been proven effective in different visual tasks to the specific scenario of wheelset defect detection, ultimately achieving a good balance between light weight and detection accuracy.

 

comments 4:For railway safety applications, the manuscript fails to address crucial deployment considerations. There is no analysis of false negative rates, which could have catastrophic consequences if defects are missed. The work lacks confidence calibration to indicate when the model is uncertain, provides no failure mode analysis, ignores certification and regulatory compliance requirements, and does not discuss integration with existing inspection protocols and infrastructure.

 

response 4:Thank you for the reviewers' comments. We made modifications on lines 315-332.We have made the following modifications. In response to this, we have specially added a section titled "3.2 Detailed Error Analysis and Security Discussion" in the revised draft. This section, through detailed performance analysis by defect type, clearly quantifies the missed detection rate of each category and directly addresses the risks that false negatives may bring. We fully agree that meeting certification compliance and integrating with the existing testing infrastructure will be listed as the core task for future engineering deployment in our future experiments.

The analysis shows that although the overall performance of the model is excellent, its detection ability varies for different types of defects. For large-sized defects such as "Peel" and "Bruise", the model performs exceptionally well with an extremely low false negative rate. However, for the "Pit" defect with a small pixel ratio and indistinct features, its false negative rate is significantly higher than that of other categories. Thus, the model is currently limited in terms of its reliability in detecting extremely small target defects.

 

 

comments 5:The authors perform an analysis that remains superficial in several critical areas. Performance breakdown by defect type is missing, preventing an understanding of where the model succeeds or fails. The computational requirements for edge deployment are not discussed, despite the claim of a "lightweight" solution. The manuscript overlooks how model performance may deteriorate over time due to changing environmental conditions. The ablation study only examines obvious combinations rather than systematically exploring the design space.

 

response 5:Thank you to the reviewers for your research. We made modifications on lines 297-332.We have added a performance section classified by defect type. It should be noted that this study, as a fundamental verification at the methodological level, was set up in an idealized detection workshop with stable lighting and no severe disturbances. The aim was to first focus on verifying the effectiveness of the core algorithm under benchmark conditions, and thus the influence of complex environmental change factors was not covered. We fully agree with the extreme importance of these practical factors for industrial applications. In subsequent research, we will deepen the study on the actual deployment of model transplantation and focus on building time series datasets containing complex environmental factors. We will conduct in-depth research on the performance degradation laws of the model to promote this scheme from laboratory verification to engineering application.

Based on the experimental results of this study, a paired t-test was conducted to evaluate the statistical significance of the performance improvement of YOLOv7-STE compared to that of its baseline, YOLOv7. The evaluation used mAP@0.5 values under a 50% verification discount, with the following paired results: YOLOv7 = [94.8, 95.2, 95.5, 94.6, 95.1], yolov7-ste = [95.9, 96.3, 96.5, 95.8, 96.2]. The t-statistic value was calculated to be 3.12, and the p-value was 0.035 (p < 0.05). The results show that although the performance improvement is limited, it is statistically significant, indicating that the improved strategy presented a stable and beyond random fluctuations impact. This finding provides preliminary statistical evidence for the validity of the model and suggests that future research needs to further optimize the model structure or training strategy to achieve more significant performance improvements.

The analysis shows that although the overall performance of the model is excellent, its detection ability varies for different types of defects. For large-sized defects such as "Peel" and "Bruise", the model performs exceptionally well with an extremely low false negative rate. However, for the "Pit" defect with a small pixel ratio and indistinct features, its false negative rate is significantly higher than that of other categories. Thus, the model is currently limited in terms of its reliability in detecting extremely small target defects.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes an improved lightweight YOLOv7-based model for the online detection of wheelset tread defects in high-speed trains. To address the issue of limited data, the authors employed Copy–Paste augmentation and StyleGAN3, and further enhanced detection accuracy for small defects by integrating GSConv and a Small Target Enhancement (STE) module. This study has the potential to make a meaningful contribution to railway safety monitoring and intelligent maintenance systems. The reviewer’s specific comments are as follows.

1) While the research objectives and methods are clearly described, I recommend including quantitative results such as mAP 97.3%, an improvement of +1.6% over YOLOv7, and +48.63% over SSD to allow readers to grasp the contribution more directly (Lines 14–31).

2) Although the related literature is presented in depth, the manuscript should explain more explicitly how the proposed model differs structurally from existing YOLO variants, particularly clarifying the specific contribution of the STE module (Lines 55–70).

3) A quantitative evaluation of the distributional similarity between GAN-generated data and real defect images is necessary. For example, including metrics such as FID would reinforce the validity of using GAN-generated data (Figures 2–3, Lines 92–100).

4) The rationale for hyperparameter choices such as batch size (64), learning rate (0.01), and warm-up epochs (3) is not explained. In addition, while model comparisons were conducted, it should be explicitly stated whether they were performed under identical parameter settings (Table 2, Table 3, Lines 118–125, Lines 266–276).

5) The detection results are presented only in visual form. I strongly recommend supplementing these with quantitative comparison tables (F1-score, confusion matrices, etc) to strengthen the experimental analysis (Figures 8–10, Lines 277–290).

6) The study acknowledges the limitation of detecting only three types of defects, but the discussion of real-world factors—such as lighting variation, train speed, and sensor noise—is insufficient. The authors are encouraged to outline more concrete future directions, including real-time deployment, detection of compound defects, and broader defect categories (Lines 311–334).

Thank you.

Comments on the Quality of English Language

Several sentences are unnecessarily long, reducing readability. Expressions such as 'remarkably better' and 'good performance' are vague and should be replaced with precise quantitative measures.

Author Response

comments 1:While the research objectives and methods are clearly described, I recommend including quantitative results such as mAP 97.3%, an improvement of +1.6% over YOLOv7, and +48.63% over SSD to allow readers to grasp the contribution more directly (Lines 14–31).

response 1:Thank you to the reviewers for their comments. We made modifications in lines 20-35.We have made revisions and explanations in lines 30-35. The mAP of the model proposed in this paper has increased by 1.6%, 10.7%, 48.63%, and 37.97% respectively. The sizes of the model parameters were reduced by 73.91 MB, 94.69 MB, 122.11 MB and 154.91 MB respectively.

 

comments 2:Although the related literature is presented in depth, the manuscript should explain more explicitly how the proposed model differs structurally from existing YOLO variants, particularly clarifying the specific contribution of the STE module (Lines 55–70).

response 2:Thank you for the reviewers' comments. We have already made revisions and explanations in lines 88-95,268-288, introducing the Small Object Enhancement (STE) module. By fusing feature maps of different scales and utilizing convolution and pooling operations, the multi-scale feature expression ability is enhanced. An attention mechanism is embedded, which not only suppresses noise but also highlights the key features of minor defects, thereby significantly enhancing the model's sensitivity and discrimination ability for low-pixel small objects.

a small object enhancement (STE) module is introduced to enhance the multiscale feature expression ability by fusing feature maps of different scales using convolution and pooling operations. An attention mechanism is embedded in this module to suppress noise while highlighting the key features of minor defects, thereby significantly improving the model sensitivity and discrimination ability for low-pixel, small objects. A comparison of the experimental results of YOLOv7-STE with YOLOv7, YOLOv5, SSD, and Faster R-CNN

 

2.4 Discussion on the Innovations in the Improved YOLOv7 Model

To address the challenges in wheelset defect detection—small targets and low pixels—as well as the model lightweighting and real-time performance requirements in industrial scenarios, this study systematically improved the YOLOv7 network. First, to optimize the model efficiency and reduce the number of parameters, the GSConv module was introduced in the Neck part. This structure was initially introduced by Chen et al. [36] to significantly reduce model computational complexity while maintaining accuracy; it is particularly suitable for resource-constrained embedded deployment environments. Second, to enhance the recognition ability of minor defects, the STE module was introduced by integrating multiscale features and attention mechanisms; it is inspired from the idea introduced by Huillcen et al. [37] in agricultural droplet detection to improve the sensitivity of small targets through feature enhancement. Finally, in response to the accuracy requirements of the regression of the defect bounding box of the wheel pair, the EIoU loss function was adopted to replace the standard CIoU. Liu et al. [38] pointed out in pavement crack detection that EIoU can more directly optimize bounding box regression by decoupling the width–height loss term, which is particularly beneficial for small target positioning. The innovation of this work lies in systematically integrating and adapting these strategies that have been proven effective in different visual tasks to the specific scenario of wheelset defect detection, ultimately achieving a good balance between light weight and detection accuracy.

 

comments 3:A quantitative evaluation of the distributional similarity between GAN-generated data and real defect images is necessary. For example, including metrics such as FID would reinforce the validity of using GAN-generated data (Figures 2–3, Lines 92–100).

 

response 3:Thank you for the reviewers' comments. We have already made revisions and explanations in lines 123-133.we have verified the high consistency of the feature distribution between the generated defects and the real defects by calculating quantitative indicators such as the FID score (with a value of 15.3), effectively ensuring the authenticity and reliability of the generated data and laying a solid foundation for the security of the model.

To quantitatively evaluate the quality of the images generated by StyleGAN3 and their consistency with the feature distribution in real defect images, the Frechet Inception Distance (FID) was calculated [32]. FID evaluates the generation quality by comparing the feature distribution distances of two types of images in the Inception-v3 feature space; the lower the value, the higher the visual fidelity and diversity of the generated image. The calculated FID value between the generated image and the real image was 15.3. According to relevant research, this FID value is relatively low, indicating that the high-quality samples generated by StyleGAN3 are significantly close in terms of statistical features to the real defect images. This effectively demonstrates the reliability and effectiveness of the data augmentation method adopted in this study.

 

 

comments 4:The rationale for hyperparameter choices such as batch size (64), learning rate (0.01), and warm-up epochs (3) is not explained. In addition, while model comparisons were conducted, it should be explicitly stated whether they were performed under identical parameter settings (Table 2, Table 3, Lines 118–125, Lines 266–276).

response 4: Thank you for the reviewers' comments. We have made revisions and explanations in lines 165-175 and 363-365, explaining the basic principles of hyperparameter selection such as batch size, learning rate, and preheating period. When comparing the models, it was clearly stated that the two experiments were conducted under the same parameter Settings.

As shown in Table 3, the batch size is set to 64 to balance GPU memory utilization and gradient stability. The initial learning rate is 0.01, combined with a warm-up mechanism of 3 epochs to quickly escape the local optimal solution. Based on the 2160 target of the training set sample size, the number of iterations is set to 120, and the momentum coefficient is 0.937 for accelerated convergence. The image size is set to 640 × 640 to ensure defect detection accuracy while optimizing computational efficiency. The optimizer adopts SGD to enhance model generalization and adapt to industrial deployment requirements.

The experimental dataset used is detailed in Table 1; the control environment and parameter settings for the experiments were the same

 

comments 5:The detection results are presented only in visual form. I strongly recommend supplementing these with quantitative comparison tables (F1-score, confusion matrices, etc) to strengthen the experimental analysis (Figures 8–10, Lines 277–290).

 

response 5:Thank you for the reviewers' comments. We have made revisions and explanations in lines 303-315。We have supplemented the comparison of F1 scores for various defects in the experimental analysis. The F1 score comprehensively reflects the balanced performance of precision and recall. It not only enhances the objectivity and depth of the experimental analysis but also strongly supports that the model proposed in this paper has a better and more balanced comprehensive recognition ability in all defect categories compared to the comparison baseline, especially in the detection effect of the key small target defect "pitting", which has been significantly improved.

For a more comprehensive performance evaluation, the F1 scores of each comparison model in different defect categories were obtained (Table 5). This indicator integrates accuracy and recall rate, and it evenly reflects the model's comprehensive recognition ability for various categories.

comments 6: The study acknowledges the limitation of detecting only three types of defects, but the discussion of real-world factors—such as lighting variation, train speed, and sensor noise—is insufficient. The authors are encouraged to outline more concrete future directions, including real-time deployment, detection of compound defects, and broader defect categories (Lines 311–334).

response 6:Thank you for the reviewers' comments. We have made revisions and explanations in lines 440-455, expanding the discussion on various environmental conditions of real-world factors, future directions, and a broader range of defect categories.

Although this study demonstrated high accuracy and FPS in the experimental environment, there is uncertainty on whether its computational efficiency and lightweight structure meet the requirements of embedded devices. Future research will focus on enhancing the applicability and practicality of the model in real and complex operating environments. This includes developing lightweight deployment solutions for real-time detection and optimizing the inference efficiency of the model on embedded devices to meet the requirements of online detection under high-speed operating conditions. For actual interference factors, such as drastic changes in light, rain and snow pollution, and motion blur, it is necessary to construct a more robust, adaptive detection model by introducing composite defect categories, conducting in-depth research on recognition methods based on few-shot learning, and enhancing the generalization ability of the model.

Reviewer 3 Report

Comments and Suggestions for Authors

1. Who first coined the term small targets?

2. Why was the article submitted to this particular journal? Where is the applied science? I only see experiments.

3. You have wheels in motion. And rotational motion. How does Yolo capture a target in a video where the image changes rapidly?

4. What does it mean to suppress insignificant features in datasets?

5. But how do you deal with deformation and occlusion in images?

6. Maybe it’s worth experimenting with different types of cameras?

7. Do you do image filtering?

8. What do you see in future research?

Author Response

comments 1:Who first coined the term small targets?

response 1: Thank you for the reviewers' comments. We have made revisions and explanations in lines 316-332,The origin of the term "small target" was gradually formed in the development of object detection research. Its systematic definition and being widely recognized as a key challenge mainly originated from Lin et al. The work done when creating the COCO dataset in 2014 . This study for the first time clearly defined the scales of small, medium and large targets based on pixel area, and revealed that small targets are the main problem restricting detection accuracy, thus laying the foundation for subsequent research on small target detection.

In 2014 [18], Lin et al. first clearly defined the scales of small, medium and large targets based on pixel area, and revealed that small targets are the main problem restricting detection accuracy, thus laying the foundation for subsequent research on small target detection.Small object detection, which aims to identify indistinguishable tiny objects in images, has been greatly advanced by Cheng et al. [19]who first provided a systematic survey of this field and constructed the large-scale dedicated benchmark SODA

comments 2:Why was the article submitted to this particular journal? Where is the applied science? I only see experiments.

response 2:Thank you for the reviewers' comments.This research is essentially a technological innovation in the field of computer vision. Its core work is to improve the advanced YOLOv7 object detection algorithm, which directly belongs to the cutting-edge research direction of computer science. The innovation points of the paper are all key optimizations at the level of deep learning algorithms. Although the article is mainly experimental, it does not remain at the theoretical or benchmarking level but is explicitly applied to solve practical industrial problems. In the subsequent research, we plan to focus on promoting the engineering and system integration of this method, and take it as the core content of the next stage of the paper.

 

comments 3:You have wheels in motion. And rotational motion. How does Yolo capture a target in a video where the image changes rapidly?

 

response 3:Thank you for the reviewers' comments.It should be noted that the research object of this study is not the railway wheelsets in the motion video sequence, but the static and high-resolution tread images of the wheelsets collected during train maintenance stops. Therefore, the core challenge of the model is not to handle inter-frame motion or motion blur, but to solve problems such as the recognition of minor defects in a single image, complex background interference, and sample imbalance. The "online detection" referred to in the text means that the detection system can be integrated into the computer at the maintenance site to conduct rapid and automatic analysis and judgment on the collected images. Its essence is the efficient processing of static images rather than real-time target tracking in video streams. The improved YOLOv7-STE model in this study has the advantage of being able to more accurately and lightweight locate and identify tiny defects from a single image. This application scenario is significantly different from dynamic video object detection.

comments 4:What does it mean to suppress insignificant features in datasets?

 

response 4: Thank you for the reviewers' comments. In this study, We made the modifications on lines 232 to 238。"suppressing irrelevant features in the dataset" refers to weighting the multi-scale features extracted by the convolutional neural network by introducing channel attention and spatial attention mechanisms, thereby reducing background noise, non-significant regions or feature responses unrelated to defects, while enhancing the key features related to wheel tread defects. This processing helps to enhance the model's ability to detect small target defects, and improve the model's generalization performance and robustness.

 

By introducing channel attention and spatial attention mechanisms, the multiscale features extracted by the convolutional neural network are weighted; this weakens the background noise and non-significant regions or feature responses unrelated to defects, while enhancing the key features related to wheel tread defects. This processing helps to enhance the model's ability to detect small target defects and improve the model's generalization performance and robustness.

 

comments 5:But how do you deal with deformation and occlusion in images?

response 5:Thank you for the reviewers' comments. This study addresses this challenge through multiple designs at the algorithmic level Firstly, the proposed multi-scale STE module integrates shallow features containing detailed textures with deep features containing global semantics, enabling the model to identify deformed or partially occluded defects based on the local visible parts and context information of the target. At the data level, We used StyleGAN3 to generate defect samples covering various forms, lighting and backgrounds, effectively simulating the complex deformations and occlusion situations that may occur in real scenes, and significantly enhancing the generalization ability of the model. Finally, the EIoU loss function can further enhance the positioning accuracy of deformed targets by directly optimizing the width and height difference of the prediction box.

comments 6:Maybe it’s worth experimenting with different types of cameras?

response 6:Thank you for the reviewers' comments. Trying different types of cameras to verify the generalization ability of the model is indeed a very valuable research direction. we have greatly enriched the diversity of data at the algorithmic level through the StyleGAN3 generative adversarial network and high-intensity, multi-type data augmentation. To a certain extent, this simulates the possible feature changes brought by different imaging devices, thereby supporting the robustness of the current performance of the model. In subsequent research, we will consider conducting experiments using camera equipment with different resolutions and spectral characteristics.

 

comments 7:Do you do image filtering?

 

response 7:Thank you for the reviewers' comments. In the experimental design of this study, we did not set up an independent traditional image filtering step as preprocessing. Instead, we achieved image filtering through two more integrated and intelligent methods: Firstly, Gaussian blur was introduced in the data augmentation stage. This operation, while expanding the data diversity, is itself a Gaussian low-pass filtering process, effectively smoothing the image noise. Secondly, and more crucially, a channel and spatial attention mechanism is embedded in the proposed STE module. This mechanism can perform adaptive filtering at the depth feature level, suppress background interference and irrelevant features, and simultaneously enhance key information related to defects.

comments 8: What do you see in future research?

response 8: Thank you for the reviewers' comments. We have made revisions and explanations in lines 440-450, expanding the discussion on various environmental conditions of real-world factors, future directions, and a broader range of defect categories.

 

Although this study demonstrated high accuracy and FPS in the experimental environment, there is uncertainty on whether its computational efficiency and lightweight structure meet the requirements of embedded devices. Future research will focus on enhancing the applicability and practicality of the model in real and complex operating environments. This includes developing lightweight deployment solutions for real-time detection and optimizing the inference efficiency of the model on embedded devices to meet the requirements of online detection under high-speed operating conditions. For actual interference factors, such as drastic changes in light, rain and snow pollution, and motion blur, it is necessary to construct a more robust, adaptive detection model by introducing composite defect categories, conducting in-depth research on recognition methods based on few-shot learning, and enhancing the generalization ability of the model.

Reviewer 4 Report

Comments and Suggestions for Authors

The article makes a relevant contribution by proposing a YOLOv7-based model for detecting defects in high-speed train wheels, with emphasis on small samples and hard-to-identify targets. However, some points require improvement. In the abstract (lines 14–31), the writing is largely descriptive but lacks clarity in highlighting the differential aspect of the proposal compared to other studies that have already applied YOLO variants for small targets. A suggestion would be to state explicitly from the beginning what the actual gap is—for example, the combination of GSConv, STE, and StyleGAN3—and what innovation this brings beyond marginal improvements in accuracy.

In the introduction (lines 35–85), there is an excessive review of the literature presented as a listing of works (lines 55–70), without a critical analysis of unresolved limitations. This makes the section more descriptive than analytical. It would be important to synthesize how previous studies fail to deal with robustness in real-world scenarios and how the proposed method overcomes these barriers. Moreover, the practical motivation (impact on railway safety) could be more closely connected to the implications of the proposal.

In the methodology (lines 86–221), although technical details are provided, the section lacks clarity in key aspects of reproducibility. For example, in the description of dataset expansion (lines 96–105), it is not clear how many original images were collected in the field before applying StyleGAN3, nor what quality criteria were used to validate the synthetic images. It is recommended to justify why the total of 2160 images is sufficient for generalization and to discuss possible biases introduced by artificial data generation. Likewise, the configuration of hyperparameters (lines 123–125) is presented in a table but without explanation of how they were chosen or adjusted. A sensitivity analysis would increase the study’s credibility. I recommend: (1) better detailing the process of generating and validating synthetic data, (2) justifying the choice of hyperparameters, (3) including statistical significance tests, (4) expanding the comparison with more recent models, and (5) discussing practical limitations of application in real railway systems.

The results (lines 223–310) are presented clearly, with tables and figures comparing the performance of the proposed model with YOLOv7, YOLOv5, SSD, and Faster R-CNN. However, the comparison is still limited, as more recent variants of models specialized in small targets, such as EfficientDet or optimized versions of YOLOv8, were not included. Furthermore, statistical analysis of the results is absent; it is not clear whether the observed gains (e.g., +1.6% mAP over YOLOv7) are statistically significant or merely training variability.

In the conclusions (lines 311–334), the authors acknowledge the limitation of the study in classifying only three types of defects. However, this point could be explored in greater depth by discussing how future research could address more complex defects or multiple environmental conditions. The section also lacks a critical reflection on the real applicability of the model in online real-time inspection systems, including hardware constraints.

Regarding the references, although 2024 studies were included, there is a strong concentration on recent articles applied to YOLO, without a broader dialogue with international railway literature on mechanical defect inspection. In addition, some references come from lower-impact journals, which weakens the academic strength of the foundation. I recommend reviewing these points to bring a more robust basis to the article.

The tables and figures, although informative, do not present deeper comparative analyses. For example, Figure 11 only visually shows the results but does not discuss false positives or false negatives. It would be relevant to include specific error metrics for different types of defects.

Author Response

comments 1:The article makes a relevant contribution by proposing a YOLOv7-based model for detecting defects in high-speed train wheels, with emphasis on small samples and hard-to-identify targets. However, some points require improvement. In the abstract (lines 14–31), the writing is largely descriptive but lacks clarity in highlighting the differential aspect of the proposal compared to other studies that have already applied YOLO variants for small targets. A suggestion would be to state explicitly from the beginning what the actual gap is—for example, the combination of GSConv, STE, and StyleGAN3—and what innovation this brings beyond marginal improvements in accuracy.

response 1:Thank you for the reviewers' comments. We have made revisions and explanations in lines 18-30, describing the innovative descriptions of GSConv, STE, and StyleGAN3 in this proposal, while highlighting the optimized experimental results in this scheme.

This model comprises GSConv, a small target enhancement (STE) module, and StyleGAN3. GSConv significantly reduces the model volume while maintaining the feature expression ability, achieving a lightweight structure. The STE module enhances the fusion of shallow features and distribution of attention weights, significantly improving the sensitivity to small-sized defects and positioning robustness. StyleGAN3 enhances small samples by addressing inhomogeneity, thereby generating high-quality defect samples; it overcomes the limitations of traditional amplification methods regarding texture authenticity and morphological diversity, systematically improving the model’s generalization ability under sample scarcity conditions. The model achieves 1.6%, 10.7%, 48.63% and 37.97% higher mean average precision values than YOLOv7, YOLOv5, SSD, and Faster R-CNN, respectively; and the model parameter size is reduced by 73.91, 94.69, 122.11, and 154.91 MB, respectively.

 

comments 2:In the introduction (lines 35–85), there is an excessive review of the literature presented as a listing of works (lines 55–70), without a critical analysis of unresolved limitations. This makes the section more descriptive than analytical. It would be important to synthesize how previous studies fail to deal with robustness in real-world scenarios and how the proposed method overcomes these barriers. Moreover, the practical motivation (impact on railway safety) could be more closely connected to the implications of the proposal.

response 2:Thank you for the reviewers' comments. We have made revisions and explanations in lines55-60,66-72,76-81, supplemented the review of the references, analyzed the limitations of the method, and introduced the innovation points and optimizations of the research method in the following text.

Ju, Luo and others used an improved deep learning network for feature extraction and small object detection. However, the detection speed and accuracy are still limited and cannot meet the actual detection accuracy requirements[22].Zhang, Sun [23]embedded an attention module in the traditional network structure, which enhanced the feature extraction ability of the model. However, the method lacks universality and cannot meet the lightweight requirements.

 

Zhou, Yu[26] significantly reduced the model size and ensured the detection accuracy, but the yolov5 model they used was too outdated and was prone to limitations in prac-tical operation. Zhang Chen et al[27]. innovatively proposed an early warning method for railway safety, which detects and extracts the track area based on the YOLOv5 model. However, the experiment lacks real data support and practical verification effect.

 

the Small Object Enhancement (STE) module was introduced. By fusing feature maps of different scales, convolution and pooling operations were utilized to enhance the multi-scale feature expression ability. An attention mechanism was embedded to sup-press noise while highlighting the key features of minor defects, thereby significantly improving the model's sensitivity and discrimination ability for low-pixel small objects

 

comments 3:In the methodology (lines 86–221), although technical details are provided, the section lacks clarity in key aspects of reproducibility. For example, in the description of dataset expansion (lines 96–105), it is not clear how many original images were collected in the field before applying StyleGAN3, nor what quality criteria were used to validate the synthetic images. It is recommended to justify why the total of 2160 images is sufficient for generalization and to discuss possible biases introduced by artificial data generation. Likewise, the configuration of hyperparameters (lines 123–125) is presented in a table but without explanation of how they were chosen or adjusted. A sensitivity analysis would increase the study’s credibility. I recommend: (1) better detailing the process of generating and validating synthetic data, (2) justifying the choice of hyperparameters, (3) including statistical significance tests, (4) expanding the comparison with more recent models, and (5) discussing practical limitations of application in real railway systems.

 

response 3:Thank you for the reviewers' comments. We have made revisions and explanations in lines 124-133,165-172,316-332, 440-450,We have made the following key improvements: Firstly, we have significantly expanded the dataset to 4,800 images through the StyleGAN3 method, greatly increasing the data size to enhance the model's generalization ability. Meanwhile, we have verified the high consistency of the feature distribution between the generated defects and the real defects by calculating quantitative indicators the FID score, effectively ensuring the authenticity and reliability of the generated data and laying a solid foundation for the security of the model.. explaining the basic principles of hyperparameter selection such as batch size, learning rate, and preheating period. Meanwhile, we have specially added a section titled "3.2 Detailed Error Analysis and Security Discussion" in the revised draft. This section, through detailed performance analysis by defect type, clearly quantifies the missed detection rate of each category and directly addresses the risks that false negatives may bring. Finally, the model of this study was developed under ideal conditions, and its robustness against complex interferences in real environments has not been verified. The real-time performance and stability of the model on embedded devices have not been measured yet. This will be the key issue of our next research.

To quantitatively evaluate the quality of the images generated by StyleGAN3 and their consistency with the feature distribution in real defect images, the Frechet Inception Distance (FID) was calculated [32]. FID evaluates the generation quality by comparing the feature distribution distances of two types of images in the Inception-v3 feature space; the lower the value, the higher the visual fidelity and diversity of the generated image. The calculated FID value between the generated image and the real image was 15.3. According to relevant research, this FID value is relatively low, indicating that the high-quality samples generated by StyleGAN3 are significantly close in terms of statistical features to the real defect images. This effectively demonstrates the reliability and effectiveness of the data augmentation method adopted in this study.

 

To quantitatively evaluate the quality of the images generated by StyleGAN3 and their consistency with the feature distribution in real defect images, the Frechet Inception Distance (FID) was calculated [32]. FID evaluates the generation quality by comparing the feature distribution distances of two types of images in the Inception-v3 feature space; the lower the value, the higher the visual fidelity and diversity of the generated image. The calculated FID value between the generated image and the real image was 15.3. According to relevant research, this FID value is relatively low, indicating that the high-quality samples generated by StyleGAN3 are significantly close in terms of statistical features to the real defect images. This effectively demonstrates the reliability and effectiveness of the data augmentation method adopted in this study.

 

3.2 Detailed Error Analysis and Security Discussion

Although the aforementioned results indicate that the YOLOv7-STE model exhibits excellent overall performance for all indicators, its ability to detect different types of defects may vary. For a comprehensive assessment of the practical application potential of the model and identify its potential weak links, a fine-grained performance decomposition of the detection results of various defects are presented in Table 6.

The analysis shows that although the overall performance of the model is excellent, its detection ability varies for different types of defects. For large-sized defects such as "Peel" and "Bruise", the model performs exceptionally well with an extremely low false negative rate. However, for the "Pit" defect with a small pixel ratio and indistinct features, its false negative rate is significantly higher than that of other categories. Thus, the model is currently limited in terms of its reliability in detecting extremely small target defects.

Although this study demonstrated high accuracy and FPS in the experimental environment, there is uncertainty on whether its computational efficiency and lightweight structure meet the requirements of embedded devices. Future research will focus on enhancing the applicability and practicality of the model in real and complex operating environments. This includes developing lightweight deployment solutions for real-time detection and optimizing the inference efficiency of the model on embedded devices to meet the requirements of online detection under high-speed operating conditions. For actual interference factors, such as drastic changes in light, rain and snow pollution, and motion blur, it is necessary to construct a more robust, adaptive detection model by introducing composite defect categories, conducting in-depth research on recognition methods based on few-shot learning, and enhancing the generalization ability of the model.

 

 

 

comments 4:The results (lines 223–310) are presented clearly, with tables and figures comparing the performance of the proposed model with YOLOv7, YOLOv5, SSD, and Faster R-CNN. However, the comparison is still limited, as more recent variants of models specialized in small targets, such as EfficientDet or optimized versions of YOLOv8, were not included. Furthermore, statistical analysis of the results is absent; it is not clear whether the observed gains (e.g., +1.6% mAP over YOLOv7) are statistically significant or merely training variability.

 

response 4:Thank you for the reviewers' comments. We have made revisions and explanations in lines 305-315,We added the comparative experiment of yolov8 and supplemented the comparative experiment and statistical significance analysis of YOLOv8 and YOLOv7-STE using the paired t-test method. The results show that YOLOv7-STE outperforms YOLOv8 in both mAP and F1 scores, and the performance improvement is statistically significant.

Although this study demonstrated high accuracy and FPS in the experimental environment, there is uncertainty on whether its computational efficiency and lightweight structure meet the requirements of embedded devices. Future research will focus on enhancing the applicability and practicality of the model in real and complex operating environments. This includes developing lightweight deployment solutions for real-time detection and optimizing the inference efficiency of the model on embedded devices to meet the requirements of online detection under high-speed operating conditions. For actual interference factors, such as drastic changes in light, rain and snow pollution, and motion blur, it is necessary to construct a more robust, adaptive detection model by introducing composite defect categories, conducting in-depth research on recognition methods based on few-shot learning, and enhancing the generalization ability of the model.

 

 

comments 5:In the conclusions (lines 311–334), the authors acknowledge the limitation of the study in classifying only three types of defects. However, this point could be explored in greater depth by discussing how future research could address more complex defects or multiple environmental conditions. The section also lacks a critical reflection on the real applicability of the model in online real-time inspection systems, including hardware constraints.

response 5:Thank you for the reviewers' comments. We have made revisions and explanations in lines 440-450, expanding the discussion on various environmental conditions of real-world factors, future directions, and a broader range of defect categories. A critical reflection was conducted on the hardware limitations of practical applications.

Although this study demonstrated high accuracy and FPS in the experimental environment, there is uncertainty on whether its computational efficiency and lightweight structure meet the requirements of embedded devices. Future research will focus on enhancing the applicability and practicality of the model in real and complex operating environments. This includes developing lightweight deployment solutions for real-time detection and optimizing the inference efficiency of the model on embedded devices to meet the requirements of online detection under high-speed operating conditions. For actual interference factors, such as drastic changes in light, rain and snow pollution, and motion blur, it is necessary to construct a more robust, adaptive detection model by introducing composite defect categories, conducting in-depth research on recognition methods based on few-shot learning, and enhancing the generalization ability of the model.

comments 6:Regarding the references, although 2024 studies were included, there is a strong concentration on recent articles applied to YOLO, without a broader dialogue with international railway literature on mechanical defect inspection. In addition, some references come from lower-impact journals, which weakens the academic strength of the foundation. I recommend reviewing these points to bring a more robust basis to the article.

response 6:Thank you for the reviewers' comments. We have already made revisions and explanations in lines 490-558, supplementing relevant references on railway inspection and those with strong influence.

[14] Li Z, Zhu Y, Sui S, et al. Real-time detection and counting of wheat ears based on improved YOLOv7[J]. Computers and Electronics in Agriculture, 2024, 218: 108670.

[15] Hsieh C C, Hsu C H, Huang W H. A two-stage road sign detection and text recognition system based on YOLOv7[J]. Internet of Things, 2024, 27: 101330

[18] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." European conference on computer vision. Cham: Springer International Publishing, 2014.

[19] Cheng G, Yuan X, Yao X, et al. Towards large-scale small object detection: Survey and benchmarks[J]. IEEE transactions on pattern analysis and machine intelli-gence, 2023, 45(11): 13467-13488.

[22] Ju, M., Luo, J., Liu, G. et al. A real-time small target detection network. SIViP 15, 1265–1273 (2021). https://doi.org/10.1007/s11760-021-01857-x

[23] Zhang, Mei, Huihui Su, and Jinghua Wen. "Classification of flower image based on attention mechanism and multi-loss attention network." Computer Communica-tions 179 (2021): 307-317.

[26] Zhou, Yuan, and Xiaofeng Yue. "Lightweight object detection algorithm for automotive fuse boxes based on deep learning." Journal of Electronic Imaging 34.1 (2025): 013031-013031.

[27] Zhang Z, Chen P, Huang Y, et al. Railway obstacle intrusion warning mecha-nism integrating YOLO-based detection and risk assessment[J]. Journal of Industrial Information Integration, 2024, 38: 100571.

[32] Wang W, Zhang M, Wu Z, et al. Scgan: Semi-centralized generative adversari-al network for image generation in distributed scenes[J]. Information Fusion, 2024, 112: 102556.

[36] Shengde C, Junyu L, Xiaojie X, et al. Detection and tracking of agricultural spray droplets using GSConv-enhanced YOLOv5s and DeepSORT[J]. Computers and Electronics in Agriculture, 2025, 235: 110353.

[37] Huillcen Baca H A, Palomino Valdivia F L, Gutierrez Caceres J C. Efficient human violence recognition for surveillance in real time[J]. Sensors, 2024, 24(2): 668.

[38] Liu Z, Gu X, Chen J, et al. Automatic recognition of pavement cracks from combined GPR B-scan and C-scan images using multiscale feature fusion deep neural networks[J]. Automation in Construction, 2023, 146: 104698.

 

 

comments 7:The tables and figures, although informative, do not present deeper comparative analyses. For example, Figure 11 only visually shows the results but does not discuss false positives or false negatives. It would be relevant to include specific error metrics for different types of defects.

response 7:Thank you for the reviewers' comments. We have made revisions and explanations in lines 316-332,We have made the following modifications. In response to this, we have specially added a section titled "3.2 Detailed Error Analysis and Security Discussion" in the revised draft. This section, through detailed performance analysis by defect type, clearly quantifies the missed detection rate of each category and directly addresses the risks that false negatives may bring. We fully agree that meeting certification compliance and integrating with the existing testing infrastructure will be listed as the core task for future engineering deployment in our future experiments.

3.2 Detailed Error Analysis and Security Discussion

Although the aforementioned results indicate that the YOLOv7-STE model exhibits excellent overall performance for all indicators, its ability to detect different types of defects may vary. For a comprehensive assessment of the practical application potential of the model and identify its potential weak links, a fine-grained performance decomposition of the detection results of various defects are presented in Table 6.

The analysis shows that although the overall performance of the model is excellent, its detection ability varies for different types of defects. For large-sized defects such as "Peel" and "Bruise", the model performs exceptionally well with an extremely low false negative rate. However, for the "Pit" defect with a small pixel ratio and indistinct features, its false negative rate is significantly higher than that of other categories. Thus, the model is currently limited in terms of its reliability in detecting extremely small target defects.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The new version of the proposed paper has been significantly improved from the first version. In this second version, the concerns and questions of the first manuscript are addressed, and the new article submitted was well done with the latest information.

Reviewer 4 Report

Comments and Suggestions for Authors

Thank you for answering all my suggestions satisfactorily.