Next Article in Journal
Association of Skeletal Muscle Radiodensity and Skeletal Muscle Index with Immunotherapy Response in Metastatic Non-Small Cell Lung Cancer
Previous Article in Journal
The Effect of Neck-Specific Exercise with or Without a Behavioral Approach in Chronic Whiplash-Associated Disorders: A Systematic Review and Meta-Analysis
 
 
Article
Peer-Review Record

Hamstring Strain Injury Risk in Soccer: An Exploratory, Hypothesis-Generating Prediction Model

by Afxentios Kekelekis 1,2,*, Rabiu Muazu Musa 3, Pantelis T. Nikolaidis 4, Filipe Manuel Clemente 5 and Eleftherios Kellis 1
Reviewer 1: Anonymous
Reviewer 2:
Submission received: 17 September 2025 / Revised: 10 October 2025 / Accepted: 27 October 2025 / Published: 4 November 2025

Round 1

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

The authors have made substantial improvements to the manuscript, and most major concerns from Round 1 have been satisfactorily addressed. The paper now provides a transparent and methodologically cautious account of a hypothesis-generating model for hamstring injury risk. I recommend acceptance after minor revisions. While the limitations are acknowledged, the very low events-per-variable ratio (≈2.1) remains a critical concern, and it would be helpful to emphasize more strongly that predictive performance metrics such as AUC and calibration are highly uncertain under these conditions. The issue of generalizability also deserves greater clarity (since the sample is restricted to young male amateur players, the clinical relevance section would benefit from a more explicit contrast with elite settings and a clear statement that results cannot be extrapolated to professional or female cohorts). Furthermore, the analysis is limited to isometric strength variables; it would strengthen the conclusion to highlight that eccentric strength, dynamic neuromuscular assessments, and workload-related measures are likely to provide more accurate real-world predictors in future studies. Figures such as the calibration and permutation importance plots could be made more reader-friendly with concise captions that highlight the key take-home messages. Finally, although the exploratory nature of the study is well framed, the clinical relevance section could be improved by briefly outlining how these findings may inform the design of future research or screening protocols, while reinforcing that the results are not yet suitable for direct clinical application.

Author Response

Please see the attachment 

Author Response File: Author Response.pdf

Reviewer 2 Report (Previous Reviewer 1)

Comments and Suggestions for Authors

Thank you for addressing my comments you have made substantial and constructive revisions. The manuscript is now suitable for publication following a few very minor adjustments:

  • The sample size calculation via G*Power is still presented; this section is somewhat redundant given the stronger emphasis on EPV and optimism. A brief clarification or condensation could help avoid confusion.

  • Terminology for “logistic regression with elastic-net penalty” vs. “logistic regression” could be made consistent throughout, it's a minor comment though. 

Author Response

please see the attachment 

Author Response File: Author Response.pdf

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for inviting me to review your manuscript. Machine learning is a topic of interest of mine, so I am grateful to be part of the peer-review process.

You report that, despite a small sample, your power calculation criteria were met. While this may suffice for a traditional hypothesis test, I’ve reviewed current trends in prediction modelling, and for machine learning such a calculation is insufficient. Contemporary methodological standards emphasize that it's the number of events per predictor parameter and anticipated model optimism, not overall sample size, that drives validity. Frameworks like TRIPOD AI and PROBAST AI recommend ensuring a shrinkage factor ≥ 0.9 to control overfitting. Please read (Riley et al., 2020; Riley et al., 2021; Collins et al., 2024; Moons et al., 2025). It may also be helpful to review systematic assessments of bias in prediction modelling (e.g. Shiferaw et al., 2024).

Please consider reporting:
• Exact number of hamstring injury events and count of candidate predictors after preprocessing.
• Events per predictor parameter ratio.
• Whether tuning or feature selection was carried out using strictly nested resampling to avoid data leakage, and how optimism was addressed in calibration and discrimination.
• A PROBAST AI risk of bias assessment, which is now recommended for this type of work.

You collected all data only at the start of the season. Injury risk is dynamic, and the literature consistently shows that baseline-only measures have limited predictive power (Bahr, 2016; Toohey et al., 2017). A static snapshot fails to capture evolving exposures, training loads, or fatigue. This limitation should be acknowledged and should temper your conclusions.

Your finding that hip abductor weakness is the strongest predictor of hamstring injury is intriguing but conflicts with predominant evidence, which supports previous hamstring injury, older age, and eccentric hamstring weakness as the most consistent predictors. Where abductors are implicated, evidence is mixed. The absence of prior injury and hamstring strength as predictors in your model suggests instability from small event numbers, variable misclassification, or methodological artifacts like measurement error or over-adjustment.

I would therefore recommend:
• Reporting univariable and multivariable effect estimates for previous injury, age, hamstring strength, and hip abduction strength, with confidence intervals.
• Conducting sensitivity analyses using broader look-back windows for previous injury.
• Evaluating the stability of the abductor signal with bootstrap resampling or permutation importance. By stability, I mean testing whether hip abductor weakness consistently ranks as important when the model is re-run on many resampled versions of your dataset (bootstrap), or whether shuffling its values causes a meaningful drop in performance (permutation). If its importance fluctuates widely or performance barely changes, this would indicate the finding is unstable and should be treated as preliminary.
• Where possible, testing the model on an external dataset or at least a temporally separated cohort.

Your results open with “21 players sustained 32 hamstring injuries,” yet the modelling treats the outcome at the player level as injured versus not injured. This collapses recurrent injuries and discards event timing, which diminishes the value of established predictors such as prior injury and age. The subsequent split is consistent with a player outcome (training 16 injured, 67 not injured; test 5 injured, 32 not injured), but the text and figures then mix “injuries” with “injured players,” creating contradictions. In Figure 2b, the test set confusion matrix shows 32 players who were not injured (21 correctly classified as not injured and 11 incorrectly classified as injured). However, the text refers to 21 uninjured players, which is inconsistent. It appears you have equated the number of true negatives (21) with the total number of uninjured players, but the false positives (11) also belong to the “not injured” class. This misinterpretation leads to errors in your reported sample composition and undermines the accuracy of subsequent performance claims. I recommend revisiting the counts and ensuring that class totals reflect both correct and incorrect classifications. Either reanalyse using a time-to-event framework that accommodates recurrent events or state that this is a player-level, single time point, exploratory model and temper claims accordingly.

Lines 122-125 Injury mechanism was self-reported, with no GPS or video confirmation, and medical diagnosis details were incomplete for some cases. While you acknowledge this, the discussion still makes mechanistic interpretations that your data cannot support.

Lines 177-180 Please provide a reference for the HHD ICC values.

Absolute strength values and multiple side-to-side or cross-muscle ratios are included together without variance inflation checks or a principled feature reduction strategy. This creates redundancy, high collinearity, and unstable model estimates, especially with a small sample. Collinearity is well known to distort regression coefficients, inflate variance, and exaggerate predictor importance, which in turn undermines the stability of machine learning models when events per variable are limited. Best practice would be to screen for collinearity (e.g., using variance inflation factors), reduce dimensionality through penalisation or variable clustering, or justify inclusion based on prior evidence rather than entering multiple highly correlated variables simultaneously.

For further guidance, see Babyak (2004), who explains how overfitting and redundant predictors destabilise regression-type models; Steyerberg (2019), whose text on clinical prediction models outlines recommended approaches for handling collinearity and redundancy; Dormann et al. (2013), which reviews practical strategies for identifying and addressing collinearity; and Riley et al. (2020), who highlight the interaction between events-per-variable ratios, redundant predictors, and model overfitting.

Line 239 You report 32 injuries in 21 players, with 28% reinjuries, yet model a binary outcome of “injured vs not injured.” This collapses multiple events into one and does not distinguish between index and recurrent injuries, limiting the value of predictors like previous injury.

Lines 203-208 You state that pre-processing, including min-max normalisation, was performed “prior to the full analysis.” This implies scaling and variable ranking were carried out on the full dataset before splitting. If so, this introduces data leakage, inflates performance, and undermines validity. All pre-processing should be nested within the training fold only. Please clarify, and if necessary, re-run the analysis.

Lines 263-275 You present mean AUC and accuracy from cross-validation, then separate results from a 70:30 train/test split. The claim that the model “correctly identified more than 90% of positive cases” refers to precision, not sensitivity/recall, and is therefore misleading.

Line 303-304 The term “Negelkerke” is misspelled. AUC values differ between text, tables, and figures, and metric terminology is inconsistent. These errors undermine clarity and quality control.

With 21 injured players and around 10 predictors, event-per-variable (EPV) is ~2, far below accepted thresholds. Without penalisation or shrinkage, overfitting is almost certain. Please see Riley et al. (2020). They are cautions against rigid cut-offs, as EPV values this low almost invariably lead to overfitting unless penalisation or shrinkage is applied.

Line 308 and Table 2 For dominant-side hip abductor strength, you report OR = 0.818, p = .016, CI = 0.695-0.964, but also CI = 0.695-1.038 in the same paragraph. A CI crossing 1 cannot yield p = .016, suggesting a reporting or analysis error.

When discussing your findings, you compare them with Ayala et al. (2019), who also developed a preseason hamstring injury prediction model. While their AUC (0.837) and sensitivity/specificity appear strong, their study shares the same weaknesses as yours: small sample, single preseason measure, no external validation. They used resampling and cost-sensitive learning to address class imbalance, worth noting, but these do not overcome fundamental issues of overfitting, lack of calibration, and uncertain predictor stability. Drawing parallels to methodologically weak studies does not strengthen your results; instead it highlights a lack of critical appraisal.

In addition, you compare your current hamstring injury model (AUC 0.79 with logistic regression) to a previous k-NN model for predicting groin injuries (AUC 0.4228) and conclude this demonstrates the superiority of logistic regression. This comparison is not valid. Different outcomes, epidemiology, and predictors mean the AUC values are not comparable. Any performance difference could be driven by outcome type or predictor set, not algorithm. Framing the logistic regression model as “impressive” is premature without optimism-adjusted calibration or external validation, especially with such limited data. This reflects a broader issue: positive findings are highlighted, but limitations are not scrutinised with equal weight.

Lines 450-452 You describe ML-based models as potentially “representing a methodological gold standard,” despite small sample size, single baseline measure, lack of calibration, and no external validation. This is overstated.

Lastly, the current title overstates the certainty and predictive claim of your findings. Given the limitations, I suggest a more cautious framing. Two possible alternatives:
• “Exploratory analysis of baseline strength measures and hamstring injury incidence in professional footballers using machine learning”
• “Baseline hip abductor strength as a candidate predictor of hamstring injury: an exploratory machine learning analysis”

These titles better align the claims with the evidence, avoid implying causation or definitive prediction, and still retain reader interest.

As presented, the study supports model development with internal validation only. The small number of events, single baseline time point, potential data leakage in preprocessing, and lack of external validation limit the strength and generalisability of the findings. Key predictors are inconsistent with the established literature, and the modelling decisions introduce instability and risk of overfitting. In other words, instead of trying to explain why your data may be a novel finding, please consider that this may be so due to methodological flaws.

 

While the manuscript addresses an important question and is timely given the interest in machine learning in sports medicine, the methodological and reporting issues are too substantial for this version to be publishable. My recommendation is to reject in its current form but encourage resubmission after major revision. If the editor simply wants to allow major revisions then please consider all the points of the document before you re-submit. A resubmitted version should include: clarification and correction of outcome definitions and counts, reanalysis with preprocessing and feature selection nested within resampling, stronger transparency around events per variable and model optimism, and external or at least temporally separated validation. With these steps, the work could make a meaningful contribution, but as it stands the conclusions are overstated relative to the evidence.

 

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript entitled “Isometric Hip Strength as a Predictor of Hamstring Injury Risk in Soccer: A Machine Learning–Based Analysis” addresses an important and timely topic in sports science and rehabilitation. The integration of muscle strength assessment with machine learning approaches to predict injury risk represents an innovative direction that aligns well with the scope of the Special Issue.

Comment 1:

The abstract could be improved by balancing detail and readability. For example, while the results are accurately reported, the clinical and practical implications should be more explicitly emphasized. The current abstract gives the impression of a strong predictive model, but the moderate sensitivity and limitations of the machine learning approach deserve more weight in order to avoid overstating the applicability of the findings.

Comment 2:

The introduction is somewhat lengthy and repetitive in places, particularly concerning the importance of lumbo-pelvic stability and the role of hip abductors. Streamlining these paragraphs would improve readability. In addition, the timeliness of the study could be enhanced by incorporating more recent references from 2023–2024 on artificial intelligence and sports injury prediction, which would place the work in a more updated scientific context.

Comment 3:

In the methods section, the exclusion of hip extensor strength from the testing protocol should be more thoroughly justified, as these muscles are highly relevant to hamstring mechanics and injury risk. The handling of missing data, particularly incomplete injury histories, is not sufficiently described and should be clarified to ensure transparency. While the supplementary materials provide valuable details, some figures showing testing positions should be incorporated directly into the main text to improve accessibility for readers.

Comment 4

In the results, the presentation could be simplified for clarity. Figures such as the validation curves and confusion matrices are technically accurate but might be difficult for non-technical readers to interpret; clearer legends and explanatory notes would make them more accessible. Moreover, while the study highlights hip abductor strength as a significant predictor, the clinical implications of this finding should be more explicitly connected to the moderate predictive sensitivity of the model, which currently limits its stand-alone utility in applied settings.

Comment 5

In the discussion, there is noticeable repetition in the emphasis placed on the role of the hip abductors, which could be streamlined for conciseness. More importantly, the discussion would benefit from a stronger focus on practical applications, such as how the findings could be integrated into real-world pre-season screening protocols or strength training interventions. Furthermore, the limitations of the machine learning approach, especially its relatively low sensitivity in detecting true injury cases, should be acknowledged more explicitly, as this constrains the immediate clinical applicability of the model.

Comment 6

The limitations section is addressed, but it should be emphasized more strongly. In particular, the relatively small sample size, reliance on self-reported injury mechanisms without objective verification, and the omission of hip extensor testing are important weaknesses that should be highlighted with greater transparency. The lack of external load monitoring and incomplete injury histories also limit the strength of the conclusions. These points should be given more prominence to balance the claims of the study.

Comment 7

The conclusion is coherent and aligns with the results. However, it is somewhat assertive in suggesting the predictive value of machine learning as a gold standard. Given the modest predictive accuracy and sensitivity observed, a more cautious formulation would be appropriate. The conclusion should acknowledge the promise of machine learning while also stressing the need for larger datasets, more comprehensive variable inclusion, and validation in different cohorts before clinical translation can be realized.

Comments for author File: Comments.pdf

Back to TopTop