Novel Study for the Early Identification of Injury Risks in Athletes Using Machine Learning Techniques

: This innovative study addresses the prevalent issue of sports injuries, particularly focusing on ankle injuries, utilizing advanced analytical tools such as artificial intelligence (AI) and machine learning (ML). Employing a logistic regression model, the research achieves a remarkable accuracy of 90.0%, providing a robust predictive tool for identifying and classifying athletes with injuries. The comprehensive evaluation of performance metrics, including recall, precision, and F1-Score, emphasizes the model’s reliability. Key determinants like practicing sports with injury risk and kinesiophobia reveal significant associations, offering vital insights for early risk detection and personalized preventive strategies. The study’s contribution extends beyond predictive modeling, incorporating a predictive factors analysis that sheds light on the nuanced relationships between various predictors and the occurrence of injuries. In essence, this research not only advances our understanding of sports injuries but also presents a potent tool with practical implications for injury prevention in athletes, bridging the gap between data-driven insights and actionable strategies.


Introduction
Sports injuries, such as ankle injuries, are a common and recurring problem for many athletes, representing the first worldwide disability in terms of sports, typically manifesting 31% of the injuries in football players and 45% in basketball players [1].These injuries can be debilitating and require a significant recovery time, taking the athletes out of the game for 3 or more weeks [2].One of the most common ankle injuries are the sprains that cause chronic symptoms in 40% of individuals, such as pain, swelling, and instability, that provoke bad motor control of the ankle joint and can lead to functional disability [3].
Age is a factor that can impact an athlete given that the athlete's risk of injury increases with age.This is because the body's tissues begin the aging process after the age of 35, leading to a loss of elasticity and slower tissue repair [4].
Gender has a minimal impact on the risk of injury as the human body has an equal likelihood of getting injured regardless of gender [5].
The number of hours and days of training has a low impact on the incidence of injuries since the time dedicated to training contributes to muscle strengthening and memory [6].
Appl.Sci.2024, 14, 570 2 of 11 Previous injuries are a variable that may yield a low result, as athletes with injuries undergo active recovery treatment, play with reduced intensity, and manage play time to prevent fatigue [7].
Corrective treatment is a variant with a small possibility of influencing injuries, as athletes either have undergone or undergo specific treatment for their injuries.Athletes who are injured tend to play with less intensity and in positions with lower physical demands.However, athletes with preventive treatment may have muscles and tissues that have been prepared for physical demands, potentially reducing the incidence of injuries [8,9].
Hydration is also an important factor in the possibility of getting injured, because it is involved in the metabolism, nutrient transport, blood circulation, and body temperature regulation [10].In terms of sports, hypohydration plays an important part in the weight loss of athletes, because plasma, blood flow, blood volume, cardiovascular functions, and thermoregulatory capacity are mechanisms that are affected [10,11].During training and games, athletes lose different body liquids, mainly by sweating, and these liquids depend on the exercise intensity and the weight of the individual.
Kinesiophobia, known as the fear of performing a movement or an activity, plays a role in functional ankle instability.This fear induces negative thoughts, leading to mechanical alterations that impede proper joint functioning.Consequently, the repercussions of the ankle's behavior are a loss of strength, more postural balance, and an alteration to the proprioception of musculoskeletal disorders [3,[12][13][14][15][16].
Fatigue in athletes can have a significant impact on their performance and increase the risk of injuries [17].Fatigue can be the result of a variety of factors, including physical exertion, lack of sleep, poor nutrition, and mental stress [7,18].Athletes who experience fatigue may have an increased risk of injuries due to several factors: alterations in coordination, reduction in performance capacity [19,20], and increased vulnerability to infection, also known as chronic fatigue [21,22].
Hydration is a factor that can have a high impact because muscles not adequately hydrated are prone to fatigue and injuries [23].
Artificial intelligence (AI) and machine learning (ML) are advancing in numerous fields, including medicine [24][25][26][27].One of the most promising applications of AI in this field is the prediction and prevention of sports injuries.
ML is a subset of AI and is divided into supervised and unsupervised learning.In supervised learning a model is trained on a labeled dataset, and the input data are associated with a correct output [28][29][30].The model learns from these data and then is ready to predict the output of new data.In health sciences, supervised learning can be useful for predicting diseases based on certain symptoms or risk factors [31,32].See Figure 1.
The number of hours and days of training has a low impact on the incidence of injuries since the time dedicated to training contributes to muscle strengthening and memory [6].
Previous injuries are a variable that may yield a low result, as athletes with injuries undergo active recovery treatment, play with reduced intensity, and manage play time to prevent fatigue [7].
Corrective treatment is a variant with a small possibility of influencing injuries, as athletes either have undergone or undergo specific treatment for their injuries.Athletes who are injured tend to play with less intensity and in positions with lower physical demands.However, athletes with preventive treatment may have muscles and tissues that have been prepared for physical demands, potentially reducing the incidence of injuries [8,9].
Hydration is also an important factor in the possibility of getting injured, because it is involved in the metabolism, nutrient transport, blood circulation, and body temperature regulation [10].In terms of sports, hypohydration plays an important part in the weight loss of athletes, because plasma, blood flow, blood volume, cardiovascular functions, and thermoregulatory capacity are mechanisms that are affected [10,11].During training and games, athletes lose different body liquids, mainly by sweating, and these liquids depend on the exercise intensity and the weight of the individual.
Kinesiophobia, known as the fear of performing a movement or an activity, plays a role in functional ankle instability.This fear induces negative thoughts, leading to mechanical alterations that impede proper joint functioning.Consequently, the repercussions of the ankle's behavior are a loss of strength, more postural balance, and an alteration to the proprioception of musculoskeletal disorders [3,[12][13][14][15][16].
Fatigue in athletes can have a significant impact on their performance and increase the risk of injuries [17].Fatigue can be the result of a variety of factors, including physical exertion, lack of sleep, poor nutrition, and mental stress [7,18].Athletes who experience fatigue may have an increased risk of injuries due to several factors: alterations in coordination, reduction in performance capacity [19,20], and increased vulnerability to infection, also known as chronic fatigue [21,22].
Hydration is a factor that can have a high impact because muscles not adequately hydrated are prone to fatigue and injuries [23].
Artificial intelligence (AI) and machine learning (ML) are advancing in numerous fields, including medicine [24][25][26][27].One of the most promising applications of AI in this field is the prediction and prevention of sports injuries.
ML is a subset of AI and is divided into supervised and unsupervised learning.In supervised learning a model is trained on a labeled dataset, and the input data are associated with a correct output [28][29][30].The model learns from these data and then is ready to predict the output of new data.In health sciences, supervised learning can be useful for predicting diseases based on certain symptoms or risk factors [31,32].See Figure 1.

Regression
Regression is a statistical technique used to identify the mathematical behavior of an unknown model.Its fundamental purpose lies in the identification of a mathematical formula that allows clarifying the existing correlation between these variables and projecting the value of the dependent variable based on the specific values assumed by the independent variables [33,34].This procedure seeks to provide a deeper and more accurate understanding of the underlying relationship, thus allowing for more accurate predictions about the dependent variable based on the observations of the independent variables [35].

Logistic Regression Model
Logistic regression is a statistical method used to predict the likelihood of a dependent variable being in a certain category, based on one or more independent variables.For example, logistic regression can be used to estimate the probability of a patient having a disease, based on their symptoms, age, sex, etc. Logistic regression is based on the logistic function, which transforms input values into a range between 0 and 1, which is interpreted as the probability of belonging to the positive category.Logistic regression is applied to binary classification problems (where the dependent variable only has two possible values) or the multiclass (where the dependent variable has more than two possible values) [36][37][38][39].
The University of Valle de Mexico is home to accomplished Olympic athletes, professionals, and high-performance individuals.Annually, a massive event known as "Interlinces" takes place, featuring various sports such as soccer, tennis, American football, touch football, basketball, swimming, animation, taekwondo, gymnastics, and volleyball, involving a total of 3500 athletes in the year 2022 [40].In the present work, a logistic regression model played a key role in the analysis and prediction of injuries in athletes within the framework of this research.We surveyed 500 athletes and took into consideration the different parameters of the individual, as presented in Section 2.
The significance of these parameters has an impact on injury prediction, whether the individuals have or have not been injured.
In this article, our contributions are: • A Cutting-Edge Predictive Model: We developed an accurate logistic regression model with an accuracy of 90.0%, standing out as a leading tool in predicting sports injuries.

•
Identification of Determining Factors: We revealed significant associations, such as practicing sports with a risk of injury and kinesiophobia, providing crucial insights for early risk detection and personalized preventive strategies.• A Comprehensive Performance Evaluation: We conducted a thorough analysis of various machine learning models, highlighting the versatility of the logistic regression model and supporting its practical utility and reliability in medical and sports environments.

•
Detailed Performance Metrics: Beyond high accuracy, we provided a detailed analysis with metrics such as recall and precision, offering a comprehensive evaluation of the model's performance in crucial situations of accurate injury detection in athletes.

Materials and Methods
In this research, a sample of 400 athletes who participated in the comprehensive survey is presented in Table 1.It is stated that 50 independent data points were employed.In the survey database, we collected information related to various aspects such as age, gender, the number of hours dedicated to daily training, the frequency of weekly workouts, the history of previous ankle injuries, the medical treatment received after an injury, participation in sports practice despite injuries, the levels of stress experienced during sports activity, the presence of kinesiophobia, the fatigue experienced, the daily hydration average, and the amount of hydration during specific sports events.

Data Collection and Analysis Tools
The data obtained from the survey were refined into a database, providing a comprehensive view of each individual's health and sports practices while ensuring each individual's privacy.The outcome variable was classified into two categories: 400 injured and uninjured athletes.MATLAB R2023a was used as the main tool to carry out the analysis.MATLAB was chosen for its robust capability for efficient data manipulation and the application of machine learning techniques such as Fine Tree, Linear Discriminant, Binary GLM Logistic Regression, Gaussian Naive Bayes, Linear SVM, Fine KNN, SVM Kernel, Boosted Trees, and Logistic Regression.

Collection Dataset
The system was purposefully designed for data acquisition and comprehension, facilitating the examination of values, patterns, and trends that could contribute to ankle injuries in athletes.This functionality enhances the ability to predict and evaluate outcomes.A detailed description of the dataset is presented in Table 2. Numerical value related to hydration on the day of the event.Numeric The data set related to the injuries consists of 357 uninjured individuals and 43 injured individuals (see Table 3).It is stated that 50 independent data points were employed.

Variable Value Type
Outcome Binary variable indicating whether the athlete has experienced an injury.Binary

Data Preprocessing
Data preprocessing is a crucial phase in data analysis and modeling and plays a key role in the quality and effectiveness of the results obtained.In the framework of this research on injuries to athletes, the preprocessing will address various tasks, ensuring that the data are accurate, reliable, and ready for analysis, as part of the preparation for the application of machine learning algorithms.
To address the challenge posed by class imbalance in our dataset, specific techniques were implemented during the model training process.Class weighting strategies were employed to assign greater importance to the minority class, and experimentation was conducted with subsampling methods, including the application of the synthetic minority over-sampling technique (SMOTE) [41,42].This approach was implemented to alleviate the impact of class imbalance on logistic regression.Furthermore, performance metrics such as precision, recall, and the F1-Score were assessed to comprehensively capture the model's effectiveness in detecting injuries in athletes.
The initial database comprises 500 surveyed athletes.The database underwent a debugging process to ensure the integrity and consistency of the data.A total of 50 outliers, duplicates, and inconsistent records were identified and addressed.Data cleaning is essential to avoid biases and errors in the subsequent analysis.

Results of Data Training and Discussion
The architecture of the injury prediction model includes four different modules:

Logistic Regression Modeling
An exhaustive evaluation of various models was performed, and the accuracy achieved by each model is presented in Table 4, with features including age, gender, hours of training, days of training, previous injuries, corrective treatment, sport with injury, preventive treatment, stress, kinesiophobia, fatigue, previous warmup, average hydration on event day and outcome.In particular, the logistic regression model demonstrated exceptional accuracy, reaching an impressive accuracy rate of 90.0%.This means an outstanding ability to accurately classify athletes based on the presence or absence of injuries.Interestingly, the SVM Kernel, Linear Discriminant, and Binary GLM Logistic Regression models also showed high levels of accuracy, reaching 89.2%, 89.0%, and 89.0%, respectively.In contrast, the models with the lowest accuracy were the Fine Tree and the Gaussian Naive Bayes, with 83.2% and 84.8%, respectively.The application of logistic regression, known for its reliability, is crucial to achieving the desired results, and these findings have significant implications for identifying and preventing injuries in athletes.The confusion matrix for the logistic regression model is illustrated in Table 5.It reveals that the model correctly predicted 6 cases without injuries and 354 cases with injuries.However, there were 37 cases where the model incorrectly predicted that there were no injuries when there were, and 3 cases where it incorrectly predicted injuries when there were none.The accuracy achieved in Equation ( 2) is 0.90, which indicates that the model managed to correctly classify 90.0% of the cases, both negative and positive.This result is of great relevance as it suggests a robust ability of the model to discriminate between athletes with and without injuries.

Recall
Recall focuses on the ability of the model to correctly identify the positive cases.It is calculated using Equation ( 3 As shown in Table 6, the high accuracy suggests that the model generally performs well in classifying cases.However, upon examining the recall, we observe that while it is high, there are some instances of false negatives (FNs).This means there are situations where the model did not correctly identify the presence of injuries, which is within the range in the context of athletes' health.The recall is approximately 0.9916, meaning the model correctly identified 99.16% of injury cases among all actual injury cases in athletes.This metric is crucial in contexts where identifying positive cases is of particular importance, such as in preventing injuries to athletes.The precision, with a value of approximately 0.9054, indicates that the model has a fairly high ability to correctly classify positive cases (athletes with injuries).In other words, when the model predicts that an athlete has an injury, there is a 90.54% chance that they have an injury.The F1-Score is approximately 0.9447.This score is relatively high and suggests a reasonable balance between the model's ability to correctly predict positive and negative cases.It is important to note that although the accuracy is high, a detailed analysis of other metrics like recall and precision provides a more complete view of the model's performance, especially in situations where identifying positive cases is critical.That is why it is essential to consider these metrics together to obtain a comprehensive evaluation of the model's performance.In medical and sports applications, where accurate identification of injuries is crucial, this detailed analysis allows for informed decisions about the practical utility of the model in the specific research domain.
It is worth mentioning that 50 independent athletes were considered as a training sample.By using this, we emphasize that the model works correctly for this study.

Receiver Operating Characteristics (ROC) and Area under the Curve (AUC)
The receiver operating characteristic (ROC curve) is a graphical representation that shows the relationship between the true positive rate (TPR or recall) and the false positive rate (FPR) for different classification thresholds.The area under the ROC curve (AUC) quantifies the model's ability to distinguish between classes.An AUC of 79.15% indicates reasonable performance.
As shown in Figure 2, the decision-making threshold selected is 0.99804.This threshold influences how the model classifies instances, with an emphasis on precision.

Predictive Factors Analysis
In this analysis, a generalized linear regression model with binomial distribution was applied to evaluate the relationship between an outcome variable and 14 potential predictors.Table 7 shows the estimated coefficients, providing key information about the

Predictive Factors Analysis
In this analysis, a generalized linear regression model with binomial distribution was applied to evaluate the relationship between an outcome variable and 14 potential predictors.Table 7 shows the estimated coefficients, providing key information about the strength and direction of these associations.Table 7 of estimated coefficients provides key information about the strength and direction of these associations, while additional measures such as the p-Value, Chi 2 -statistic, and dispersion offer insights into the model's goodness of fit.
Athletes who practice their sport with an active injury have a low probability of getting injured because coaches decide to allow them to play for a shorter period or in positions with lower physical demands.During the development of academy soccer players (ASPs), specific skills or physical qualities can lead to players being selected for certain playing positions due to variations in the tactical and physiological requirements of those positions.In professional soccer, goalkeepers occupy the majority of low intensity actions, unlike outfield players, who exhibit more running, ball possession, and high-intensity activity.However, the distance covered and the frequency of game actions within the match among outfield positions may contribute to the different physical demands experienced by field ASPs [43].Likewise, the variable "Sport with injury" presents a coefficient of −1.0194 and a p-value of 0.03202, indicating a significant negative association.Those involved in "Sports with injury" may have a lower probability of achieving the desired outcome.
On the other hand, "Kinesiophobia" shows a significant positive association (coefficient = 0.58079, p-value = 0.00056105), suggesting that kinesiophobia is positively related to the outcome variable.
In addition, observation of action interventions and game techniques can be effective in improving the rehabilitation outcomes of lower limb injuries.Therefore, their application should be considered along with standard treatment protocols.This allows us to employ specific strengthening of the injured muscles, as well as correcting the game technique, resulting in athletes reducing their probability of injury [44].

Non-Significant Variables
Several variables, such as "Age", "Training Hours", "Previous Injuries", and others, show no significant association as their p-values are greater than 0.05.These results indicate that these variables may not be determining factors in the outcome.

Overall Model Evaluation
The model as a whole is evaluated by the chi 2 p-value of 6.15 × 10 −6 , indicating that at least one of the predictor variables has a significant effect on the outcome variable.A dispersion of 1 suggests that the model fits the binomial distribution adequately.

Conclusions
This novel study with original data provides an effective tool for predicting injuries in athletes, and the importance of considering detailed metrics for a comprehensive evaluation of the model's performance.We, the authors, consider this work to be of great importance as it offers key perspectives that could revolutionize injury prevention in athletes, contributing to their health and optimal performance.
The culmination of this study, which addressed the prediction of injuries in athletes using a logistic regression model, represents a significant advance in understanding and preventing sports risks.The robustness of the model, backed by an accuracy of 90.0%, underscores its effectiveness in classifying athletes based on the presence or absence of injuries.
In the detailed analysis of performance metrics, a high recall of 99.16% was observed, indicating the model's ability to correctly identify athletes with injuries.This metric is essential in contexts where accurate detection of positive cases is crucial, such as in injury prevention in sports.
The precision of 90.54% reinforces confidence in the model's ability to correctly classify positive cases, underlining its practical utility.The F1-Score, which combines precision and recall, showed a reasonable balance of 94.47%, consolidating the overall effectiveness of the model in situations where both metrics are fundamental.
It is relevant to highlight the importance of key variables, such as practicing sports with injury and kinesiophobia, which demonstrated significant associations with sports injuries.These findings offer valuable information that can be fundamental in the early identification of risk factors and the implementation of personalized preventive strategies.

Future Work
Shortly, a second survey will be conducted with a larger sample of athletes, aiming to expand the database and refine the acceptance percentages of the injury prediction algorithm in addition to including the sport that athletes practice.This initiative aims not only to validate and improve the robustness of the model but also to provide a more complete and representative view of various sports conditions and practices.In addition, the development of a dedicated mobile application that will allow real-time injury prediction is being considered.This application will be an invaluable tool at sports events, offering the ability to anticipate possible injuries and providing timely preventive measures, thus reinforcing the attention to and comprehensive care of athletes' health in high-performance situations.This innovative approach integrates technology and research to advance prevention and care of sports injuries.In addition, a detailed investigation will be carried out regarding the age range most affected by kinesiophobia.

Figure 1 .
Figure 1.Flow chart of the different machine learning techniques used.Figure 1. Flow chart of the different machine learning techniques used.

Figure 1 .
Figure 1.Flow chart of the different machine learning techniques used.Figure 1. Flow chart of the different machine learning techniques used.

Table 1 .
Ankle injury status by sport.

Table 7 .
Factors affecting sports injuries.