1. Introduction
Balancing between training and competitive loads and recovery is important for athletes to achieve maximum performance [
1]. The endurance of athletes largely depends on the effective recovery of the body after the competition. Active endurance sports have physiological, immunological, and metabolic effects on the bodies of athletes [
2]. Restoration refers to the process in which an altered biological system, including the metabolome, returns to its original state [
3]. Endurance athletes are characterized by higher intensities of exercise and training due to modified muscle metabolism. As training levels are different in endurance and non-endurance athletes, exercise performance is also different, because endurance sports are characterized by repeated contractions of skeletal muscle. Typical endurance sports include cross-country, long-distance swimming, cycling, skiing, and long-distance track events, whereas non-endurance events include baseball, tennis, volleyball, softball, and short-distance/sprint events for track/field and swimming [
4]. In athletes, metabolic recovery occurs in two stages: a fast stage, in which oxygen, ATP, and phosphocreatine are replenished, and a subsequent slow stage, in which the adaptations of innate metabolism are restored [
1,
3].
Training and competitive periods can temporarily impair athlete performance. This disturbance can be short-term (several minutes or hours after exercise) or long-term, lasting up to several days [
1]. Recovery depends on glycogen replenishment, usually within 24 h of strenuous exercise [
5], and rehydration [
6]. Long-term impairment may be due to muscle damage or delayed muscle soreness (DOMS) [
7]. The mechanisms mediating athlete recovery are not entirely clear; however, an imbalance between exercise stress and recovery over extended periods of time can affect the performance of an athlete. This imbalance can have long-term debilitating consequences in the form of overtraining [
1].
Increased fatigue also increases the risk of injury in athletes. To accelerate the recovery process, athletes often perform structured post-competition recovery sessions. These sessions are designed to maintain the balance between exercise-induced stress and recovery. Therefore, monitoring the effectiveness of the recovery period is crucial for each athlete to take additional and timely measures to correct this period, if necessary.
Under prolonged loads, catabolic processes are mainly activated, allowing the quick and effective mobilization of the body’s energy resources to achieve a sports-related result. In contrast, anabolic processes predominate during the recovery period, allowing compensation for the loss of the body [
8].
Increasingly, biomedical research is using machine learning techniques to analyze large sets of clinical data to predict the health of the human body [
9]. Many studies demonstrate the promise of using machine learning methods as an auxiliary tool to help in making a clinical decision. The traditional analysis and interpretation of the results of clinical observations is a laborious process, often depending on the practical experience of the doctor. Today, the literature is actively discussing approaches to the application of machine learning methods for the classification and diagnosis of multifactorial diseases, such as coronary heart disease, cancer, diabetes mellitus and many other pathologies [
9,
10,
11,
12,
13]. It is important to use the intellectual analysis of large clinical data on the health status of athletes, for which there is a significant shift in the generally accepted reference ranges of clinical indicators compared to those obtained for a sedentary population, generally reflecting the body’s adaptations to regular and prolonged physical activity.
Despite the actively developing field of data mining, there is no “gold” standard in methodological approaches yet. Depending on the task and the dimension of the dataset under study, the most popular machine learning approaches are logistic regression, support vector classifier, decision tree, multinomial naive Bayes, random forest, and multinomial regression [
10,
11,
14,
15].
The goal of this study is to examine the effectiveness of the recovery period of elite athletes in the post-competitive period. This study involved 3661 athletes who completed the competitive period and underwent an in-depth medical examination (IME). The IME of an athlete was performed to obtain complete and comprehensive information about the physical state of the body, assessing the state of health, the functional state of the body, and indicators of its physical performance. Health monitoring was performed according to the anthropometric and biochemical data by classifying the phenotype according to the type of “catabolism” and “anabolism”.
The design of a study in the absence of a generally accepted approach for analyzing the results of a clinical trial traditionally includes procedures for substantiating the choice of a machine learning method, a comparative analysis of the use of two or more methods to solve a biomedical classification problem, and assessing the reliability of the results obtained.
Our study is built according to the traditional scheme and is aimed primarily at identifying the most important “predictors”, or clinical indicators, for assessing the health status of a professional athlete. The training process and diet are especially important for the possible adjustment of the recovery process. Random forest and multinomial logistic regression machine learning algorithms were used to classify catabolic or anabolic phenotypes.
2. Materials and Methods
2.1. Ethics Statement
All participants were informed of the risks and discomfort associated with the investigation and signed a written consent form to participate. The study was approved by the Board for Ethical Questions in A. I. Burnazyan State Research Center of the FMBA of Russia (Protocol No. 40 from 18 November 2020) according to the principles expressed in the Declaration of Helsinki.
2.2. Subjects
Healthy and trained athletes (
n = 3661) participated in this study. Participants were excluded if they had a history of muscle disorder, cardiac or kidney disease, or those taking medicine (including anti-inflammatory drugs, antibiotics, and supplements) or nicotine. Each participant completed a questionnaire on their medical history and previous training. The baseline characteristics selected for this investigation are shown in
Table 1.
At the time of inclusion in the study, the athletes were considered healthy based on the results of a previously completed in-depth medical examination, which included instrumental (fluorography, ultrasound examination of the abdominal cavity and pelvic organs, echocardiography, electrocardiography, and stress testing “to failure”) and laboratory examinations (general analysis of urine and biochemical and general clinical analysis of blood), as well as examinations by specialists (ophthalmologists, otolaryngologists, surgeons, cardiologists, neurologists, dentists, gynecologists (women), endocrinologists, and therapists). During the recovery period, athletes did not take pharmacological drugs or biologically active additives that could affect the biochemical analyses of the body. The physical activity took place once a day for two hours, six days a week.
The identification of the predominance of catabolic or anabolic processes (classification of four comparison groups in
Table 1, column “Phenotype”) in the metabolic mechanisms of the body during the annual macrocycle was based on generally accepted (in elite sports) biomarkers of blood serum, clinical picture, and concentration of hormones of the hypothalamic–pituitary–adrenal axis (
Supplementary materials Table S1).
The severity of catabolic or anabolic regulatory mechanisms is determined by the intensity and duration of physical activity. These changes in the body of athletes are combined with high activity of the hypothalamic–pituitary–adrenal axis and are confirmed by an increase in the blood concentrations of catabolic products of protein and amino acid compounds, lipid peroxidation, and nucleic acids. Intense and prolonged physical activity is accompanied by changes in the rate of regulation of the body’s metabolism and biochemical changes in the blood plasma, confirming the violation of metabolic processes in muscle tissues and internal organs.
The first class was characterized by an absolute predominance of catabolic processes of regulation of the body with signs of overwork in athletes and violations of the mechanisms of regulation of the cardiovascular, central nervous, and endocrine systems with signs of pronounced stress (most often debilitating) (catabolism in muscles and liver;
Figure 1).
The second class was characterized by a slight predominance of catabolic and anabolic processes of body regulation with signs of fatigue in athletes. In this group, there was a depletion of neurohumoral mechanisms of regulation and a decrease in the functional reserves of the body by half compared to that during the optimal state of health. During stress testing, a one-and-a-half times decrease in the functional performance of the body was recorded compared to the optimal level of the functional state of the body (catabolism in the muscles and liver;
Figure 1).
The third class was characterized by a slight predominance of anabolic over catabolic processes of body regulation, with signs of a satisfactory condition in the examined athletes. This group easily endured intense physical and psychological stress (anabolism in the muscles and liver;
Figure 1).
The fourth class was characterized by an absolute predominance of anabolic processes of regulation of the body (anabolism in the muscles and liver;
Figure 1). Athletes in this group had a high level of functional reserves, and the mechanisms of metabolic and neuroendocrine interaction of body systems demonstrated optimal regulation.
2.3. Analysis of Blood Parameters
Blood sampling from athletes with the highest achievements was carried out strictly on an empty stomach, according to the standard method [
16], from 8 to 10 a.m., at the clinical diagnostic laboratory of the State Scientific Center of the Federal Medical and Biological Center named after A.I. A.I. Burnazyan FMBA of Russia.
Vacutainers containing K2EDTA as an anticoagulant were pre-labeled. The resulting biomaterial in a vacuum tube was centrifuged at 3500 rpm, and the supernatant was transferred into pre-labeled polypropylene tubes. Before performing quantitative biochemical analysis, blood plasma samples were frozen at a temperature no higher than −20 °C. Biochemical parameters of blood were studied using the modular platform Cobas 6000 (Roche Diagnostics, Mannheim, Germany). When analyzing peripheral blood, the following indicators were determined: acid phosphatase (U/L), lactate (mmol/L), total protein (g/L), albumin (g/L), creatinine (µmol/L), urea (µmol/L), uric acid (mmol/L), amylase (U/L), triglycerides (mmol/L), total cholesterol (mmol/L), high-density lipoproteins (HDL-mmol/L), total bilirubin (mmol/L), direct bilirubin (mmol/L), alanine aminotransferase (ALAT-Me/L), aspartate aminotransferase (ASAT-Me/L), creatine kinase (Me/L), creatine kinase-MB (Me/L), lactate dehydrogenase (Me/L), gamma-glutamyl transpeptidase (gamma-GT-Me/L), alkaline phosphatase (Me/L), total calcium (mmol/L), phosphorus (mmol/L), magnesium (mmol/L), iron (mmol/L), somatotropic hormone (GH-ng/mL), thyroid stimulating hormone (TSH-Me/l), T4 free (Me/L), prolactin (Me/L), total testosterone (nmol/L), and bone resorption marker (gross-laps -b ng/mL). The results of the laboratory analyses were entered by a senior researcher at the TsSMiR FMBTS im. A. I. Burnazyan into individual cards of athletes passing the IME.
2.4. Urine Analysis
Biomaterial sampling was performed in the morning on an empty stomach. Before collecting urine, a thorough examination of the external genital organs was carried out, and an average portion of urine was collected in a sterile container. Pre-labeled polypropylene sample tubes were carefully delivered to the laboratory. Prior to analysis, the biomaterial in these test tubes was quickly frozen at temperatures not higher than −20 °C. Microscopic studies of urine were performed using AXIO microscopes with an Imager A1 video system (Axiostar Plus; Carl Zeiss, Oberkochen, Germany). Physicochemical properties were studied using an Aution Elevan AE-4020 urine analyzer (Arkray Inc., Kyoto, Japan). The results of the laboratory analyses were entered by a senior researcher at the TsSMiR FMBTS im. A. I. Burnazyan into individual cards of athletes passing the IME.
2.5. Statistical Analysis
A pairwise comparison of phenotypic classes for each metabolite was performed using Student’s
t-test with Bonferroni correction. The analysis was performed using [
17] URL
https://www.R-project.org/ (accessed on 31 August 2022) and the package rstatix [
18].
2.6. Data Preparation
Before using the data for machine learning models, they were normalized and brought to the same range. The data conversion process consisted of the following steps.
All string data, such as “Gender,” “Type of load,” and “Type of sport,” were converted into categorical features.
Missing values in the numerical features were replaced by the average of the feature range, and missing values in the categorical features were replaced by the mode of values of this feature.
All numerical features were normalized using Z-scaling of the data based on mean and standard deviation (Equation (1)):
where
μ = mean of the feature, and
σ = standard deviation
To train and validate the model, the input data were divided into training and validation sets at the following proportions: 60 and 40% for the training and validation set, respectively.
2.7. Data Classification Using the Random Forest Algorithm
The total number of decision trees in the model and the maximum depth of trees were set as the regularization parameters. The following metrics were used to evaluate the classification results:
Overall accuracy (Equation (2)) is defined as the number of correctly predicted items (true positive (TP), true negative (TN)) over total of item to predict (true positive (TP), true negative (TN), false positive (FP) and false negative (FN)).
where
n is the number of classes (in our case,
n = 4).
where
n is the number of classes (in our case,
n = 4) and
is the accuracy of the
class, defined as the number of correctly predicted items for each class over the total number of items to predict (Equation (4)).
where
n is the number of classes (in our case,
n = 4) and
is the precision of the
class, defined as the number of true positives (TP) over the number of true positives plus the number of false positives (FP), as described in Equation (6):
where
n is the number of classes (in our case,
n = 4) and
is the recall of the
class, defined as the number of true positives (TP) over the number of true positives plus the number of false negatives (FN), as described in Equation (8):
where
n is the number of classes (in our case,
n = 4).
Microsoft Azure ML Machine Learning Studio was used to implement the random forest model.
The following model regularization parameters were set for training (
Table 2).
2.8. Classification Using the Multinomial Logistic Regression Algorithm
Multinomial logistic regression is a classification method that generalizes logistic regression to multi-classification problems—that is, problems with more than two possible discrete outcomes. In the multinomial logistic regression model, a binary logistic regression equation was built for each category of the dependent variable. In this case, one of the categories of the dependent variable becomes the reference variable, and all the other categories are compared.
In general, the multinomial logistic regression equation can be written as in Equation (10):
where
x is the vector of regressors,
y is the dependent variable taking the values {1, 2,…,
k}, and
θ is the regression parameter determined using machine learning methods.
A classification model using multinomial regression was built using tools provided by the MS Azure ML Machine Learning Studio.
In relation to our problem, the regressors
x represent the normalized features of the original dataset, and the dependent variable
y takes the value 1…4
y = {1, 2, 3, 4}. The optimal regression parameter
θ was determined using a machine learning studio that uses the gradient descent method to minimize the error function, which can be expressed as (Equation (11)):
where
m is the number of elements in the training set,
k is the number of classes (in our case,
k = 4), α is the regularization parameter, and ||
θ||F is the Frobenius norm of matrix [
θ1,…,
θk].
Blood and urine biochemistry indicators were scored and weighted by the feature importance metric, which was calculated using the following algorithm:
We computed the reference score s of the model (for instance, the accuracy) on the test dataset (D).
For each feature j (column of D):
For each repetition k in 1…K:
Randomly shuffle column of the dataset to generate a corrupted version of the data named
Compute the score (accuracy) of the model on corrupted data
Compute importance score
for feature
defined as (Equation (12)):
Machine learning algorithms were implemented using Microsoft Machine Learning Studio. In order to prevent overfitting, tools for regularization and hyperparameter optimization were used. A package for performing regularization and hyperparameter optimization was provided by the Microsoft Machine Learning Studio, based on the “Cross-validation with a parameter sweep” algorithm. For the decision forest model, the following hyperparameters were regularized:
For multinomial regression:
For cross-validation, the dataset was split into 5 consecutive folds.
4. Discussion
The primary reason for the decrease in the physical performance of athletes is the reprogramming of metabolic mechanisms of regulation due to changes in biochemical processes and the work of bioregulation systems of the body, leading to structural disorders. The severity of catabolic or anabolic mechanisms of regulation is determined by the intensity and duration of physical activity. When the function of the adrenal glands is inhibited, the concentration of the products of catabolism of protein and amino acid compounds (hepatic metabolism) increases in the blood, whereas the content of metabolites of lipid peroxidation and nucleic acids increases in the muscle tissue, with a pronounced decrease in the activity of respiratory enzymes, which is extremely important in sports medicine. Increased catabolism in the body of athletes contributes to the development of detoxification system blockade and changes in blood plasma, confirming the violation of metabolic processes in the muscle tissues and internal organs.
Predicting the mechanisms of the predominance of anabolic or catabolic processes in the athlete’s body, both at the current time and its change in the dynamics of a four-year macrocycle, is a paramount task in sports medicine and planning the recovery phase of an athlete. In this study, clinical data consist of a large set of heterogeneous features. Such an enormous number of biochemical indicators, most of which are beyond the acceptable ranges, complicates decision-making by physicians. It is compulsory to establish the most significant contributing indicators that display successful post-competitive recovery and performance.
The proposed approach involving classification algorithms is completely impartial and capable of enabling the accurate monitoring of catabolism and anabolism to maintain the effective management of the training activities of athletes. The most significant biochemical indicators can be gleaned through the relatively short calculating time with utmost accuracy. Among the machine learning algorithms suitable for multiclass classification problems, the decision forest algorithm is one of the most popular tools. This algorithm provides the ability to process binary, categorical and numerical features. In this case, a simple preliminary preparation of the input data is sufficient, with no scaling or transformation. Calculations can be parallelized into several processes, which significantly reduces the calculation time. The decision forest method is suitable for multidimensional data because it operates with subsets. The method is robust to data outliers and indifferent to non-linear behavior of features due to the balancing of errors in imbalanced class sets, which allows us to reduce the overall error rate. Eventually, each decision tree has high variance at a low bias. Averaging all the trees in the random forest averages the variance, resulting in a model with low bias and moderate variance.
According to the model indicators (presented in
Section 3), the decision forest algorithm showed the best results. However, the multinomial regression algorithm can also reveal the significant features that contribute to the successful recovery of elite athletes. The predominance of anabolic processes amongst examined individuals indicates an excellent functional state and good adaptive reserves of the body, sufficient to overcome intense and prolonged physical exertion. In connection with these aforementioned factors, this study, dedicated to the search for new informative predictors that characterize athletes’ homeostasis, is still relevant at present. To identify the most significant contributing predictors and to evaluate the prevalence of anabolic or catabolic processes, dozens of biochemical indicators (see
Section 2.3 and
Section 2.4; and
Supplementary Materials Table S1) were collected and subsequently included in the model.
The phenotype “catabolism” for muscle and liver metabolism (classes 1 and 2) was characterized by the increase in aspartate aminotransferase (AST), creatine kinase, lactate dehydrogenase (LDH), and myoglobin compared to the phenotype “anabolism” (
Figure 5). The reference ranges of blood and urine biochemistry for all classes correspond to the observations regarding study participants’ uric acid, urea, and creatinine. Class 1 was characterized by the absolute predominance of catabolic processes with signs of overwork of athletes, such as violations of the mechanisms of regulation of the cardiovascular, central nervous, and endocrine systems, with remarkably explicit stress (most often debilitating). Meanwhile, Class 2 was undoubtedly inclined toward catabolic processes with signs of fatigue in athletes, the depletion of neurohumoral regulatory mechanisms, and a two-fold decrease in recovery compared with the optimal performance.
A significant increase in the levels of the liver function indicators (AST and ALT), LDH, creatine kinase, and myoglobin in the blood has been shown in athletes under fatigue conditions seven days after intensive exercise (
p < 0.01) [
19,
20,
21,
22]. Indeed, transaminitis in athletes is often mediated by damaged muscle tissue, as demonstrated in this study. The level of liver function indicators can be increased by an order of magnitude compared to the reference values [
23,
24]. Refusal of intensive training restores the liver function indicators to normal level, in favor of damage to muscle tissue in athletes, a significant increase in correlation with creatine kinase levels for the anabolic phenotype (classes 1 and 2,
Figure 5b). However, liver injury factors cannot be completely ignored, because the level of alkaline phosphatase is elevated by 17% (
Supplementary materials Table S1).
Almost 70% of the examined athletes participated in minor or moderate exercise, regardless of their phenotype (anabolism or catabolism), whereas the other 30% retained a high intensity of physical activity. We assumed that anabolism under the conditions of moderate physical activity reflects the delayed normalization of indicators after an intense period of exercise, which exceeds 11 days under conditions of overwork [
25].
The pronounced catabolism phenotype stands is characterized by meaningfully elevated myoglobin, which is linked to myalgia and extremely negative rhabdomyolysis, closely related to the intensity and duration of physical activity. This condition is characterized by the destruction of the integrity of skeletal muscles after physical activity [
26,
27]. Furthermore, we determined the increased levels of acids and creatinine in the catabolism phenotype athletes, which is generally acceptable after exercise [
28]. We assume that this effect is caused by the offset mechanisms after exercise-mediated energy depletion, when a decrease in the ATP/ADP ratio triggers boosted purine synthesis and the degradation and elimination of adenine nucleotides [
28].
5. Conclusions
Monitoring blood biomarkers in athletes makes it possible to neutralize negative effects in the post-competitive period by adjusting diet, training load, and recovery strategy. In this regard, population-based studies of athletes are important to assess the effectiveness of the recovery strategy in accordance with clinical reference values and between the phenotypes “anabolism” (recovery state of the body) and “catabolism” (stressed state and overwork as a result of physical exertion).
We introduced the decision forest and multinomial regression models to locate a pattern of the most significant indicators of blood and urine biochemistry affecting the performance of recovery processes in the post-competitive period in athletes.
This study demonstrates that laboratory test values are outside the generally accepted reference limits because they are calculated for a sedentary population and reflect adaptations to regular and prolonged physical activity. We also demonstrated that muscle-related metabolites (AST, CK, LDH, and ALT levels, and indicators of the ornithine cycle, such as creatinine, urea acid, and urea) are significant indicators for the classification of the “catabolism” and “anabolism” phenotypes. The present model for the stratification of prevalence between catabolism and anabolism needs further adjustment. Although the proposed model efficiently discriminates a huge number of biochemical indicators and is capable of establishing the most significant contributing factors, some factors, such as gender, demonstrate less significant input to distinguish these phenotypes. This may limit and affect further consideration of the final decision in adjusting the post-competitive recovery to achieve efficient athlete performance.