Biomechanical and Psychological Predictors of Failure in the Air Force Physical Fitness Test

Physical fitness is a pillar of U.S. Air Force (USAF) readiness and ensures that Airmen can fulfill their assigned mission and be fit to deploy in any environment. The USAF assesses the fitness of service members on a periodic basis, and discharge can result from failed assessments. In this study, a 21-feature dataset was analyzed related to 223 active-duty Airmen who participated in a comprehensive mental and social health survey, body composition assessment, and physical performance battery. Graphical analysis revealed pass/fail trends related to body composition and obesity. Logistic regression and limited-capacity neural network algorithms were then applied to predict fitness test performance using these biomechanical and psychological variables. The logistic regression model achieved a high level of significance (p < 0.01) with an accuracy of 0.84 and AUC of 0.89 on the holdout dataset. This model yielded important inferences that Airmen with poor sleep quality, recent history of an injury, higher BMI, and low fitness satisfaction tend to be at greater risk for fitness test failure. The neural network model demonstrated the best performance with 0.93 accuracy and 0.97 AUC on the holdout dataset. This study is the first application of psychological features and neural networks to predict fitness test performance and obtained higher predictive accuracy than prior work. Accurate prediction of Airmen at risk of failing the USAF fitness test can enable early intervention and prevent workplace injury, absenteeism, inability to deploy, and attrition.


Introduction
The purpose of the Air Force Physical Fitness Test (APFT) is to promote year-round physical conditioning resulting in increased productivity, decreased absenteeism, and optimized readiness [1,2]. Failure to pass the APFT can directly impact mission capabilities, ability to deploy, workplace injuries, and attrition. The U.S. Army Public Health Service Center produces statistics annually for the US Army documenting the rising rates of inactivity, obesity, and Army Physical Fitness Test failures. The 2015 report identified that one in 20 active-duty Army Soldiers fail to pass their annual fitness test resulting in a three-fold increased risk of inability to deploy [3]. The report also noted that it costs the U.S. Army $137 million annually to recruit and train soldiers discharged due to test failure ($76,000 per recruit). An accurate predictive model could refer specific Airmen to targeted wellness initiatives through established installation human performance services, such as the Health Promotions Office, Fitness Facility, Aerospace Physiology, and Integrated Operational Support Teams [4,5].
There is a growing need for studies to evaluate predictors of military fitness test failure to address and reduce the impacts on health and deployment readiness [2]. One study investigated attrition in the civilian recruit population prior to and during basic military training and reviewed demographic, cognitive, mental health, physical health, and personal fitness [6]. While valuable for screening recruits, this study did not apply machine learning algorithms to predict APFT performance.
Three prior studies used machine learning algorithms to predict either test or test component performance, and they are summarized in Table 1. Orr et al. (2020) used age, height, weight, initial fitness performance, and course duration to predict fitness test failure in the Australian Army [7]. Sih and Negus (2016) used a time-series fitness and fatigue phenomenological model to predict Army Basic Combat Training two-mile run time, which is a component of the U.S. Army fitness assessment [8]. Finally, Allison, Knapick, and Sharp (2006) used age, height, weight, initial test results, test scores, education, number of dependents, and demographic data to predict U.S. Army fitness assessment failure [9].  [9] AUC: Area under the receiver operator characteristic (ROC) curve.
The prior works evaluated demographic and biomechanical variables into their prediction models, but have not considered psychological and social variables as predictors, such as workplace sleepiness, burnout, traumatic stress, and social wellness. They also did not apply neural networks as a predictive algorithm. There is supporting evidence that psychological factors are associated with physical activity engagement [10], improved training effectiveness [11], and can predict dropout in military populations [12].
This study examines both biomechanical and psychological variables to predict fitness test failure, and focuses on the following research question: Using available biopsychosocial data, how well can classical machine learning and neural network algorithms predict failure in the Air Force Physical Fitness Test in a population of active-duty Airmen? It was hypothesized that combining biomechanical and psychosocial features would yield a better predictive model with included features that are significant (α ≤ 0.05), an area under the receiver operator characteristic curve (AUC) that improves upon the prior work in Table 1 (AUC ≥ 0.80), and accuracy > 0.90, which is consistent with other medical studies [13]. Additionally, it was hypothesized that applying a neural network model would yield better model performance than a logistic regression model.

Methods
This study was conducted on an existing dataset collected from a support squadron at a U.S. Air Force Base [14,15]. This support squadron was comprised of~280 activeduty personnel. Data collection occurred across multiple days in February and July 2021. Advertisement for participation was conducted by word of mouth, squadron emails, and flyers. All potential participants were informed that the data collected would be anonymous and that the fitness testing would not be counted as their official diagnostic APFT. This was done to reduce fear of reprisal and boost participation. All data was collected as a part of the operational mission of an embedded, multidisciplinary health and wellness team [15]. Informed consent was collected on all participants and data was deidentified at the point of collection. The 223-sample dataset is summarized in Table 2 and includes features ranging from demographics, mental health surveys, fitness participation surveys, injury history surveys, physical performance measures, and body composition assessments.

Data Understanding and Preparation
The data were prepared, processed, and analyzed using the Scikit-learn, Keras, and TensorFlow frameworks within Python version 3.7. The cross-industry standard process for data mining (CRISP-DM) was followed, with the phases of data understanding, data preparation, modeling, and evaluation [23].
An interview with the embedded health team that provided the data described the protocol of the testing day [15].

•
Participants would start with the physical and mental health questionnaires, mental health questionnaires were proctored by a licensed clinical psychologist, immediately followed by a body composition assessment proctored by a registered dietitian. • Next, the participants moved on to performing the Functional Movement Screen (FMS) proctored by a certified and credentialed athletic trainer. The FMS protocol is described well in the literature [21,22,[24][25][26].

•
The physical questionnaires assessed (1) their six-month history of musculoskeletal injury and whether the member sought medical evaluation for that injury, (2) if the member sustained an injury within the last six months, and if it impacted their participation in physical activities, (3) whether or not they were currently on a dutylimiting medical profile, and (4) on a five-point Likert scale the member indicated their perceived satisfaction with their current fitness level. • BMI, body fat percentage, and muscle mass percentage were assessed by using the In-Body230 (InBody LTD, Seoul, Republic of Korea) bioelectrical impedance analyzer [20].
• The assessment concluded with the administration of the APFT, which is further described below.
The APFT is comprised of a timed 1.5-mile run, one minute of push-ups, and one minute of sit-ups; all components were performed on the same day of the collected measurements in accordance with the documented protocol within the Air Force manual for Fitness Testing [27]. The 1.5-mile timed run was performed outdoors on a standard 400-m track. The push-ups and sit-ups were performed indoors with options for a 1-inch pad and/or toe-bar. The APFT was performed and proctored by certified Air Force Physical Training Leaders. Raw APFT scores were collected, scored according to the manual, and then categorized into a binary category of either pass (≥75% composite score and passed all components) or fail (<75% composite score or failed a component).
The binary dependent variable was whether the participant successfully passed all components of the APFT. This variable was unbalanced, as 70.4% of the participants passed the APFT. Preliminary modeling was conducted as a part of data preparation and selected categorical input variables were removed. Rank, Section, and Flight were removed as they each contained numerous categories and had a minimal influence on performance. The remaining numeric and ordinal categorical features were graphically analyzed for normality. Age and Outcome Rating Scale were identified as not normal and were log-transformed to increase their normality. While the Post-Traumatic Stress Disorder Checklist (PCL-5) feature was right-skewed, it was not transformed as various transformations did not create a normal distribution. The feature distributions are shown in the Figure 1 raincloud plot. To facilitate the visualization of the histograms, the features were temporarily min-max standardized prior to creating the figure.

Metrics
In this analysis, the classification metrics of AUC, precision, recall, and accuracy are used to measure and compare the performance of the classical and neural network models. In the dataset, the positive class (1) is passing the APFT. In the case where an Airman failed the APFT, it is critical that a model prediction of passing (false positive) be avoided-this would result in the negative consequence of an at-risk Airmen not receiving assistance. As a result, a high value of precision is important to emphasize false positives and true positives. The confusion matrix false positives and the receiver operator characteristic (ROC) curve false positive rate were examined. Finally, precision and accuracy were selected as accurate predictors of true positives and to facilitate a comparison to prior work.

Classical Modeling
For the classical modeling part of the study, the python Logit logistic regression algorithm was used. The original dataset was split into a 70/30 train/test split for performance evaluation and overfitting monitoring. The split was stratified to ensure an equal percentage of pass/fail in each set.
Several feature selection approaches in logistic regression modeling were used to provide a robust comparison. Two models were created with a backwards stepwise feature selection approach, with an α = 0.05 p-value selection method. Recursive feature elimination (RFE) was also investigated, which fits the model and rank orders all features by importance, removing the weakest feature recursively. Lastly, the select-K-best method was used; this method selected the best predictors for the dependent variable using a Ψ 2 function score.
A summary of the logistic regression variations and their respective results are included in Table 3.

Neural Network Modeling
The Python 3.7 TensorFlow and Keras libraries were used to create neural networks to predict whether an Airman will pass their APFT. Binary cross-entropy was selected as the loss function since this was a binary classification problem. The Adaptive Moment Estimation (Adam) optimization algorithm was used as it is recommended for multi-layer networks [28,29]. The rectified linear unit (ReLU) activation function was used for all layers except for the final layer, which used a sigmoid activation function [30]. All non-output layers used L2 regularization with λ = 0.001. The same log transformations were applied as in the classical modeling, and then the data were z-score normalized.
One concern with this dataset was the relatively low ratio of 223 data points to 21 input variables. This concern was mitigated in two ways, and the first was to limit the capacity of the neural network. In the 1st International Conference on Neural Networks, Widrow proposed that the number of recommended datapoints P is the number of weights (neurons × (inputs + 1)) divided by the desired error level, according to the equation 1 below [31]. For our goal accuracy (0.90) the recommendation is 1 neuron for 21 inputs, and 4 neurons for 5 inputs. As a result, the capacity of the neural network was limited to within an order of magnitude of these recommendations, and potential overfitting was closely monitored.
The second method to address the low ratio of data points to input variables was to use a two-way dataset split with cross validation instead of the typical three-way train/test/holdout split used in neural networks. This maximized the number of datapoints available for training. The same 70/30 train/test stratified split was used as in the classical modeling, and three-fold cross validation was applied in the following manner: • A multi-dimensional hyperparameter search (neurons, layers, learning rate, epochs, and batch size) was performed on the training dataset using three-fold cross validation. Total neuron count was limited by the Widrow recommendation [31].

•
Overfitting was monitored by comparing the accuracy of each fold. Models with >5% inter-fold accuracy variance were not considered for selection.

•
An optimal set of hyperparameters was determined by the highest mean fold accuracy of the remaining models. • Using the optimal hyperparameters, the model was then retrained on the entire training dataset.

•
The model was validated by measuring metrics on the holdout dataset.
This process was used twice: with all features and ≤10 total neurons and also on the four features from the best classical model (Sleep Score, BMI, Fitness Satisfaction, Physical Restriction) with ≤40 total neurons.

Results
Graphical analysis revealed two trends that are highlighted in Figures 2 and 3. The first trend is the influence of body composition on the ability to pass the APFT. As shown in Figure 2a, there was a clear correlation between increased muscle percentage and passing the APFT. This trend plateaued once the member's muscle percentage reached 40%. As a corollary, Figure 2b shows an increase in the failure rate once the member's body fat percentage increased past 20%.  The second trend is shown in Figure 3, which shows the relationship of obesity to passing the APFT. The figure is overlaid with biological gender-based body fat norms for men ( Figure 3a) and women (Figure 3b) from the American Council on Exercise (ACE) [32].
A significant proportion of Airmen in the dataset are obese according to the ACE body fat norms, and a trend is apparent in Figure 3 that there is a much higher failure rate for those service members that are classified as obese. This is an important finding not only due to the correlation between higher body fat and failing the APFT, but also because of the health risks associated with obesity, such as heart disease and type 2 diabetes.
While 64% of overall Airmen were obese and 30% of overall Airmen failed the APFT, 82% of female Airmen were obese and 51% of female Airmen failed the APFT. Female Airmen also rated their satisfaction with their fitness lower than male Airmen: Only 21% of females rated their fitness satisfaction a 4 or higher (out of 5), while 41% of males did. While it is acknowledged this dataset is small and is not representative of the entire USAF, the trends regarding obesity and lack of fitness, especially for female Airmen, are noteworthy.

Model Results
The performance metrics are presented in this section for the five variations of classical logistic regression and three variations of neural network modeling. For comparison, metrics are also calculated for two trivial models: One that predicts randomly (Chance) and one that always predicts the majority class (Always Predicts Pass), also known as the no information rate model.

Classical Modeling
The 5 classical models developed in this work showed significance with a p-value < 0.01, and their AUC, precision, recall, and accuracy are presented in Table 3. For comparison, these performance metrics are also presented for the two trivial models, Chance and Always Predicts Pass. The goals of this study are also shown in the bottom row of the table.
The four-feature p-value model was found to be the best classical model, with a higher combination of accuracy, precision, recall, F1-score, and AUC compared to all other models. The following input variables were contained in this model: Sleep (p-value = 0.014), body mass index (BMI, p-value = 0.015), self-report of fitness satisfaction (FitSat, p-value < 0.01), and recent injury resulting in physical restriction (PhysRestr, p-value < 0.01). This model was better at predicting APFT pass (f1 = 0.89) than APFT failure (f1 = 0.77).
For the four-feature p-value model, the confusion matrix from the holdout dataset is shown in Figure 4a along with the ROC curves from both the training and holdout datasets in Figure 4b. The confusion matrix showed that this model had eight false positives out of 19 APFT failures and three false negatives out of 48 APFT passing results. The ROC curve shows an acceptable 5% level of overfitting when comparing the AUC from the train and holdout datasets. While the selected classical model possessed good performance, it did not meet all the study goals.

Neural Network Modeling
Three neural network models are presented in this work: a baseline model on the entire dataset followed by models selected from multidimensional hyperparameter sweeps on two sets of input features. The baseline model was created with inputs to a single 10-neuron ReLU layer and a single-neuron sigmoid output layer. The first dataset for the hyperparameter sweeps was the entire 21-feature dataset, and the second dataset contained the 4 features from the best classical model: Sleep Score, BMI, Fitness Satisfaction, and Physical Restriction. The hyperparameter sweeps were performed using the GridSearchCV algorithm within the ranges shown in Table 4.
For each family of models, metrics were collected and used to select the best model. As three-fold cross validation was used in all modeling, the accuracy for each fold was retained and used in two ways: (1) The mean accuracy was calculated across the three folds, and (2) the maximum variance was determined between the three folds. The best model was selected as the model with the highest mean accuracy that possessed the lowest inter-fold variance. These metrics for the families of models that resulted from the two variations is shown in Figure 5, along with an arrow that indicates the best model.   Using these criteria, the give neural network models that possessed the highest mean fold accuracy are presented in Table 5 for the full 21-feature dataset (top) and limited four-feature dataset (bottom). Their associated hyperparameters and inter-fold variance are shown as well. Bold text indicates the best models, which had an excellent combination of performance and limited complexity.
Concerning the selected "best" models, the neural network structure, accuracy vs. epoch training curves, ROC curves and AUC, and confusion matrix are shown in Figure 6. The left side of Figure 6 shows the model information for the full-dataset model, and the right side shows the limited-dataset model. There is a difference between the three-fold cross validation metrics and the final neural network performance metrics-for example, the accuracy for the full model in Table 5 (84.6%) is lower than the accuracy shown in Figure 6 panel b1 (93%). The authors propose this is due to the variation of training dataset size between the models. Three-fold cross validation (used for hyperparameter selection) used subdivisions of the training dataset, while the final model used those hyperparameters on the full training dataset.  The impacts of normalization on the neural network models were then investigated. The performance shown in Table 5 includes z-score normalization, and the hyperparameter search was repeated with normalization removed. Models were discarded that contained more that 10 neurons (full dataset model) or 40 neurons (limited dataset model). In this case the best full model with <5% inter-fold variance possessed a mean fold accuracy of 80.8%, which is 3.8% less than the model created from the z-score normalized dataset. The best limited model with <5% inter-fold variance possessed a mean fold accuracy of 84.0%, which is 7.7% less than the model created from the z-score normalized dataset.
The metrics for the neural network models presented in this work are summarized in Table 6 and were measured on the holdout dataset. Precision and recall are calculated as the weighted average between the majority and minority classes.

Discussion
In the current study, predictors of Air Force Physical Fitness Test failure were evaluated using logistic regression and limited-capacity neural networks. The models developed in this study demonstrated better performance than prior work to predict physical fitness test failure [7][8][9], and the neural network model achieved the highest level of performance. All models demonstrated an acceptable <5% level of overfitting between the training and holdout datasets.
The best logistic regression model was significant (p-value < 0.01), but it did not achieve the precision and accuracy goals of this study. While the logistic regression model underperformed the neural network, it yielded valuable inferences that poorer sleep quality, higher BMI, recent history of an injury, and a self-report of lower fitness satisfaction are potential indicators of APFT failure. Importantly, these features are modifiable with health promotion and workplace wellness interventions, which has the potential to change the likelihood of APFT failure.
A method of hyperparameter sweeps, regularization, and model selection enabled the neural network to outperform the logistic regression model. The 21-feature full dataset model yielded the best combination of AUC, precision, recall, and accuracy, as shown in Table 6. Notably, the near-identical performance from the four-feature limited dataset model (sleep quality, BMI, self-report of fitness satisfaction, and injury resulting in physical activity restriction) suggests that strong APFT prediction can be obtained with minimal data collection. This has the potential to enable wider implementation of this modeling approach as the full battery of tests requires a significant resource dedication. Future research investigating predictive modeling in sports science and injury prevention should consider limited-capacity neural network models.
If Air Force unit commanders and installation Health Promotion Offices implement unit-based surveys and body composition assessments of Airmen to assist in identification of individuals at greater risk of APFT failure, they could prioritize workplace interventions and installation health promotion assets to those individuals and units with the highest degree of need. These assets include the Health Promotions Office, Fitness Facility, Aerospace Physiology, and Integrated Operational Support Teams [4,5]. Changing from a reactive-based post-APFT failure model towards a proactive preventative model could improve retention and morale of military members.
Additionally, there is the potential for significant savings by preventing APFT failure. Successfully passing physical fitness testing and adhering to standardized exercise programming have been recognized as major contributors in noncombat injury risk reduction (33-45%) among active-duty members [33,34]. This is significant, as noncombat musculoskeletal injuries account for more than 80% of all injuries, 60% of limited duty days, and 65% of medically nondeployable active-duty members across the U.S. military [1]. Furthermore, the cost burden is immense for training and replacing warfighters discharged APFT failure, $137 million annually [3]. The USAF could save tens of thousands of dollars per Airman not lost to attrition through early identification of a servicemember likely to fail their APFT and early implementation of targeted wellness initiatives through established installation human performance services. Further research should be conducted evaluating similar biomechanical and psychosocial variables on more varied work centers in the USAF as well as other military branches.
A known limitation of this work is that participants were recruited from within a single USAF support squadron. This limits generalizability, and future research could include larger sample sizes and a greater variety of military occupations to validate the modeling approach. The difference in sample size (223) compared to the population size (280) is likely attributed to the optional nature of the study and the timing of data collection relative to shift work or approved leave.

Conclusions
The results of this study indicated that combining biomechanical and psychosocial variables yielded better prediction of failure on the USAF Fitness Test than previous work. In this work, biomechanical, psychological, and APFT performance data related to 223 active-duty Airmen was graphical analyzed to show pass/fail trends related to body composition and obesity. Multiple machine learning algorithms were then applied to predict fitness test performance using these variables. The logistic regression model achieved a high level of significance with an accuracy of 0.84 and AUC of 0.89 on the holdout dataset. This model showed that Airmen with poor sleep quality, recent history of an injury, higher BMI, and low fitness satisfaction tend to be at greater risk for fitness test failure. A limited-capacity neural network approach to predictive modeling yielded better performance than classical logistic regression-0.93 accuracy and 0.97 AUC on the holdout dataset. Given this greater understanding of the biopsychosocial variables that appear to predict failure in the USAF Fitness Test, the U.S. Department of Defense could achieve lower numbers of fitness test failures by mobilizing health promotion and workplace wellness campaigns targeting these identified variables.