Predicting Fatigue in Long Duration Mountain Events with a Single Sensor and Deep Learning Model

Aim: To determine whether an AI model and single sensor measuring acceleration and ECG could model cognitive and physical fatigue for a self-paced trail run. Methods: A field-based protocol of continuous fatigue repeated hourly induced physical (~45 min) and cognitive (~10 min) fatigue on one healthy participant. The physical load was a 3.8 km, 200 m vertical gain, trail run, with acceleration and electrocardiogram (ECG) data collected using a single sensor. Cognitive load was a Multi Attribute Test Battery (MATB) and separate assessment battery included the Finger Tap Test (FTT), Stroop, Trail Making A and B, Spatial Memory, Paced Visual Serial Addition Test (PVSAT), and a vertical jump. A fatigue prediction model was implemented using a Convolutional Neural Network (CNN). Results: When the fatigue test battery results were compared for sensitivity to the protocol load, FTT right hand (R2 0.71) and Jump Height (R2 0.78) were the most sensitive while the other tests were less sensitive (R2 values Stroop 0.49, Trail Making A 0.29, Trail Making B 0.05, PVSAT 0.03, spatial memory 0.003). The best prediction results were achieved with a rolling average of 200 predictions (102.4 s), during set activity types, mean absolute error for ‘walk up’ (MAE200 12.5%), and range of absolute error for ‘run down’ (RAE200 16.7%). Conclusions: We were able to measure cognitive and physical fatigue using a single wearable sensor during a practical field protocol, including contextual factors in conjunction with a neural network model. This research has practical application to fatigue research in the field.


Why We Need to Measure Physical and Cognitive Fatigue in the Field
Measures of physical and cognitive fatigue are needed in the field to improve performance and help improve safe participation in outdoor environments.
Physiological and cognitive fatigue in field environments directly affects performance as a person modulates decisions based on contextual input to maintain resources [1]. Various fields where operational safety is related to fatigue have been investigated, including pilots [2,3], motor vehicle drivers [4][5][6][7][8][9], firefighters [10,11], and shift workers [12]. Physical fatigue relates to reduced force, endurance, level of effort, strength, speed, and coordination [13]. Levels of performance may be modulated by physical load, sleep, nutrition, and psychological factors based on mission duration, pain, levels of perceived exertion [14][15][16][17], intensity, and time on task [18]. Hill [19] won the Noble prize for his work on skeletal muscle and maximum oxygen uptake.
Cognitive fatigue can be viewed as a combination of goal, adaption, and reward trade-offs, including the energetic requirements to achieve a goal [23,24]. Performance psychology [25,26] describes performance as recalling one's knowledge, skills, and abilities during an event. Cognitive and physical fatigue have a complex interaction of over-lapping redundant systems [27].

How We Can Measure Physical and Cognitive Fatigue in the Lab and the Field
Mental and physical fatigue have been researched in the lab using different sensing modalities including computer interaction [28], accelerometery, electroencephalogram (EEG), electrooculography (EOG) [29], electromyography (EMG), and electrocardiograph (ECG) [16,[30][31][32], however, these techniques are not always practical in a field setting.
Assessment of performance and fatigue has been studied [3] with multiple sensors and neural networks. However, they have not been validated in the field with noise sources such as terrain, slope, and obstacles. Enoka [33] noted that lab-based experiments such as maximum voluntary contractions (MVC) result in task dependency that do not translate into field performance. The reduction of separate effects does not equate to overall performance. The only way to determine performance reductions from fatigue is to measure the response to loads in the field.
Field applications require the number of sensors to be minimized while performing challenging multiday events and to not distract the operator from their mission tasks or add to logistical loads when deploying technology into an operational environment. Where multiple sensors would aid accuracy and redundancy, they may lead to lack of deployment of the entire system, hence a minimum viable solution to maximize use by operators is desirable. A review of sensors used for measuring occupational fatigue [34] showed that the most effective sensors were heart rate and accelerometry. Smartphones with multi-channel inertial sensors and deep learning models have been used for human activity recognition [35,36] in controlled environments for complex activity types. A review of physical and cognitive fatigue has shown a relationship of heart rate and accelerometry with muscle activity, proprioception, and changes in gait [37][38][39]. Gait has been shown to change physical performance with increased mental fatigue [9,16,40], goals [41], and reduced executive function [42]. Terrain has been shown to influence gait and accelerometry readings [43].
Traditional machine learning with feature extraction has been used in applications such human activity recognition [43,44], however this approach assumes the features of interest are known and calculatable. Deep learning uses models which automatically determine feature morphology and significance in the data which may not be observable with traditional statistics and data analysis. Deep learning has been used for areas such as wakefulness detection with accelerometry and ECG [45] and fatigue estimation by Gordienko et al. [46] showed positive results with a repetitive exercises in the gym. Recurrent neural network (RNN) and long short-term memory (LSTM) are often cited as the preferred models for time series data [47]. Convolutional neural networks (CNN) have also been used for time series data [43,48] and do not suffer from the stability issues of RNNs while enabling parallel processing which is not possible with RNN type models. CNN models have shown good performance on physiological time series data for emotion classification, [49] and mental fatigue [50] using EOG, which is not generally practical in field operations with high levels of activity. Accelerometry has been shown to be affected by cognitive fatigue [51].
The aim of this study was to: -Determine whether cognitive and physical fatigue could be accurately predicted by an AI model using data from a single sensor capable of being worn in an endurance activity for multiple days, measuring acceleration and ECG in an outdoor environment with voluntary activity.
-Additionally propose a protocol for data collection in an unsupervised remote environment with no manual labelling by the participant -Determine if environmental parameters would affect accuracy, including; random activity, self-pacing, terrain surface (concrete, gravel, dirt, mud grass), and slope (flat, up and down slopes)

Ethics
The researcher's university ethics committee (AUTEC 18/412) approved all procedures in the study and the participant gave written informed consent prior to participating in the study.

Protocol-Physical and Cognitive Load and Performance Assessments
A protocol was developed that included self-paced running in an unstructured mountain environment and standard performance assessments with no distractions in a laboratory for comparison.
The protocol was developed using physical and cognitive loads in excess of a participants' critical power [52] to induce fatigue. A one-hour period of fixed load was repeated until the participant voluntarily ceased the protocol. No restart was allowed. Physical load was provided by a trail run (3.8 km, 200 m vertical gain), and cognitive load was provided by 10 min Multi Attribute Test Battery (MATB) [53] (Figure 1). A goal was set as 100 km distance, 5200 m (17,000 feet) total climb, and 26 h' time in order to address motivation [14] and psychological perception of pain [54]. The course was prescribed to cover various slope angles and terrain types (concrete, gravel, dirt, grass, boulders) and obstacles (trees, river, gate, fence) and to not require active navigation for safety under fatigue and reduced decision-making capacity [55]. Speed was rewarded by earlier completion of the hourly protocol, resulting in a larger rest period per hour.
For clinical comparison, a battery of performance assessments were completed on an iPad Pro (Apple, Cupertino, CA, USA) using a custom application, implementing tests built with an Apple Research Kit [56]. The battery of assessments was chosen because they have previously shown sensitivity to the protocol loads and fatigue-related diseases, Table 1. These included assessments used for fibro myalgia [57], Parkinson's [58], and physical [16,59] and cognitive fatigue [60]. Assessments used included Stroop, Finger Tap Test, FTT, Trail Making A, Trail Making B, paced serial addition test, PVSAT, memory, and jump height.
The trail was divided into twenty-three sections separated by waypoints defined by a change in terrain surface, slope, or obstacle. Terrain descriptors were validated against video (GoPro Hero 4, Garmin, KS, USA). Slope was determined from a mean of GPS altitude measurements at each waypoint. Waypoint location was determined from Google maps to an accuracy of 10 cm. Time at a waypoint was determined when the subject was closest. Walk and Run activity labels were defined by cadence from vertical axis accelerometery zero crossings (100 < Walk < 150 < Run steps per minute) as described in Russel et al. for human activity recognition [43]. Identification of crossing obstacles was based on geographic location and manual observation of the acceleration waveforms Figure 2. Time resolution for labelling was one second. Figure 2 shows the multi-channel 1-D Convolutional Neural Network (CNN) that was selected to allow learning on separate channels and cross correlation into a single regression output value. The training label was FTT up-sampled to 250 Hz. Data were split by activity type and segmented by input window length. The initial model width for all hidden layers was set at 256, which was approximately one second of data. The model implemented the Adam optimizer and mean absolute error (MAE) as the error term during training. Randomized train test split ratio was 0.33. Hyper parameter tuning, included window size for each activity type, was performed (64,128,256,512). The lowest MAE activity was selected for further model optimization of hidden layer widths. Optimization was performed separately for three datasets: acceleration; ECG; and combined acceleration and ECG. The final model for comparison was selected for lowest MAE. Performance was assessed using the mean absolute difference (MAE 200 ), and range of absolute difference (RAE 200 ), between the label values and the average of 200 predictions. RAE was of interest as it indicated the largest error possible when the trained model was used to predict a fatigue value in the future.

Statistics
Linear regression (Pearson correlation R 2 ) was performed on each performance test to assess sensitivity of the protocol. The performance test results were normalized across the protocol and linearly interpolated to give a long-term linear fit (LTLF). The same tests with highest R 2 were up-sampled to 250 Hz using inter-test interpolation (ITI), as ITI includes short term fatigue and recovery. LTLF is more representative of long-term fatigue but is only possible with a research protocol designed with a constant load over time. ITI is needed for random field predictions where no assumptions can be made about overall loads.
Time series data were normalized using feature scaling via Equation (1) in preparation for training the CNN. ECG data were base line corrected. All accelerometer axis (x, y, z) and ECG data were transformed into an array (D, W, F) with D rows, W window width, and F number of features. x

Results
The participant voluntarily ceased the protocol at 11 h (2200 m vertical climb, 41.8 km) due to perceived exhaustion. Figure 3 shows the representative input to the CNN of the gait waveforms of vertical acceleration on tarseal and dirt at different fatigue levels. Each plot is 50 steps triggered at zero g and plotted with the median waveform in a thick black line. Inter-step variation in acceleration and morphology can be observed between surfaces (a) tarseal and (b) dirt. The changes in waveform shape between surfaces was likely due to surface hardness and variations in surface texture uniformity. Across the protocol, variation was likely due to fatigue reducing peak forces and subsequent gait adaption, as seen on the plots at point (c).    Table 2. Jump height shown in Figure 4b was performed after each physical load period and showed high correlation (R 2 0.78) with the protocol. Stroop shown in Figure 4c had two outliers and showed moderate correlation (R 2 0.5) with the outliers removed. PVSAT shown in Figure 4d was not correlated with the protocol load. Trail making A (R 2 0.29) and spatial memory (R 2 0.28) were somewhat correlated to post cognitive load. Trail making B (R 2 0.22) was somewhat correlated to post-physical load. Figure 5 shows the variation of gait for four periods in the protocol illustrating the variation to the accelerometer waveforms for both fatigue levels and terrain.  A training result is shown for a single activity 'run down' in Figure 6 for data window 128, epoch 100, individual predictions (light grey), and rolling average of 200 predictions (black). The label for FTT (red) inter-test linear interpolation with discontinuities between time periods due to concatenation. A total of 108 machine learning experiments were performed to test which input data width and activity type gave the best MAE. Initially, a fixed CNN topology was used (Epoch 50, Batch 256, layer 1 filter 256, layer 2 filter 256, dense layer 128, overlap = 0). Three data group results were compared for: acceleration, ECG, and combined acceleration with ECG. These three conditions were tested for each activity type ('run', 'walk', etc.) over four data window widths (64, 128, 256 and 512). The results for these experiments are shown in Figure 7 by activity, where circle diameter is data window width. Minimum MAE was at 'walk up' (window width 256, MAE 0.105, samples 1,534,500, windows 5994) and 'sit' (window width 256, MAE 0.116, samples 2,662,750, windows 10,401). However, sit was not included as it took place in the lab for cognitive testing. Samples were more numerous for 'run down' (window width 256, MAE 0.181, samples 1,843,749, windows 7202) and still gave a larger minimum MAE. This indicates that total sample count is not the main influence on MAE, however the activity with considerably lower samples did show larger MAE values, 'walk down' (stride 512, MAE 0.309, samples 20,000, windows 78).
Further experiments were performed for acceleration and ECG with 'walk up' to optimize the CNN model hyperparameters, various widths of the first two convolutional layers, and the dense layer. The lowest MAE was found to be the following model: Conv1D 128, Conv1D 128, max_pooling, flatten, dense 128, dense 1. Table 3 shows the total samples per activity and results for MAE and RAE with the training labels using two methods, linear fit, and inter-test interpolation, window width 128, epoch 100, batch size 256, and a rolling window average of 200 predictions. There was no result for activity of 'walk-down' as the total samples divided by the window width of 128 was 156, which was less than the rolling average of 200 predictions. Activity 'Walk Up' gave the lowest MAE for both linear interpolation and inter-test interpolation of label data.
Activity 'Run Down' gave the lowest range of errors, indicating it may be a better activity for field prediction.

Discussion
A protocol for cognitive and physical fatigue was performed in the field, with voluntary activity selection and voluntary pacing over various terrain slopes and surfaces. Jump height and FTT-dominant-hand were most sensitive to the protocol. FTT-non-dominanthand and Stroop were moderately sensitive. FTT was the most sensitive and biomechanically non-specific, as the legs were exposed to physical load and the arms-hand-fingers were tested for neuromuscular performance. It is likely Stroop would be more sensitive if the protocol included sleep deprivation. Spatial Memory was mildly correlated to the cognitive load.
The experiment showed that a field protocol of cognitive and physical load in excess of a critical power will cause failure and modulate standard objective measures of cognitive and physical performance. Mental and physical fatigue led to earlier-than-anticipated termination of the protocol, which aligned with previous studies [16,40].
The use of a machine learning model was required due to the complex gait waveform morphology variations throughout the protocol. The results for acceleration, ECG, and combined acceleration and ECG are shown in Figure 7 across various stride lengths from 64 to 512 samples. While the activity 'sit' had low MAE showing how a controlled environment could give good results, our work aimed to determine if it was possible in an uncontrolled field-based environment. Activity 'walk up' had low MAE for both inter-test interpolation and long-term linear fit. 'Run down' had the lowest RAE. It is recommended that RAE is used, as this represents the results you would get when using the model in the future for inference. This experiment showed how a single sensor could be used in conjunction with a CNN model to give accurate results of cognitive and physical fatigue equivalent to gold standard objective tests; FTT and Vertical Jump Test. Best results were obtained when model training was specific to activities such as 'run down' and 'walk up'. MAE and RAE performed well for a rolling window of 200 continuous predictions of 102 s. This intuitively makes sense that any one step in a persons' gait may be influenced by objects, surface, and other distractions, and it is best to use multiple steps of a persons' gait to determine a fatigue result. Winter [70] showed that the cadence in steps per minute on a uniform surface varied from 84.7 ± 10.4 for slow to 121.6 ± 5.3 for fast.
The input window size of the CNN model has an optimum size. Too small does not allow a full gait or ECG waveform to be analyzed, and too large significantly reduces the number of training samples.
Tests that had the highest sensitivity to the protocol, and indicated a central fatigue component, were the jump test (high physical load on the legs) and the FTT (utilized hand digits which were not significantly utilized during running). Cognitive tests were less sensitive to the protocol, indicating there may have been a mismatch between cognitive and physical loads.
The effectiveness of the protocol was encouraging as it provided proof of concept for translational research to be undertaken in outdoor environments. Future work could examine how team workload and tactical decision-making can be adjusted for cognitive and physical fatigue in real time with no additional data entry for soldiers on multiday missions. Recovery during training missions could be assessed without researchers being present. Adventure sports people could gain insight into their cognitive and physical fatigue, enabling informed training plans. Work rest cycles could be adjusted, and critical tactical and navigation decisions can be chosen based on periods of highest cognitive performance.
This feasibility study researched approaches of protocol design, error sources, calibration techniques, data collection, validation, labelling, and data processing. Given the lessons learnt, data gathering and processing needs to be more automated to reduce the high processing load that occurred for the one participant in this study. Further work is needed to test inter-subject variability to the protocol, test-retest accuracy of the prediction model, longer duration, and additional fatigue modulators including sleep, pain, discomfort, and nutrition.

Limitations
Limitations in validating the experimental objective include a linear protocol and the limited amount of comparison tests, however, this is a natural limitation in the field of cognitive assessments in the field. A long-term linear fit was appropriate for this protocol as the repetitive load could be assumed constant over the longer-term time frame. A random field assessment with no defined load protocol would require training using inter-test interpolation to allow for stochastic loads and recovery cycles. A constant long-term load was required to fit a machine learning model. Future work could compare the results in a long-term non periodic protocol.
The limitations of this test were the duration and the use of a single participant to initially prove the feasibility of the protocol and approach. Further research is required around increasing the duration of the protocol, possibly by reducing the hourly physical load. Additional studies over longer periods are required to generate cognitive fatigue that includes sleep deprivation. The test battery should include assessments immediately after large vertical assents to gather insight into short-term recovery. The addition of cognitive loads and assessment significantly affected the rate of perceived exertion. Future protocols should halve the physical load to lengthen the time to failure. Additionally, this method requires more participants to compare inter-person sensitivity and variability.

Conclusions
This paper showed that a single wearable sensor could be used in conjunction with a neural network model to determine cognitive and physical fatigue without performance tests being required during an operation in an outside unstructured environment. This research has the potential to increase safety and operational performance in high-risk environments by indicating the possibility of replacing traditional performance tests with a single wearable device. This work is novel, to the knowledge of the authors, in developing a field-based protocol for human performance with no direct supervision and modulation from ground surface, slope, fatigue, and task motivation. Future research is required for more participants and will require further automation of data labelling to process field data with self-pacing activities.