Multi-Task Classiﬁcation of Physical Activity and Acute Psychological Stress for Advanced Diabetes Treatment

: Wearable sensor data can be integrated and interpreted to improve the treatment of chronic conditions, such as diabetes, by enabling adjustments in treatment decisions based on physical activity and psychological stress assessments. The challenges in using biological analytes to frequently detect physical activity (PA) and acute psychological stress (APS) in daily life necessitate the use of data from noninvasive sensors in wearable devices, such as wristbands. We developed a recurrent multi-task deep neural network (NN) with long-short-term-memory architecture to integrate data from multiple sensors (blood volume pulse, skin temperature, galvanic skin response, three-axis accelerometers) and simultaneously detect and classify the type of PA, namely, sedentary state, treadmill run, stationary bike, and APS, such as non-stress, emotional anxiety stress, mental stress, and estimate the energy expenditure (EE). The objective was to assess the feasibility of using the multi-task recurrent NN (RNN) rather than independent RNNs for detection and classiﬁcation of AP and APS. The multi-task RNN achieves comparable performance to independent RNNs, with the multi-task RNN having F1 scores of 98.00% for PA and 98.97% for APS, and a root mean square error (RMSE) of 0.728 calhr.kg for EE estimation for testing data. The independent RNNs have F1 scores of 99.64% for PA and 98.83% for APS, and an RMSE of 0.666 calhr.kg for EE estimation. The results indicate that a multi-task RNN can effectively interpret the signals from wearable sensors. Additionally, we developed individual and multi-task extreme gradient boosting (XGBoost) for separate and simultaneous classiﬁcation of PA types and APS types. Multi-task XGBoost achieved F1 scores of 99.89% and 98.31% for the classiﬁcation of PA types and APS types, respectively, while the independent XGBoost achieved F1 scores of 99.68% and 96.77%, respectively. The results indicate that both multi-task RNN and XGBoost can be used for the detection and classiﬁcation of PA and APS without loss of performance with respect to individual separate classiﬁcation systems. People with diabetes can achieve better outcomes and quality of life by including physical activity and psychological stress assessments in treatment decision-making.


Introduction
Chronic diseases, such as diabetes, require frequent adjustments to treatment decisions to tailor and personalize the treatment to individual patients for improved outcomes. More frequent assessment of the instantaneous state and conditions of the subject can further enable and enhance precision diabetes treatment. People with Type 1 diabetes (T1D) can keep their blood glucose values in a desired range by incorporating their physical activity Signals 2023, 4 168 (PA) and acute psychological stress (APS) information in their insulin dosing decisions. The type and intensity of PA and the nature of APS experienced by an individual affect a range of endocrine and metabolic pathways. Frequently measuring the variations in biological analytes in free-living conditions and throughout daily life is not practical, which necessitates different modalities for noninvasive sensing to infer information on PA and APS required to adjust the diabetes therapy [1].
The need to assess PA and APS information from noninvasive sensors has spurred the development of novel wearable devices with advanced sensors and new algorithms to interpret the raw data. Sensors such as three-axis accelerometers (ACC) and heart rate (HR) monitors based on photoplethysmography that measures blood volume pulse (BVP) enabled noninvasive detection of PA. Detection of APS requires additional biosignals such as electrodermal activity (EDA) measured by galvanic skin response (GSR) sensor, and skin temperature (ST) [2]. The data generated by these sensors need to be cleaned of the noise and artifacts corrupting the signal to enhance the information extracted from the signals.
After cleaning the raw data, the signals must be mapped to features that can inform the algorithms on the type and intensity of PA and nature of the APS. Various machine learning algorithms have been developed and trained in the literature to detect PA and APS, including naïve Bayes classification, nearest neighbor methods, logistic regression, decision trees, support vector machines, and neural networks (NN) [3][4][5].
Novel NN architectures and training algorithms can identify intricate and hidden patterns in the signals to gain information on PA and APS. The use of recurrent neural networks (RNN) with long short-term memory (LSTM) has shown promising results in predicting the type and intensity of PA and the type of APS episode [6,7]. An issue with the data collected to train the models is the class imbalances, which can bias the performance of the algorithms to favor the majority class at the expense of lower accuracy for the minority class. Although the sizes of the classes may be balanced by either downsampling the majority class or upsampling the minority class, it risks discarding useful information if samples are removed or biasing the algorithms towards the samples that are repeated multiple times when upsampled [8]. Addressing the class imbalances requires more sophisticated upsampling algorithms that generate new data samples or incorporating weighted learning when training the model.
Training independent models to predict the type of PA and APS without connecting the shared information between the two tasks can require more training data and longer training time. Exploiting the shared representations in the data by training one model to predict the related tasks jointly can potentially improve data efficiency and reduce the training time. However, learning multiple tasks simultaneously can be challenging [9][10][11][12]. The combination of the tasks must be considered when handling the class imbalances. The tasks also must be partially related with overlapping feature maps to reinforce the join learning of multiple tasks. We showed in previous works that a unified common feature map can encompass the features required to predict the types of PA and APS [6].
The detection of PA and APS, whether they occur alone or simultaneously, can affect the treatment decision for T1D [13]. People with T1D must continuously monitor their blood glucose levels using continuous glucose monitors (CGM) and evaluate their insulin requirements based on their glucose levels, meal, PA, and APS information. Incorporating all these diverse sources of information to continuously adjust insulin administration is an arduous process. Artificial pancreas (AP) systems connect a CGM sensor to an insulin pump via an algorithm to calculate and administer insulin accordingly in people with Type 1 diabetes [14]. AP systems developed by our research group have extended the traditional AP structure (based exclusively on CGM and insulin information collected automatically and manual entries of meal and exercise information) by incorporating additional signals from wearable devices, such as wristbands, to provide information on PA and adjust insulin dosing accordingly [15,16]. Although PA and APS are both similar in their effects on some signals, such as increasing HR, they must be accurately classified to avoid adverse outcomes. Moderate intensity PA usually lowers blood glucose levels, which requires a decrease in the insulin dose to maintain stable blood glucose levels within the safe target range. APS increases blood glucose levels, which may necessitate an increase in the insulin dose to maintain the glucose levels in the target range. Despite their opposing effects on blood glucose levels, the presence of PA and APS can be easily misinterpreted if the classification decision relies on only a limited set of measurements, such as relying solely on HR.
Motivated by the above considerations, the main contributions of this work are: • Multi-task learning of RNN with LSTM architecture for simultaneously classifying the type and intensity (i.e., energy expenditure) of physical activity events (sedentary state, stationary bike, or treadmill run) and type of acute psychological stress events (non-stress, emotional anxiety stress, or mental stress) using a common feature map and comparing the performance of the multi-task model with the independent models for each task. • Multi-task learning of extreme gradient boosting (XGBoost) for simultaneously classifying the type of PA (sedentary state, stationary bike, or treadmill run) and the type of APS (non-stress, emotional anxiety stress, or mental stress) using a common feature map and comparing the performance of the multi-task XGBoost to the independent XGBoost models for each task. Section 2 details the methods for collecting the data, preprocessing the signals, extracting feature maps, selecting the informative features useful for the multi-task learning, handling class imbalances, and the architecture of the trained recurrent NN models and the XGBoost model. Section 3 presents the results of the multitask RNN with LSTM and XGBoost algorithms, and comparatively evaluates their performance against their respective independent models. Section 4 provides a discussion on the advantages of the approach and possible improvements in future works. Finally, Section 5 provides the concluding remarks.

Materials and Methods
Many physiological variables can be valuable to classify the occurrence of PA from APS [17][18][19][20][21][22][23], such as hormonal changes of lactate and cortisol levels, eye-tracking [24], and speech wave analysis. However, currently, these variables cannot be measured noninvasively and frequently in free living. In this work, we used data collected noninvasively by the Empatica E4 wristband. Empatica E4 has a 3-axis ACC that captures motion-based activity, a photoplethysmography (PPG) sensor that measures BVP from which HR and HR variability is derived by an internal algorithm of E4, an infrared thermopile to read peripheral ST, and EDA, also known as GSR, to measure the electrical activity conducted through sweat glands in the skin. A Cosmed K5 wearable metabolic system is used to measure energy expenditure (EE) to determine the intensity of the PA (the ground-truth) [25] to compare with the EE estimated from E4 signals. A limited number of experiments are conducted using the Bioplux finger-tip PPG sensor device that provides a higher accuracy PPG and electrocardiogram (ECG) signal as the ground-truth measurement [26]. The characteristics of physiological variables recorded by Empatica E4, Cosmed K5, and Bioplux are summarized in Table 1. The signals collected from the Empatica E4 wristband are preprocessed to remove noise and artifacts. Random convolutional kernel transformation (ROCKET) is utilized to extract a large number of feature maps. Features with the most predictive power are selected using partial least squares discriminant analysis (PLS-DA) and partial least squares (PLS) for classification and regression tasks, respectively. The selected features are used to train the machine learning (ML) algorithms including multi-task RNN with an LSTM layer for simultaneously classifying the type and intensity of PA and the type of APS. To deal with imbalanced class sizes and avoid bias in model training, we used adaptive synthetic sampling (ADASYN) [27,28] and weighted training.

Data Collection
A total of 34 subjects participated in 166 clinical experiments approved by the Institutional Review Boards (IRB) at the universities conducting the experiments. Table 2 shows a general overview of the participants' demographics. The experiments involve being in a sedentary state (SS) or performing two types of PA, either treadmill running (TR) or stationary bike (SB). Subjects perform PA under no psychological stressor non-stress (NS), or under the influence of stressors that induces APS, either mental stress (MS) or emotional anxiety stress (EAS). The APS inducement methods are standard reliable techniques that have been reported in the literature in previous studies [1,19,22,23,[29][30][31][32][33].
The SS experiments are divided into three subcategories: NS events, EAS inducement, and MS inducement. In NS, subjects perform free living activities such as reading books, watching neutral videos or surfing the internet. In EAS inducement, subjects meet with their supervisors to report progress of their work, drive a car, and solve test problems in a specific time frame. In MS inducement, subjects solve mental or mathematics exam or IQ test, or puzzle games or perform the Stroop test. Similarly, APS inducement during PA (TR and SB experiments) are split into three subcategories. An NS experiment involves watching natural videos or listening to music. During EAS inducement sessions, subjects watch surgery videos or car crash videos, while in MS inducement experiments, they solve mental math problems. Figure 1 describes the data acquisition system for the data collection.
IQ test, or puzzle games or perform the Stroop test. Similarly, APS inducement during PA (TR and SB experiments) are split into three subcategories. An NS experiment involves watching natural videos or listening to music. During EAS inducement sessions, subjects watch surgery videos or car crash videos, while in MS inducement experiments, they solve mental math problems. Figure 1 describes the data acquisition system for the data collection.  The Cosmed K5 portable indirect calorimetry system is used to measure the EE (the ground truth). To ensure the PA is consistent across all experiments, the EE was compared across NS, EAS, and MS. In addition, the State-Trait Anxiety Inventory Trait STAI-T and the State-Trait Anxiety Inventory State STAI-S scores are calculated for each participant to assess the anxiety response [34][35][36]. Before and after each nonstress and emotional anxiety stress inducement experiment, the State-Trait Anxiety Inventory (STAI) self-reported questionnaire is collected. The STAI-T scale consists of 20 statements that ask people to describe how they generally feel. On a daily basis, it describes how one feels stressed, The Cosmed K5 portable indirect calorimetry system is used to measure the EE (the ground truth). To ensure the PA is consistent across all experiments, the EE was compared across NS, EAS, and MS. In addition, the State-Trait Anxiety Inventory Trait STAI-T and the State-Trait Anxiety Inventory State STAI-S scores are calculated for each participant to assess the anxiety response [34][35][36]. Before and after each nonstress and emotional anxiety stress inducement experiment, the State-Trait Anxiety Inventory (STAI) self-reported questionnaire is collected. The STAI-T scale consists of 20 statements that ask people to describe how they generally feel. On a daily basis, it describes how one feels stressed, anxious, or uncomfortable. The STAI-S scale also consists of 20 statements, but the instructions require subjects to indicate how they feel at a particular moment in time. It is used to determine the actual levels of anxiety intensity induced by the stressful experiment. Table 3 lists the experiments conducted for data collection.

Signal Processing
Determining of the label of each event, namely, PA or APS and its type, requires a specific duration of biosignals recorded by the wristband sensor. Signal segmentation enables the trained model to be evaluated frequently. The signal segmentation includes splitting a long duration of biosignals into consecutive and overlapping segments. Recursively estimating the labels of different PA and APS requires information from the current time-window of biosignals as well as several past segments of the signal. Hence, all biosignals are split into segments with a duration of 10 s and each observation of the biosignal is made of 5 overlapped segments of these biosignals. Each two-consecutive time-window of biosignals has a 50% overlap, which accounts for 5 s of mutual samples in biosignals for consecutive time segments. The label of each segment was determined from the label of the last second of the segment. This formation of the data is suited to train NN models that are capable of capturing the time-dependency in the data. Therefore, RNN with LSTM architectures are an ideal choice for this purpose. Figure 2 illustrates this notation for labeling each segment of the signal and demonstrates the process of stacking samples with their chronological order for training a RNN model with LSTM architecture [6].
Due to the sensitivity of the PPG sensor to position on the wrist and movement, Empatica E4 signals are corrupted by noise and motion artifacts. A number of factors, such as sensor detachment or communication loss, may result in missing information in raw signals. Signal processing is used to remove noise and artifacts and to impute missing data.
The 3-axis ACC provides the main signal used to capture and discriminate between different types of PA. Since almost all of the human activity frequencies lie between 0 and 10 Hz [37], a low-pass filter or a band-pass filter with a lower frequency close to zero can be used to reject frequencies that are not associated with body movement. We used a 4th order Butterworth bandpass filter with cutoff frequencies 0.1-10 Hz. Signals 2023, 4, FOR PEER REVIEW 7 The 3-axis ACC provides the main signal used to capture and discriminate between different types of PA. Since almost all of the human activity frequencies lie between 0 and 10 Hz [37], a low-pass filter or a band-pass filter with a lower frequency close to zero can be used to reject frequencies that are not associated with body movement. We used a 4th order Butterworth bandpass filter with cutoff frequencies 0.1-10 Hz.
The variables that are most informative for determining which PA or APS a patient has are the estimation of HR, the variability in HR, and the breath rate. Since HR values can range from 40 to 200 BPM, the values outside of this range are likely to be either highfrequency noise or motion artifacts. Therefore, we passed the BVP signal through a 4th order Butterworth bandpass filter with cutoff frequencies 0.2-3.3 Hz to remove all oscillation and noises outside of this range.
There are two types of information in the EDA signals, tonic skin conductance level (SCL) and phasic skin conductance response (SCR). It is possible to consider SCL as the baseline for evaluating EDA changes. In contrast, SCR occurs as a result of rapid changes in short-term environmental stimuli, such as sight and noise, as well as other factors that precede participation, such as fear, anticipation, and decision-making. Upsampling the signal and estimating the baseline are the primary steps in the preprocessing of EDA, after which the SCL and SCR are extracted after differentiating them from the signal. Figure 3 summarizes the pipeline of signal preprocessing of each physiological variable [6].  The variables that are most informative for determining which PA or APS a patient has are the estimation of HR, the variability in HR, and the breath rate. Since HR values can range from 40 to 200 BPM, the values outside of this range are likely to be either high-frequency noise or motion artifacts. Therefore, we passed the BVP signal through a 4th order Butterworth bandpass filter with cutoff frequencies 0.2-3.3 Hz to remove all oscillation and noises outside of this range.
There are two types of information in the EDA signals, tonic skin conductance level (SCL) and phasic skin conductance response (SCR). It is possible to consider SCL as the baseline for evaluating EDA changes. In contrast, SCR occurs as a result of rapid changes in short-term environmental stimuli, such as sight and noise, as well as other factors that precede participation, such as fear, anticipation, and decision-making. Upsampling the signal and estimating the baseline are the primary steps in the preprocessing of EDA, after which the SCL and SCR are extracted after differentiating them from the signal. Figure 3 summarizes the pipeline of signal preprocessing of each physiological variable [6]. The 3-axis ACC provides the main signal used to capture and discriminate between different types of PA. Since almost all of the human activity frequencies lie between 0 and 10 Hz [37], a low-pass filter or a band-pass filter with a lower frequency close to zero can be used to reject frequencies that are not associated with body movement. We used a 4th order Butterworth bandpass filter with cutoff frequencies 0.1-10 Hz.
The variables that are most informative for determining which PA or APS a patient has are the estimation of HR, the variability in HR, and the breath rate. Since HR values can range from 40 to 200 BPM, the values outside of this range are likely to be either highfrequency noise or motion artifacts. Therefore, we passed the BVP signal through a 4th order Butterworth bandpass filter with cutoff frequencies 0.2-3.3 Hz to remove all oscillation and noises outside of this range.
There are two types of information in the EDA signals, tonic skin conductance level (SCL) and phasic skin conductance response (SCR). It is possible to consider SCL as the baseline for evaluating EDA changes. In contrast, SCR occurs as a result of rapid changes in short-term environmental stimuli, such as sight and noise, as well as other factors that precede participation, such as fear, anticipation, and decision-making. Upsampling the signal and estimating the baseline are the primary steps in the preprocessing of EDA, after which the SCL and SCR are extracted after differentiating them from the signal. Figure 3 summarizes the pipeline of signal preprocessing of each physiological variable [6].

Feature Extraction
Following the cleaning of the raw data, the signals must be mapped to features that are processed by the algorithms in determining the type and intensity of PA and the type of APS. Calculating features and different fingerprints from biosignals is crucial for two main reasons: First, different biosignals are calculated and streamed at different sampling rates and they need to be fed to the NN model with a similar frequency. Second, raw signals need to be transformed into a new feature space to better represent the target variables (i.e., the class labels). The new feature space introduces nonlinearity to the data and hence, more complex patterns between input and the class labels are used to develop and train the model. We utilized random convolutional kernel transformation (ROCKET) [38] to extract 1800 features from the time-series signals [39]. By generating random convolutional kernels of random length, weight, bias, dilation, and padding, ROCKET extracts feature vectors. In addition, deep convolutional LSTM NN models can also be used for this step [40]. Convolution layers incorporated into 1D convolutional LSTM RNN models require a large number of data samples, and GPUs are not yet optimized to run LSTM layers efficiently. ROCKET runs faster, is resistant to dilation, and is more flexible by applying convolutional kernels with different sizes, padding, etc.
Using Equation (1), we extracted dilated convolutional-based feature map from each segment of signals by calculating the maximum and the proportion of positive values of the filtered signal [38][39][40][41].
where 1D signal X ∈ R 10× f sI and the kernel filter f : 0, . . . , m − 1 → R. The length m of each kernel filter is selected as 2 × f s I where f s I is the sampling rate of biosignal I. Variable d is the dilation factor. In addition, we transformed the BVP signal into the frequency domain using the fast Fourier transform in order to extract the modified power spectrum peaks orthogonal to the 3-axis ACC signals (Equation (2) [42,43]).
where N BVP and N Acct , t ∈ x, y, z represents the normalized power spectrum of the BVP and 3-axis ACC signal, respectively. I (n h f −n I f ) represents the identity matrix, and n h f n I f > 0 are indexes of spectral bins, expressed in BPM, corresponding to the highest and the lowest frequency of heart beats. In addition, the frequency, height, width, and the prominence of the highest, peak, artifact-free power spectrum N BVP ⊥ A cc t is also integrated with the set of all feature maps.

Sample Imputation
The raw data collected by Empatica E4 wristband may have missing samples due to factors such as sensor detachment and loss of communication. Data imputation is essential to replace the missing samples with meaningful values before training the ML models. The sample imputation is performed after extracting the feature variables by ROCKET to leverage the calculated feature maps and the relations among the features in estimating the missing samples. Imputation could be performed by simple methods such as replacing the missing values with the mean or the median values or by more advanced approaches such as splines or probabilistic principal component analysis (PPCA) [44,45]. In this work, we used PPCA with 5 principal components to estimate the missing samples.

Feature Selection
A feature selection method is needed to select the most informative features that correlate with the output targets from the 1800 features extracted by ROCKET.
Uninformative feature variables are determined and excluded from the model. The truncated number of feature variables not only enhances the prediction power of the model, but also reduce computational complexity of the pipeline model. Firstly, we excluded 551 features with the highest co-linearity index (Pearson correlation coefficient), followed by PLS-DA and PLS feature selections methods to extract the most informative features of the remaining 1249 features for the classification tasks and the regression task respectively, where the topmost 200 informative features were selected for each output target. Therefore, we selected the 200 features corresponding to the largest variable important for projection (VIP) scores of the PLS-DA (the largest 200 absolute coefficient of the PLS-DA).
PLS is a cross-decomposition technique. It derives the latent variables (LV) by maximizing the covariance between the features and the output variable; as a result, PLS will ensure that the first LV has the highest degree of correlation with the response variable(s). PLS-DA is an extension of PLS to deal with datasets with categorical target variables (i.e., class labels). PLS-DA is used to determine class separation and to identify the variables containing class-defining information [46].
A total of 244 features are selected by combining features from APS and the pair-wise mutual features from PA and EE to train the multi-task LSTM RNN model that makes simultaneous classification of APS types (NS, EAS) and PA types (SS, SB, TR) and EE estimation. A total of 296 features are selected by combining features from APS and PA for use in the multi-task LSTM RNN model and multi-task XGBoost for the simultaneous classification of APS types (NS, EAS, MS) and PA types (SS, SB, TR).

Multi-Task RNN Models with LSTM
We used three different model architectures: a multi-task LSTM RNN model that can make simultaneous classification of APS types (NS, EAS) types and PA types (SS, SB, TR) and estimation EE (  [6]. The nodes in RNN networks are connected in a cycle, so that output from one node affects the input to another, causing RNNs to demonstrate dynamic behavior over time [47,48]. Ordinary RNN suffer from the vanishing gradients and exploding gradients problems. LSTM is a class of RNN that is capable of learning long-term dependencies. Unlike RNN, the LSTM unit is able to handle the problem of vanishing gradients and exploding gradients problems [49,50]. The RNN models with LSTM used in this study have several layers (Figure 4): an input layer, an LSTM with 40 units, a dropout layer 20%, a fully connected layer with 40 units, a dropout layer 20%, and output layers. For classification tasks, the output layer has softmax as an activation function to predict the probability distribution of target classes [50,51]. The model parameters are summarized in Table 4.

Class Imbalances
Training NN models without accounting for the relative weight of each class distribution will result in poor performance for samples from minority classes, since during training, the model weights are updated relatively more according to the majority class. To address the issue of the imbalanced classes, two different approaches are employed: weighted training and ADASYN. Weighted training/cost-sensitive optimization involves updating the model parameters and loss function so that samples are weighted inversely proportional to the number of samples in each class [52,53]. ADASYN generates synthetic samples based on density distributions, where additional samples for the minority class are generated that are harder to learn than those that are easier to learn. Table 5 shows the size of training splits before and after applying ADASYN for balancing the training split of the data for classification tasks. When balancing the training splits using ADASYN, we considered all 9 combinations of PA and APS types. Table 5. The size of training splits before and after applying ADASYN for balancing the training split of the data for classification tasks. As an alternative, we developed multi-task XGBoost classification of APS and PA and compared its performance against the independent XGBoost models and RNNs. XGBoost is a scalable and efficient tree boosting supervised ML algorithm [54]. XGBoost is a branch of gradient boosted decision trees (GBM). Boosting is an ensemble learning method that works by constructing a strong classifier from various weak classifiers. Ensembles are constructed from Decision Tree (DT) models as the weak learner model, where DT is added sequentially to the ensemble and fit to reduce the prediction errors made by the preceding models. Models are fit by gradient boosting using a gradient descent optimization algorithm. XGBoost is designed to enhance the accuracy and to reduce the computational time over the alternative boosting ML algorithms.

Results
We used a stratified shuffle split approach for each dataset with the proportion of 75:15:10 corresponding to training, validation, and testing, respectively. Then, we used the two alternative approaches ADASYN and weighted training/cost-sensitive optimization to address imbalanced classes in the training set, as discussed in the previous section. In order to better evaluate the performance of the ML models for predicting class labels of PA and APS classification, we have used the precision, recall, and F1-score (Equations (3)-(5)) where TP is true positive, FN is false negative, and FP is false positive. Table 6 summarizes F1 score for PA and APS classification using LSTM models. Table A1 summarizes precision, recall, and F1 score for PA classification using LSTM models. Table A2 summarizes precision, recall, and F1 score for APS classification using LSTM models. Root Mean Squared Error equation (RMSE) is used to assess the performance of EE regression, Equation (6): where n is the number of testing samples. All numerical studies are performed using TensorFlow 2.0 environment. In addition, several other Python libraries were used for data preprocessing [39,55,56]. Additionally, we compared the performance of the multi-task XGBoost classification of APS types (NS, EAS, MS) and PA types (SS, SB, TR) to the independent XGBoost classification of APS types (NS, EAS, MS) and the independent XGBoost classification PA types (SS, SB, TR). We used ADASYN to address imbalanced classes in the training set. Table 7 summarizes the F1-score for PA and APS classification using XGBoost models. Table A3 summarizes precision, recall, and F1 score for PA classification using XGBoost models. Table A4 summarizes precision, recall, and F1 score for APS classification using XGBoost models.  Figure 5a shows the confusion matrix of PA types classification by using multi-task LSTM RNN model designed to simultaneously perform classification of APS types, classification of PA types and estimation of EE. Figure 5b depicts the confusion matrix of the corresponding APS classes estimated form multi-task LSTM RNN model. The results for mental stress are excluded because not enough EE data were collected during mental stress sessions. PLS-DA is used for feature selection for the classification task. A total of 244 features were selected by combining features from APS and the pair-wise mutual features from PA and EE. Weighted training is used to handle the imbalanced classes. The model achieved a RMSE of 0.728 cal hr.kg for EE estimation. The architecture of multi-task LSTM RNN classification of APS types and PA types and estimation of EE is shown in Figure 4a. stress sessions. PLS-DA is used for feature selection for the classification task. A total of 244 features were selected by combining features from APS and the pair-wise mutual features from PA and EE. Weighted training is used to handle the imbalanced classes. The model achieved a RMSE of 0.728 .

Multi-Task Classification of PA Types, APS Types and EE Estimation
for EE estimation. The architecture of multi-task LSTM RNN classification of APS types and PA types and estimation of EE is shown in Figure 4a.  Figure 6a shows the confusion matrix of PA types classification (SS, SB, TR) using multi-task classification of APS types (NS, EAS, MS) and PA types (SS, SB, TR) obtained from multi-task RNN model tuned with weighted training, Figure 6b also shows the confusion matrix APS types classification (NS, EAS, MS) using multi-task classification of APS types (NS, EAS, MS) and PA types (SS, SB, TR) with weighted training. PLS-DA is used for feature selection for the classification tasks. A total of 296 features were selected by combining features from APS and PA. Weighted training is used to handle the issue of   Figure 7a shows the confusion matrix of PA types classification using dual task RNN classifier ADYSN. Figure 7b shows the confusion matrix APS types classification using the dual task classification of APS types and PA types with ADYSN technique for address-  Figure 7a shows the confusion matrix of PA types classification using dual task RNN classifier ADYSN. Figure 7b shows the confusion matrix APS types classification using the dual task classification of APS types and PA types with ADYSN technique for addressing the problem of imbalanced classes. PLS-DA is also used for feature selection for the classification tasks.

Multi-Task Classification of APS Types and PA Types with ADYSN
(a) (b) Figure 6. (a) Confusion matrix of PA types classification using the multi-task LSTM RNN classification of APS types and PA types (class imbalance mitigated by weighted training); (b) Confusion matrix of APS types classification using the multi-task LSTM RNN classification of APS types and PA types (weighted training). Figure 7a shows the confusion matrix of PA types classification using dual task RNN classifier ADYSN. Figure 7b shows the confusion matrix APS types classification using the dual task classification of APS types and PA types with ADYSN technique for addressing the problem of imbalanced classes. PLS-DA is also used for feature selection for the classification tasks.   Figure 8a shows the confusion matrix of PA types classification using the independent LSTM RNN. Figure 8b shows the confusion matrix of APS types classification using the independent LSTM RNN Model. PLS-DA is used as a feature selection method to select the topmost informative 200 features for the classification tasks. Weighted training is used to handle the class imbalance. The architecture of the independent LSTM RNN is shown in Figure 4c.  Figure 8b shows the confusion matrix of APS types classification using the independent LSTM RNN Model. PLS-DA is used as a feature selection method to select the topmost informative 200 features for the classification tasks. Weighted training is used to handle the class imbalance. The architecture of the independent LSTM RNN is shown in Figure 4c.   Figure 9a,b display confusion matrices of PA and APS classification tasks. Both confusion matrices were calculated based on predictions made by two independent RNN classifiers to discriminate different types of PA and APS. Synthetic samples of minority classes were generated for unbiased model training. The architecture of the independent LSTM RNN is shown in Figure 4c.

Independent LSTM RNN for EE Estimation
In the independent EE regression task, PLS is used to narrow down the most informative features. The regression model achieved an RMSE of 0.666 cal hr.kg . Figure 4d shows the architecture of the independent LSTM RNN model for regression. Figure 10a compares the EE estimation using the independent LSTM RNN model and the measured EE by the indirect calorimeter (Cosmed K5) for an independent testing data for an individual subject running on the treadmill and experiencing EAS. Figure 10b compares the EE estimation using the multi-task LSTM RNN model and the measured EE by the indirect calorimeter for the same subject.

Multi-Task XGBoost Classification of PA Types and APS Types (ADYSN)
In order to better compare the performance of the multi-task RNN classifiers, a multitask XGBoost was trained by same training splits and confusion matrices for each classification tasks were calculated. Figure 11a shows the confusion matrix of PA types classification using the multi-task XGBoost classification of APS types and PA types. Figure 11b shows the confusion matrix APS types classification using the multi-task XGBoost classification of APS types and PA types. PLS-DA is used for feature selection for the classification tasks and ADYSN is used to handle the class imbalance.
architecture of the independent LSTM RNN model for regression. Figure 10a compares the EE estimation using the independent LSTM RNN model and the measured EE by the indirect calorimeter (Cosmed K5) for an independent testing data for an individual subject running on the treadmill and experiencing EAS. Figure 10b compares the EE estimation using the multi-task LSTM RNN model and the measured EE by the indirect calorimeter for the same subject.

Multi-Task XGBoost Classification of PA Types and APS Types (ADYSN)
In order to better compare the performance of the multi-task RNN classifiers, a multitask XGBoost was trained by same training splits and confusion matrices for each classification tasks were calculated. Figure 11a shows the confusion matrix of PA types classification using the multi-task XGBoost classification of APS types and PA types. Figure 11b shows the confusion matrix APS types classification using the multi-task XGBoost classification of APS types and PA types. PLS-DA is used for feature selection for the classification tasks and ADYSN is used to handle the class imbalance.

Independent XGBoost Classification of PA Types and APS Types with (ADYSN)
Independent estimation of the PA and APS was also studied for a comparison with independent RNN classifiers. Figure 12a shows the confusion matrix of PA types classification using the independent XGBoost classification of PA types with ADYSN. Figure 12b shows the confusion matrix APS types classification using the independent XGBoost classification of APS types with ADYSN. PLS-DA is used for feature selection for the classification tasks and ADYSN is used to handle the class imbalance.

Independent XGBoost Classification of PA Types and APS Types with (ADYSN)
Independent estimation of the PA and APS was also studied for a comparison with independent RNN classifiers. Figure 12a shows the confusion matrix of PA types classification using the independent XGBoost classification of PA types with ADYSN. Figure 12b shows the confusion matrix APS types classification using the independent XGBoost classification of APS types with ADYSN. PLS-DA is used for feature selection for the classification tasks and ADYSN is used to handle the class imbalance. independent RNN classifiers. Figure 12a shows the confusion matrix of PA types classification using the independent XGBoost classification of PA types with ADYSN. Figure 12b shows the confusion matrix APS types classification using the independent XGBoost classification of APS types with ADYSN. PLS-DA is used for feature selection for the classification tasks and ADYSN is used to handle the class imbalance.

Discussion
In this work, we used a multi-task learning approach to train both an RNN with LSTM architecture and XGBoost for simultaneously classifying the type and intensity of PA and the type of APS using a common feature map. We used data collected during activities of daily living and exercise sessions, relying only on the physiological signals measured noninvasively by the Empatica E4 wristband. The measured biosignals used for discrimination between different APS and PA include a 3-axis accelerometer, BVP, ST, and GSR (HR is reported by E4 based on BVP). The data obtained from the wristband are processed to impute the missing values and to reduce the noise and the artifacts that compromise the data quality. We employed random convolutional kernel transformation to extract a large number of features from the time series signals. We used two different feature selection techniques to select the most informative features, PLS-DA for the classification tasks and PLS for the regression tasks. In order to address the issue of the imbalanced classes, two different approaches are employed: weighted training and ADASYN.
The advantage of the multi-task RNN model is that only a single model is developed and maintained rather than many independent classification and regression models. Moreover, in cases where there is similarity between the tasks, multi-task learning can provide consistency in the predictions. Additionally, mutual features were used for multiclassification regression tasks, therefore enhancing the prediction power of the model, and reducing the computational complexity, which makes it a great candidate for real-time implementation on platforms with low computational power.
The multi-task LSTM RNN model designed to simultaneously perform classification of APS types (NS, EAS), classification of PA types (SS, SB, TR), and estimation of EE achieves comparable performance to the independent RNNs, with the multi-task RNN having F1 scores of 98.00% for PA and 98.97% for APS, and an RMSE of 0.728 cal hr.kg for EE estimation using independent testing data. In contrast, the independent RNNs have F1 scores of 99.64% for PA and 98.83% for APS, and an RMSE of 0.666 cal hr.kg for EE estimation. Multi-task XGBoost achieved F1 scores of 99.89% and 98.31% for the classification of PA types and APS types, respectively, while the independent XGBoost achieved F1 scores of 99.68% and 96.77%, respectively. The results illustrate that multi-task NN and multi-task XGBoost can effectively assess the signals from wearable sensors and effectively enhance the detection of PA and APS. This can be explained by the potential for improved data efficiency in exploiting the shared representations in the data by training one model to predict the related tasks PA and APS jointly. Training independent models to predict the type of PA and APS without connecting the shared information between the two tasks may require more training data and longer training time to achieve a high level of accuracy.
It is crucial to consider the relative risk of misclassification of the different types of PA and APS to the patients with diabetes. For instance, in the case of misclassification of APS events, whether MS or EAS as an NS event, the AP system will not take the proper action on regulation of blood glucose concentration, and consequently, hyperglycemia may occur. Alternately, misclassification of NS as an MS or EAS is harmful since the AP will incorrectly inject additional insulin in an attempt to mitigate APS, leading to hypoglycemia. Similarly, misclassification of SS as SB or TR will lead to hyperglycemia due to reduction of insulin injection by the AP, while misclassification of SB or TR as SS is dangerous since AP will not reduce insulin infusion during PA leading to hypoglycemia or potentially severe hypoglycemia.
A few of the EAS inducement samples were misclassified as NS, as shown in the confusion matrix of APS classification Figure 5b (i.e., EAS recall = 98.43% as indicated in Table A2). The main reason that some APS samples are predicted as non-stressful episodes can be caused by over-smoothing the biosignals, especially BVP, since the variation of IBI is the main biosignal conveying the information on psychological stress. Additionally, the experiments were conducted under the review and monitoring of the IRB to ensure the safety and welfare of the subjects. As a result, the APS experiments are limited to mild APS inducement; consequently, some of the physiological variables during EAS resemble NS events.
Classification of different APS is a challenging task: for one reason, different classes, in particular, milder MS and EAS, can be misclassified interchangeably. Another factor in our data is the difference in data sizes, the number of samples in EAS dominates other class labels (NS and MS). Usually, handling imbalance labels in the training split improves the performance of the model with unseen data. Yet, intervention between NS and EAS indicates a low signal-to-noise ratio in some collected samples. The noise in the data causes estimation of the probability of each class close to the threshold value. Hence, the trained model results in estimating samples with low confidence. Similarly, during SB sessions, the variation of the 3D accelerometer signals can be similar to the SS condition and therefore, other biosignals such as BVP become rather important to distinguish between SS and SB.
SB and TR samples are readily distinguishable from each other. Figures 5a-9a and 11a show no misclassification between SB and TR, because the TR experiments do not contain high magnitude measurements from the three-axis ACC which distinguishes the SB experiments.
Overall F1 scores for classification of PA types (SS, SB, TR) are higher than classification of APS types (NS, EAS, MS) for all models considered. The 3-axis accelerometer signal is the main signal contributing only to discrimination of PA types while the 3-axis accelerometer signal is not correlated with APS types. Sympathetic activation stimulates the sweat glands. Hence, EDA is an indicator of sweating rate, and is strongly correlated with PA intensity as well as APS level. The magnitude of the physiological variables such as HR in response to PA is more pronounced compared to APS.
The multi-task LSTM RNN model for classification of APS types (NS, EAS, MS) and PA types (SS, SB, TR) with weighted training had F1 scores of 99.8% for PA and 99.3% for APS. On the other hand, the multi-task LSTM RNN model for classification of APS types (NS, EAS, MS) types and PA types (SS, SB, TR) with ADYSN had F1 scores of 99.69% for PA and 98.83% for APS. A comparison between Figures 6 and 7 reveals that adding synthetic samples based on the density and similarity between samples does not efficiently address the issue of imbalanced samples. Adding synthetic samples using ADYSN assumes the observations with high similarity will have similar labels. Although this assumption in many applications is valid, different types of APS often show similar behavior and these three classes are not simply separable. Therefore, some synthetic samples with their labels are considered as the main reason for the increased number of misclassified samples in comparison to the weighted training technique for handling the imbalanced classes.
Similarly, comparison of the performance of models trained by ADYSN-balanced samples and weighted classes were performed for the independent model architectures. As expected, the ADYSN technique is not the optimal solution for addressing the problem of imbalanced samples in the data. The high similarity between different types of APS and NS classes causes biased interpretation of the trained model from the synthetic samples.
A comparison between Figures 6 and 8 illustrate the difference in architecture of the two models. In the multi-task architecture, mutual features contributing to both classes were used for the classification task. It should be noted that the number of model layers, trainable units, and other hyper parameters remain invariant. Hence, overfitting of "datahungry" LSTM layers drops the performance of classification. However, this issue can be solved by introducing a regularization feature in trainable layers. The independent models also require performing of repeating feature engineering for each model; hence, the issue can be troublesome in model deployment in real time.
The estimated EE values for an individual subject running on the treadmill under the influence of EAS are illustrated in Figure 10. The EE estimation algorithms for both the independent LSTM RNN model and the multi-task LSTM RNN model are able to track the EE measured by the Cosmed K5 Calorimeter with high accuracy.
A significant drop in the performance of XGboost models can be observed as compared with both independent and multi-task NN architectures. Training XGboosted trees is a challenging task and requires constant monitoring of models to avoid the problem of overfitting as well as biased model training. For both XGboost models, off-diagonal predictions in the confusion matrices Figures 11b and 12b increased drastically. Apart from biased predictions that stemmed from synthetic samples, a range number of samples were misclassified as NS events. A major difference between the two models is the time series structures in RNN models while XGboost models only trained on a single slice of the data and no past time windows were used in the model. Since different episodes of APS and PA take place in piece-wise patterns, the recurrent model is a better choice for modeling from these types of measurements, as RNN models capture the dynamic behavior in estimating the probability of the classes. In contrast, the trained XGboost only predicts probabilities based on the current snapshot of biosignals and as a consequence, non-smooth predictions and more oscillations between the predicted classes are anticipated.
A limitation of the current work is that the EE data collected during MS experiments are not sufficient to include MS type in the case of multi-task LSTM RNN classification of APS types and PA types and estimation of EE. The presented approach can cover the common PA in daily life. Future work will extend the presented approach to include other classification scenarios to obtain an accurate classification during all kinds of daily activities.
Incorporating information on the type and intensity of PA in diabetes therapies improves time in range (TIR) and prevents hypoglycemia in people with T1D by modulating the insulin requirements to counteract the effects of PA on the blood glucose dynamics [15,57]. Additionally, incorporating information on the APS can improve treatment outcomes. Researchers have documented that athletic competition stress increases blood glucose levels and reduces insulin sensitivity in individuals with type 1 diabetes preceding and during an athletic competition in comparison to the same physical activity performed in training at the same intensity [58]. In addition to incorporation of PA information, future work will incorporate APS information in AP systems to adjust the insulin dosage in people with diabetes to account for the glycemic disturbance effects of both PA and APS.

Conclusions
The advantage of the multi-task learning approach is that a single model is developed and maintained instead of many independent classification and regression models. Exploiting the shared representations in the data by training one model to predict the related tasks of PA and APS jointly can improve data efficiency. We used data collected during exercise sessions and daily life activities relying only on the physiological signals measured noninvasively by Empatica E4 wristband. Random convolutional kernel transformation is employed to extract a large number of features from the time series signals. Two different feature selection techniques are used to select the most informative features, PLS-DA for the classification tasks and PLS for the regression task. In order to address the issue of the imbalanced classes, two different approaches are employed: weighted training and ADASYN. The multi-task RNN model with LSTM is developed to simultaneously classify the type of PA and estimate its intensity and classify the type of APS. Multi-task LSTM RNN classification of APS types (NS, EAS, MS) and PA types (SS, SB, TR) with weighted training achieved the highest F1 score for both APS and PA types. A multi-task XGBoost model is developed to simultaneously classify the type of PA and the type of APS where the multitask XGBoost achieved higher F1 scores in comparison to the independent XGBoost. The results illustrate that multi-task NN and multi-task XGBoost can effectively assess the signals from wearable sensors and effectively enhance the detection of PA and APS.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. Table A1. Precision, Recall, and F1 score for PA Classification using (LSTM models).