Human Activity Recognition Algorithm with Physiological and Inertial Signals Fusion: Photoplethysmography, Electrodermal Activity, and Accelerometry

Inertial signals are the most widely used signals in human activity recognition (HAR) applications, and extensive research has been performed on developing HAR classifiers using accelerometer and gyroscope data. This study aimed to investigate the potential enhancement of HAR models through the fusion of biological signals with inertial signals. The classification of eight common low-, medium-, and high-intensity activities was assessed using machine learning (ML) algorithms, trained on accelerometer (ACC), blood volume pulse (BVP), and electrodermal activity (EDA) data obtained from a wrist-worn sensor. Two types of ML algorithms were employed: a random forest (RF) trained on features; and a pre-trained deep learning (DL) network (ResNet-18) trained on spectrogram images. Evaluation was conducted on both individual activities and more generalized activity groups, based on similar intensity. Results indicated that RF classifiers outperformed corresponding DL classifiers at both individual and grouped levels. However, the fusion of EDA and BVP signals with ACC data improved DL classifier performance compared to a baseline DL model with ACC-only data. The best performance was achieved by a classifier trained on a combination of ACC, EDA, and BVP images, yielding F1-scores of 69 and 87 for individual and grouped activity classifications, respectively. For DL models trained with additional biological signals, almost all individual activity classifications showed improvement (p-value < 0.05). In grouped activity classifications, DL model performance was enhanced for low- and medium-intensity activities. Exploring the classification of two specific activities, ascending/descending stairs and cycling, revealed significantly improved results using a DL model trained on combined ACC, BVP, and EDA spectrogram images (p-value < 0.05).


Introduction
With recent technological advancements in developing multi-sensor wearable devices, research related to physiological signals has grown rapidly.Classifying human physical activities has recently received a lot of attention due to its connection to physical and mental health.Physical activity (PA) is not only essential for preventing obesity; it might also confer neuroprotection in Alzheimer's (AD), Parkinson's (PD), and Huntington's (HD) diseases via the upregulation of synaptic signaling pathways [1].It also helps in rehabilitation processes such as cardiorespiratory fitness (CRF) [2].Wearable physical activity (PA) monitoring systems have also been developed for monitoring the activity levels of a specific population, such as athletes or elderly individuals.E-textile sensors combining accelerometer and goniometer data assist in estimating joint angles; smart insoles are utilized for gait analysis; and smart garments are used in measuring respiratory rate during PA [3][4][5].In general, knowledge of the frequency and intensity of an individual's activity plays a significant role in improving his or her lifestyle, which is achievable using a reliable classifier.
Sensors 2024, 24, 3005 2 of 13 Several factors must be considered in improving HAR classification performance, including the placement of wearable [6][7][8][9].Using data acquired from the wrist is challenging due to the often non-activity-related natural hand/wrist movements.However, the wrist is the recommended collection site for increased wear time [7,8].
Among physiological signals, accelerometer (ACC) data are the main signal in proposed algorithms.In a recent survey of 163 selected HAR studies, it was shown that 149 deployed at least one accelerometer or an accelerometer in conjunction with another sensing modality (gyroscope, magnetometer, body temperature sensor, electrocardiograph, electromyography, etc.) [10].The prevalent use of accelerometers in HAR may be explained by the low cost and small size of the device as standalone sensors, and additionally, their low energy consumption and high feature performance, as demonstrated in modern smartphones, further contribute to their widespread adoption [11,12].
Recently, the concept of multi-modal HAR has been explored, and it was shown that multi-modal-based methods can improve classification results by leveraging the complementary characteristics between the single modes [13].For example, accelerometers were used in conjunction with other sensors including gyroscopes [14], magnetometers [15], heart rate (HR) and respiration rate sensors [16], electrocardiographs [17], surface electromyography (sEMG) [18], and skin conductance measurement (EDA) [19].
Jia and Liu [20] proposed a classification method to harness data from a 7-lead ECG worn on the chest and an accelerometer magnitude vector recorded from waist.Time-and frequency-domain features were calculated for classifying lying, sitting, standing, walking, walking upstairs, walking downstairs, and running activities.Average classification accuracy from accelerometer data alone was 93.83%, and it was increased to 99% when fused with ECG data.
Photoplethysmography (PPG) is a less-obtrusive alternative to ECG for obtaining a continuous cardiac signal.It was also shown that ECG signals can be reconstructed from PPG signals [21].Furthermore, advancements in optical sensors have made PPG an alternative for measuring pulse rate variability features, as surrogates for heart rate variability (HRV) indexes [22].These techniques offer greater convenience and less intrusion compared to traditional methods.
The derived signal from the PPG sensor is called the blood volume pulse (BVP), and it is the measurement of changes in blood volume in arteries and capillaries.Interestingly, changes in the BVP signal can reflect changes in blood pressure [23], which may correspond to levels of exertion characterizing some types of activities [24].These attributes make it a promising modality for a wide array of applications, including but not limited to human activity monitoring [17,[25][26][27].
However, classifier performance improvement related to non-ACC signal data may be activity-dependent.The use of accelerometer data alone can be insufficient for recognition of activities that require little movement but considerable energy consumption [28].An example is an activity such as cycling, when a wrist-worn accelerometer is measuring mostly the slight movements of the wrist.Interestingly, a study conducted by Weippert et al. demonstrated that heart rate variability measures could be used to distinguish static exercise (supine leg press) from dynamic exercise (cycling) when the heart rate for each was similar [29].Another study shows that combining selected HRV features from a single-lead ECG resulted in improved performance in separating sitting from standing and walking from ascending-walking classes over that obtained from accelerometer alone [28].ECGderived signals such as PPG data, although not proven to be informative enough on its own as an effectively discriminating input, have provided the capability to improve the separation of activities (when combined with a triaxial accelerometer) involving similar movement yet requiring different levels of exertion, such as cycling and cycling with added resistance [30].
A recent study employed an ensemble method of pre-trained deep learning models (Resnet50V2, MobileNetV2, and Xception), fused at the feature level, to classify four activities (running, walking, and high/low resistance cycling), for the purpose of comparing the efficacy of PPG input data to ECG [25].The resulting accuracies of the ensemble model tested on the ECG and PPG data were around 94% and 89%, respectively, demonstrating a potential significant improvement in classification accuracy when PPG is used in deep learning models (CNNs) over machine learning models.
Recording a high-quality PPG/BVP signal is complex, since it is influenced by changes in ambient light, electrical noise from the device, and sensor movement [26].Methods to alleviate the motion artifact (MA) contained in PPG signals due to physical activity have been proposed at the cost of increasing complexity to the signal processing phase of activity classification [31,32].However, the PPG signal's susceptibility to corruption by noise introduced by MA has been exploited to predict activities [27].
Another additional signal that has the potential to improve HAR performance is electrodermal activity (EDA).EDA represents changes in the electrical conductivity of the skin.The most salient characteristic of an EDA signal is the skin conductance responses (SCRs) resulting from an underlying sympathetic reaction to a stimulus [33].The SCRs are the rapid-onset and smooth, exponential-like, transient events noticeable in the EDA signal.EDA data have been used in the prediction of seizures [34], human emotion recognition [35], menstrual cycle analysis [36], and in affective computing [37].Somewhat recently, EDA has been used in the classification of activity intensity as perceived by the subject (relative PA intensity) [19,38].
Poli et al. [19] trained SVM and bagged tree models on features taken from triaxial accelerometry, HR, inter-beat interval (IBI), skin temperature, and EDA.Using the bagged tree classification model trained on EDA data alone resulted in an average F1-score of 73.8%-which increased to 93.9% when all modalities were included-for classification of sedentary, moderate, and vigorous activities.
Despite the popularity of classifying physical activities using time-series data, some studies focused on using image representation of activities.Such a classification system typically employs a CNN for class prediction.Short-time Fourier transform (STFT), reduced interference distribution with Hanning kernel (RIDHK), and smoothed pseudo Wigner-Ville distribution (SPWVD) have been used in the classification of radar echo signals produced by various physical activities with averaged accuracy (over six physical activities) of 96.61%, 94.72%, and 91.06%, respectively [39].Accuracy was increased to 98.11% when all three images were vertically concatenated and used as input to a VGG16 network.
In another study, accelerometer and gyroscope data from the Sussex-Huawei Locomotion (SHL) dataset [40] were used to generate FFT spectrograms as inputs to a NN classifier [41].The network was trained separately on spectrograms corresponding to each axis of both sensors resulting in F1-scores of 90.5%, 91.1%, 90.1%, and 92.8% for accelerometer spectrograms (x, y, z, and magnitude, respectively) and 84.7%, 87.8%, and 83.7% for gyroscope spectrograms (x, y, and z axes, respectively).It should be noted that the SHL dataset contains certain activity classes (riding in a car, bus, train, and on the subway) which are not human physical activities per se.
This study aims to explore the significance of non-inertial signals in enhancing the recognition of human activities within both feature-based and deep learning classifiers.The subsequent sections delve into the processes of collecting physiological data, preprocessing signals, extracting features, generating spectrogram images, and employing classification techniques.These methods are then evaluated against inertial-based classifiers to assess their performance.Furthermore, the impact of blood volume pulse (BVP) and electrodermal activity (EDA) signals on enhancing the recognition of cycling and stairs-climbing activities is investigated and reported.

Materials and Methods
The interest of this study is in determining whether a classification system can be improved by the fusion of multiple physiological inputs with triaxial ACC data.The performances of a random forest (RF) classifier and a pre-trained deep learning (DL) model (ResNet18) were analyzed using combinations of physiological signals.RF models were trained on features, and DL models were trained on short-time Fourier transform (STFT) images, engineered from the following modalities: 0-ACC, 1-ACC, and BVP; 2-ACC and EDA; and 3-ACC, BVP, and EDA signals.These datasets will be referred to as 0, 1, 2, and 3, respectively, and the corresponding classification models will be referred to as C0, C1, C2, and C3 for both RF and DL classifications.Workflow of the HAR algorithm development-demonstrating its general steps, including data collection, data processing, dataset generation-and the two approaches that were taken in this paper to develop RF and DL algorithms are shown in Figure 1.Each step is described in detail in the following sections.

Materials and Methods
The interest of this study is in determining whether a classification system can be improved by the fusion of multiple physiological inputs with triaxial ACC data.The performances of a random forest (RF) classifier and a pre-trained deep learning (DL) model (ResNet18) were analyzed using combinations of physiological signals.RF models were trained on features, and DL models were trained on short-time Fourier transform (STFT) images, engineered from the following modalities: 0-ACC, 1-ACC, and BVP; 2-ACC and EDA; and 3-ACC, BVP, and EDA signals.These datasets will be referred to as 0, 1, 2, and 3, respectively, and the corresponding classification models will be referred to as C0, C1, C2, and C3 for both RF and DL classifications.Workflow of the HAR algorithm development-demonstrating its general steps, including data collection, data processing, dataset generation-and the two approaches that were taken in this paper to develop RF and DL algorithms are shown in Figure 1.Each step is described in detail in the following sections.

Physiological Data Collection
In this study, the Empatica E4 (Empatica Inc., Boston MA) wristband was used to collect physiological data from subjects while performing a set of activities.E4 sensors include (1) a MEMS-type triaxial (x, y, and z) accelerometer that measures continuous gravitational force on the scale ±2 g with sampling frequency of 32 Hz; (2) a PPG sensor, from which BVP and HR data are derived (sampling frequencies of 64 Hz and 1 Hz, respectively); (3) an EDA sensor recording data at a sampling frequency of 4 Hz; and (4) a skin temperature sensor recording data at a sampling frequency of 4 Hz [42].
An Institutional Review Board protocol (#1796115) was approved for data collection at the University of North Florida and University of Central Florida.Twenty-three subjects (20.7 ± 1.6 years old), including ten females and thirteen males, were recruited to wear the wristband during the experiment.Before starting the recording session, subjects were provided with all information regarding this study, as well as the consent form to review and sign.The subjects were helped to set up the wristband and to ensure the proper placement.The device was worn on the non-dominant wrist.A fifteen-minute time duration was considered for the device to warm-up before starting the session.Subjects were asked to perform a series of activities, each of five-minute duration or-if less-for a duration that the subject was comfortable performing a given activity.Subjects were instructed on how to perform each activity; however, to simulate free-living conditions, there were no strict requirements defining each activity.A short break between activities was given for rest, hydration, and to allow the subject's heart rate to return to as close to his/her baseline as possible.The baseline heart rate was obtained during the sitting activity.The activities were performed in the following order: sitting, standing, lying, walking, brisk walking, jogging, running (on the indoor track), cycling (stationary bicycle at a rate of around 80

Physiological Data Collection
In this study, the Empatica E4 (Empatica Inc., Boston MA) wristband was used to collect physiological data from subjects while performing a set of activities.E4 sensors include (1) a MEMS-type triaxial (x, y, and z) accelerometer that measures continuous gravitational force on the scale ±2 g with sampling frequency of 32 Hz; (2) a PPG sensor, from which BVP and HR data are derived (sampling frequencies of 64 Hz and 1 Hz, respectively); (3) an EDA sensor recording data at a sampling frequency of 4 Hz; and (4) a skin temperature sensor recording data at a sampling frequency of 4 Hz [42].
An Institutional Review Board protocol (#1796115) was approved for data collection at the University of North Florida and University of Central Florida.Twenty-three subjects (20.7 ± 1.6 years old), including ten females and thirteen males, were recruited to wear the wristband during the experiment.Before starting the recording session, subjects were provided with all information regarding this study, as well as the consent form to review and sign.The subjects were helped to set up the wristband and to ensure the proper placement.The device was worn on the non-dominant wrist.A fifteen-minute time duration was considered for the device to warm-up before starting the session.Subjects were asked to perform a series of activities, each of five-minute duration or-if less-for a duration that the subject was comfortable performing a given activity.Subjects were instructed on how to perform each activity; however, to simulate free-living conditions, there were no strict requirements defining each activity.A short break between activities was given for rest, hydration, and to allow the subject's heart rate to return to as close to his/her baseline as possible.The baseline heart rate was obtained during the sitting activity.The activities were performed in the following order: sitting, standing, lying, walking, brisk walking, jogging, running (on the indoor track), cycling (stationary bicycle at a rate of around 80 revolutions per minute without resistance), and using stairs (ascending and descending), with the non-dominant hand free to swing.The total data recorded (756.All activities were performed in the University of North Florida Student Wellness Complex or the University of Central Florida Recreation and Wellness Center.All the activity start times were labeled by tapping the E4's button (<1 s), and the type of activity was recorded by a proctor.The timestamps later were extracted from a downloaded .csvfile that contains a record of the times of the events marked during a session.

Data Processing
Upon completion of the activity sessions, the subject's data were inspected via the E4 Connect session viewer.Event markers denoting the start and stop time of each activity performed were verified against the times marked by hand in the session.The data corresponding to the first and last five seconds of an activity were discarded along with occasional segments of the recording during which it was noted that the subject either momentarily stopped performing the activity or the subject's wrist movement (sensor-worn wrist) became excessive (for example, during standing or lying).
For each session, a timestamp series was generated, determining the start and end time of each activity.For data segmentation, initially, window sizes ranging from 5 to 20-second with a step size of 5-second were generated and tested.Ultimately, we selected a 10-second window size to ensure sufficient frequency resolution for electrodermal activity (EDA) features and to improve overall performance.Additionally, a 50% overlap was used to achieve both an informative window size and acceptable dataset size [43].
All pre-processing steps (filtering and segmentation) and feature computations were completed using Python v. 3.8.11.A second-order Butterworth bandpass filter (0.3-10 Hz) was used to remove the DC component and frequencies above 10 Hz in ACC data (x, y, and z components), noting that 98% of the spectral amplitudes corresponding to human activities lie below 10 Hz [44].PPG and EDA signals were filtered using a fourth-order Chebyshev-II band pass filter (0.4-5 Hz [45]), and a fourth-order Butterworth high-pass filter (0.05-2 Hz), respectively.Datasets 0, 1, 2, and 3 were generated from the processed and segmented data for all subjects.Time-and frequency-domain features computed for RF classification are listed by modality in Table S8.STFT images were generated from 10-second data segments of BVP, EDA, and ACC magnitude, with a 50% overlap.

HAR Classifier Development
A HAR system which included two independent algorithms was assembled.First, a random forest algorithm-which is among the popular classification algorithms for HAR models [17,46]-was trained on features extracted from 10-second data segments.Then, as a DL classifier, a convolutional neural network obtained via transfer learning was trained on spectrogram images of 10-second pre-processed ACC, BVP, and EDA data segments.

Random Forest Algorithm
The RF classifier from the sklearn.ensemblemodule with 500 estimators (trees) was used in this study.Grid search was performed over a range of 100 to 1000 with a step size of 100 to optimize this hyperparameter.The remaining default parameters were used, including min_samples_split of 2, and max_feautres of sqrt.Four RF algorithms (C0, C1, C2, and C3) were trained to classify eight individual activities (lying, standing, walking, brisk walking, jogging, running, stairs, and cycling) using features from datasets 0 through 3, respectively.Next, activities were grouped according to the body movement involved in the activity and activity-intensity overlap, which may have likely been induced by the relaxed experimental environment.Therefore, standing and lying, walking and brisk walking, and jogging and running engendered low-, medium-, and high-intensity groups.Cycling and stairs remained separate activity classes.RF algorithms C0 through C3 were also trained on grouped-activity datasets.
Features, both low-level and handcrafted, extracted from ACC, BVP, and EDA data, are reported in Table S8.The feature set is represented by statistical estimators, such as those used in previous HAR studies [17,19,47], and some experimentally and intuitively derived [48]; for example, a range of frequency bands associated with respective levels of cardiac output.
Training and testing were conducted in leave-one-subject-out mode.Training data were further divided into 90%/10% randomized train/validation datasets.Data were then normalized (z-score, sklearn.preprocessing.StandardScaler), and a single test was conducted for each test subject.STFT images were generated from datasets 0 through 3.The magnitude vector of ACC segments (i.e., 1 / 3 sqrt(ACCx 2 + ACCy 2 + ACCz 2 ) was used to construct STFT images for dataset 0. Datasets 1 through 3 were created through vertical concatenation of the STFT images.The original ACC and EDA image size was 220 × 220 pixels; however, the BVP spectrogram image was cropped to a 220-width-by-110-height size due to the excessive near-zero-valued amplitudes present in the STFT higher frequencies.After vertical concatenation, the combined image was resized to 224 × 224, as was required by the ResNet-18 input shape.New classification and fully connected layers were created, and they replaced the two last ResNet-18 network layers to reduce the number of outputs from one thousand to eight classes.No other changes were made to the network.Time series signals are shown in Figure 2, along with their spectrograms, for eight activities.
walking, and jogging and running engendered low-, medium-, and high-intensity gro Cycling and stairs remained separate activity classes.RF algorithms C0 through C3 w also trained on grouped-activity datasets.
Features, both low-level and handcrafted, extracted from ACC, BVP, and EDA d are reported in Table S8.The feature set is represented by statistical estimators, suc those used in previous HAR studies [17,19,47], and some experimentally and intuiti derived [48]; for example, a range of frequency bands associated with respective leve cardiac output.
Training and testing were conducted in leave-one-subject-out mode.Training were further divided into 90%/10% randomized train/validation datasets.Data were normalized (z-score, sklearn.preprocessing.StandardScaler), and a single test was ducted for each test subject.

Convolutional Neural Network (CNN)
Pre-trained DL image classification neural networks eliminate the laborious tas designing a DL network and provide a re-trainable model which has learned rich fea representations from more than a million images.MATLAB's (R2021b) Deep Lear Toolbox Model for ResNet-18 [49] was used in this study to perform the DL experim (individual activity classification and group-based classification), analogous to the chine learning experiments described in Section 2.3.1.
STFT images were generated from datasets 0 through 3.The magnitude vecto ACC segments (i.e., ⅓ sqrt(ACCx 2 + ACCy 2 + ACCz 2 ) was used to construct STFT im for dataset 0. Datasets 1 through 3 were created through vertical concatenation of the S images.The original ACC and EDA image size was 220 × 220 pixels; however, the spectrogram image was cropped to a 220-width-by-110-height size due to the exces near-zero-valued amplitudes present in the STFT higher frequencies.After vertical catenation, the combined image was resized to 224 × 224, as was required by the Res 18 input shape.New classification and fully connected layers were created, and the placed the two last ResNet-18 network layers to reduce the number of outputs from thousand to eight classes.No other changes were made to the network.Time series sig are shown in Figure 2, along with their spectrograms, for eight activities.The initial learning rate and batch size were set to 0.00125 and 75, after implemen a hyperparameter grid search over the ranges of 0.0005 to 0.002 with step size of 0.0 and 25 to 125, with step size of 0.00025, respectively.Training data were divided The initial learning rate and batch size were set to 0.00125 and 75, after implementing a hyperparameter grid search over the ranges of 0.0005 to 0.002 with step size of 0.00025 and 25 to 125, with step size of 0.00025, respectively.Training data were divided into 90%/10% randomized train/validation datasets.All data were normalized, applying zero-center normalization, which normalizes each pixel value between [-1, 1].Classifiers were trained for 50 epochs using the Adam optimizer.

Results
Data from twenty-three subjects performing eight activities (lying, standing, walking, brisk walking, jogging, running, ascending/descending stairs, and cycling) were included in the analysis of HAR classifier performances.First, individual activity classification was conducted.Then, for a less-granular classification task, six of the eight activities were grouped by their similar intensity (standing/lying (low), brisk walking/walking (medium), and jogging/running (high)).RF and CNN algorithms were trained to classify activities at both individual and group levels.Classifiers were trained on ACC (C0); ACC and BVP (C1); ACC and EDA (C2); and ACC, BVP, and EDA (C3) datasets.CNN was trained on STFT spectrogram images, while RF was trained on featurized data.To assess and compare classifiers performance, the area under the ROC curve and F1-score were calculated.Table 1 presents the performance of each classifier by input, both for individual and grouped activities.Due to variation in CNN model prediction from trial to trial, the average of three CNN trials is reported.The results show that the RF classifier outperforms CNN; however, EDA and BVP signals improve the performance of the CNN classifier.AUC-ROC and F1-scores are reported in Tables 2-5 for individual and grouped activity classification.Additional signals do not appear to improve RF classification results in individual or grouped activities (Tables 2 and 4); however, for the CNN model, nearly all individual activity classifications were improved with additional signals (Table 3).In grouped activity classifications, CNN performance was improved for low-and medium-intensity activities, as well as for cycling and stairs (Table 5).  1 Indicates significant improvement with added BVP. 2 Indicates significant improvement with added EDA. 3 Indicates significant improvement with added BVP and EDA.To determine if the improvement in performance was significant, a comparison of the three signal-augmented models to a baseline ACC-only CNN model was conducted using a mid-p-value McNemar test at the 5% significance level.The McNemar test showed statistically significant improvement in the classification results (p < 0.05) of C3 for all classes (as is shown in Tables 3 and S2), except running, when compared to C0. Statistically significant improvement (p < 0.005) was also observed in grouped low-and medium-intensity stairs and cycling activities for C1, and in cycling for C2 when compared to C0 (Tables 5 and S3).Adding both EDA and BVP inputs (C3) improved the performance of classification of all groups significantly (p < 0.005), except high-intensity activities.
Among common activities [10], stationary cycling and ascending/descending stairs appear to be two of the more difficult classes to discriminate from other modes of ambulation-or, in the case of cycling, from sedentary activities such as sitting-when using accelerometer data acquired from the wrist [7,17,50,51].At a granular individual activity classification level, a Wilcoxon rank sum test was conducted to compare the classification accuracy of both cycling and stairs by comparing signal-augmented models to a baseline ACC-only CNN model.In C3, misclassification of stairs as standing and walking activities (p < 0.05) and cycling as lying and standing activities (p < 0.005) was significantly improved (Tables S4  and S5).Misclassification of cycling as standing was also decreased using C2 (p < 0.05).Misclassification of stairs and cycling with respect to the rest of the activities was negligible.
For grouped activity classification, a similar rank sum test was conducted.Using C3, misclassification of stairs as a low-intensity activity group (p < 0.05) and cycling as low-and medium-intensity activities (p < 0.005 and p < 0.05 respectively) was reduced.The combined ACC and BVP inputs improve classification error rates of both cycling and stairs, which were misclassified as low-intensity activities by C0 (Tables S6 and S7).Misclassification of stairs and cycling with respect to other activities not mentioned above was negligible.

Discussion and Conclusions
The application of wearable sensors in HAR system development continues to enjoy multi-disciplinary interest due to its importance in the monitoring of human well-being.Although most of the available algorithms rely on accelerometer sensor data, the purpose of this paper is to investigate the importance of BVP and EDA signals in HAR algorithms.Therefore, a multi-modal sensing wristband was employed for data collection.It should be noted that other signals, such as EMG and ECG, have the potential to improve PA classification results; however, the purpose of this study was limited to the data derived from signals obtained from a single Empatica E4 wristband.
Data from PPG, EDA, and triaxial ACC sensors were preprocessed and used to train machine learning and deep learning classifiers, with input modalities including only ACC data (C0); ACC and BVP (C1); ACC and EDA (C2); and ACC, EDA, and BVP (C3) to examine the importance of these non-ACC signals.
No significant improvement was observed in the RF classifier by adding additional signals.However, significant improvements were observed in overall F1-scores for CNN models trained on additional bio-signals and were confirmed by applying a McNemar test (p < 0.05).In individual activity classification, the average CNN F1-score was improved from 64.22% with C0 to 69.01% with C1, 67.7% with C2, and 69.89% with C3.The corresponding F1-scores for group classification also were improved from 80.65% with C0 to 86.34% with C1, 83.77% with C2, and 87.51% with C3.Regarding individual activity classification, a mid-p-value McNemar Test showed that performance of C3 was improved over C0 for all activities except running (p < 0.005); performance of C1 was improved over C0 for standing, lying, cycling, and stairs (p < 0.05); and performance of C2 was improved over C0 for cycling, jogging, lying, and stairs classes (p < 0.05).For the grouped, five-activity classification task, the McNemar Test showed that performance of both C1 and C3 was improved over C0 for all grouped activity classes except high-intensity (p < 0.005); and the performance of C2 was improved over C0 for cycling class (p < 0.05).This may suggest that there is a link between improved classification of cycling activity and EDA data facilitated by minimum wrist movement and, consequently, lower motion artifact.It is shown in Table 1 (also Table S1) that training the CNN model on BVP and/or EDA inputs consistently improves the average F1-scores, Matthews Correlation Coefficient (MCC), and Cohen's kappa score.
Although there is no significant improvement in classification by a feature-based classifier such as the RF trained on bio-signal features, statistically significant improvement in classification by a multi-modal CNN model was observed.This improvement might be due to the ability of DL models to directly learn robust features from raw data for specific applications, whereas ML models such as RF require expertly extracted or engineered features for satisfactory performance [52].Specifically for EDA and PPG, which are susceptible to noise, statistical features are less suitable.Other ML classification challenges may be (1) the issue of "overlapping" statistical characteristics of time-and frequency-domain signals, corresponding to certain activities (e.g., running and jogging) in free-living conditions (reduction of inter-class variability); (2) that for time spent in activity, even one as moderate as walking, cardiac output increases initially over a period of time and then tends to plateau [53].This would imply that the beginning and end of a single activity may vary significantly in bio-signal amplitudes.
Motion artifacts also affect the performance of multimodal classifiers.In the BVP signal, the main artifacts composed of wrist acceleration in the x, y, and z directions and the changes in space between the skin and the PPG sensor, if considerably regular or periodic, are likely correlated with motion-intensity, which may characterize the subject's activity.A spectral representation of a PPG signal has a dominant quasi-periodic signal component resulting from rhythmic wrist movement/foot-ground contact, and another component corresponding to heartbeat.If the two components have close periods, due to leakage, the spectral peak associated with the heartbeat may not be distinguishable from the peak associated with the wrist swing rhythm [26].This condition may not be helpful to ML models but could be exploited by DL models.
The findings of our study, regarding the improved classification performance of a CNN despite being trained on MA-corrupted BVP inputs, may be supported by conclusions from research which used a combined convolutional and recurrent layer network to classify standing, walking, jogging, jumping, and sitting classes [32].In this study, each subject's PPG signal was decomposed into a cardiac component (0.5-4 Hz), a respiratory component (0.2-0.35 Hz), and a MA noise component (≥0.1 Hz) and converted to the frequencydomain using the FFT.When comparing models trained on data composed of the cardiac, respiratory, and MA components (individually and in combination), results suggested that the MA signal is a better predictor than the cardiac and respiration signal.Further, the predictability of activities from PPG does not only result from the elevation of heart rate and respiration rates but also from the MA generated when performing the activities [32].Results from the C1 classifier in our study might suggest that the added BVP data could make classification by DL models robust against MA, given that the model is trained on noisy BVP data yet model accuracy significantly improves.
Among common activities in ML HAR datasets [10], stationary cycling and ascending/descending stairs appear to be more difficult to discriminate from other activities when using accelerometer data acquired from the wrist [7,17,50,51].A somewhat heavily MA-affected activity, ascending/descending stairs is mostly misclassified as walking using ACC-based classifiers.However, the CNN model trained on BVP, EDA, and ACC data (C3) demonstrated significant improvement in separating stair use from walking (p < 0.05).On the other hand, cycling, which may be less affected by MA than ascending/descending stairs, is commonly misclassified as a sedentary activity such as sitting or standing.Results from our study show that the cycling class becomes more separable from lying and standing (p < 0.005) for CNN model C3.These results may point to a potential improvement in the classification of these challenging classes with the addition of bio-signal inputs when a suitable DL model is used.

Figure 1 .
Figure 1.Flowchart for HAR algorithm development; ACC, BVP, and EDA data were collected from a wristband device.RF algorithms were trained on features extracted from signals, and DL algorithms were trained on spectrogram images.

Figure 1 .
Figure 1.Flowchart for HAR algorithm development; ACC, BVP, and EDA data were collected from a wristband device.RF algorithms were trained on features extracted from signals, and DL algorithms were trained on spectrogram images.

2. 3 . 2 .
Convolutional Neural Network (CNN) Pre-trained DL image classification neural networks eliminate the laborious task of designing a DL network and provide a re-trainable model which has learned rich feature representations from more than a million images.MATLAB's (R2021b) Deep Learning Toolbox Model for ResNet-18 [49] was used in this study to perform the DL experiments (individual activity classification and group-based classification), analogous to the machine learning experiments described in Section 2.3.1.

Figure 2 .
Figure 2. Paired time and time-frequency plots (STFT) of each activity corresponding to their A BVP, and EDA segments (top to bottom) from a 20-year-old female subject.Activities represe are lying, standing, walking, brisk walking, jogging, running, stairs, and cycling (left to right) colors in spectrogram represents the amplitude of a specific frequency at a given time, ranging dark blues for low amplitudes and dark red, indicating stronger amplitudes.

Figure 2 .
Figure 2. Paired time and time-frequency plots (STFT) of each activity corresponding to their ACC, BVP, and EDA segments (top to bottom) from a 20-year-old female subject.Activities represented are lying, standing, walking, brisk walking, jogging, running, stairs, and cycling (left to right).The colors in spectrogram represents the amplitude of a specific frequency at a given time, ranging from dark blues for low amplitudes and dark red, indicating stronger amplitudes.

Table 1 .
Random forest and CNN classifiers performance by inputs for both individual and grouped classes.Results are the mean ± standard deviation of the F1-score and area under the receiver operating characteristic (ROC) curve for 23 subjects (leave-one-subject-out mode).

Table 2 .
Random forest classification performance for eight individual activities.

Table 3 .
CNN classification performance for eight individual activities.

Table 4 .
RF classification performance for grouped, stairs, and cycling activities.

Table 5 .
CNN classification performance for grouped, stairs, and cycling activities.