Physical Activity Recognition Based on Deep Learning Using Photoplethysmography and Wearable Inertial Sensors

: Human activity recognition (HAR) extensively uses wearable inertial sensors since this data source provides the most information for non-visual datasets’ time series. HAR research has advanced signiﬁcantly in recent years due to the proliferation of wearable devices with sensors. To improve recognition performance, HAR researchers have extensively investigated other sources of biosignals, such as a photoplethysmograph (PPG), for this task. PPG sensors measure the rate at which blood ﬂows through the body, and this rate is regulated by the heart’s pumping action, which constantly occurs throughout the body. Even though detecting body movement and gestures was not initially the primary purpose of PPG signals, we propose an innovative method for extracting relevant features from the PPG signal and use deep learning (DL) to predict physical activities. To accomplish the purpose of our study, we developed a deep residual network referred to as PPG-NeXt, designed based on convolutional operation, shortcut connections, and aggregated multi-branch transformation to efﬁciently identify different types of daily life activities from the raw PPG signal. The proposed model achieved more than 90% prediction F1-score from experimental results using only PPG data on the three benchmark datasets. Moreover, our results indicate that combining PPG and acceleration signals can enhance activity recognition. Although, both biosignals—electrocardiography (ECG) and PPG—can differentiate between stationary activities (such as sitting) and non-stationary activities (such as cycling and walking) with a level of success that is considered sufﬁcient. Overall, our results propose that combining features from the ECG signal can be helpful in situations where pure tri-axial acceleration (3D-ACC) models have trouble differentiating between activities with relative motion (e.g., walking, stair climbing) but signiﬁcant differences in their heart rate signatures.


Introduction
Using sensors built into intelligent wearable devices (e.g., mobile smartphones and smart home appliances) to track and identify human activities has been used in a variety of fields, such as security and surveillance [1], smart homes [2], healthcare [3], and human-computer interaction [4]. Human Activity Recognition (HAR) collects raw signals characterizing the surrounding environment through various sensors on individuals, either independent or embedded. Based on the features retrieved from raw sensor traces, HAR traditionally uses machine learning (ML) models to recognize the fundamental activity. The most commonly used sensors include wearable inertial sensors (typically sense to tri-axial acceleration (3D-ACC) and tri-axial angular velocity) [5], GPS [6], cameras, or • Using biosignals from wearable sensors, a deep residual network known as PPG-NeXt was developed for HAR. Multiple filter sizes applied to the input at the same level were used in developing the PPG-NeXt multi-kernel block. As a result, it can capture information on various scales within the current data segment. The CNN block uses the power of the 1 × 1 convolution operation in addition to the multi-slide filters to cluster information across the channel. • Validation of the PPG-NeXt model is performed on three publicly available benchmark datasets. PPG-DaLiA, PPG-ACC, and Wrist PPG During Exercise achieved overall accuracies of 99.16%, 99.23%, and 99.17%, respectively. According to the results, the proposed model outperformed other innovative HAR approaches using PPG data from wearable sensors.
• The Deep Convolutional Neural Network (CNN), Stacked Long Short-Term Memory (LSTM), CNN-LSTM, CNN-Gated recurrent units (GRU), and inception-based iSPLInception benchmark deep learning (DL) models from the literature were considered and used with the standard public datasets to validate the proposed approach. The proposed PGG-NeXt model performed better than all the benchmark models when the performance of the models was evaluated using conventional evaluation criteria (accuracy, recall, precision, and F1-score).
The remaining components of the study are broken down into the following sections: Section 2 examines related literature on biosignal-based HAR, DL approaches for HAR, and existing problems. Section 3 presents this paper's hybrid deep residual learning framework for physical activity recognition. Section 4 outlines the experimental result and describes the context in which the experiments were conducted. Section 5 provides a discussion of the study results reported in this section. Section 6 contains the summary and suggestions for challenging future research.

Related Works
This section explores the HAR applications that use biosignals as biometric data to recognize human activities. In addition, we summarize related works on learning-based HAR approaches, including the current state-of-the-art.

Biosignals in HAR
Several examples show how biosensing has expanded its scope and made significant headway in medicine. Smartwatches and other wearable technologies are used to measure pulse rates, monitor irregularities using algorithms, and predict heart attacks. An echocardiography scan can detect an impending heart attack in a person up to two weeks before the attack occurs [13]. The ideas of pattern recognition and rule evaluation are used in an AI-based sleep detection algorithm that analyzes biosignals to determine a person's sleep stages [14]. Digital settings of various devices enable the monitoring of cardiovascular disease and other conditions related to the cardiovascular system. Biosignal values captured by wearable sensors, such as ECG and PPG biosignals, have been analyzed using DL techniques [15].
In HAR, biosignals are the only information source explaining how the human body functions in healthy and diseased states. Diverse biosignals have different origins and properties, and multiple biosignals may contain information about the same organ or system. Researchers and practitioners have used different biosignals to measure, process, analyze, and interpret human behaviors. When it comes to categorizing human activities, there is an essential category that includes mobility (e.g., sitting and walking), transportation mode (e.g., driving a car or riding a bicycle), phone use (e.g., talking on the phone), and daily activities (e.g., eating and watching television), among many others [16]. However, there is no clear distinction between actions belonging to one group and those belonging to another. Purposeful actions are almost entirely application related. Physical inactivity (e.g., sitting still or being sedentary) could be targeted to support healthy lifestyles and prevent obesity [17]. In our ongoing research, we collect information using smartwatches to detect movement patterns such as sitting, walking, jogging, and jumping.
Different sensors have varying capacities for recognizing specific activities. As a result, the choice of sensors needs to be based on the different kinds of actions that we want to acknowledge. The capability of accurately identifying the activities that are being targetedtogether with cost, form factor, and intrusiveness-are all essential variables in the decisionmaking process. An accelerometer is the most widely studied sensor for ambulatory activity recognition because it effectively detects repetitive body movements, such as walking, jogging, cycling, and climbing stairs [18]. To improve the recognition performance, a gyroscope and magnetometer are usually used together with the accelerometer [19]. Additional contextual features (e.g., the current location and the sound environment) have been added to the recognition performance by others using sensors such as light, GPS, and audio sensors [6]. Several researchers have recently proposed using wireless signals such as Wi-Fi to recognize indoor ambulatory activities [7]. This is possible because human activity causes interference in Wi-Fi signals transmitted and received by nearby Wi-Fi devices. The benefits of employing Wi-Fi signals for HAR include device flexibility, improved interior coverage, and privacy protection.
In our recent study, we proposed the use of a new sensing modality, recognizing ambulatory activities using a PPG sensor incorporated in wristbands and smartwatches. The PPG sensor is mainly employed for monitoring cardiac and respiratory functions. The motion artifacts (MAs) are typically eliminated from the raw PPG signal before they can be utilized to detect the heart's rhythmic beating and breathing cycles correctly. We take the opposite approach whenever the PPG signal is used for the HAR. Instead of discarding the MAs, we assume that they have the potential to recognize a variety of ambulatory activities through the use of their predictive abilities. We use a procedure called "signal pre-processing" to divide the raw PPG signal into these three distinct signals: cardiac, respiration, and motion artifact. These three signals are then fed into end-to-end neural network models to predict five different ambulatory activities.

DL Networks for HAR
Once the sensors for recognizing specific targeted activities have been selected, the next logical step is to identify which analytical approach will be utilized for the recognition tasks. Many existing HAR works [20][21][22][23] have explored various ML and DL techniques. These approaches can be classified according to how they perform feature extraction (automatic feature extractions versus manually crafted feature extractions). The first category comprises ML techniques (e.g., Decision Tree, Random Forest, and Hidden Markov). In contrast, the second category is composed of DL techniques (e.g., feed-forward neural networks, recurrent neural networks, CNN, and various combinations of these types of networks).
Pierluigi et al. [24] proposed a wearable system that attached an accelerometer to a user's chest and used random forest models on 319 manually extracted features from accelerometer signals to recognize activities such as walking, climbing stairs, conversing with a person, remaining still, and working on a computer. Accuracy levels of 90% or higher have been achieved with this approach. Lu et al. [21] employed an unsupervised clustering method known as Molecular Complex Detection or MCODE to recognize different physical and sports activity types. For this purpose, they took 19 features from smartphones with built-in accelerometers that were attached to the waists of study volunteers. The MCODE clustering approach outperformed most other clustering methods, such as GMM and K-means, with an accuracy and F1-score of 88% and 85%, respectively. Attal et al. [5] explored various classification methods, including both supervised learning (e.g., K-Nearest Neighbors, Support Vector Machines, and Random Forest) and unsupervised learning (e.g., Gaussian Mixture Models, K-Means, Markov Chain, and Hidden Markov), which were investigated in the context of recognizing activities of daily living of elderly with three inertial sensors placed at different locations of the upper/lower limbs and with features in the time and frequency domains. In this study, K-Nearest Neighbors and the Hidden Markov model were identified as the best algorithms for supervised and unsupervised methods, respectively. All of these existing researches were carried out in quite varied situations in terms of sensor location and numbers, targeted activities, and equipment, presenting a significant obstacle that needed to be overcome. As Jordao et al. [25] mentioned, it is impossible to make definitive insights into HAR optimization. Ideally, a HAR system should not impose an additional burden on its users (e.g., the user should not have to carry a new device only for HAR) and should have varying degrees of functionality depending on its placement (e.g., be functional only in a particular fixed position) or require a significant amount of featurization, which severely limits its generalizability to different HAR tasks.
The traditional convolutional network suffers from the issue of information loss during the transmission of information, and it also causes the gradient to vanish or the gradient to explode, which prevents the deep network from being able to train. This issue can be solved to a certain degree by utilizing ResNet [26]. The main goal of this system is to increase the network's direct channels while preserving data integrity by directing input information to the output. Compared with VGG-Net [27], the main difference of ResNet is the presence of several bypasses that direct the input to the subsequent layers. This type of structure is also known as a shortcut.
With the addition of the identity mapping in the residual structure (shown in Figure 1, the initial function H(x) that must be learned is transformed into F(x) + x-that is, H(x) = F(x) + x-and the complete network only needs to learn the distinction between input and output. This concept originated in the field of image processing with the use of residual vector encoding. By reorganizing the information, the inputs and outputs of this module are overlaid at the element level. In addition to not increasing the network's load with new parameters and calculations, it can also significantly increase the training effect and speed of the model. Even though the residual structure provides a solution to the problem of gradient disappearance brought on by the deepening of the network layer, ResNet stacks the modules with the same topology. As a result, each component of the whole network becomes more extensive, and the attributes of the branches contained in each element become more varied. Inspired by the structure of the network series Inception [28], the ResNeXt design [29] was created. It combines the residual form in ResNet with the ResNeXt system, making the Inception branch construction method more straightforward and modular. By reducing the number of hyperparameters, accuracy can be increased without complicating the parameters. In particular, increasing cardinality in ResNeXt is a more efficient strategy to improve the accuracy than increasing depth and width as depth and breadth give diminishing returns in existing models. The reason for the greater effectiveness is that aggregated transformations of ResNeXt are more robust representations than residual connections of ResNet [29]. Cardinality is the size of the transformation set, a specific, measurable dimension of central importance.

Available PPG Datasets
HAR datasets containing PPG signals are relatively rare, and the number of activities represented is limited. However, it is still an interesting topic because a PPG sensor is already integrated into smartwatches or smart wristbands. When other HAR sensors are not accessible, they can be used alone. Alternatively, it can be paired with other HAR sensors to achieve better recognition results; moreover, it can monitor different physiologic parameters in one device [30]. Reiss et al. [31] introduced PPG-DaLiA data that collected PPG, electrodermal activity (EDA), and body temperature, including tri-accelerometer data of eight physical activities. PPG-ACC is a dataset developed by Biagetti et al. [32] to provide insights into the PPG signal for HAR. PPG signals and the associated 3D-ACC signals were measured over three different physical activities and included in this dataset. Casson et al. [33] presented a PPG dataset called "Wrist PPG During Exercise". The dataset is a database that contains wrist-based PPG measurements for four activities. It includes both a three-axis accelerometer and a three-axis gyroscope to monitor motion in all directions. A summary of the three public PPG datasets is shown in Table 1.

Overview of HAR Framework Used in This Study
This work studied DL-based HAR, which applies deep residual networks to extract abstract features from raw PPG data automatically. The studied HAR workflow consists of four main processes: data acquisition, pre-processing, data generation, and model training with classification, as shown in Figure 2. We collected HAR datasets that contained PPG data on human activities. We selected three public datasets from the literature-namely, PPG-DaLiA, PPG-ACC, and Wrist PPG During Exercise-to study in this work. The summary table of the three public datasets is shown in Table 2. The sensor data include PPG and IMU data. Then, a sliding window approach was used to denoise, normalize, and segment sensor data to train DL models and evaluate results. These samples were generated by the protocol of 10-fold cross-validation (CV). Finally, four different HAR standard metrics were utilized to assess and compare the trained models. The subsequent subsections explained the specifics of each step in the procedure. The PPG-DaLiA dataset collected PPG signals for human activity recognition and was presented by Reiss et al. [31]. The information for this dataset was collected using 15 participants between the ages of 21 and 55. To manage the desired signals, they utilized two different instruments. Each participant wore a device called RespiBAN on their chest to record ECG signals, respiration, and 3D-ACC at a sampling rate of 700 Hz. In addition, participants wore a device called Empatica-E4 on their non-dominant wrists. This device recorded 3D-ACC at a sampling rate of 32 Hz, blood volume pulse (BVP) signal containing the PPG signal at 64 Hz, Electrodermal activity (EDA), and body temperature at 4 Hz each. After attaching the above devices to the participants' chests and wrists, Reiss et al. had participants perform daily activities. These included sitting, climbing, descending stairs, playing table soccer, cycling outside, driving a car, taking a lunch break, walking, and working. In addition to the listed activities, the authors also documented the temporary activities between each activity.

PPG-ACC Dataset
The Electronics Research Group of the Department of Information Technology, Polytechnic University of the Marche, in Ancona, Italy, collected the second dataset. In this work, a dataset called PPG-ACC [32] was utilized, providing insights into the PPG signal acquired by the wrist in the presence of motion artifacts and into the acceleration signal received simultaneously by the same wrist. This dataset describes data collected from 7 participants (three males and four females aged 20 to 52) throughout three different activities. It contains 105 PPG signals (15 for each individual) and the corresponding 105 3D-ACC signals measured at a sampling frequency of 400 Hz.

Wrist PPG During Exercise Dataset
The Wrist PPG During Exercise dataset introduced by Jarchi et al. [33] is accessible online at PhysioNet and was utilized for the experiments in this study. During exercise, data were collected from 8 healthy subjects (five males and three females) using a sampling frequency of 256 Hz. Data were collected using a wrist-worn PPG sensor on-board the Shimmer 3 GSR+ for an average recording duration of 5 min and a maximum duration of 10 min. Four exercises were selected and performed: two on a stationary exercise bike and two on a treadmill. The exercises were as follows: treadmill walking, treadmill running, high-resistance exercise bike, and low-resistance exercise bike.

Data Denoising
The valuable information contained in a signal is distorted by noise. Generally, sensorbased HAR is achieved by collecting data from wearable sensors and analyzing them using classification techniques. Nonetheless, throughout the data collection process, the source data of sensors often include noise (missing value, incorrect value, or aberrant value) caused by environmental influence [34]. The combined effects of a long-term dependent sequence could result in incorrect categorization. The quality of training data has a significant impact on the accuracy of models. Numerous gathering situations in the real world bring various noises that decrease data quality [35]. Therefore, it is essential to minimize the impacts of noise to obtain helpful information from the signal for subsequent processing. The most popular filtering techniques are the mean, low-pass, wavelet, and Gaussian filters. In our research, we employed a mean smoothing filter for signal denoising. For this purpose, the filters were applied to the PPG signal and all three dimensions of the accelerometer and gyroscope signals.

Data Normalization
The raw sensor data were then normalized to the range between 0 and 1, as shown in Equation (1). By bringing all data values into a close range, this approach helps solve the model learning problem. This allows the gradient drops to converge more quickly.

Data Segmentation
It is not practical to input all the data into the HAR model simultaneously because wearable sensors capture a significant amount of signal data. Therefore, segmentation into sliding windows should be performed before inputting the data into the model. The sliding window technique is one of HAR's most widely used data segmentation methods for recognizing routine activities (e.g., walking and running) and static activities (e.g., standing, sitting, and lying down). The raw sensor signals are divided into fixed-length windows. There is a percentage of overlap between subsequent windows so that more training data can be collected, and the user does not miss the transition from one activity to the next. Figure 3 shows a detailed explanation of the windowing process.
The sample data, which are divided into segments by a sliding window of size N, have a size of K × N. The sample W t is represented as where column vector a k t = (a k t 1 , a k t 2 , . . . , a k t N ) T are the signal data of sensor k at window time t, T represents the transpose operator, and K represents the number of sensors. In order to exploit the correlations between windows and apply the training process, the window data are divided into sequences of windows. S = {(P 1 , y 1 ), (P 2 , y 2 ), . . . , (P Q , y Q )} where Q represents the size of the window sequence and y Q represents the label of the corresponding activity in window P. For windows containing multiple activity classes, the most common sample activity is chosen as the label of the window.

The Proposed PPG-NeXt Network Architecture
The proposed PPG-NeXt network is an end-to-end DL network based on a deepresidual architecture. This architecture is composed of convolutional blocks and multikernel residual blocks. The overall design of the presented model can be seen in Figure 4. The Convolutional Blocks (ConvB) technique extracts low-level features from raw sensor data. As shown in Figure 4, ConvB includes four layers: 1D-convolutional (Conv1D), batch normalization (BN), exponential linear unit (ELU), and max-pooling (MP). Multiple adaptive convolution kernels capture different features in Conv1D, and each kernel generates a feature map. The BN layer was chosen to accelerate and stabilize the training phase, and the ELU layer was used to increase the expressiveness of the model. The MP layer was utilized to achieve compression of the feature map while maintaining the integrity of the most critical components.
The Multi-Kernel Blocks (MK) comprise three modules containing feature convolutional kernels of different sizes-specifically, 1 × 3, 1 × 5, and 1 × 7. To reduce the overall complexity and quantity of parameters in the proposed network, each module employs 1 × 1 convolutions before applying these kernels.
For each multicore block, convolutional units with varying kernel sizes are executed simultaneously, and their outputs are combined. Each of these units has three kernel sizes: 1 × 3, 1 × 5, and 1 × 7. The maximum kernel size of the hyperparameter determines the kernel sizes of Conv1D layers with dimensions of 1 × 3, 1 × 5, and 1 × 7. Furthermore, 1 × 1 convolutions are achieved before employing these kernels to reduce the model's complexity and the number of parameters needed. This 1 × 1 convolution is a low-cost procedure that functions as a dimensionality reduction layer for the input features. It is significantly more affordable to implement when the additional channel is removed, as seen in [28,36]. Details can be seen in Figure 4; the feature set produced by each kind of kernel is included. Utilizing the padding technique, the spatial dimensions of all of these feature sets were maintained. In the proposed PPG-Next, we used the same padding technique [37] that results in padding with zeros evenly to the left/right of the input such that the output has the same dimension as the input. After including these feature maps, the resulting feature map is combined with the input feature map, and the module's outcome is transferred to the subsequent unit.

Hyperparameters
The hyperparameter settings control the learning process in DL. The following hyperparameters are used in the proposed PPG-NeXt model: (1) learning rate (α), (2) epochs, (3) batch size, (4) optimization, and (5) loss function. Initially, we set α for the learning rate to 0.001. The number of epochs was set to 200, and the size of each batch was set to 128. If the validation loss had not improved after 30 epochs, we programmed a call to stop early to terminate the training process. After six more epochs, we adapted it to 75% of its initial value when the validation of the proposed PPG-NeXt model did not improve accuracy. For error reduction, we used the Adam optimizer [38] with settings β 1 = 0.9, β 2 = 0.999, ε = 1 × 10 −8 . The optimizer uses the categorical cross-entropy function to identify the error. The cross-entropy [39] performs better than the classification and mean square errors.

Model Training and Performance Evaluation
The hybrid deep residual network was trained using the three PPG datasets after setting the model's hyperparameters as described in the previous section. Instead of a fixed split of training and testing, we used 10-fold CV to measure the performance of the proposed PPG-NeXt model. The 10-fold CV separates the entire dataset into ten equal-sized folds that do not overlap. The models are fitted using a nine-fold iterative procedure, omitting the new fold to measure performance.

Evaluation Metrics
The proposed PPG-NeXt model classifies a sample as a true positive (TP) result when the activity classification is correctly recognized, a false positive (FP) result when the activity classification is incorrectly recognized, a true negative (TN) result when the activity classification is correctly rejected, and a false negative (FN) results when the activity classification is incorrectly rejected. Within this study's scope, the proposed method's effectiveness was appraised concerning four standard measures (accuracy, precision, recall, and F1-score) using Equations (4)-(7) shown below to determine the following: Accuracy(%) = TP + TN TP + TN + FP + FN × 100%

Experimental Setup
All experiments were performed on the Google Colab Pro+ platform in this work. Training of the DL models was accelerated using the Tesla V100-SXM2-16GB GPU module. The CUDA [40] and TensorFlow backend [41] were used in the Python library to implement the proposed PPG-NeXt and other DL-based models. The DL models' training and testing were accelerated using the GPU. The following Python libraries were used for the experiments: • Reading, processing, and analyzing the sensor data were performed using Pandas and Numpy. • Seaborn and Matplotlib plot were used to visualize the data analysis and model evaluation results. • During the execution of the experiments, the library Scikit-learn served as a resource for creating samples and data. • Training and implementing DL models for the proposed PPG-NeXt model was performed using TensorFlow and Keras.

Experimental Results
This study investigated the use of sensor-based HAR using DL models to recognize human activities. We used three public benchmarks PPG datasets (PPG-DaLiA, PPG-ACC, Wrist PPG During Exercise) that collected PPG and other wearable sensor data. The raw PPG data were preprocessed, trained, and used to evaluate the trained DL models through the 10-fold CV technique. The experimental results are presented below: After completing the last part, in which the model hyperparameters were defined, the next step was to train the hybrid deep residual network using the three public benchmark datasets. The results of the experiments are presented in Tables 3, 4, and 5, respectively.  The confusion matrix in Figure 5 shows that the proposed PPG-NeXt models using PPG and acceleration data achieved acceptable F1-score rates of at least 95% for the three datasets to classify human activities.  Figure 5. The confusion matrix of the proposed PPG-NeXt models using PPG and acceleration data: (a-c) for PPG-DaLiA; (d-f) for PPG-ACC; (g-i) for Wrist PPG During Exercise datasets.
To assess the proposed PPG-NeXt model's interpretation, the proposed model is compared against state-of-the-art DL approaches in the scope of biosignal-based HAR. Table 6 reports a list of the state-of-the-art works related to HAR using PPG and IMU sensors. The comparative results reveal that the PPG-NeXt surpasses other related models' overall accuracy. The proposed PPG-NeXt model obtained the highest performance on all three datasets, with 99.33%, 99.23%, and 99.68% on the PPG-DaLiA, PPG-ACC, and Wrist PPG During Exercise datasets, respectively.

Impact of Sampling Frequencies on Different Dataset
Based on the experimental findings in Tables 4-6, the averaged accuracies of PPG-NeXt employing acceleration data with higher frequencies are superior to those of our suggested model using acceleration data with lower frequencies. When the sampling frequency was increased, sensor data comprised more data points per sensor, and this large number of data points offered more insight into the motion [44]. The findings also show that the sampling rate of the PPG signal remained the same as the PPG-NeXt model's average accuracy. The point permits the sampling rate of the PPG signal to be reduced to low-frequency levels without making significant HAR effects. Table 7 provides the F1-score of the PPG-NeXt trained on several datasets (PPG-DaLiA, PPG ACC, and Wrist PPG During Exercise). Each dataset includes various human behaviors. The majority of the PPG-DaLiA dataset's eight daily living tasks are straightforward. The PPG ACC consists of three exercise-related tasks: resting, squatting, and walking. During the exercise dataset, the four activity-related activities included in the Wrist PPG are walking, running, and cycling with high and low resistance.

Impact of Sensor Types
We arrange the results of PPG-DaLiA based on seven possibilities of sensor combinations, i.e., employing just one input of signal (scenarios 1, 2, and 3) and a variety of two inputs (scenarios 4, 5, and 6), including scenario 7, which involves the fusion of PPG, 3D-ACC, and ECG data. As illustrated in Figure 6, we clarify our findings about the activity interpretation of the HAR models. For identifying human behavior, the PPG signal surpasses the other two biosignals when just one signal source is considered. When the PPG and 3D-ACC signals are combined, the model's performance exceeds that of the model employing just one signal source. Our findings imply that including the PPG signal in HAR solutions based primarily on the 3D-ACC might enhance the model's effectiveness. Moreover, when considering all three signal sources, we discover that HAR efficacy is identical to when we included PPG and 3D-ACC signals. However, fusing ECG signals did not enhance the performance of the classifiers in our investigation.

Limitations and Further Directions
This study has several limitations. First, the experiment was performed with a limited sample size in a semi-controlled setting using three publicly accessible datasets, which might restrict our results' generalizability. Second, the drawback of the PPG-NeXt model provided is the interpretability of the retrieved features. The feature matrix signifies binary numbers representing the percentage of positive values, which makes it challenging to comprehend the network's concentration on the essential areas of signals. Despite these limitations, this study provides new insights into how to assess human behavior employing different sensing modalities than motion sensors. So, further research will require the collection of an additional PPG dataset to acquire excellent and more generalized findings from the classification model. The dataset will include PPG data with data from other low-energy sensors on a variety of immediate and complicated actions, sample frequencies, and sensor localization.

Conclusions
This work introduced a deep residual network, PPG-NeXt, for physical activity recognition using PPG and wearable inertial sensor data. The proposed model was evaluated using three publicly available benchmark PPG datasets (PPG-DaLiA, PPG-ACC, and Wrist PPG During Exercise) and compared with other DL models. The results show that more than 90% of the F1-score in classification is achieved using only PPG data.
We performed a comparative analysis to evaluate the significance of the various contributions of signal sources in HAR systems. The experimental results show that the 3D-ACC is the most informative signal when the goal of the HAR system is to acquire and use a single signal source. Moreover, our results indicate that combining PPG and 3D-ACC signals increases activity recognition without significantly increasing hardware and processing costs. However, biosignals, ECGs, and PPGs can separate static and nonstatic activity and have a sufficiently successful level. Overall, our findings signify that combining PPG and 3D-ACC signal features could reinforce enhancing the F1-score of all activity situations. Nonetheless, the ECG signal feature has difficulty distinguishing between activities with similar motions but significantly different heart rate signatures.