Using a Hybrid Neural Network and a Regularized Extreme Learning Machine for Human Activity Recognition with Smartphone and Smartwatch

Mobile health (mHealth) utilizes mobile devices, mobile communication techniques, and the Internet of Things (IoT) to improve not only traditional telemedicine and monitoring and alerting systems, but also fitness and medical information awareness in daily life. In the last decade, human activity recognition (HAR) has been extensively studied because of the strong correlation between people’s activities and their physical and mental health. HAR can also be used to care for elderly people in their daily lives. This study proposes an HAR system for classifying 18 types of physical activity using data from sensors embedded in smartphones and smartwatches. The recognition process consists of two parts: feature extraction and HAR. To extract features, a hybrid structure consisting of a convolutional neural network (CNN) and a bidirectional gated recurrent unit GRU (BiGRU) was used. For activity recognition, a single-hidden-layer feedforward neural network (SLFN) with a regularized extreme machine learning (RELM) algorithm was used. The experimental results show an average precision of 98.3%, recall of 98.4%, an F1-score of 98.4%, and accuracy of 98.3%, which results are superior to those of existing schemes.

using accelerometers and gyroscopes, are the most popular apps. These physical activities are used to calculate the number of calories spent. Over the past decade, recognition of physical activities has been applied to prevent falls among the elderly [17][18][19]. However, with the COVID-19 pandemic and an aging society, monitoring quarantined or elderly individuals has become a major issue in mHealth. Numerous studies have shown that people's activities have strong correlations with their physical and mental health [20,21]. Therefore, recognizing physical activities using accelerometers and gyroscopes embedded in smartphones and smartwatches is a critical challenge in mHealth.
In recent years, deep learning (DL) and machine learning (ML) have been widely applied in mHealth [22][23][24][25]. In these studies, DL and ML models are not only used for diagnosing, estimating, mining, and delivering physiological signals, but also for preventing chronic diseases. However, in mHealth, the big data need to be delivered to servers, such as hospitals or health management centers. Therefore, telecommunications and navigation technologies are also important, in which the technologies of artificial intelligence have been applied [26,27]. Stefanova-Pavlova et al. proposed the refined generalized net (GN) to track users' locations [28]. Silva et al. used Petri nets to process the reliability and availability of wireless sensor networks in a smart hospital [29]. Ruiz et al. proposed a tele-rehabilitation system to assist with physical rehabilitation during the COVID-19 pandemic [30].
Convolutional neural networks (CNNs) can extract features from signals, while long short-term memory (LSTM) can recognize time-sequential features. Therefore, some studies have proposed deep neural networks that combine CNNs and LSTM to recognize physical activities [31,32]. Li et al. utilized bidirectional LSTM (BiLSTM) for continuous human activity recognition (HAR) and fall detection with soft feature fusion between the signals measured by wearable sensors and radar [33]. The extreme learning machine (ELM) has shown excellent results in classification tasks with extremely fast learning speed [34]. Chen et al. proposed an ensemble ELM algorithm for HAR using smartphone sensors [35]. Their results showed that the performance was better than those of other methods, such as artificial neural networks (ANNs), support vector machines (SVMs), random forests (RFs), and deep LSTM. In order to improve the accuracy of HAR systems, more complex deep learning models have been proposed. Tan et al. used smartphone sensors for HAR. They proposed an ensemble learning algorithm (ELA) that combined a gated recurrent unit (GRU), a hybrid CNN+GRU, and a multilayer neural network, then fused them with the fully connected three layers [36]. In 2020, the International Data Corporation (IDC) reported that wearable devices are being used more frequently to monitor health due to the COVID-19 pandemic, resulting in a 35.1% increase in smartwatch sales [37]. Thus, more activities could be classified and higher accuracies could be approached if smartphones and smartwatches are synchronously used for HAR. Weiss et al. used smartphone and smartwatch sensors for HAR with an RF algorithm [38]. Mekruksavanich et al. also used smartphone and smartwatch sensors for HAR with a hybrid deep learning model called CNN+LSTM [39]. Prior studies have shown that adding hand-movement signals measured by smartwatch sensors can enhance the accuracy of HAR.
To improve the accuracy of HAR systems, the development of more complex deep learning models will be necessary. Thus, this study focuses on recognizing 18 different physical activities, including body and hand movements, as well as eating movements, utilizing data from sensors embedded in smartphones and smartwatches. The recognition process involves two steps: feature extraction and HAR. To extract features, a hybrid structure was used that consisted of a CNN and a recurrent neural network (RNN), while a multilayer perceptron neural network (MPNN) was used for the recognition of activities. The RNN was replaced with various other models, such as LSTM, GRU, BiLSTM, and bidirectional GRU, to optimize the hybrid structure. The MPNN was trained separately using backpropagation (BP), the ELM, and the regularized ELM (RELM). The HAR dataset used in this study was obtained from the UCI Machine Learning Repository and specifically the WISDM smartphone and smartwatch activity and biometrics dataset [31]. According to the experimental results, the proposed HAR system demonstrated superior performance when compared to the systems developed in existing studies.

Materials and Methods
The proposed HAR system has three components: a data processing unit, a feature extraction unit, and a classification unit, as illustrated in Figure 1. Physical activity signals are captured by a smartphone and a smartwatch and are subsequently sampled, segmented, and reshaped for further processing. The sensor data features are extracted using a hybrid CNN+RNN model. Finally, an MPNN is employed to classify the 18 types of physical activities.
Sensors 2023, 23, x FOR PEER REVIEW 3 of 17 dataset used in this study was obtained from the UCI Machine Learning Repository and specifically the WISDM smartphone and smartwatch activity and biometrics dataset [31]. According to the experimental results, the proposed HAR system demonstrated superior performance when compared to the systems developed in existing studies.

Materials and Methods
The proposed HAR system has three components: a data processing unit, a feature extraction unit, and a classification unit, as illustrated in Figure 1. Physical activity signals are captured by a smartphone and a smartwatch and are subsequently sampled, segmented, and reshaped for further processing. The sensor data features are extracted using a hybrid CNN+RNN model. Finally, an MPNN is employed to classify the 18 types of physical activities. Figure 1. Structural diagram of the proposed HAR system, including the data processing unit, the feature extraction unit, and the classification unit.

UCI-WIDSM Dataset
The UCI-WISDM dataset [40] is comprised of tri-axial accelerometer and gyroscope data obtained from 51 volunteer subjects. The subjects carried an Android phone (a Google Nexus 5/5x or a Samsung Galaxy S5) in a front pocket of their pants and wore an Android watch (an LG G Watch) on their wrist while performing eighteen activities, which were categorized as body movements (walking, jogging, walking up stairs, sitting, and standing) included in many previous studies, hand movements (kicking, dribbling, catching, typing, writing, clapping, brushing teeth, and folding clothes) representing activities of daily life, and eating movements (eating pasta, drinking soup, eating a sandwich, eating chips, and drinking from a cup) to investigate the feasibility of automatic food-tracking applications [38]. The data were sampled at a rate of 20 Hz, and the 12 signals were segmented into fixed-width sliding windows of 6.4 s with 50% overlap between them. Each sample contained 12-channel signals, and each channel comprised 128 points. Samples containing two activities were removed. The numbers of training and testing samples were 34,316 and 14,707, respectively, and the sample numbers for each of the eighteen activities are presented in Table 1.

UCI-WIDSM Dataset
The UCI-WISDM dataset [40] is comprised of tri-axial accelerometer and gyroscope data obtained from 51 volunteer subjects. The subjects carried an Android phone (a Google Nexus 5/5x or a Samsung Galaxy S5) in a front pocket of their pants and wore an Android watch (an LG G Watch) on their wrist while performing eighteen activities, which were categorized as body movements (walking, jogging, walking up stairs, sitting, and standing) included in many previous studies, hand movements (kicking, dribbling, catching, typing, writing, clapping, brushing teeth, and folding clothes) representing activities of daily life, and eating movements (eating pasta, drinking soup, eating a sandwich, eating chips, and drinking from a cup) to investigate the feasibility of automatic food-tracking applications [38]. The data were sampled at a rate of 20 Hz, and the 12 signals were segmented into fixed-width sliding windows of 6.4 s with 50% overlap between them. Each sample contained 12-channel signals, and each channel comprised 128 points. Samples containing two activities were removed. The numbers of training and testing samples were 34,316 and 14,707, respectively, and the sample numbers for each of the eighteen activities are presented in Table 1.  Figure 2 illustrates a feature-extraction model that employs a hybrid CNN and RNN to extract the features of sensor signals. The fully connected layer, consisting of three layers, is used to classify the 18 types of physical activities. After training, the outputs of the RNN for the training samples serve as the feature samples to train the activation-classification models. Since the human movements in each activity occur in chronological order, the sensor signals represent time-sequential data. To address this, a time-distributed layer comprising four 1D CNNs (i.e., four pairs of CNNs with three layers and a maximal pool layer as the last layer) is stacked on top of the RNN. This separates a sample into four segments, with each segment containing 32 points. In the convolutional layer, the number of filters is 64; the kernel sizes are 3, 5, and 13; the stride is 1; and the padding is 4. In the pooling layer, the kernel size is 2, and the stride is 2. The activation function employed is ReLU. The RNN is replaced with the LSTM, BiLSTM, GRU, or BiGRU, with the unit numbers of LSTM and GRU set to 128 and those of BiLSTM and BiGRU set to 256. The batch size is set to 32, with the control reset gate and update gate using a sigmoid function and the hidden state using a tanh function. The numbers of full connection layers are 128, 64, and 18, respectively, with ReLU used as the activation function in hidden layers and softmax in the output layer. The loss function is the categorical Cross-Entropy (CE) function, and the Adam optimizer is used [41], with the learning rate set to 0.0001. Equation (1) is the formula for categorical CE: where M is 18, a k is the score of softmax for the positive class, and a i is the score inferred by the net for each class.

Activation-Classification Model
The activation-classification model is a single-layer feedforward neural network (SLFN) with the ELM algorithm [42]. Its advantages are the convergent time being shorter than that of the BP method and its not converging to the local minimum. For an SLFN, a

Activation-Classification Model
The activation-classification model is a single-layer feedforward neural network (SLFN) with the ELM algorithm [42]. Its advantages are the convergent time being shorter than that of the BP method and its not converging to the local minimum. For an SLFN, where X r denotes the rth input vector and Y r represents the rth target vector. The output o of SLFN with l hidden neurons can be expressed as: where f (x) is the activation function in the hidden layer, W ji is the weight vector from the input layer to the jth hidden node, W ji = (w j1 , w j2 , . . . ,w jn ) ∈ R n , b j is the bias of the jth hidden node, β k is the weight vector from the hidden nodes to kth output layer, and l is the number of hidden layers. In the ELM, activation functions are nonlinear functions that provide nonlinear mapping for the system. O r is the rth output vector. Mean square error (MSE) is the object function: where N is the number of samples. The MSE will approach 0 as the number of hidden nodes approaches to infinity. The output o of SLFN is equal to the target output y. Thus, Equation (2) could be described as follows: where Y is the output matrix, H is the matrix of the activation function in the hidden layer, and β is the weight matrix from the hidden nodes to the output layer. ELM uses random parameters W ij and b j in its hidden layer, and they are frozen during the whole training process.
where H † is the Moore-Penrose inverse. The resident, ε i , is between the target and output values of the ith sample. However, the ELM has the risk to approach the result of over-fitting model because it bases on the empirical risk minimization principle [43]. Den et al. proposed a regularized ELM (RELM) that used a weight factor γ for empirical risk [44].
In order to obtain a robust estimate weakening outlier interference, ε i can be weighted by a factor v i . Equation (7) is changed thus: The method of Lagrange multipliers is used to search for the optimal solution of Equation (8): where α is the Lagrange multiplier with the equality constraints of Equation (9). Setting the gradients of L(β,ε,α) equal to zero gives the following Karush-Kuhn-Tucker (KKT) optimality conditions [44,45]:

Experimental Protocol
The hardware used in this study comprised an Intel Core i7-8700 CPU and a GeForce GTX1080 GPU. The operating system used was Ubuntu 16.04LTS, with development being conducted in Anaconda 3 for Python 3.7. The deep learning tool used was Pytorch 1.10, and the compiler used was Jupyter Notebook. To assess the proposed method's performance, we evaluated the optimal feature-extraction model and the activation-classification model for HAR separately.
In the feature-extraction model, the RNN was replaced with LSTM, BiLSTM, GRU, and BiGRU, separately. The training samples were used to adjust the parameters of the hybrid CNN+RNN, while the testing samples were used to evaluate the performances of these RNNs. The feature-extraction model that achieved the best performance was one in which the RNN outputs for all training and testing samples were used as the new training and testing samples to evaluate the activation-classification model.
In the activation-classification model, a multilayer perceptron neural network (MPNN) was used to classify the 18 physical activities. The output number of the MPNN was 18, and the input number depended on the number of RNN outputs. The training algorithms used were BP, ELM, and RELM. The number (l) of hidden layers and the regularized parameter (γ) of RELM were optimized using the grid-search method to find the optimal values.

Statistical Analysis
According to the proposed method, a sample was considered a true positive (TP) when the classification activity was correctly recognized, as a false positive (FP) when the classification activity was incorrectly recognized, as a true negative (TN) when the activity classification was correctly rejected, and as a false negative (FN) when the activity classification was incorrectly rejected. In this work, the performance of the proposed method was evaluated using the measures given by Equations (13)-(16):

Results
In order to evaluate the effectiveness of the proposed method, we will present three sets of results: those for the feature-extraction model, the activation-classification model, and the training times of the models.

Analysis of the Feature-Extraction Model
The learning curves for the hybrid CNN+LSTM model are depicted in Figure 3, where (a) and (b) represent the accuracy and loss curves, respectively. The blue line corresponds to the training data, while the original line corresponds to the validation data. The optimal values for the accuracy and loss function are achieved at epoch 29. When applied to the testing data, the model achieved an average precision, recall, F 1 -score, and accuracy of 93.8%, 93.8%, 93.8%, and 94.1%, respectively. The total training time for the model was 130.26 s. In Figure 4, the learning curves for the hybrid CNN+GRU model are presented, where (a) and (b) denote the accuracy and loss curves, respectively. The blue line represents the training data, while the original line represents the validation data. The optimal values for the accuracy and loss function are attained at epoch 28. When evaluated on the testing data, the model achieved an average precision, recall, F 1 -score, and accuracy of 92.6%, 92.6%, 92.5%, and 92.2%, respectively. The total training time for the model was 98.67 s. The learning curves for the hybrid CNN+BiLSTM structure are displayed in Figure 5, where (a) and (b) represent the accuracy and loss curves, respectively. The blue line corresponds to the training data, while the original line corresponds to the validation data. The optimal values for the accuracy and loss function are achieved at epoch 30. When applied to the testing data, the model achieved an average precision, recall, F 1 -score, and accuracy of 95.3%, 95.3%, 95.3%, and 95.3%, respectively. The total training time for the model was 138.86 s. In Figure 6, the learning curves for the hybrid CNN+BiGRU model are presented, where (a) and (b) denote the accuracy and loss curves, respectively. The blue line represents the training data, while the original line represents the validation data. The optimal values for the accuracy and loss function are attained at epoch 29. When evaluated on the testing data, the model achieved an average precision, recall, F 1 -score, and accuracy of 95.7%, 95.4%, 95.5%, and 95.2%, respectively. The total training time for the model was 108.69 s. Table 2 provides an overview of the performances of four feature-extraction models. Although the hybrid structures with BiLSTM and BiGRU require more training time per epoch than LSTM and GRU (4.60 s vs. 4.49 s and 3.74 s vs. 3.52 s, respectively), their testing accuracies are superior to those of LSTM and GRU (95.3% vs. 94.1% and 95.2% vs. 92.2%). Given that the hybrid structure with BiGRU saves 19% of training time compared to BiLSTM and that their accuracies are very similar (95.25% vs. 95.3%), the feature-extraction model based on the hybrid CNN+BiGRU structure was chosen for building the HAR system.

Analysis of the Activation-Classification Model
To classify the 18 types of physical activities, an MPNN was utilized, where the input and output nodes were set to 256 and 18, respectively. The MPNN was trained using three activation-classification algorithms: BP, ELM, and RELM. The performance of ELM and RELM was influenced by two parameters: the regularized index (γ) and the number of hidden layers (l).

Performance of the MPNN with the BP Algorithm
The MPNN with the BP algorithm had two hidden layers with 128 and 64 nodes, respectively, where ReLU was used as the activation function in the hidden layers and softmax in the output layer. Table 3 shows the performances of the MPNN with the BP algorithm for 18 physical activities on the testing data. The model achieved an average precision of 97.1%, an average recall of 97.2%, an average F 1 -score of 97.2%, and an accuracy of 97.2%. The total training time was 10.563 s. Among the 18 activities, the worst F 1 -scores were obtained for the eating pasta, catching a ball, and eating a sandwich activities, which all involve hand and eating movements.

The Optimal Parameters of the RELM
The SLFN utilized both ELM and RELM algorithms, and the optimal parameters for the RELM were determined using a grid-search method. For the RELM, the regularized index (γ) was set to 5 × 10 −4 , and the number of hidden layers was gradually increased from 256 nodes to 8000 nodes. Table 4 displays the testing accuracies and training times for various numbers of hidden layers. The highest accuracy of 98.35% and a training time of 3.80 s were achieved with 6000 hidden nodes. After that, when l was fixed at 6000, γ gradually increased from 5 × 10 −4 to 4. Table 5 shows the testing accuracies and training times for different regularized indexes. It was observed that the most accurate results and the highest training time were obtained when γ was set to 5 × 10 −4 . In Equation (7), the empirical risk, ε 2 , is regularized by γ. Thus, the performances of the ELM and RELM would be close in this study.

Performances of the SLFN with the ELM and RELM Algorithms
For the ELM algorithm, the SLFN had one hidden layer with 6000 nodes. Figure 7 shows the confusion matrix of the classification of eighteen activities. The performances of writing, clapping, brushing teeth, eating chips, and drinking from a cup activities were better than those for the ELM algorithm. Table 6 presents the performances of the SLFN with the ELM algorithm on the testing data. The model achieved an average precision of 97.9%, a recall of 97.9%, an F 1 -score of 97.9%, and an accuracy of 97.8%. The total training time was 7.52 s. The F 1 -scores for the eating pasta, catching a ball, and eating a sandwich activities rose to 98.0%, 96.4%, and 98.1%, respectively.
Sensors 2023, 23, x FOR PEER REVIEW better than those for the ELM algorithm. Table 6 presents the performances of th with the ELM algorithm on the testing data. The model achieved an average prec 97.9%, a recall of 97.9%, an F1-score of 97.9%, and an accuracy of 97.8%. The total t time was 7.52 s. The F1-scores for the eating pasta, catching a ball, and eating a sa activities rose to 98.0%, 96.4%, and 98.1%, respectively.    For the RELM algorithm, l was set to 6000 for the SLFN, and γ was set to 5 × 10 −4 . Figure 8 shows the confusion matrix of the classification of eighteen activities. The eating pasta activity was easily confused with the drinking soup and drink from a cup activities. Catching a ball was easily confused with kicking a ball. Table 7 shows the performances of the SLFN with the RELM algorithm on the testing data. The model achieved an average precision of 98.3%, a recall of 98.4%, an F 1 -score of 98.4%, and an accuracy of 98.3%. The total training time was 3.59 s. The F 1 -scores for the eating pasta, catching a ball, and eating a sandwich activities rose to 98.1%, 97.6%, and 99.2%, respectively. For the RELM algorithm, l was set to 6000 for the SLFN, and γ was set to 5 × 10 −4 . Figure 8 shows the confusion matrix of the classification of eighteen activities. The eating pasta activity was easily confused with the drinking soup and drink from a cup activities. Catching a ball was easily confused with kicking a ball. Table 7 shows the performances of the SLFN with the RELM algorithm on the testing data. The model achieved an average precision of 98.3%, a recall of 98.4%, an F1-score of 98.4%, and an accuracy of 98.3%. The total training time was 3.59 s. The F1-scores for the eating pasta, catching a ball, and eating a sandwich activities rose to 98.1%, 97.6%, and 99.2%, respectively.

Discussion
The proposed HAR system involves the use of a hybrid CNN+RNN model to extract activation features from accelerometers and gyroscopes in smartphones and smartwatches. This method was originally proposed by Tan et al. [36]. Since the accelerometer and gyroscope signals for activities are time-sequential, the performance of different RNN models can vary for HAR. In this study, LSTM, GRU, BiLSTM, and BiGRU were explored, and the classifying performances of BiLSTM and BiGRU were found to be very similar. However, BiGRU had a shorter training time than BiLSTM (108.69 s vs. 138.86 s) and was therefore used to extract the activation features. To enhance the performance of the classifier, the SLFN with the RELM algorithm was used. The ELM algorithm, which utilizes an SLFN with hidden neural weights and bias, was proposed by Huang et al. [46,47]. The ELM has an extremely fast training time and good generalized performance. Deng et al. proposed the RELM, which is based on the structural risk minimization principle of statistical learning theory and overcomes the drawbacks of the ELM [44]. Table 8 summarizes the total performances of the activation-classification models, the MLNN with BP, and the SLFN with the ELM and RELM. It was found that the classifying performances of the ELM and RELM were very similar (97.8% vs. 98.2% accuracies). The reason for this was the very small regularized weight, γ. However, the training time of the ELM was shorter (7.52 s vs. 10.56 s). However, the RELM exhibited the best performance for HAR despite its longer total testing time (feature extraction plus classification) compared to the ELM (0.038 s vs. 0.025 s).  Table 9 presents a comparative analysis of our proposed method with those of other studies that utilized the UCI-WISDM smartphone and/or smartwatch activity and biometrics dataset for six/eighteen activities. Previous studies [36,[48][49][50][51][52] only classified six activities, while studies [38,39] classified eighteen activities. As shown, the proposed HAR system using the hybrid CNN+BiGRU model and the SLFN with the RELM achieved an F 1 -score and an accuracy of 98.4% and 98.2%, respectively, which are among the best results reported in the literature. For the opening HAR datasets, the sensors, which are all accelerometers and gyroscopes, are embedded in smartphones or smartwatches or are body-worn [38,52]. The greater the number of sensors, the higher the accuracy of HAR. Table 10 displays the F 1 -scores of 18 physical activities using the accelerometers and gyroscopes embedded in the smartphones and smartwatches. We explored the performance of our proposed method when only using these sensors, specifically, either the accelerometers or the gyroscopes. When HAR used the sensors of the smartphones and smartwatches, the average F 1 -scores were 90.7% and 89.1%, respectively. When only the accelerometers or gyroscopes of the smartphones and smartwatches were used for HAR, the average F 1 -scores were 94.1% and 76.9%, respectively. These results suggest that the accelerometers provide more information than the gyroscopes for HAR.

Conclusions
The proposed deep learning model utilizes the hybrid CNN+BiGRU for feature extraction from the signals of sensors embedded in smartphones and smartwatches and the SLFN with the RELM algorithm for the classification of 18 physical activities, including body, hand, and eating movements. The experimental results demonstrate that the proposed model outperforms other existing schemes that utilize deep learning or machine learning methods in terms of F 1 -scores and accuracy. Notably, the worst F 1 -score was found in the classification for brushing teeth. Our investigation shows that using different deep learning models for feature extraction and classification during the training phase can effectively increase recognition accuracy and training time. Moreover, since the data are recorded by smartphones and smartwatches, our proposed method has the potential to be used for mHealth in real time in environments without embedding of wireless sensor networks. The weakness of this study is that it ignores signals sent when two activities are transferring. Thus, in the future, we will explore this problem.