Enhanced Hand-Oriented Activity Recognition Based on Smartwatch Sensor Data Using LSTMs

: The creation of the Internet of Things (IoT), along with the latest developments in wearable technology, has provided new opportunities in human activity recognition (HAR). The modern smartwatch offers the potential for data from sensors to be relayed to novel IoT platforms, which allow the constant tracking and monitoring of human movement and behavior. Recently, traditional activity recognition techniques have done research in advance by choosing machine learning methods such as artiﬁcial neural network, decision tree, support vector machine, and naive Bayes. Nonetheless, these conventional machine learning techniques depend inevitably on heuristically handcrafted feature extraction, in which human domain knowledge is normally limited. This work proposes a hybrid deep learning model called CNN-LSTM that employed Long Short-Term Memory (LSTM) networks for activity recognition with the Convolution Neural Network (CNN). The study makes use of HAR involving smartwatches to categorize hand movements. Using the study based on the Wireless Sensor Data Mining (WISDM) public benchmark dataset, the recognition abilities of the deep learning model can be accessed. The accuracy, precision, recall, and F-measure statistics are employed using the evaluation metrics to assess the recognition abilities of LSTM models proposed. The ﬁndings indicate that this hybrid deep learning model offers better performance than its rivals, where the achievement of 96.2% accuracy, while the f-measure is 96.3%, is obtained. The results show that the proposed CNN-LSTM can support an improvement of the performance of activity recognition.


Introduction
Miniature sensors have driven rapid growth in the wearable technology sector, since they permit computerized monitoring at any time [1]. A wide range of affordable wearable products are now available, including smartwatches, smartphones, and smart glasses. Through their internal gyroscopes and accelerometers, these devices are able to record significant quantities of data which can then undergo analysis to determine the various activities the wearer has performed [2]. It is then possible to classify those activities and use the information for various purposes, such as monitoring the elderly [3], tracking exercise [4], or protecting people from office workers' syndrome [5].
To date, mobile computing and human activity recognition have become popular and complex [6,7]. The data obtained from the various wearable sensors can be treated as time-series data, where a majority of studies have focused on the recognition of activity through the use of smartphones [8][9][10][11]. Almost everyone now carries a smartphone, which increases the potential of the applications they use. However, one drawback is the fact that phones are usually carried in bags or pockets, are therefore not consistently positioned on the body. Furthermore, a phone held in the trouser pocket is not ideally positioned to record hand movements, which cannot even be relied upon to maintain the same orientation as it is carried around. For female users, the smartphone is especially inconvenient as many do not even carry it in their pockets, preferring instead to use a handbag. Smartwatches, however, can solve many of these issues, and since they are worn in a fixed position on the wrist, they are the best choice to capture hand movements [12].
There are a number of benefits offered by smartwatches, in which other forms of inertial sensor technology cannot match. Among these include the fact that smartwatches are able to merge the features of a smartphone with the ability to monitor data constantly. Through small screens, it is possible for the user to interact with the smartwatch wherever they are located. The watch can be worn at all times, will be waterproof, and typically has a much longer useful battery lifespan than a smartphone [13]. Moreover, smartphones cannot always be carried during periods of strenuous exercise due to their size and shape. In contrast, smartwatches can be worn while exercising and can, therefore, generate data from the compass, GPS, accelerometer, and gyroscope, as well as monitoring heart rates [14,15].
Today's greater availability of wearable technology has brought about increased interest in HAR in order to bring about benefits for people's health and well-being [16]. During the past five years there has been a notable increase in the number of research papers published in this particular field [17], with a majority of these studies focusing on the applications of HAR in conventional machine learning (ML) models. Such models typically employ ML algorithms such as support vector machines, naive Bayes, decision trees, hidden Markov, and k-nearest neighbor models. The drawback to such approaches, however, is the reliance upon hand-crafted shallow feature extraction, which is a practice that relies heavily upon the knowledge of the humans performing the work [18]. Furthermore, these ML approaches apply independent statistical formulas to segment each time step within the time-series data, which leads to the time and space relationships within the data being lost when the models are trained. However, in more recent years, published studies have begun to show the accomplishments of deep learning (DL) models in addressing complicated HAR issues [19,20]. These techniques differ from the traditional ML approaches since it becomes possible to address high-level features when examining raw data from sensors.
In this study, a smartwatch-based approach to HAR is examined in the context of hand movement recognition using data from sensors in a smartwatch, along with the Wireless Sensor Data Mining (WISDM) public smartwatch dataset, which is composed of IMU sensor data that records the activity of the hands as well as data associated with eating of 51 participants were provided with the data. This study suggested the use of a hybrid deep learning model that employed LSTM networks for recognition performance improvement.
The remaining paper is structured as follows. Section 2 presents the preliminary background and related works in the research work of human activity recognition. In Section 3, an overview of the proposed smartwatch based HAR framework is presented. The details of each process in the framework are described. Section 4 conducts the experiments from various scenarios, as well as showing improvable results by the proposed framework comparing with state-of-the-art evaluation metrics. Finally, Section 5 presents the research conclusion and challenging future points.

Human Activity Recognition
The definition of human activity covers all physical movements involved in human life, whether necessary or otherwise. Such activity requires energy to be spent in moving the skeletal muscles, with examples including walking, eating, sitting, standing, and so forth. When conducting research involving time-series classification, human activity recognition (HAR) faces the challenges of classifying the sequences of data obtained from wearable sensors in time-series format, to determine clearly defined human movements [21][22][23][24].
Suppose a human performs some activities according to a pre-defined activity set A = {a 1 , a 2 , a 3 , . . . , a m }, where m means the number categories of physical activity. Given a time-series sequence s = {d 1 , d 2 , . . . , d t , . . . , d n } of sensor data reading is captured from the human activity information, where d t means the data sensor reading at time t and n denoted of sequence and n ≥ m. The HAR task aims to determine a function F to forecast the physical activity sequence which based on data sensor reading s. The function F can be defined: ., a n }, a n ∈ A, Meanwhile the real activity sequence is indicated as: The general framework encompassing HAR using wearable sensor data is shown in Figure 1. This framework comprises four key processes which can be employed to categorize the various forms of physical activity based on raw data from the activity sensors. In the first step, the raw data are gathered from the IMU sensors carried in wearable technology. Then in step two, the data are cleaned and prepared through the preprocessing phase [25]. This involves filtering and addressing the issue of missing data from sensor errors by including estimations based on statistical formulas. The data must then be segmented and balanced. However, because the sensor data appears in time-series form, the characteristic features is necessary to be extracted from the high-dimensional data. In the final stage, these features will then be used to train the model through the use of various ML algorithms including naive Bayes, support vector machines, or decision tree. Once trained, the ML model will then be capable of recognizing and categorizing human activity.

Deep Learning with HAR
Important recent studies of HAR [17] have revealed certain problems associated with conventional machine learning techniques which ultimately influence the ability to recognize human activity. This limitation concerns the choice of hand-crafted features since the selection is dependent upon the skills and knowledge of the person taking the decisions [26]. Within the time span of just a few years, however, deep learning has been applied as a suitable alternative approach which can address these limitations effectively. Figure 2 shows the deep learning process can eliminate the need for feature extraction in the general HAR framework. Recently, the studies have put forward several models for deep learning, which can be addressed the issues related to the time-series classification of HAR problems. Studies have examined some learning models to assess their recognition abilities in terms of miscellaneous benchmark activity datasets. Two of the main models are Convolution Neural Networks (CNNs) and Long Short-Term Memory Networks (LSTMs); they are effective for HAR problems using smartphones since they offer appropriate evaluation metrics. These two models are thus investigated in this study to determine their effectiveness in recognizing hand movements using data from smartwatches.

Proposed Framework
In this section, the proposed framework for hand-oriented activity recognition based on smartwatch sensor data is presented. The overall process of the framework is depicted as Figure 3. The framework consists of four main processes (Data collection, Data preprocessing and segmentation, Deep model training, and Model evaluation).

Data Collection
The smartwatch data used in the research work is a public benchmark dataset called WISDM from the UCI Repository [27]. This dataset provides tri-axial accelerometer data and tri-axial gyroscope data collected at a rate of 20 Hz from Android smartphones and an Android smartwatch. Data is gathered at a rate of 20 Hz in every 50 ms. The Android smartphone and smartwatch are Samsung Galaxy S5 with Android 6.0 operating system and LG G Watch running with Android Wear 1.5, respectively. These raw sensor data are recorded from 51 subjects with 18 and 25 years who performed 18 pre-defined physical activities. All subjects wear the smartwatch on their dominant hand while they are performing the activities [28,29]. These physical activities can be categorized into three main categories, i.e., non-hand-oriented activities, hand-oriented eating activities, and hand-oriented general activities. Some samples from accelerometer data and gyroscope data are illustrated in

Data Preprocessing and Segmentation
The activities from the WISDM dataset have been found to be recorded separately from the exploratory data analysis. This means that each person performs one activity for approximately 3 min in a raw data, followed by the next activity, etc. When looking at the timestamps, it can be seen that the transitions from one activity to the next one are not continuous, but the recording has happened in isolation. The smartphone and smartwatch data is not also synchronized, i.e., they have not been collected in parallel.
This research analyzes the sensor data by exploratory data analysis and finds that there are activity data of 7 subjects did not contain all pre-defined activities. So, the seven subjects are cut-off, i.e., subject number 1616, 1618, 1637, 1638, 1639, 1640, and 1642. To handle time-series data on the HAR problem with 75% of the overlapping proportion as shown in Figure 6, the smartwatch data remains with 44 subjects from 55 subjects, where the data is also segmented by a sliding window of 10 s. The number of raw sensor data is shown in Table 1.  Given the sampling rate of 20 Hz and the windows size of 10 s, there are 200 raw sensor data for each sample. Then these sensor samples are split to 70% as training set and remaining 30% as test set for evaluating the proposed model. The numbers of smartwatch sensor samples used in this work is presented in Table 2.

Deep Model Training
LSTM is an extended deep learning model of recurrent neural networks (RNN) [30]. LSTM is proposed to tackle vanishing and exploding gradient problems [31]. The computational process of the LSTM network can be detailed as the following process: (1) any input data X = {x 0 , x 1 , x 2 , . . . , x t , x t+1 , . . . } is turned into the data together with the data of the hidden layer at the last stage to the hidden layer H = {h 0 , h 1 , h 2 ,. . . , h t , h t+1 , . . . } by the matrix transformation technique, and (2) the hidden layer output passes under an activation function to the last value of the output layer Y = {y 0 , y 1 , y 2 , . . . , y t , y t+1 , . . . } as illustrated in Figure 7. The LSTM network consists of the specific characteristic structure for memorizing information for a stretched time. The input and forget gate manipulate process to replace the new result information by comparing the internal memory with the new data feeding. This process enhances gradients to circulate effectively in time.
As shown in Figure 8, input gate (i), forget gate ( f ), output gate (o) and memory cell (C) of LSTM are designed to control what information should be stored, updated, and deleted. Gating is the technique to pass the needed information selectively [32]. This technique composes of the Sigmoid and the Hadamard product function. Derived output value within range of [0, 1] is allowed the multiplication to proceed the information. This consideration is the effective practice to initialize these gates to a value of 1 or close to 1. So, this situation is not to debilitate training initially. Therefore, each parameter at the moment t can be defined in the LSTM node.
From Figure 8, each gate that works inside an LSTM cell is mathematically defined as the following expressions [33]: where: • Forget gate ( f t ) manipulates when there is desired unnecessary data to delete. • Input gate (i t ) manipulates the input activation of new data into the memory cell. • Input modulation gate (g t ) manipulates the preeminent input to the memory cell.

•
Output gate (o t ) manipulates the flow of output.

•
Internal state (c t ) manipulates the constitutional recurrence of cell.

•
Hidden state (h t ) manipulates the data from the preceding data case inward the context window.
Equations (3)-(8) described the detailed process of LSTM network as following: (1) When there is a need to delete unnecessary data that involves the forget gate. (2) New data which needs to store in memory with an input gate is determined. Then the state value of an old node C t−1 is updated to the new node state C t . (3) Finally, which data should be output to the layer over with an output gate is determined. The hybrid LSTM network called CNN-LSTM is proposed to enhance the recognition performance in this research. The CNN-LSTM is consisted of two convolutional layers with one LSTM layers. The structure of CNN-LSTM can be illustrated in Figure 9. The proposed CNN-LSTM model introduces by 1D-CNN two layers with 3 × 1 kernels for extracting deep spatial appearances from smartwatch sensor data. The number of filters is set to 128 and 64 for the first and second 1D-CNN layer, respectively. The rectified linear function is used as the activation function in the proposed model. After the second 1D-CNN layer is processed, the dropout technique is applied. A LSTM layer with 128 units is then applied before a dropout layer for extracting temporal features. Finally, a completely connected dense layer is employed to the LSTM output at the final time step with a SoftMax function.

Model Evaluation
Generally four evaluation metrics for multi-class classification are employed to verify the performance of the proposed CNN-LSTM models. Accuracy, precision, recall, and F-measure are considered as performance metrics evaluation in this research work. When correctly recognized, such activity could be classified as True Positive (TP) or True Negative (TN). On the other hand, activity can be classified as False Positive (FP) or False Negative (FN) when incorrectly classified by metrics. True Positive or True Negative derive other performance metrics. Given TP = ∑ N c=1 TP c denotes the total true positive rate for a classifier on all classes, TN = ∑ N c=1 TN c denotes the total true negative rate, FP = ∑ N c=1 FP c denotes the total false positive rate, and FN = ∑ N c=1 FN c denotes the total false negative rate. These metrics are detailed following.

Accuracy
Accuracy shows that ratio of correctly classified issues which is the sum of correct classification divide by the total number of classifications.

Precision
Similar to accuracy, precision measures the accuracy value based on the negative instance fraction, which is classified as negative. The total precision (Precision) is the average of the precision for each class mean by Precision c :

Recall
Recall shows the effectiveness of correctly forecasted issues as positive instances. The total recall (Recall) is the average of the recall for each class mean by Recall c :

F-Measure
F-measure helps to search the balancing of precision and recall. There is an uneven class distribution (larger amount of actual negatives). The F-measure for each class is denoted by F-measure c .

Experiments
Experimental implementation is conducted by using Python 3.6.9, TensorFlow 2. • Scenario 1-Only raw accelerometer data, • Scenario 2-Only raw gyroscope data, • Scenario 3-Raw accelerometer data and raw gyroscope data. Table 3 shows layer types of baseline deep learning models (CNN anf LSTM) with the proposed CNN-LSTM model. Table 3. Layer types of baseline deep learning models and the proposed CNN-LSTM mode.

Experimental Results
For performance evaluation, smartwatch sensor data is used from WISDM dataset by splitting 70% for training models and the remaining 30% for testing the trained models. The experimental results of all three scenarios are shown in Table 4. From the results in Table 4, the proposed CNN-LSTM model outperforms the other DL models in every activity. Though when analyzing the performance metrics recognition with accuracy, precision, recall, and F-measure, results of F-measure metric is only presented because they incorporate accuracy and precision. The confusion matrix for the proposed CNN-LSTM networks is shown in Table 5. Because of focusing on the comparison of different activities in this research, F-measure is chosen as the performance metric for this objective. Moreover, we can compare the effectiveness of the CNN-LSTM with the F-measure of each category of activities. The F-measure results of the different DL models trained from smartwatch sensor data are shown in Figure 11.   Figure 11. F-measure of the different deep learning (DL) models (trained from smartwatch sensor data on the WIDSM dataset) of (a) Non-Hand-Oriented Activities, (b) Hand-Oriented Activities (General), and (c) Hand-Oriented Activities (Eating).
In order to avoid that results are biased by the particular choice of training and test dataset, the second experiment is conducted by utilizing different combination. The training, validation, and test sets are obtained using stratified ten-fold cross validation. As for the subsample, 80% of them, are used as training data. The remaining 20% are split for validation and test sets. In each fold, we ensure that the subsamples from each trial are not assigned to more than one set. The experimental results of stratified ten-fold cross validation with standard deviation value is shown in Table 6. Table 6. Experimental results of stratified ten-fold cross validation.

Model
Evaluation

Comparison of Proposed CNN-LSTM Network and Other Conventional Deep Learning Networks
The significance of some deep learning network structures on model operation is explored in this section. As detailed shown in Table 4, three types of model compositions-CNN, LSTM, and the proposed CNN-LSTM-are implemented respectively for experimental results comparison. Activity recognition results are evaluated by the advanced metrics such as accuracy, precision, recall, and F-measure on the test collection. Moreover, the computation speed of training iteration is given. The research experiments are conducted by depended on the WISDM dataset.
The classical convolutional neural network (CNN) structure is commonly followed by a totally-connected layer to incorporate the features extracted from the previous layer. As can be seen that the F-measure of the CNN attends to 93.10% when the trained model is revealed to the test set composing with both accelerometer and gyroscope sensor data. In spite of classical CNN can enhance the recognition performance of the model, a massive number of parameters which more than 4 million model parameters have unavoidably to be carried. Moreover, the training stage time per epoch is still more than 9.21 s. The LSTM structure belongs to the vanilla long short-term memory neural network. This structure gains the capability to perceive temporal information from sequential data. The F-measure result of this model comes to 89.60% when both accelerometer and gyroscope sensor data used for training the model. Different from CNN structure, LSTM structure brings a smaller number of parameters-that are only 52 thousand parameters. The training stage time consumes more than 212.77 s per epoch.
In the proposed CNN-LSTM structure, the sensor data captured from smartwatch are firstly delivered into two layers of CNNs and then transferred to LSTM layers for performing the feature extraction process. Certainly, the proposed model surpasses in F-measure performance from the other deep learning networks with 96.30%. Moreover, the proposed CNN-LSTM brings a small number of parameter-that are 1.6 million model parameters. Furthermore, the model training time is spent only 20.46 s per epoch. So, the proposed model in this work is not only achieved high recognition rate of F-measure metric, but it also greatly simplifies the model structure with a smaller number of parameters.
Considering to model training speed, Figure 10 displays training speed comparison (in term of model loss and model accuracy). It becomes apparent that training loss from the proposed CNN-LSTM is decreasing faster than the baseline CNN and LSTM model. Moreover, the CNN-LSTM is more suitable than the other conventional deep learning models in term of training speed by time training per epoch.
The previous work [27] uses hand-crafted features on conventional machine learning methods (decision tree and k-nearest neighbor) or an ensemble learning (random forest). The results of this work indicated that the random forest (RF) is more outperform than others. So, the proposed CNN-LSTM model is used to compared with existing trained models by the same smartwatch dataset. The comparative results is summarized in Table 7. The proposed CNN-LSTM can improve the accuracy in every scenario in the combination of sensor data. Generally, the proposed CNN-LSTM model outperform others in the conducted experiments. It is because CNN-LSTM has efficiency at dealing with complicated time series problems. Comparison performance results of the eating activity and general activity between the previous work and the proposed CNN-LSTM model can be shown in Figure 12.

Conclusions and Future Works
This work studied hand-oriented activity recognition using sensor data captured from a smartwatch that is collected in the WISDM dataset. The framework for smartwatch-based human activity recognition is proposed to enhance the recognition performance with a hybrid Long Short-Term Memory network. Three scenarios were employed for each deep learning model performance using the different smartwatch sensors. The evaluation metrics such as accuracy, precision, recall, and F-measure were applied. The experimental results show that the proposed CNN-LSTM model outperforms other baseline deep learning models with the highest performance metrics in every scenario. Moreover, the proposed model is compared to other previous works. The comparative results show that the recognition performance of every scenario can be enhanced by the proposed model, especially for hand-oriented activities.
The research of conventional deep learning techniques is generally depended on CNNs and LSTMs in HAR environment. With deep learning CNNs, there are two advantages over other models-local dependency and scale invariance. The CNNs are nonetheless memory networks that be able to only extract spatially deep features from raw sensor data. Moreover, with another conventional deep learning model-LSTMs, these networks can employ the temporal dependencies in raw sensor data appear as the common alternative for human-based movement modelling with captured sensor data. So, with the proposed method CNN-LSTM in this research, an advantage is relied on to aid in extracting both the spatial features and temporal features of the signal data of human activities. For future work, developing other kinds of deep learning models and improving recognition performance through data augmentation techniques are considered for personalized smartwatch based HAR.