Recognition of Fine-Grained Walking Patterns Using a Smartwatch with Deep Attentive Neural Networks

Generally, people do various things while walking. For example, people frequently walk while looking at their smartphones. Sometimes we walk differently than usual; for example, when walking on ice or snow, we tend to waddle. Understanding walking patterns could provide users with contextual information tailored to the current situation. To formulate this as a machine-learning problem, we defined 18 different everyday walking styles. Noting that walking strategies significantly affect the spatiotemporal features of hand motions, e.g., the speed and intensity of the swinging arm, we propose a smartwatch-based wearable system that can recognize these predefined walking styles. We developed a wearable system, suitable for use with a commercial smartwatch, that can capture hand motions in the form of multivariate timeseries (MTS) signals. Then, we employed a set of machine learning algorithms, including feature-based and recent deep learning algorithms, to learn the MTS data in a supervised fashion. Experimental results demonstrated that, with recent deep learning algorithms, the proposed approach successfully recognized a variety of walking patterns, using the smartwatch measurements. We analyzed the results with recent attention-based recurrent neural networks to understand the relative contributions of the MTS signals in the classification process.


Introduction
As wearable devices are gaining popularity, wearable-based human activity recognition (HAR) has attracted increasing attention. Some fundamental functionalities have been adopted by many consumer smartwatches. For example, the device may encourage us to stand up if we sit still for a long time or request an SOS if we fall while alone. With the recent advances in sensors and wearable technologies, many studies have investigated using smartwatches as data-collection equipment [1][2][3][4]. To date, many HAR studies have focused on the coarse-grained classification of human movements, such as walking, running, sitting, and lying, each of which is a distinct activity.
However, in various situations, it is often necessary to recognize fine-grained movements. In some cases, fine-grained classification would make computational experiences contextually aware [5]. For example, differentiating regular walking from inclined walking-walking on steps or on a uniform slope-may be required for the precise calculation of human energy expenditure [6]. In addition, recognition of a slight tremor when walking would make screening processes, such as for Parkinson's disease [7], more precise.
In a similar context, we focus on the fine-grained classification of walking, which is a fundamental movement that comprises the largest proportion of humans' daily movements, and propose a system that can recognize predefined walking styles in a supervised manner. To that end, we defined 18 different walking styles, such as regular walking, carrying Many applications require fine-grained activity recognition; however, achieving a high recognition rate is challenging, because similar movements produce similar signals. A recent work reported that errors occurred when its system classified similar movements that involved walking patterns, e.g., differentiating regular walking from vacuum cleaning [2]. Weiss et al. [4] proposed a system that classifies various everyday activities using a consumer smartwatch. They reported that recognizing similar hand-oriented eating activities, such as eating pasta and soup, was challenging. Kwapisz et al. [8] also proposed a system to classify similar walking activities, including regular walking, and ascending and descending stairs. However, ascending and descending stairs were frequently evaluated as identical movements.
Extensive feature-engineering work may mitigate such recognition issues; however, finding the ideal set of features for a classification process would be time-consuming [9,10]. Thus, classification with manually defined features may not be able to capture subtle differences in similar but different complex temporal patterns. To address the challenges in recognizing fine-grained activities, we adopted recent deep neural networks, such as onedimensional convolutional neural networks (Conv1D); gated recurrent neural networks (RNNs), such as long short-term memory (LSTM); and gated recurrent units (GRU).
Although deep learning algorithms can learn complex and hierarchical features automatically from raw multivariate timeseries (MTS) data, the learning process is normally not designed to explain how its internal model works. To learn an interpretable representation and visualize the indicators of the raw data that seems influential in the model's evaluations, we further utilized attention-based neural networks.
The primary contributions of this paper are as follows: 1.
We defined a set of fine-grained walking styles that appear every day and proposed a wearable system that can recognize these predefined patterns in a supervised fashion.

2.
We conducted an experiment to validate the feasibility of an intelligent wearable system with feature-based machine learning and recent deep learning algorithms, including attention-based deep neural networks.

3.
We visualized and analyzed the parameters in the attention layer, which indicate the extent to which the classification result would depend on input signals from different time steps.

Fine-Grained Recognition of Walking Activity
Although quality of walking is used as a measure of the healthiness of a person [7,11,12], few studies have undertaken detailed classification of walking motion, as summarized in Table 2.
In an earlier pioneering work, Bao and Intille [13] proposed a system that classifies daily movements, including activities related to walking, e.g., regular walking, walking while carrying items and ascending stairs, using the measurements from multiple on-body accelerometers. They found that overall recognition accuracy was highest when a decisiontree classifier was used for the task. They also envisioned that machine learning algorithms could be used to recognize different types of walking styles, such as walking slowly and walking briskly. In another study, a smart band-based wearable system was proposed to recognize five different walking styles, such as while texting or calling, with hands in pockets, whilst carrying a suitcase and regular walking, and achieved high and robust classification performance with a support vector machine (SVM)-based classification model [14]. Another previous work proposed a wearable system that utilized gait phase information [15]. Based on the walking distance-estimation algorithm and a decision-tree model, their system successfully recognized three different walking strategies, such as regular walking, walking upstairs and walking downstairs. Interestingly, another previous work demonstrated that acceleration information could be used to recognize differently inclined surfaces in a supervised fashion [16]. They proposed using customized time-frequency domain features to recognize different inclined walking based on a Gaussian-mixture-model classifier. Their experimental results demonstrated its remarkable classification accuracy. They also emphasized that the normalization process for features is crucial to minimize individual variation. A HAR system, based on a body-worn smartphone, was proposed in another recent study [17]. The proposed deep neural network learned the features successfully in an end-to-end fashion, after turning raw input signals into a multi-channel image using Fourier and wavelet transformations, resulting in high classification performance.   Table 2 summaries previous studies on walking-related activity recognition.

Smartwatch-Based Activity Recognition
With the advances in sensor and wearable technologies, studies using smartwatches to recognize human activities have been increasing. In real-life situations, using a smartwatch to capture human activity is advantageous as compared to using a smartphone, in that a smartwatch it is normally placed on a specific body part (e.g., wrist) and does not interfere with body movements.
One crucial but implicit assumption of using a smartwatch to recognize various human activities is that different types of activities would result in different hand movements; thus, types of the whole-body activities could be recognized (or observed) differently using measurements from the smartwatch. Based on this assumption, there have been numerous studies on HAR using recent smartwatches, particularly during the last decade.
For example, an earlier study investigated the possibility of using a smartwatch to recognize 18 different everyday activities [20]. Remarkably, they achieved high accuracy and F m by proposing a stacked architecture, comprised of a convolution neural network (CNN) and LSTM.
In another previous work, Mekruksavanich et al. [21] proposed a smartwatch-based system that can recognize six different human activities, i.e., sitting, standing, lying, walking, walking upstairs and walking downstairs, in the context of preventing office workers syndrome. With nine different selected features and an ensemble model, they achieved 93.5% classification accuracy. In a follow-up study, they used an LSTM-based deep neural network and achieved 96.3% classification accuracy [22].
A recent work explored and validated the feasibility of sensing hand-oriented activities using consumer smartwatches [5]. Based on an analysis of the spatiotemporal aspect of inertial hand movements using a recent deep CNN model, they achieved 95.2% accuracy across 25 fine-grained everyday hand activities.
Although we have summarized relevant recent studies, it is important to note that research into smartwatch-based activity recognition systems is in an early stage.
In this paper, we assumed that different types of walking activities generally involve different dynamic hand motions, as shown in Figure 1. Note that different walking strategies would result in different arm-swing patterns. From this perspective, we hypothesized that differences in MTS motion signals from different walking patterns could be learned by machine learning algorithms. To validate our hypothesis, we first developed an in-telligent wearable system that leverages recent advances in artificial neural networks. Then, we conducted an experiment in which participants were asked to walk as instructed with the device on their wrist. We will describe the experiment and the results in the following section.

Experiment
In this section, we first describe the wearable system developed for the proposed fine-grained activity recognition task. As described in the previous section, we focused on the wrist-worn smartwatch as walking patterns affect the hand motions while walking, differently from the previous studies that focused on the sensors attached to the leg [15] for recognizing the walking patterns. We then describe the activities defined in this study and the experimental procedure conducted to validate the proposed approach's feasibility.

Equipment
In the data collection process, we used a consumer smartwatch (DW9F1 by Fossil Group, Inc., Texas, USA) as the sensing device and a smartphone (Galaxy Note 20 by Samsung Electronics Co. Ltd., Korea) as the host device. For the smartwatch, we developed a custom application to capture the inertial movements of the hand in the form of MTS data using Wear OS by Google. Here, sensor values from the built-in motion sensors (e.g., triaxial accelerometer and gyroscope) were captured at every 20 ms. For the smartphone, we developed a custom host application to manage smartwatch application remotely over the Bluetooth low-energy (BLE) connection. With the host application, the experimenter can assign a label to the motion, take notes for the experiment, and control the start and end of the capture process remotely. Figure 2 shows the smartwatch device used in this study (left) and an example of the custom application running on the smartwatch (right).

Activity Definition
We defined a set of 18 different walking styles (Table 1 and Figure 3) that are used frequently in daily life. For motion classes C4 and C5, we asked the participants to read arbitrary content displayed on the smartphone while walking. For motion classes C6, C7, and C8, the participants walked while holding a 2-kg dumbbell (approximately 4.4 pounds) in the left, right, and both hands, respectively, to simulate holding a heavy load (e.g., groceries).  Table 1. Class index is displayed at the bottom-right corner of each picture. Class #C17 (i.e., doing something while sitting and standing) was added as the reference class.

Problem Definition
Given the MTS input data x = x 1 , x 2 , . . . , x T ∈ R T×D , the machine learning systems for activity recognition attempt to estimate y ∈ R M , i.e., a type of activity from a predefined set of activities. Here, x t ∈ R D represents the t-th measurements, T and D (=6 in our case) represent the length of the signal and the dimension of the sensor data, respectively, and M denotes the number of activity types. Figure 4 shows the pipeline of the machine learning process used in this study. Figure 4. Pipeline of the machine learning process. A feature-based approach, in which machine learning step is preceded by a feature engineering process is used as the baseline models of deep learning-based approach.

Data Collection
Thirty-six subjects (20 to 62 years old; average age: 27.91; standard deviation: 11.57 years) participated in this experiment. Note that all participants self-reported being right-handed. In this experiment, the participants wore the smartwatch on their non-dominant hand (i.e., the left wrist).
The participants were asked to walk according to the instructed walking styles. For class C0, we instructed participants to walk at a self-paced speed but not at higher intensities exceeding moderate levels. Most participants walked at least one lap around the 400-m campus track.
For classes C15 and C16, the participants were moved to stairs, and for classes C13 and C14, the participants walked up and down ramps (inclined approximately 10 degrees), respectively, on the university campus.
Although the experiment was conducted in different seasons (winter to summer), the amount of data obtained for classes C2 and C3 (walking with an umbrella) and class C1 (walking on thick snow) was relatively small compared to the other cases because specific weather conditions were required for data collection. In addition, a relatively small amount of data was collected for class C12 (jogging) because this task was performed in a shorter time over the same distance. Note that we instructed the participants to stop the trial whenever they felt uncomfortable, to avoid becoming tired after the experiment.
The total time taken for each class is shown in Table 3. Cumulatively, we collected a total of 45.18 h (std: 0.72) of data from the 36 participants.

Data Segmentation
As described in Section 3.1, labelled information was assigned by the host device during the experiment. The collected MTS data were normalized by removing the mean and scaling to unit variance on each axis. The preprocessed data were then segmented using two different partitioning windows (T = 100 and 150 samples, accounting for 2 and 3 s of movement, respectively) without overlaps between adjacent segmentations. Here, we selected a motion segment length of T = 100 and 150 because common walking activities have a cycle of less than 2-3 s. Note that we did not align the signals according to the walking phase so that the machine learning models could learn features from each activity regardless of the activity phase ( Figure 5), a viable strategy according to a recent study [17].

Feature-Based Machine Learning
Rather than relying on time-consuming feature-selection tasks, we employed the tsfresh library [9] to extract statistically significant timeseries features. The tsfresh library provides highly parallel feature selection algorithms based on the Benjamini-Yekutieli procedure [24], which is a false-discovery-rate-controlling procedure.
In the feature-extraction process, a comprehensive number of features (e.g., 4686 = 781 × 6 features in our case) was extracted from the segmented MTS signal x ∈ R T×D . We then selected the 180 most-significant features. Here, approximately 30 features could be extracted for each axis based on the significance hypothesis test. The entire feature extraction process is illustrated in Figure 5, and Table 4 shows the 12 most significant features based on the results of the feature significance hypothesis test.
As the baseline, we used a set of feature-based classifiers, including naïve Bayes (NB), support-vector-machine (SVM) [25], and random-forest (RF) [26] classifiers. The NB classifier is a probabilistic model based on Bayes' theorem [27]. The NB classifier is applicable to many practical problems; however, its performance often degrades due to the naïve assumption that features are conditionally independent and contribute equally to the output. The RF classifier utilizes ensemble learning, which is a machine-learning technique that combines many decision-tree classifiers. The RF classifier can handle highdimensional data efficiently and can mitigate the overfitting issue [28]. The SVM classifier is a machine-learning tool that is effective at classifying high-dimensional data [25]. In this study, the radial basis function (RBF) was used as the kernel function. Figure 5. Flow of the feature-extraction and -selection processes using the tsfresh library [9]. Final feature set is selected according to the p-values from the feature significant test.  We adopted Conv1D, LSTM, and GRU to learn features and classify the segmented MTS signal x ∈ R T×D . In addition, we employed attention-based LSTM and GRUs to learn an interpretable representation that describes which parts of the input sequence receive the model's attention during classification. We adopted the attention mechanism, initially devised for machine translation tasks, for densely visualizing the machine attention to explain and interpret how the models come to a decision.

Conv1D
A convolutional neural network (ConvNet) is a particular kind type of artificial neural network comprised of multiple building blocks, e.g., alternating convolution and pooling layers to learn features, and fully-connected layers for classification and regression. A ConvNet extracts local features efficiently at a specific hidden layer by limiting the size of the receptive fields of filters (i.e., sparse connectivity). It also learns the spatial hierarchies of features using stacked deep-layer structures. Especially during the last few years, it has successfully demonstrated its capability to learn features from different types of information, such as regular image, spectral data [5,17,29], 3D volumes [30], etc. In a onedimensional convolutional neural network (Conv1D), convolutional kernels are convolved with the layer input over a single temporal/spatial dimension [31,32] to produce latent features. Conv1D can learn hierarchical features with low computational complexity, as the major operation is a simply weighted sum of two one-dimensional arrays [33], it is widely used in many practical sequence classification tasks, e.g., sentence classification [32], earthquake detection [34], surface recognition [35], context understanding [36], and realtime electrocardiogram monitoring [37]. Similar to a recent work [38], we set all the kernel sizes (i.e., the length of the 1D convolution window) as 3 and the stride length of the convolution as 1.

LSTM
The standard RNN with the traditional tanh unit suffers from the vanishing and exploding gradient problem, which makes difficult its learning long-term dependencies. LSTM was proposed to mitigate this issue. LSTM can learn long-term dependencies using memory-cell and gate units [39], and LSTM-based architectures have been employed in many sequence classification applications [35,36]. The memory cell stores information taken from the input and previous cells over the given period. This information is controlled by the gate units, i.e., update, forget, and output gates. GRU Similar to LSTM, the GRU [40] performs better than the basic RNN in many sequence transduction tasks, e.g., language modelling [41], torque generation [42], and many sequence classification tasks [36,43]. For the GRU-and LSTM-based architectures, we stacked recurrent cells two times (i.e., stacked two-layer GRU/LSTM [44]) to retain more long-term dependence information. The dimensionality of the output space of the recurrent hidden states was set to T, identical to the length of the input signal x ∈ R T×D .

GRU and LSTM with Attention Mechanism
Although gated RNNs, e.g., LSTM and GRU, and Conv1D have demonstrated their effectiveness in various sequence classification tasks, it remains difficult to explain and interpretate how the models come to a decision. Thus, for the proposed classification task, we utilized attention-based RNNs, which are typically applied to a variety of sequence transduction tasks in which alignments between different modalities must be learned [45][46][47].
Here, we adopted a multiplicative attention mechanism, which reduces encoder/decoder states to an attention score via a simple matrix multiplication [46]. As shown in Figure 6, our network comprises an LSTM/GRU-based sequence encoder, an attention layer, and a classification layer.
Given the MTS input data x = x 1 , x 2 , . . . , x T , where x t ∈ R D represents the t-th measurement, the sequence encoder generates a sequence of hidden states a = a 1 , a 2 , . . . , a T , where a t ∈ R h represents the output of the t-th data point.
The context vector, which is a weighted sum of a and captures relevant source-side information to predict the label of the input signal, is calculated by multiplying attention weights α with the encoder outputs a as follows. Here, α T,t describes the amount of attention thatŷ T should pay to the input feature at time t (i.e., a t ). As shown below, the alignment score is normalized with a softmax layer to produce the attention weights.
Here, score(·) is a bilinear function, which compares the two hidden states, and W a is the trainable weight matrix of attention. The length of the alignment score α is T. Differing from attentional encoder-decoder problems [45,46], in our classification problem, a T is the last hidden state of the encoder network because our problem does not involve any decoder structure. A similar approach was used in recent studies [48,49].
The attentional hidden state h T is produced by concatenating the context vector c T and the last hidden state a T as follows: Then, the attentional vector h T is used to calculate the probability and label of the outputŷ T as follows.
y T = argmax y p(y x) For the cost function of all deep learning-based approaches, we employed cross entropy between measured values, y, and estimated values,ŷ, which is defined as follows: where m is the batch size. Also, we added a dropout layer to the hidden layer output of all the deep networks to prevent overfitting. The Adam optimizer (with a learning rate of lr = 10 −3 , β1 = 0.9, β2 = 0.999) is used to train all of the deep learning-based models outlined to minimize cross-entropy loss [50].

Classification Results
We use F1 score in the evaluation of each class, defined as harmonic average of precision (P) and recall (R), and weighted F1 score as the primary performance metric.
To compute mean F1 score (F m ), we weight the per-class F1 scores by the number of instances for each class.
Here, N c is the number of samples that belong to class c, and N tot is the total number of the samples from C different classes. Table 5 shows the classification accuracies and F m obtained from the experiments, and Table 6 shows the mean and standard deviation time required for inferencing a single data sample ∈ R T×D . Confusion matrices of the results from feature-based and deep-learning algorithms are shown in Figure 7.  782) respectively. There was no significant performance difference with the addition of attention. The benefits of an attention mechanism will be discussed in Section 5. Detailed classification performances are listed in Table 5.
We also examined the high-dimensional internal features (D = 64 in our case) learned by our deep neural networks, such as Conv1D, LSTM, GRU, LSTM + Att, and GRU + Att, using t-distributed stochastic neighbor embedding (t-SNE) [51]. The two-dimensional embeddings projected from the last fully-connected layer are shown in Figure 8.

Blind Test
We collected an additional blind test dataset to further evaluate the robustness of the proposed system. The blind test data was collected from the subjects who did not participate in the experiment. We obtained the blind test dataset in a comparable but not identical environment to the training data because the blind test dataset was obtained assuming real-world conditions (e.g., flat walkway and field tracks on campus). Table 7 shows the total time spent on each class. Cumulatively, we collected a total of 35.90 min of data for the 18 classes, which is approximately 1.99 min (std: 0.87 min) for each class. The sampling rate was set to 50 Hz, the same as for the training dataset.    Figure 9 shows the confusion matrix (left) and the corresponding t-SNE visualization of the blind test set using the Conv1D (upper) and LSTM (bottom) model as a classifier (right). Correctly classified data is marked with a filled circle and incorrectly classified data is marked with a cross.

Classification Performance
In general, based on the overall classification results, deep learning-based approaches successfully learned features from the different fine-grained walking styles defined in our study. During the test phase, it is noticeable that the LSTM/GRU-based approach demonstrated the highest accuracies and F m , i.e., greater than 96%, in both segmentation conditions. In our study, the addition of an attention layer did not significantly affect classification performance. Conv1D also exhibited high accuracies and F m over 96% when the length of the segmentation window was T = 150 (approx. 3 s). The most challenging activity to recognize was C8 (p: 88.027, r: 86.442) when LSTM was utilized.
In contrast, feature-based approaches demonstrated lower classification performances over almost all the activities despite of the extensive feature-engineering process. Therefore, except for the SVM, it is apparent that the feature-based machine-learning models adopted in our study do not have sufficient capacity for learning the features from proposed finegrained motion dataset.
Regarding the blind test described in Section 4.2., the accuracy (F m ) was significantly reduced by 9.686 (8.712) percent in the case of Conv1D compared to those from the test dataset. Although our approach validated the feasibility of the proposed learning scheme, robust recognition of some classes, such as C5, C8, C13, and C14, was found to be challenging as shown in Figure 9. More specifically, we found that C5 (walking phone right) was misclassified as C0 (regular walking) when the Conv1D model was used. This may be because there were differences in the degree to which participants focus on their smartphones, although they were asked to read the arbitrary contents displayed while walking. Also, we found that walking with a dumbbell in both hands (C8) was confused with walking with a dumbbell in the left-hand (C6). In addition, walking uphill (C13) was somehow confused with walking with a dumbbell in the right-hand (C7) and walking downhill (C14) was confused with walking downstairs (C16). An earlier work [8] reported a similar misclassification issue: ascending and descending stairs were frequently evaluated as identical movements.
Noting that walking with something in the right hand and walking on inclined/stepped surfaces were successfully recognized in the training and test datasets but not in the blind test dataset, we plan to collect more data on these activities from diverse users to make our model more robust.
Except for these classes, the rest of the classes' classification performance was better than or similar to the test dataset results. The blind test dataset, on the other hand, was analyzed using a modest amount of data. As a result, additional research with data from the various distributions is required.

Attention Mechanism
Learning an interpretable representation is crucial in many machine-learning tasks. A deep learning algorithm has an advantage of extracting features from the raw data; however, typically, understanding the relative contributions of the input data is a challenging task. To mitigate this issue, the concept of attention was introduced in earlier studies [45,46]. In this paper, we incorporate an attention mechanism, originally devised for neural-machine-translation tasks [46], into our classification model to learn an interpretable representation that describes which parts of the input data are receiving the model's attention. Different from recent studies on attention-based HAR systems [52][53][54], we further focus on densely visualizing and analyzing the attention weights along with the raw sensor input signal, x ∈ R T×D . Figures 10 and 11 are examples of visualization of attention vectors, α ∈ R T , highlighted in the bottom of each figure. The darker the highlighted bar, the more attention the attention vector received from the model during the inference phase. Note that attention values are formed in a continuous manner. In other words, a machine-learning model takes a collection of adjacent parts of input signals, rather than discrete parts of the signals, during the training and inference phases. This may be because input signals from specific intervals contribute to the calculation of the context vector, which captures relevant source-side information required to predict the label of the given MTS input signals. Figure 10. Example of MTS input signals from three different walking activities with temporally aligned attention vectors highlighted. The darker the highlighted bar, the more attention it received from the model, and thus contributing more during the inference phase. If the repetition cycle of the exercise was long, this example indicates that attention peaks shown in purple were formed at a slow cycle. Figure 11. Examples of input signals from walking with something in the right hand, with temporally aligned attention vectors highlighted.

Walking with Something in the Right Hand
Activities with something in the left-hand are relatively easy to recognize in that sensor values are recorded in the smartwatch worn in the left hand. There was little confusion reported between walking with an umbrella in the left and right hand (C2/C3) and between walking with a phone in the left and right hands (C4/C5).
Although we initially assumed that it would be challenging to recognize cases in which the objects are being held in the right hand, it turned out that the proposed system could successfully recognize these activities, i.e., walking with an umbrella in the right hand (C3), walking with a phone in the right hand (C5), and walking with a dumbbell in the right hand (C7). This may be because our whole-body motion, including that of the left hand, is somehow affected by the constraints imposed on the right hand. For example, holding an umbrella or a heavy load in the either hand normally affects our dynamic walking patterns, such as spatiotemporal-stride and arm-swing parameters, significantly. Figure 11 shows examples of input signals from walking with something in the right hand (C3/C5/C7) with temporally aligned attention vectors highlighted. Note that the darker the highlighted bar, the more attention it received from the model; thus, contributing more during the inference phase. As shown in Figure 12, the two-dimensional feature embeddings from these activities (C3/C5/C7) are well clustered in distribution and separated those from other types of activities, including regular walking (C0). However, as we said in Section 5.1, our system is unable to detect all walking behaviors during the blind test. C8 (p: 100.0/86.667, r: 26.316/22.807 when Conv1D/LSTM were used) was, for example, mistaken with C6 (p: 46.213/29.605, r: 100.0/100.0 when Conv1D/LSTM were used). This could be because typical motion aspects (for example, swinging the left arm slowly due to a heavy load in the left hand) are invariant to right-hand motion. Figure 13 exhibits example input signals with attention weights aligned when our system misidentified C8 as C6 during the blind test. In contrast to Figures 10 and 11, which show examples of when the recognition process was correct, attention weights are not routinely and densely formed in Figure 13.

Evaluation on Walking-Related Datasets
We compared the classification results to those of other publicly available datasets. First, we used the PAMAP2 dataset (Physical Activity Monitoring for Aging People 2) [55], which includes 12 daily physical activities measured by on-body sensors attached to three different body parts, the hand, chest, and ankle. This dataset, interestingly, contains walkingrelated activities, such as walking, running, Nordic walking, ascending/descending stairs, and vacuum-cleaning. To achieve a temporal resolution comparable to our dataset, we downsampled the PAMAP2 dataset from 100Hz to 50Hz. The data was segmented into 3 s fixed-width sliding windows with no overlap. We also created a hand-oriented subset (PAMAP2-hand) using measurements from a sensor attached to the hand. Second, we used the SBHAR dataset (Smartphone-Based HAR dataset with Postural Transitions), which is a multivariate time series data from 30 participants ranging in age from 19 to 48 years [56]. This dataset includes six basic activities (walking, walking upstairs, walking downstairs, sitting, standing, and lying) and six postural transitions (standing-to-sitting, sitting-to-standing, sitting-to-lying, lying-to-sitting, standing-to-lying, lying-to-standing). A smartphone mounted on the participant's waist served as an inertial motion-capture device, equipped with a triaxial accelerometer and a gyroscope operating at 50 Hz. For testing our approach with the SBHAR dataset, we segmented the measurements using a sliding window of 3 s, with 50% overlap.
Third, we used the Daphnet freezing of gait (DG) dataset [7], which consists of inertial measurements (i.e., acceleration) from 10 Parkinson's disease (PD) patients who are experiencing freezing of gait (FoG), which manifests as a sudden and temporary inability to move. The DG dataset is collected while PD patients are walking using on-body sensors attached to 3 different body parts (ankle, knee, and trunk). We validated our approach by downsampling our DG dataset from 66Hz to 50Hz and segmenting it with a sliding window of 3 s without overlap. Table 8 contains detailed information used for the evaluation. For more information on each dataset, see previous studies [10,57], which extensively summarizes the public dataset. Table 8 shows performance in terms of weighted F1 scores (i.e., F m ) from the different public datasets along with ours. As shown below, we demonstrate that it is feasible to learn features from the walking-related activities, each of which is inherently bound to have similar temporal features, using the recent deep learning-based approaches. Although there is no significant performance improvements with the addition of attention, it enhances the explainability of the classification process.

Limitations
The proposed model demonstrated high accuracies and F m in recognizing activities on the test set. However, as discussed in Section 4.2, it is not guaranteed that similar recognition performance can be achieved in real-life scenarios because our data was collected in a controlled environment and from a limited number of participants. In fact, Bao and Intille [13] emphasized the importance of unsupervised and naturally collected data. They collected two different types of data. One type was collected in a semi-naturalistic environment, wherein the participants were asked to complete descriptive tasks. This setting allowed participants to move on their own to some extent. The other type was collected in a laboratory setting where the participants were instructed to execute several predefined activities. Since our data was also collected in a controlled setting, we plan to design experiments in which participants are allowed to move more naturally.

Applications
Fine-grained classification of walking styles would open a new venue for promising applications in diverse fields, such as providing contextual information tailored to a user's current situation, measuring precise energy expenditures during exercise, and monitoring abnormal activities.
Assistance for Distracted Walkers As smartphones become more common, people often look at their smartphone screens, even when walking. Consequently, a distracted walker may get into an accident. A part of our study, i.e., differentiating walking activities while looking at the smartphone screen (C4/C5) from regular walking (C0), can be utilized to help walkers. For example, wearable assistants based on our approach could provide distracted walkers with warnings when they enter a busy street. Identifying or recognizing cognitive loads while walking using wearable devices would be interesting future work.
Contextual Applications Furthermore, recognition of the availability of the user's hands, e.g., walking with dumbbells in both hands (C8) and walking with an umbrella in either hand (C2/C3), would be useful for those who cannot use their hands to manipulate smart devices. For example, wearable applications could read incoming messages or open car doors automatically if the system recognized that a user was moving with luggage in both hands.
Encouraging Fitness Recognition of fundamental activities, such as running and walking, are already embedded in modern consumer smartwatches. For example, the device may encourage us to stand up if we sit still for a long time, and the device can recognize whether we are walking or running for fitness.
As we investigated throughout this study, our activities could be recognized in much finer detail. A wearable system may encourage users who are exercising to walk faster if they are walking slowly or with their hands in their pockets. Without loss of generality, the proposed approach can be extended to summarize the recorded activities into a set of fine-grained activities, enabling personalized fitness suggestions and encouragements.

Conclusions
Assuming that hand motions are an important part of human walking activities and thus have different spatiotemporal characteristics according to the walking styles, we propose a wearable system that can recognize fine-grained walking patterns. To that end, we defined 18 different everyday walking styles and developed a wearable system that can capture a user's body motions from their hand motion in the form of MTS signals. Then, we employed a set of machine-learning algorithms, including feature-based algorithms and recent deep-learning algorithms to learn the MTS data with the predefined walking patterns in a supervised fashion.
With our model, the LSTM-based approach demonstrated the best classification results in terms of accuracy (F m ) of 97.158 (97.156). However, deep-learning-based approaches, including Conv1D, LSTM, GRU, LSTM + Att, and GRU + Att, generally exhibited higher classification performance, i.e., accuracy and F m greater than 95%. Despite of our extensive feature engineering work, feature-based approaches demonstrated poor classification performances overall. One remarkable finding from the experimental results was that walking activities with something in the dominant hand can be recognized even when the smartwatch is worn on the non-dominant side. Regarding the blind test, the classification results of accuracy (F m ) were 87.290 (88.259) when Conv1D was employed. Our model has trouble robustly recognizing specific walking patterns, such as walking with something in the right hand and walking on inclined/stepped surfaces, according to the findings of the blind test. To resolve this generalization issue, we plan to collect more data on these activities from diverse users to make our pretrained model more robust. Using the attention-based neural networks, we further analyzed the classification results to understand the relative contributions of the MTS signals used in the classification process. In the application section, we explored a set of wearable applications that utilize the proposed fine-grained walking activity-recognition scheme. Future studies will focus on increasing the robustness of the model and extending the proposed approach to diverse healthcare applications.  Informed Consent Statement: Written informed consent has been obtained from participants in the study.

Data Availability Statement:
We cited the details of each dataset in the document.

Acknowledgments:
The authors would like to thank the reviewers for all of their constructive and insightful comments in relation to this work.

Conflicts of Interest:
The authors declare no conflict of interest.