Sensing-HH: A Deep Hybrid Attention Model for Footwear Recognition

: The human gait pattern is an emerging biometric trait for user identiﬁcation of smart devices. However, one of the challenges in this biometric domain is the gait pattern change caused by footwear, especially if the users are wearing high heels (HH). Wearing HH puts extra stress and pressure on various parts of the human body and it alters the wearer’s common gait pattern, which may cause difﬁculties in gait recognition. In this paper, we propose the Sensing-HH, a deep hybrid attention model for recognizing the subject’s shoes, ﬂat or different types of HH, using smartphone’s motion sensors. In this model, two streams of convolutional and bidirectional long short-term memory (LSTM) networks are designed as the backbone, which extract the hierarchical spatial and temporal representations of accelerometer and gyroscope individually. We also introduce a spatio attention mechanism into the stacked convolutional layers to scan the crucial structure of the data. This mechanism enables the hybrid neural networks to capture extra information from the signal and thus it is able to signiﬁcantly improve the discriminative power of the classiﬁer for the footwear recognition task. To evaluate Sensing-HH, we built a dataset with 35 young females, each of whom walked for 4 min wearing shoes with varied heights of the heels. We conducted extensive experiments and the results demonstrated that the Sensing-HH outperformed the baseline models on leave-one-subject-out cross-validation (LOSO-CV). The Sensing-HH achieved the best F m score, which was 0.827 when the smartphone was attached to the waist. This outperformed all the baseline methods at least by more than 14%. Meanwhile, the F 1 Score of the Ultra HH was as high as 0.91. The results suggest the proposed model has made the footwear recognition more efﬁcient and automated. We hope the ﬁndings from this study paves the way for a more sophisticated application using data from motion sensors, as well as lead to a path to a more robust biometric system based on gait pattern.


Introduction
Recently, with the wearable technology advancing at a fast pace, billions of smart devices have been equipped with built-in motion sensors such as accelerometers and gyroscopes. They can be exploited to log the body motion of users, which can be a very useful tool for the research communities studying motion sensing. More and more researchers have used the motion characteristics of the human body for various tasks, which ranged from activity recognition [1][2][3][4][5][6], gesture categorization [7], clinical condition monitoring [8], BMI predication [9], to user gait recognition [10][11][12][13]. In particular, identity recognition using the dynamics of the walking pattern seems a promising technique in preventing the use of smart devices and other systems linked with them without the owner's permission. However, the quality of gait-based biometric systems is greatly influenced by the footwear which the subject is wearing.
The previous works [14,15] studied the gait changes related to the different shoes worn by the subjects. Their experiments were carried out with four kinds of shoes with different weights. They found that heavy footwear reduces the discrimination and the sideways motion of the foot has the most discriminating power compared to the up-down or forward-backward directions of the motion. Meanwhile, based on some previous papers on exercise physiology, the height of the heels is also a important parameter related to the human gait. A recent survey [16] summarized a list of the five main open problems for gait recognition including different kinds of shoes. Walking requires ongoing, finely tuned interactions between muscular and tendinous tissues. Wearing HH puts extra stress and pressure on various parts of the human body that would affect the subject's natural gait [17]. In common sense, an increase in the height of HH will cause a decrease in subject's walking speed and the length of stride.
Though footwear alters the gait, there is only a very limited number of studies in footwear recognition. The existing methods normally use the RGB camera [18], the specific motion capture system [19], the ground reaction force sensors [20], or Microsoft Kinect sensor [21], all of which are lab limited. In fact, there is no research on the footwear recognition in the daily life scenario, and none for the HH which about 37% to 69% of American women frequently wear [22]. Additionally, even if only considering the HH, they are categorized into many categories by the height of the heels, as shown in Table 1 and Figure 1. Therefore, motion sensor-based footwear recognition using the gait characteristic in daily life is still an open challenge. One of the major challenges is that the daily life walking environment is highly dynamic and it includes a variety of environmental factors that could directly or indirectly introduce variations into the gait patterns. For example, the clothes the individual is wearing, the different walking surfaces, slopes, and obstacles on the road, can all contribute to gait changes besides footwear.
In this section, we evaluate the diffculty of the task by visualizing the raw signals, as shown in Figure 2. A participant of medium build (average weight) was asked to walk back and forth three times on the same surface, wearing different shoes (flat, mid HH and ultra HH), each time with a smartphone placed on her waist. From the visualized data, we find that the gait of ultra HH (9.8 cm) is significantly different from the previous two scenes. Its acceleration component has a sharper peak, and especially the angular velocity has a lateral rotation lasting for one second. We believe that this is due to the reduced stability and the changes of the center of gravity caused by the ultra HH.
Inspired by the deep neural networks, some very recent works employ them to motion sensor-based recognition, such as Convolution Neural Network (CNN) [3,5,23], which are competent in capturing the local characteristics of multi-channel signals; Recurrent Neural Network (RNN) [24], and its variant, LSTM units [1,25], which are designed to extract the temporal dependencies and incrementally learn information over time.  Recently, the combination of CNNs and LSTM in a unified stack framework has already offered state-of-the-art results in sensor-based recognition [7]. In our previous study, we developed a hybrid deep neural network [9] for gait analysis using data captured from built-in motion sensors in smartphones. The hybrid deep neural network overcomes the challenge of environmental factors. In this study, we extended our prior work by incorporating some extensions of attention mechanism to the previous model, and tested its performance by investigating gait changes related to footwear. The extensions we introduced in this study and the major contributions of this paper are summarized in three points: (1) To the best of our knowledge, we are the first to recognize the subject's footwear by the dynamics of gait changes acquired from smartphone sensors in daily life. We categorize the shoes into 3 classes by the height of the heels (flat, mid HH and ultra HH). We propose Sensing-HH, a novel deep attention model, which can automatically learn a hierarchical feature representation and the infinite temporal contexts from raw signals through the hybrid net structures. It also has the ability to implicitly learn to suppress irrelevant parts in the raw signals and to highlight salient features useful for this specific task by adding the attention mechanism. (2) We established a dataset with 35 young females wearing 3 kinds of shoes. All of them were asked to walk for 4 min on a flat surface, with 3 smartphones as recording devices, which at the same time were held by their hands, attached to their waists, and placed in their handbags, respectively. (3) We conducted comprehensive experiments on this dataset to evaluate the proposed Sensing-HH model. The results showed that our model achieved competitive performance with a mean F1-score ( F m ) of 0.827 when the smartphone was attached to the waist, from different classes, through cross verification. Meanwhile, the F 1 Score of the Ultra HH was as high as 0.91.
The remaining part of this paper is structured as follows: In Section 2, we give a brief overview of the state of some related literature. In Section 3, we present how the dataset was established. In Section 4, we illustrate the Sensing-HH, a deep attention model. Experimental results with the baseline methods are presented in Section 5. Section 6 gives the conclusions.

Related Work
In this section we summarize research that are most relevant to our proposed approach, grouping them in three domains: footwear related gait analysis using motion sensors, previous deep learning approaches for motion sensor-based recognition and attention mechanism for sensor data processing.

General Footwear Related Gait Analysis Using Motion Sensors
Wearable motion sensors makes gait analysis [16,26] much easier. The research about general gait analysis using motion sensors are focused on model-based methodology [27,28], which needs to first model gait based on a comprehensive understanding of the gait mechanism, and then convert the sensor signal into some gait-related physiological parameters [29][30][31], such as gait rhythm, step length, symmetry, inner foot distance, ankle shape, detection of gait phases [32], or kinematic parameters (joint angle measurement) [33]. The recent review [34] has proved the wearable sensors to be very useful in monitoring and analyzing the stability of subjects.

Previous Deep Learning Approaches for Motion Sensors-Based Recognition
Over the past few years, deep neural networks emerged as a family of learning models for automating feature design, and have achieved tremendous successes in many application domains [35][36][37][38][39][40]. Particularly, Yosinski et al. [41] demonstrated that features learned were not specific to a particular task and could be useful for multiple related tasks. Some studies employed deep neural networks for motion sensor-based recognition tasks. It is common to use Convolutional Neural Networks (CNN), Recurrent Neural Network (RNN), and recently some researchers have paid attention to the hybrid network which consists of CNN and RNN. Gadaleta et al. [12] presented IDNet, a user authentication framework from smartphone-acquired motion signals. The stacked convolutional layers were used as a series of feature extractors, and then One-Class SVM (OSVM) was used as a classifier for gait recognition. The experiments exploited an in-house dataset with data collected from 50 subjects during six months. Data are acquired using different smartphone models positioned in the right front pocket of trousers. Subjects were asked to walk at their normal pace in different walking sessions for about 5 min. The accelerometer, gyroscope were both used in the recognition process for recording. Zou et al. [13] proposed a CNN-RNN structure for robust gait feature representation, with which features of the space and time domains were successively abstracted by the hybrid network. Two datasets were collected for identification and verification. In a previous work, we also proposed a hybrid deep neural network [9] to predict the BMI of smartphone users, which was also based on the characteristics of body movement captured by the smartphone's built-in motion sensors.

Attention Mechanism for Sensor Data Processing
The attention mechanism is popular in deep learning areas [42]. It has been successfully applied to image recognition [43,44], natural language processing [45,46] and speech recognition [47], which is originally a concept in biology and psychology that illustrates how we restrict our attention to something crucial for better cognitive results. Recently, some researchers have explored the potential of using attention models for processing sensor data, such as Electroencephalography (EEG) and wearable sensor data. Zhang et al. [48] presented a Convolutional Attention Model (CAM) for EEG-based human movement intention recognition in the subject-independent scenario. In the study, the integrated attention mechanism was utilized to focus on the most discriminative information of EEG signals during the period of movement imagination while omitting other less relative parts.
Zhang et al. [49] introduced a selective attention mechanism into the reinforcement learning scheme to focus on the crucial dimensions of the multimodal wearable sensor data. This mechanism helped to capture extra information from the signal and thus it was able to significantly improve the discriminative power of the classifier. Zeng et al. [50] proposed two attention models for human activity recognition: temporal attention and sensor attention. These two mechanisms adaptively focused on important signals and sensor modalities. Wang et al. [51] presented an attention-based convolutional neural network for human recognition from weakly labeled data.
Our proposed attention model is focused on a long sequence of sensor data, and it not only improves the performance of the model but also has better interpretability.

Dataset
To our best knowledge, there is no existing dataset that specifically studied the motion sensor-based gait recognition of HH wearing in a daily environment. In this section, we describe our strategy for motion sensor data collection to build the dataset.

Participants Selection
We recruited female participants who wear HH for at least 5 days a week, for an average of 12 h a day (including walking, sitting and standing). In order to avoid other factors such as age, height, and weight to impact the results, we selected 35 subjects with the age range from 19 to 27, and with similar builds. Participant details are shown below: age: 23 ± 4 years; height: 164.3 ± 12.4 cm; mass: 51.8 ± 7.6 kg. Each of the participants was informed before the experiment of its aim and the measuring method. All of them signed a consent to participate in the study. Prior to the gait measurement, we conducted a short survey asking questions about the preferred types of footwear and how frequently they wear HH. Two-thirds of the participants answered that they preferred flat shoes in their day to day life. One-third of them preferred high heeled shoes, even with the heels more than 8 cm in height. All of them wore 3 kinds of shoes (flat, mid HH and ultra HH) for this study.

Data Collection
All of the motion sensor data were recorded by a log application from 3 different android smartphones (Samsung Galaxy S10, Samsung Galaxy Note8, and Smartisan Pro2). Table 2 summarizes sensor specifications for the devices. The tri-accelerometer and the tri-gyroscope are motion sensors equipped by the smartphones we used. The tri-accelerometer is based on the basic principle of acceleration and it is used to measure the acceleration (including gravity) in the X, Y and Z directions of the smartphones. The tri-gyroscope captures the angular velocity of a smartphone during its rotation in space. Both of them reflect the gait characteristic of smartphone users.
All of the participants were asked to walk for 4 min on a flat ground, as shown in Figure 3, the recording devices, the 3 smartphones that was mentioned before were held on their hands, attached to their waists, and placed in their handbags, respectively, as shown in Figure 4.

Methodology
In this section, we give an overview of the development of Sensing-HH. First, we define the notations used in this study. Second, we introduce the proposed Sensing-HH model in details.

Notation and Definitions
To avoid ambiguity, we are clarifying the following terms used in this paper: Sequence, Sub-Sequence and Instance: The sequence S is all recordings of one subject, it is an ordered list of multi-dimensional time series that are typically recorded in temporal order at fixed intervals. Given the dataset with total N subjects, the m-th subject, m ∈ [1, N], the sequence is S m , and T m is the total number of intervals.
d i m denotes the m-th subject's sensor recording (tri-axis accelerometer and tri-axis gyroscope) at the i-th sampling point and i ∈ [1, T m ], as follows: In this paper, the sequence S m will be segmented into a series of sub-sequences by a sliding windows strategy.

Sub-Sequence
The de-facto standard workflow for processing sensor data in ubiquitous computing treats individual sub-sequences x k m as statistically independent. x k m , k ∈ [1, L], is the k-th sub-sequence of the sequence S m : , w is the length of each sub-sequence, and θ is the step between the start intervals of two consecutive sub-sequences. Concretely, x k m has the sampling points from d

Instance
In practice, the instance i k m refers to the data fed into the recognition model, which is the suitable transforming format of sub-sequence x k m by data preprocessing function H( * ).
In this paper, the task is to learn a function f : I → Y from a given data set. Where I denotes the instance space, , and Y is the set of class labels, Given the unified representation f , we simultaneously optimize the network by minimizing a loss function L, which makes it possible to shorten the distance between the predicted label and ground truth.

Sensing-HH: A Deep Attention Model
This subsection introduces our proposed deep attention network, which consists of two streams, and takes acceleration and angular velocity as inputs respectively. Each stream is composed of four different Modules: a signal preprocessing module, a deep hybrid connection network module, an attention network module, and a fusion module. As illustrated in Figure 5, and the details are presented as follows:

Data Preprocessing Module
In practice, the data preprocessing module includes three main steps: Step 1: Resampling and Interpolating Unlike some specific sensors which are used under constrained experimental conditions, the sampling frequencies in most smartphones' built-in sensors are time-varying because their processing unit and operating system were designed for multitasking [11]. Additionally, different sensors have different sampling rates to guarantee the data from all types of the motion sensors can be processed simultaneously, resampling and interpolating steps are required to transform the sequence of raw signals into equally spaced time series. In this paper, the motion sensor data time series are interpolated using cubic spline method [52] and resampled at f = 200 Hz.
Step 2: Gravity Filtering Raw accelerometer data include gravity components, which makes it difficult to use motion sensors to reflect the change of celerity and position of a smartphone at the time. In this paper, we applied a novel gravity filtering method based on the combination of EMD (Empirical Mode Decomposition) and the wavelet threshold, which is proposed by Lu et al. [53].
Step 3: Normalization After filling up the missing values by resampling and interpolating, we normalize the training data by setting data mean to 0 and standard deviation to 1, and as usual we use the training data mean and standard deviation to normalize the test data.

Deep Hybrid Connection Network Module
As a result, learning the inter-modality correlations along with the intra-modality information is one of the major challenges in HH recognition from multi-modalities of signals. The current researches of sensor-based recognition are usually accomplished with multiple different sensors such as accelerometer and gyroscope. Generally, using the diverse sensing modalities can obtain better results than using only one particular sensor. Our proposed deep hybrid connection neural networks consist of two-stream CNN-BiLSTM (Bidirectional LSTM) networks with the stacked convolution layers and bi-directional Long Short-Term Memory that encode features from multiple perspectives.
There are two main components in the CNN-BiLSTM, the first one is the stacked 2-Dimensional CNNs, which is applied to extract spatial features from processed sensory data such as acceleration and angular velocity. The second one is the BiLSTM which is responsible for learning the bidirectional long-term dependencies of salient features extracted by CNNs.
In practice, the stacked CNNs are competent in capturing the local connections of sensory data in spatial scale. In order to learn a rich representation of the input, the convolutional layers produce a set of multiple feature maps. Although the cells in adjacent convolutional layers are locally connected, various significant patterns of input signals at different levels can be obtained by stacking several convolutional layers to form a hierarchical structure of gradually more abstract features. The 2-Dimensional convolution layer l with its operation of calculating a feature map c l,M i,j as: where X and Y are the size of convolution kernel running over space and time, respectively, M is the number of feature maps in the convolutional layer (l − 1), w l−1,m ∈ R X × Y × M is a local filter weight tensor, and b l−1,k ∈ R is a bias, and ϕ ( * ) is the Rectified Linear Units (ReLU) nonlinear function. One shortcoming of conventional LSTM is that they are only able to make use of the previous context. Following Bi-LSTM, the same input data are fed into a forward LSTM and a backward LSTM. Then two hidden states are concatenated to compute the final output of Bi-LSTM y t as: where − → h t is the forward LSTM hidden state and h t is the backward LSTM hidden state simultaneously at each time step t, LSTM( * ) denotes the LSTM operation, W h and W h represent the weights of the forward LSTM and the backward LSTM, respectively, and b is the bias at the output layer.

Attention Network Module
As shown in Figure 5, the attention network is constructed based on the deep hybrid connection network we mentioned before. We generate the class activation maps [54] using global average pooling (GAP) in the CNNs parts, where GAP outputs the spatial average of the feature map of each unit at the last convolutional layer. A weighted sum of these values is used to generate the final output. Similarly, we compute a weighted sum of the feature maps of the last convolutional layer to obtain our class activation maps. We describe this in details below for the case of classification using softmax.
The weights of the softmax layer are propagated back to the convolution layers for decomposing the multi-dimensional time series into salient and non-salient regions. The so-called salient regions are considered to contain information on discriminative gait patterns of wearing high heels, which provide indications and important information associated to pre-defined shoe categories, and the non-salient regions that are less relevant to the footwear categories.
For a given instance of signals, we denoted f k (c, t) represent the activation of unit k in the last convolutional layer at spatial location (c, t) , where c means the channel of signals and t means the timestamps of signals. Then for a certain category m, we denote the corresponding weight of unit k and the corresponding input of softmax layer as w m k , and the the result of performing global average pooling as F k can be obtained Thus, for a given class m, the input to the softmax, S m , can indicate the overall importance of convolutional activations for category m, we obtain that Also we can define Att m the class activation map as class m, and it can directly indicate the importance of the activation at spatial location (c, t) for category m Finally, after all these processes, we have a set of compatibility score for the output of class m by a softmax function: This way, we transfer the spatial attention into deep hybrid connection network to emphasize the salient regions with discriminative information. This attention model is also able to revisit the previous information and focus on more important parts to learn a better representation.

Fusion Module
In the previous work [9], we used the fully connected (FC) layer on top of two-stream CNN-LSTM to produce probability scores on target labels. However, in this paper, to overcome the "one-stream-dominating-the-network" problem, the designed fusion module is combined with the attention weighted learning strategy. On one hand, directly concatenating the convolutional features and feeding it into FC layers may result in over-parameterization, which makes training difficult, especially for a high heels gait dataset on a limited scale. On the other hand, the low accuracy of the previous model is not only due to the over-fitting problem but also because only one type of sensor dominates the network while the other source only has a small impact on the final prediction. In this paper, we modify the attention mechanism to take two sources as input and have the compute attention weight from each source to produce a prediction for the current input by the softmax layer. This assumption is also confirmed by the following stream selection approach. We took two-stream CNN-BiLSTM as input and compute weights for each stream, as follows: where W 1 , W 2 are the weighted parameters of the different streams, and x 1 , x 2 are the learned features from accelerometer and gyroscope, respectively. The attention weights are normalized by softmax to create the attention map α i for each type of sensor.

Experiments
In this section, to evaluate the performance of the proposed Sensing-HH model for real-world application scenarios, we carefully conducted an experimental evaluation on a real-world dataset collected by ourselves and compared the results with several baselines methods. Additionally, we tested if there were significant signal differences between using footwear as measuring standard verses not using it.

Baselines
To illustrate the difficulty of the task we also compared the approach proposed in this work with standard classification methods typically used for automated assessment systems in other sensor-based recognition [9,[55][56][57][58].
RF [55]. The random forest (RF) is an ensemble classifier which, besides classifying data, can be used for measuring attribute importance. RF builds many classification trees, where each tree votes for a class and the forest chooses the classification having the most votes from the trees.
SVM [56]. The recognition process starts with the acquisition of the sensor signals, which were subsequently pre-processed by applying noise filters and then sampled in fixed-width sliding windows. From each window, a vector of 17 features is obtained by calculating variables from the signals in the time and frequency domain. Finally, these patterns are used as input of the trained SVM Classifier for the recognition.
CNN [57]. The stacked 2-Dimensional CNNs were designed to introduce a degree of locality in the patterns matched in the input data and to enable translational invariance with respect to the precise location (i.e., time of occurrence) of each pattern within a frame of movement data.
BiLSTM [58]. The model was based on a bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN), which is designed to take contextual information into account. The network can process data gathered from different positions, which results in a system that is invariant to transformations and distortions of the input patterns.
CNN-LSTM [9]. The CNN was designed to capture the spatial relationship, and the LSTM can make use of the temporal relationship. Combining CNN and LSTM enhances the ability to recognize the varied time span and signal distributions.

Setup
The handcrafted feature-based methods use WEKA toolkit [59] and the settings from previous papers [55,56]. Sensing-HH and other deep learning benchmark models [9,57,58] are performed on Keras 2.3.0 and Tensorflow 2.0. For such deep learning models, tuning hyper-parameters is a time-consuming and challenging task due to the fact that numerous parameters need to be configured. In this paper, we applied the functional ANOVA framework proposed by Hoos et al. [60] to estimate the impact of each hyperparameter on the performance observed across all experiments. Six common hyper-parameters, namely the optimizer, learning rate, number of epochs, batch size, dropout rate, and regularizer, are optimized, see Table 3.

Cross-Validation Strategies
In order to obtain an unbiased evaluation of the classification performance, a leave one subject out cross-validation (LOSO-CV) is adopted. Suppose a dataset with N subjects. For each experiment, we used N − 1 subjects' sensor data for training and the rest of the subjects' sensor data for testing. At first, in LOSO-CV, the subjects {S n } N n=1 are partitioned into N groups. The samples are then partitioned by the groups into N sub-samples {D n } N n=1 of the N sub-samples. A single sub-sample D test is retained for testing the model, and the remaining N − 1 sub-samples D train are used as the training data. Then the cross-validation process is repeated N times, with each of the N sub-samples used exactly once as the validation data. Compared with k-fold cross-validation, the LOSO-CV not only ensures that the testing procedure covers all the participants but also makes it closer to the real-world application.

Evaluation Criteria
Since high precision and high recall are both desired in this application, and the datasets utilized in this work are possibly biased as it is limited by the selection of the subjects. We used the mean F1 score (F m ) to estimate the overall performance of different models, which corresponds to the harmonic mean of precision and recall: Here, i = 1, . . . , C is the set of classes considered.
TP i , FP i represents the number of true and false positive, respectively and FN i is the number of false negatives.

Experimental Results and Analysis
Extensive experiments were conducted on footwear recognition tasks on the real-world dataset collected by ourselves, as mentioned in Section 3. We first compare our method with different state-of-the-art works under different locations of devices, held in their hands, attached to their waists, and placed in their handbags, respectively. Then, to demonstrate how well the Sensing-HH works in real-world applications. An additional experiment was performed on a new fusion scene.

Comparsion with Baselines
In this subsection, we extensively compare our model with a set of baseline methods under different scenes for footwear recognition. Table 4 presents the comparison between the proposed Sensing-HH and the state-of-the-art methods as well as baselines, in three groups of sliding windows parameters settings for example to quantitatively show the different performance of the models, and the best performance is emphasized in bold. In general, the deep neutral network-based models [9,57,58] indeed improve considerably due to the captured complex features from raw signals. On the other hand, the handcrafted feature-based method only has satisfactory results in the waist scene. Sensing-HH achieved the best F m score which was 0.896, when the smartphone was attached to the waist. Meanwhile, we found that the suitable size of sliding windows for this recognition task was 2 s. Clearly, in this scene, the performance improvements of Sensing-HH over the RF [55], SVM [56], CNN [57], BiLSTM [58], CNN-LSTM [9] models are 23.1%, 17.8%, 14.2%, 16.5% and 15.4%, respectively. Overall, the Sensing-HH has robust performance in most of the scenes, regardless of the device locations. The reason could be that it has attention-based two-stream deep hybrid networks. We will discuss this further in the following subsection.

Ablation Study
In this subsection, to demonstrate the efficiency of our framework design, we performed a careful ablation study to examine the contributions of the proposed components to the model's classification performance. Specifically, we removed each component one at a time in our Sensing-HH framework. First, we named the different versions of Sensing-HH with different components removed as follows: (1) HHw/oATT: The Sensing-HH model without the attention component. For different variants, we tuned the hidden dimension of models, so that they had similar numbers of model parameters to the completed Sensing-HH, to remove the performance gain induced by model complexity.
The experiment measures use W s : 2 s, O: 50% settings. The results are shown in Figure 6, with comparison to other deep learning models. Some observations from these results were worth noting: (1) The best recognition performance was obtained with the smartphone attached to the waist.
The Sensing-HH significantly outperformed other deep models in this scene. But the differences amongst all the deep models in other scenes, i.e., held by the hand or in the bag, were not significant. (2) Removing the attention component (in HHw/oATT) from the Sensing-HH caused the most significant performance drop in the waist scene, which dropped nearly 14.6%. This suggests the importance of the attention component in this mode.
(3) Removing the BiLSTM component (in HHw/oLSTM) from the Sensing-HH caused a performance drop of nearly 4-6% in most of the scenes.
The conclusion is that all of the components together lead to the robust performance of Sensing-HH in all of the scenes.

Failure Cases
To analyze failure cases of our proposed Sensing-HH, we visualized the confusion matrix of the result of misclassification. The details of the instances as shown in Table 5. From the confusion matrix in Figure 7, we found that the recognition accuracy of the Flat and Mid HH classes in the cross-view benchmark were relatively lower than the Ultra HH class, which had Precision 0.92 and Recall 0.89. Furthermore, we paid attention to the specific failure cases of the Mid HH and Ultra HH classes, as shown in Table 6. We found that gait pattern changes related to the different shoes seemed to be impacted by the subject's body height and weight.

Conclusions
In summary, we developed Sensing-HH for footwear recognition based on daily life gait data captured by built-in motion sensors from smartphones. To our best knowledge, we are the first to recognize the subject's footwear by the dynamics of gait changes acquired from smartphone sensors in daily life. We categorize the shoes into 3 classes by the height of the heels (flat, mid HH and ultra HH). Sensing-HH is a novel deep attention model which can automatically learn a hierarchical feature representation and the infinite temporal contexts from raw signals through the hybrid net structures. It also has the ability to implicitly learn to suppress irrelevant parts in the raw signals and to highlight useful salient features for this specific task by adding the attention mechanism. We used a daily life gait dataset to evaluate the performance of Sensing-HH and other baseline models. Comparing to three existing deep neural networks and two shallow models, Sensing-HH performed significantly better in most scenarios.
The results show that the proposed model is able to make footwear recognition more efficient and automated. It also can be applied to a large population as it only requires data from smartphones and it can accurately recognize footwear using daily life gait data with no restriction to the location of the measuring devices. Sencing-HH has the potential to extend use of the motion sensor data. For example, to help build a robust biometric system that includes gait pattern analysis. Future studies will focus on how to accurately recognize footwear in a dataset having a wider range of varied heights and weights of the subjects, so that the model would be able to work under an even closer-to-reality scenario.