Multimodal Few-Shot Learning for Gait Recognition

: A person’s gait is a behavioral trait that is uniquely associated with each individual and can be used to recognize the person. As information about the human gait can be captured by wearable devices, a few studies have led to the proposal of methods to process gait information for identiﬁcation purposes. Despite recent advances in gait recognition, an open set gait recognition problem presents challenges to current approaches. To address the open set gait recognition problem, a system should be able to deal with unseen subjects who have not included in the training dataset. In this paper, we propose a system that learns a mapping from a multimodal time series collected using insole to a latent (embedding vector) space to address the open set gait recognition problem. The distance between two embedding vectors in the latent space corresponds to the similarity between two multimodal time series. Using the characteristics of the human gait pattern, multimodal time series are sliced into unit steps. The system maps unit steps to embedding vectors using an ensemble consisting of a convolutional neural network and a recurrent neural network. To recognize each individual, the system learns a decision function using a one-class support vector machine from a few embedding vectors of the person in the latent space, then the system determines whether an unknown unit step is recognized as belonging to a known individual. Our experiments demonstrate that the proposed framework recognizes individuals with high accuracy regardless they have been registered or not. If we could have an environment in which all people would be wearing the insole, the framework would be used for user veriﬁcation widely.


Introduction
The human gait, a person's manner of walking, is sufficient for discriminating between individuals [1][2][3]. Information about a person's gait has been utilized for diagnosing diseases [4][5][6][7], and it can also be used for biometric authentication [8][9][10][11][12][13]. Gait recognition has three main advantages compared to other typical biometric authentication methods. First, it is robust against impersonation attacks. Second, it does not require physical contact between sensors and people. Lastly, it may not be necessary to use vision sensors to capture gait information [14,15].
A typical framework for gait recognition consists of two parts: capturing data that are a good representation of the gait, and using algorithms to classify the collected data to identify individuals. In that sense, we can categorize the gait recognition framework based on data acquisition devices and data analysis algorithms. Specifically, information about the gait can be collected using vision sensors, pressure sensors, and inertial measurement units (IMUs), then the collected data can be analyzed using linear discriminant analysis (LDA), k-nearest neighbor (k-NN), hidden Markov model (HMM), support vector machine (SVM), convolutional neural network (CNN), or combinations thereof [15].
In general, two types of recognition problems exist. The first one is closed set recognition, whereby all testing classes are known at the time of training, and the other one is open set recognition where incomplete knowledge is given at the time of training, and unknown classes can be classified by an algorithm during testing [16]. In gait recognition, the majority of frameworks are designed to solve the closed set recognition problem, and few approaches attempt to address the open set recognition problem [17].
In a recent study of the closed set recognition problem, the original data were divided into separate unit steps to use the data more efficiently and effectively [13]. In this study, we adapted their method to recognize individuals from their gait. The pressure sensors, 3D-axis accelerometer and 3D-axis gyroscope installed in the insoles of shoes record the time series data of gait information [18]. Human walking cycles consist of a stance phase and a swing phase [19]. During the swing phase, since the entire foot is in the air, the values reported by the pressure sensors should be zero. Considering that, we might be able to divide the original time-series data into consecutive unit steps by detecting the time index where the pressure value is zero. However, due to the inference between sensors or high temperatures in the insole, the reported pressure values are often non-zero during the swing phase [20]. To avoid potential errors, these authors determined the unit steps using Gaussian smoothing [13].
To recognize individuals from consecutive unit step data, we propose the use of an ensemble consisting of a CNN and a recurrent neural network (RNN). The ensemble network maps multimodal unit step data to the embedding vectors in a latent space. To evaluate the system, we use training, unknown known, unknown unknown datasets. In the training phase, the system is provided all labeled samples in training dataset, a few (3 ≤ k ≤ 10) labeled samples in unknown known dataset. To train this network, we used triplet loss [21], which forces the distances between the embedding vectors of the homogeneous unit steps to be much smaller than the distances between the embedding vectors of the heterogeneous unit steps in training dataset. Once the ensemble is trained, we randomly select k unit steps for every person in the unknown known dataset, and we store the corresponding k embedding vectors and the centroid of the embedding vectors. Using the one-class support vector machine (OSVM) algorithm [22], we compute a decision function for the k embedding vectors of each individual who was included in the unknown known dataset.
In the test phase, a unit step in unknown known dataset (except selected k in the training phase) or unknown unknown dataset is given, in response to which the system should indicate whether the given unit step belongs to someone who was included in the unknown known dataset. The system accomplishes this by mapping the unknown unit step to an embedding vector using the ensemble network and finds the nearest neighbor by simply using the distances between the embedding vector of an unknown unit step and the centroids of individuals in the unknown known dataset. Finally, we conclude that the unknown unit step belongs to the nearest neighbor if the embedding vector of the unknown unit step is inside the decision boundary of the nearest neighbor.
In summary, the contributions of this study are as follows: (1) We designed an ensemble network that uses CNN and RNN, which is applicable to open set gait recognition. (2) We developed a system that addresses the open set gait recognition problem using the OSVM algorithm. (3) The system requires only a few walking cycles of an individual to be able to recognize them.

Related Work
Studies on gait recognition began with the use of vision sensors [23]. These approaches were subsequently further studied and developed [24][25][26][27][28][29]. In general, vision-based gait recognition requires strict conditions while collecting the data. For example, a video sequence would have to contain only individuals that need to be recognized. Apart from this, the recognition accuracy is not sufficiently high. Moreover, the sensing devices' viewpoint and the orientation also affect the accuracy. To recognize a subject from a video sequence which includes more than one person, each subject should be segmented and tracked individually [30,31]. To achieve a stable recognition accuracy regardless sensing devices' viewpoint and orientation, 3D construction model or view transformation model can be utilized [32][33][34].
In recent studies, pressure sensors and IMUs are widely used to collect data. Typically, IMUs consist of an accelerometer, a gyroscope, and a magnetometer. For instance, gait information was collected from IMUs placed on the chest, lower back, right wrist, knee, and ankle of subjects [8], and then a CNN-based predictive model [35] identified individuals. Similarly, gait information was collected from a variety of IMUs attached to the user in multiple positions, and the user's activity was recognized by analyzing the time series patterns of the data [36]. Later, pressure sensors and IMUs were installed in wearable devices, such as, fitness trackers, smartphones, or shoe insoles [37]. For example, gait information was collected using IMUs installed in smartphones [38]. Subjects carried the smartphones in their front trouser pockets to gather data, and then a mixed model consisting of CNN and SVM [39] was used to recognize individuals. In another study, gait information was measured using pressure sensors and an accelerometer on the shoe insoles [40], and the collected data were classified using null space LDA [41]. However, these methods require the placement of different types of sensors on various parts of the body, take a long period of time to gather data, or need improvement in terms of identification accuracy. More recently, an ensemble network was used to identify individuals using gait information, but their framework is only effective for solving the closed set recognition problems [13].
The open set gait recognition problem was partially addressed in the literature. For example, gait information is captured by 11 cameras, and it was classified using CNN with softmax output layer [42]. To address the open set gait recognition problem, the softmax layer included one more class than the number of subjects in the training dataset. To train the network, samples of subjects who were not included in the training dataset are labeled as 'not recognized.' Because of using the softmax output layer, this approach is not scalable since the network should be trained again every time a new subject is added. For another example, collected gait information using IMUs installed in smartphones was recognized using a framework based on CNN and OSVM [38]. Different than our study, the proposed method required about a hundred unit steps to train the OSVM algorithm, and the system was evaluated using unit steps in the unknown known dataset only.

Method
In our work, subjects' gait information was measured using a shoe insole. The original data format is a vector of time series that consists of consecutive unit steps. We processed the time series vector into fixed size fragments (i.e., unit steps) to improve the recognition accuracy and reduce the computational complexity. These unit steps are then recognized using the proposed system.

Data Pre-Processing
We used a commercial shoe insole, FootLogger [43], to record subjects' gait information. The design of the insole is depicted in Figure 1. The insole for each foot has eight pressure sensors, a 3D-axis accelerometer, and a 3D-axis gyroscope. The pressure sensor measures the level of pressure at one of three levels: 0, 1, or 2. The accelerometer and gyroscope gauge acceleration and rotation in three dimensions as integers between -32,768 and 32,768. The sampling rate of the insole was 100 Hz, and we collected data from both of the subjects' feet. We followed the notation of a previous study [13]. We denote the (univariate) time series and multivariate time series by x(t) and x(t), respectively. Different sensing modalities are expressed using superscript letters, that is x p (t) for pressure, x a (t) for acceleration, and x r (t) for rotation. Different subject identifications are expressed using subscript numbers, that is x i (t) for id = i. We also adapt the method from Reference [13] to determine the unit steps from the original time series. Except, for brevity, we use the notation s and s(t) interchangeably. We repeat the notation here for the readers: The ith unit step of subject id = a for sensing modality m is denoted by s m i,a , where m ∈ {pre, acc, rot}, and their dimensions are |s . Examining the minimum length of the subjects' unit steps, we set d = 87 in the experiments. Using the timestamps of the unit steps, the original time series of both feet were converted into the standard format. The procedure of converting to the standard format also follows Reference [13]. We omit the detailed information to avoid repetition.

Network Architecture
We adapted and upgraded the networks that were used previously [13]. Figure 2 depicts the design of the network architecture. A number of different architectures (including shallower/deeper and narrower/wider) were evaluated and compared, however, their performance differences were neglectable. The original datasets include time series data of pressure, acceleration and rotation. Since the pressure sensors measured the pressure on the same foot during the same walking cycle, we assumed that their values are correlated. Similar assumptions were made for three-dimensional acceleration and rotation values. By considering these correlations, we designed a encoding network model combining CNN and RNN. The proposed network model maps unit steps of pressure s prs , acceleration s acc , and rotation s rot in the standard format to embedding vectors v: (1) We use the notation f cnn and v cnn when only the CNN is activated, f rnn and v rnn when only the RNN is activated, and f ens and v ens when both CNN and RNN are activated.

Convolutional Neural Network
Given each sensing mode, our proposed CNN includes three identical networks that function independently, and the outputs of these three networks are concatenated. Each network contains three one-dimensional (1D) convolutional layers with 32, 64, and 128 filters, and the convolutional layers are followed by a batch normalization layer. Each filter in the first convolutional layer has a size of 20 × (w · 2), while the sizes of filters are 20 × 32 and 20 × 64 for the second and third convolutional layers, respectively. More specifically, for the first convolutional layer, the width of each filter is equal to that of the standard format (w · 2). We slide each filter across the height of the input and compute the dot product between the filter and the input, resulting in a series of scalar values. This convolution operation is repeated for the 32 filters, and the resulting series of scalar values are stacked horizontally, whereby the width of the output becomes equal to the number of filters. The stride of all convolutional layers is set to 1, and the padding size is set such that the height of the output is the same as that of the input. Similarly, for the second and third convolutional layers, the width of the filters equals the number of filters in the previous convolutional layer. Therefore, the shape of the feature map is 87 × 32, 87 × 64, and 87 × 128 after each convolutional layer. The last feature map is flattened, then the size of the feature vector is 87 · 128. The feature vectors of the three networks are concatenated to form one vector, followed by two fully connected layers. We use a rectifier linear unit (ReLu) as the activation function for every convolutional layer and the first fully connected layer to avoid the vanishing gradient phenomenon [44].

Recurrent Neural Network
Similarly, our proposed RNN includes three identical networks that operate independently, given each sensing mode. The outputs of these three networks are ultimately concatenated. Each network contains two consecutive long short-term memory (LSTM) layers [45]. LSTM is a modified version of RNN with the capability of utilizing internal memory units to overcome the vanishing gradient problem of traditional RNN models. More specifically, we include 128 memory units in each LSTM layer and activate the input, output, and forget gate by the sigmoid function.
To prevent overfitting, the dropout rate was set at 0.2 [46]. Similar to the CNN, the input for the first LSTM layer has the shape 87 × (w · 2). For each row of input data, the LSTM layer creates a scalar value per memory unit; the resulting scalar values are concatenated to form an output vector of shape 87 × 128. The second LSTM layer returns the scalar value per memory unit, therefore, the size of the output vector is 128. The output vectors of the three networks are concatenated to form one vector, followed by two fully connected layers.

Embedding Vector
We take the last fully connected layer with 128 units of the CNN and RNN as the output of each network model. Therefore, the dimensions of the embedding vectors of CNN and RNN are identical to 128, that is f cnn (·) ∈ R 128 , f rnn (·) ∈ R 128 . The embedding vector of the ensemble model is generated by concatenating the embedding vectors of CNN and RNN; hence, the dimension is 256, that is f ens (·) ∈ R 256 . All embedding vectors are normalized, that is || f cnn (·)|| 2 = || f rnn (·)|| 2 = || f ens (·)|| 2 = 1. be a unit step of subject id = b for a sensing modality m. The model takes three types of unit steps: pressure, acceleration, and rotation. For brevity, however, we use the simplified notation f (s i,a ) instead of f (s p i,a , s a i,a , s r i,a ). Similar to the triplet loss [21], the multimodal triplet loss is defined as

Loss Function
where , and α is a margin (we set α = 1.0). The multimodal triplet loss forces that the distance between v i,a and v j,a is smaller than the distance between v i,a and v k,b for all possible triplets in the training dataset. A conceptual diagram of the multimodal loss is illustrated in Figure 3.

Few-Shot Learning
We define the unknown known and unknown unknown datasets. In the unknown known dataset, the samples (unit steps) are not used for training the encoding function (i.e., the CNN, RNN, or ensemble networks); instead, only a few samples are utilized for training the decision boundaries of individuals using OSVM. In the unknown unknown dataset, the samples are used only for testing.
For a positive integer 3 ≤ n ≤ 10, let {s i,a |1 ≤ i ≤ n} be the set of randomly selected unit steps of subject id = a in the unknown known dataset and {v i,a = f (s i,a )|1 ≤ i ≤ n} be the set of corresponding embedding vectors that are generated by the trained network model which can be one of CNN, RNN, or ensemble network. For each subject in the unknown known dataset, at first, the system computes the centroid of n embedding vectors. The centroid of the subject id = a is defined by M a = 1 n ∑ n i=1 v i,a . In addition, the system learns decision functions in the latent space using the OSVM algorithm [22] for all subjects. The algorithm obtains {v i,a |1 ≤ i ≤ n} as an input and solves the following optimization problem: where K(v, v ) = e −γ||v−v || 2 2 is a radial bias kernel function, α i are the Lagrange multipliers, and γ and ν are among the hyper-parameters of the system. Let s * ,u be a unit step of an unknown subject u in either the unknown known or unknown unknown dataset. The symbol * denotes that the unit step can be any unit step of the subject u. For each subject a in the unknown known dataset, the decision function of v * ,u for subject id = a is defined by a , v h,a ) for any h that satisfies the condition 0 < α h < 1 νn and 1 ≤ h ≤ n. An unknown subject could be one who was included in either the unknown known dataset or unknown unknown dataset. Therefore, the system should be able to recognize a unit step if it belongs to a subject in the unknown known dataset. On the other hand, the system should be able to reject a unit step if it belongs to a subject in the unknown unknown dataset. The system determines the prediction of u as follows: 1.
Otherwise, "u is not recognized" where τ is one of the hyper-parameters of the system. A conceptual diagram of the test phase is illustrated in Figure 4. . Illustration of gait recognition using the trained model. In the example, unit step s * ,u is recognized as that of the "green" subject, whereas unit step s * ,w is not recognized.

Experiment
Using empirical datasets, we demonstrate the recognition accuracy of our proposed method with distinct sensing modalities (single and triple) and different network architectures (CNN, RNN, and ensemble).

Datasets and Evaluation Metric
We gathered gait information data from 30 adults aged 20 to 30 years. The insole was used to collect the data while the subjects walked for approximately 3 minutes. The data that were gathered during this time included approximately 151 unit steps on average per subject, and the entire dataset consisted of 4544 unit steps. In the experiment, we set the standard length to be d = 87.
As shown in Figure 5, we split the data into three sets-training, unknown known, and unknown unknown. First, we randomly selected 16 out of the 30 subjects, and allocated 100% of the unit steps to the training dataset, which was used to train the CNN, RNN, and ensemble models independently. Second, we selected 7 out of the remaining 14 subjects arbitrarily. For each subject among the selected people, n = 10 unit steps were utilized to train the OSVM algorithm and the decision boundary of the subject was determined. Except for these n unit steps, all unit steps of the selected 7 subjects are allocated to the unknown known test dataset. Finally, all unit steps of the remaining 7 subjects are allocated to the unknown unknown test dataset [17]. The number of unit steps in the training dataset is approximately 2423, and the number of unit steps of the unknown known test and the unknown unknown test datasets were approximately 990 and 1060, respectively. We repeated generating the datasets 20 times. For each dataset, we trained and tested the network independently and reported the averaged evaluation metrics. For a unit step in the unknown known test dataset, we define a true positive (TP) if a unit step is recognized correctly, and a false negative (FN) otherwise. In contrast, for a unit step in the unknown unknown test dataset, we define a true negative (TN) if a unit step is not recognized as any subject in the unknown known test dataset, and a false positive (FP) otherwise. We report the true positive rate TPR = TP TP+FN , the true negative rate TNR = TN TN+FP , and the accuracy ACC = TP+TN TP+FN+TN+FP . Figure 5. Illustration of the approach we used to split the data into training, unknown known test, and unknown unknown test datasets.

Multi-Modal Sensing
The distributions of ACC as a function of γ and ν for the CNN, RNN, and ensemble models are shown in Figure 6a. Clearly, selecting γ and ν is critical to the overall recognition accuracy of the models. A comparison of the area in which the rates are greater than 90% (light green to yellow areas) indicates that the region of the ensemble model is broader than that of the regions of the CNN or RNN model. This means that the ensemble model has a weak dependency when selecting γ and ν, which affects the robustness of the recognition result. The distribution of the TPR is shown in Figure 6b. A comparison of the area in which the rates are greater than 93% (yellow), the region of the RNN model is slightly broader than that of the CNN model. The overall distribution of the ensemble model is similar to that of the RNN model. The distribution of the TNRs is shown in Figure 6c. Contrary to the distributions of the TPR, the overall distribution of the ensemble model is almost identical to the distribution of the CNN model. In particular, a comparison of the area in which the rates are greater than 93% ( yellow) reveals that the region of the CNN model is significantly broader than that of the RNN model. These distributions of the TNR explain why the ACC of the RNN model is significantly lower than the ACC of the CNN model. Utilization of the proposed system in a practical application would require the hyperparameters to be tuned by considering both the TPR and TNR at the same time. For example, if the system was to reject all unit steps, then we could achieve 100% in TNR, but the TPR would equal 0%. In this sense, we set the hyperparameters to minimize the differences between TPR and TNR.   To determine the effect of τ, we specified separate values of γ and ν for the different models in the following experiment. We used γ = 1.9 and ν = 0.06 for the ensemble model, γ = 1.8 and ν = 0.06 for the CNN model, and γ = 2.2 and ν = 0.08 for the RNN model. In Figure 7, we see that choosing a τ value smaller than 0 significantly improves the TPR and ACC. Based thereupon, we propose alternative options for choosing τ instead of τ = 0.0 for the decision boundary in the latent space.

Uni-Modal Sensing
To determine the contribution of each sensing modality to the accuracy, we trained and tested the models using uni-modal sensing. Effectively, in each sensing modality, only the corresponding sub-network was activated, whereas the two other sub-networks were deactivated while the network was being trained and tested. The TPR, TNR, and ACC of the uni-modal ensemble model as function of τ are compared in Figure 8. The overall performance of the ensemble model using pressure sensing was slightly lower than that of the others.  Figure 9 compares the accuracy of unimodal sensing and multimodal sensing. In the case of the acceleration sensing modality, all the network models (Ensemble, CNN, RNN) showed the best performance compared to the other sensing modalities with the pressure sensing modality being the worst. In particular, the difference between these modalities is noticeable when the RNN model is used. Detailed TPR, TNR, and ACC results obtained with all the network models for multimodal and unimodal sensing are summarized in Table 1.  The recognition accuracy in the previous papers [13,38] were ranging from 98.5% to 99.5%, which is higher than this study's result. However, a direct comparison is inappropriate due to the different problem setting (for example, addressing the closed set problem [13]), or different datasets and devices (for example, using the unknown known test dataset only collected by smartphones [38]).

Discussion
To verify that the system forms a discriminative cluster for each subject, we present the t-SNE [47] plots of the embedding vectors of the unit steps in the unknown known and the unknown unknown test dataset in Figure 10. Considering that the networks were trained with subjects' unit steps in the training set only, these plots show that the proposed system learns the general characteristics of unknown subjects' gait patterns satisfactorily. To enable us to quantitatively analyze our results, we devised a distance function between two unit steps using their embedding vectors. This distance function is defined by The distributions of the distances between homogeneous and heterogeneous unit steps, respectively, are plotted in Figure 11. The blue line shows the distribution of the distances between homogeneous unit steps, which are two unit steps of identical subjects, and the orange line shows the distribution of distances between heterogeneous unit steps, which are two unit steps of different subjects. A clear distinction between the two distribution curves would signify the recognition accuracy of the system to be outstanding. Unfortunately, the two curves overlap to a certain extent, indicating that potential recognition errors may occur. Figure 11. Distributions of distances between homogeneous unit steps and between heterogeneous unit steps in the latent space.

Conclusions
We proposed a new framework to recognize people based on their gait information. The proposed framework is the first approach to address the complete open set gait recognition from the data collected using wearable devices, namely insoles. Assuming an environment in which all people would be wearing the insole, our proposed framework could be applicable to variety of functions, for example, user verification. To build a user verification system, the system administrator would need to collect gait information for only 10 cycles of walking for every user. This would enable the system to recognize a user by examining a single cycle of their walking with 93.6% accuracy. Because the system does not require the encoder networks to be trained every time users are added, our proposed framework is highly scalable. In the future study, we aim to improve the recognition accuracy by minimizing overlap between the distributions of the distances of homogeneous and heterogeneous unit steps.