A Method of Ultrasonic Finger Gesture Recognition Based on the Micro-Doppler Effect

: With the popularity of small-screen smart mobile devices, gestures as a new type of human–computer interaction are highly demanded. Furthermore, ﬁnger gestures are more familiar to people in controlling devices. In this paper, a new method for recognizing ﬁnger gestures is proposed. Ultrasound was actively emitted to measure the micro-Doppler effect caused by ﬁnger motions and was obtained at high resolution. By micro-Doppler processing, micro-Doppler feature maps of ﬁnger gestures were generated. Since the feature map has a similar structure to the single channel color image, a recognition model based on a convolutional neural network was constructed for classiﬁcation. The optimized recognition model achieved an average accuracy of 96.51% in the experiment.


Introduction
Touchscreen control is now used in most mobile devices, such as mobile phones and tablets. When a person uses it with a wet hand or gloved hand, the touch will not work well. With the rapid development of mobile devices, such as small-sized smart watches, it is also very inconvenient to control devices on a small screen. As a part of human communication, gestures can be used to express a wide variety of emotions and thoughts. Gestures are usually the second most natural method of interaction between humans and the environment, as well as among humans [1]. Gestures are convenient and have a vast interaction space and super high flexibility, providing excellent interactive experience. Therefore, gestures in human-computer interactions have gained greater attention in recent years [2].
Wearable sensor-based: Many gesture recognition methods based on wearable sensors have been reported [3,4,11,12]. John Weissmann et al. [3] explored a data glove as the input device and recognized five kinds of predefined hand gestures. Renqiang Xie and Juncheng Cao [12] presented an accelerometer-based pen-type sensing device and a user-independent hand gesture recognition algorithm that achieved almost perfect user-independent recognition accuracies. All of these methods require the user to wear additional sensors on the hand or arm. In contrast, ultrasonic-based gesture recognition methods directly sense the movements of the hand without any wearable device.
Optical vision-based: Optical vision sensors, including color cameras, depth cameras and infrared cameras, are the most widely used in gesture recognition because of the successful commercial products Microsoft Kinect and Leap Motion, which can capture human body activities by cameras [5,6,13]. Guillaume Plouffe et al. [6] developed a natural gesture user interface that could track and recognize hand gesture based on depth data collected by a Kinect sensor. Aurelijus Vaitkeviźcius et al. [14] presented a system that is capable of learning gestures by using the data from the Leap Motion device and the Hidden Markov classification algorithm. Although optical vision-based methods have good recognition performance, they are susceptible to illumination conditions and ambient infrared radiation [15]. In addition, these methods also have to entail high computational costs.
Electromagnetic sensor-based: Electromagnetic sensors have been widely used for human activity classification [7,8,16,17] and gesture recognition [18][19][20][21][22][23][24][25][26]. Most of the proposed methods for gesture recognition based on electromagnetic sensors can only recognized hand gestures of big movements, but some methods for classifying finger gestures have also been reported, such as WiFinger's method [23] and project Soli [22,27]. WiFinger [23] presented a fine-grained finger gesture recognition system by using a single commodity WiFi device and achieved a recognition accuracy of over 93%. However, the system is impractical in the outdoors considering the availability of commercial wireless Access Points to connect the WiFi device. Soli [27] is a new gesture sensing technology designed by Google and designated to be an interaction sensor that uses radar for motion tracking of the human hand. The key part of Soli's design is a dedicated radar chip, incorporating the entire sensor and antenna array into an ultra-compact 8 mm × 10 mm package. In contrast, the proposed method in this paper just requires a pair of common commercial sensors, a simpler hardware setting and low-cost computing resources.
Ultrasound-based: Ultrasound-based methods of gesture recognition can be generally classified into three categories via different schemes. First, the Doppler effect has been used to sense gesture for a long time [9,10,[28][29][30][31][32]. For example, with the speaker already embedded in the laptop, SoundWave [10] generates an inaudible tone and measures the frequency shift of echo reflected from the moving hand. Dolphin [28], leveraging the loudspeaker and microphone in smartphones, extracts features from the Doppler shift and recognizes a rich set of predefined hand gestures by combining manual recognition and machine learning methods. The obvious problem with these proposed methods is that they can only recognize hand gestures of big movements, and are limited to simple gestures or combinations of these simple gestures, such as: push, pull, slide left or right. Second, many methods use ultrasonic sensor array to estimate direction of arrival (DOA) or range [33][34][35], requiring a complex hardware setup and multiple (at least 3) sensors to form a specific geometry, such as: a triangle, a cross and a line. Third, though many works based on ultrasound tracking has been proposed [36][37][38][39]. These methods can tracks a hand or finger at a high accuracy, but these methods can not work well when it comes to multi-targets at the same time, especially targets moving in different directions. Hence, methods proposed by these works are inadequate for recognizing complex gestures.
In this paper, a method using only two ultrasonic sensors for recognizing finger gestures is proposed. One sensor is used to emit a single tone of 300 kHz, the other is used to receive echo reflected by moving fingers. Then, the micro-Doppler information from echos is processed into feature maps. According to the feature maps, a deep convolutional neural network (CNN) is built for finger gesture recognition and a competitive accuracy of recognition of five finger gestures is achieved.

The Micro-Doppler Effect
The micro-Doppler effect was firstly presented in coherent laser systems [40]. When the target or any part of the target has vibrations or rotations in addition to its bulk translation, it might cause additional frequency shifts on the returned signal. This phenomenon is referred to as the micro-Doppler effect [41]. In many early research, the micro-Doppler effect has been used to recognize the moving state of the human body [42,43]. Figure 1 is a schematic diagram of the micro-Doppler effect. The distance between the target and the ultrasonic transducer is R(t). The displacement change caused by the target's micro motion is x(t). λ represents the wavelength of the ultrasonic wave, and the received signal r(t) can be expressed by: where f c is the center frequency, φ is the initially phase of the transmit signal. Assuming that the distance R(t) does not change in a short time, the phase change in the received signal r(t) is caused by the target's micro motion x(t). In this paper, the center frequency f c is 300 kHz, and the speed of ultrasound travel in the air is 340 m/s, so the wavelength of transmit signal is 1.13 mm. If the micro motion of target is only 0.5 mm, it will induce 1.77π phase change.
x(t)  Figure 2 shows a block diagram of the proposed method. It consists of two ultrasonic transducers, one signal conditioning circuit, one NI platform for signal sampling, and two post-processing modules, including micro-Doppler processing and recognition modules. The ultrasonic transducers are MA300D1-1 by Murata Manufacturing Co.. The MCU on the signal conditioning circuit generates a designed pulse signal. The pulse signal is amplified and stimulates the ultrasonic transceiver to emit ultrasonic waves. The echos with interference reflected from finger gestures are received and amplified by the signal conditioning circuit and then sampled by ADC in the NI platform [44]. The raw data is firstly saved on the NI platform and processed offline on a laptop. The center frequency of MA300D1-1 is 300 kHz and bandwidth is very small. So the transmit signal is designed as modulated pulses, showed in Figure 3. The τ is the width of pulse, T represents Pulse Repetition Interval (PRI). M is the total number of pulses for coherent processing, and T CPI denotes Coherent Processing Interval (CPI). To obtain a high-quality micro-Doppler feature map requires high Doppler frequency resolution and range resolution. So the parameters of modulated pulses should be elaborately designed. The Doppler frequency resolution f d res and the range resolution R res are as [45]:

Platform Design and Parameters Setting
where c is the speed of ultrasonic wave travel in air and λ is wavelength of ultrasonic wave. In order to accurately recognize finger gestures, some parameters are designed carefully and are shown in the left part of Table 1, and the performance indicators are calculated as shown in the right half of Table 1.
As can be seen from the table, the Doppler frequency resolution is 15.625 Hz, the speed resolution reaches 9 mm/s and the distance resolution reaches 6.9 mm. Such high speed resolution and distance resolution are sufficient to clearly distinguish different finger motions.  In Section 2.2, the designed modulated pulses can be expressed by: where n = 0, 1, 2, . . . , N, T = PRI is the pulse repetition interval, f c is the center frequency of the transmit signal, and u(·) is the complex envelope of modulated pulses. After reflected from finger motions, the received signal can be described as: where T = nT, n = 0, 1, 2, . . . , N. The received signal preprocessing flow as show in Figure 4 includes sampling, bandpass filtering, IQ demodulation, and lowpass filtering. After preprocessing, a two-dimensional pulse Doppler data matrix can be obtained. Each cell in the matrix can be represented by y[l, n], and y[l, n] is where l is the range bin and n represents the nth pulse, n = 0, 1, 2, . . . , N. The larger n is, the longer the data collected is. The flow of preprocessing is shown in Figure 4.

Micro-Doppler Processing
In order to improve the signal-to-noise ratio, the two-dimensional data matrix in the previous section must be coherently processed in the slow time dimension. The total number of pulses for coherent processing is set at 64. Based on the designed parameters of modulated pulses, target operating range is set to R 1 ∼ R 2 , that is in the range bin of l 1 ∼ l 2 . Accumulating all range bins within the operating range, Equation (5) can be rewritten as: Then y[n] is divided into K segments, the length of each segment is M, and the incremental step between adjacent segments is D. The segmented data y k is performed by fast Fourier transform (FFT) with a window size of NFFT = 256, and the result can be expressed as: where k = 1, 2, . . . , K. Combining ST k in order, the micro-Doppler feature map of finger gesture is acquired. In this paper, K is set at 45, and D is set at 14. Then the micro-Doppler feature map is in shape of 45 × 256 just like Figure 5, which is a example of five finger gestures. The whole flow of micro-Doppler processing illustrated in Figure 6.

Recognition Model
The gesture recognition model mainly uses machine learning algorithms, such as K-Nearest Neighbor (KNN) [28,32] and Support Vector Machine (SVM) [35], and deep learning algorithms, such as Recurrent Neural Network (RNN) [31] and convolution neural network (CNN) [2]. The machine learning algorithms need manual extraction of features, whose quality has an fatal impact on the results. While CNN shows excellent recognition performance in image recognition. The micro-Doppler feature map (after expanding it to a tensor of 45 × 256 × 1) has the same structure as a single color channel image. Hence, the proposed method in this paper adopts a deep convolution neural network (CNN) for robust finger gesture recognition.
The network architecture of our recognition model is shown in Figure 7. The architecture is composed by three CNN layers for feature extraction, a Fully connected (FC) layer and a Softmax layer for classifying finger gestures. Figure 8 illustrates the architecture of deep convolution neural network (DCNN) in this paper. Convolution Layers compute the dot product between the local region of the input image and the weight matrix of the filter, and the filter slides over the entire map, repeating the dot product operation. The convolution kernel (filter) size of each convolution layer is 4 × 4 and the number of kernel of three convolution layers is 64, 128 and 256, respectively. Generally, Batch normalization is cascaded after the convolution layer to prevent overfitting and accelerate deep network training. An activation function, which must be highly nonlinear, follows the Batch normalization. This paper employs Restricted Linear Units (ReLU), which describes the nonlinear input output. The pooling layer followed after ReLU, also known as downsampling, is designed to reduce data volume while preserving useful information. The Max pooling is used and Pooling size is set at 2 × 2 in the whole deep convolution network.  The Fully Connected (FC) layer flattens the features extracted by convolution Layers and computes the probability of different classes. Like traditional neural networks, all nodes in adjacent FC layers are connected by weights. In the proposed recognition model, an FC layer is followed after convolution layers and the size of the FC layer is 512. The ReLu activation function is used and local response normalization is applied after each layer. Dropouts of 0.5 are applied in the four FC layers. In the end, a Softmax output layer outputs the predicted label.
The proposed recognition model is implemented based on the Keras platform and trained from scratch because no pre-trained models can be used to our proposed model. Adam [46] is used to optimize the parameters of the model with respect to the loss function. The batch size is initially set at 128, learning rate is set at 0.0001, and the number of epochs is set at 500.

Gestures Set
The proposed method is used to evaluated five finger gestures, including finger close, finger open, finger clockwise circle, finger counterclockwise circle, and finger slide, as shown in Figure 9. The selected gestures are all finger movements, excluding big movements of the palms or arms. These gestures are common gestures for people to control devices in daily life.

Data Acquisition
Training a deep neural network requires a large amount of training data containing enough variations of gestures. Six subjects including four males and two females were recruited to perform the five designated finger gestures. In order to obtain as many as variations, only minimal instruction on how to execute these gestures were given. Each feature map is marked with a class label. Each subject was asked to perform each gesture for 50 times. Thus, a total of 5 × 6 × 50 = 1500 samples were recorded. The dataset was expanded to 4500 by time folding and superimposing Gaussian noise on feature maps. Finally, the dataset of total 4500 samples were applied to the experiments as raw input.

Discussion Parameters of the Recognition Model
(1) Number of Convolution layers: To further explore the impact of the number of convolution layers on the recognition accuracy, recognition models under configurations of different numbers of convolution layers are tested. The number of convolution layers is set at 2, 3, 4 and 5, respectively, in each model, and the other parameters of the models are consistent with the initial settings. The recognition accuracies of different models are as shown in Table 2. It can be seen from the table that the recognition rate is the highest when the number of layers is three. However, when the number of layers increase to 4 and 5, the recognition rate falls because of over-fitting. The reason is that the increase of convolution layers leads to over parameterization of the models, and consequently making it difficult to be well trained by relatively number of samples. So, the number of convolution layer is fixed at 3. (2) Size of the Convolution Kernel: In this experiment, the performance of recognition model is tested under different sizes of the convolution kernel. The sizes of the convolution kernel in all convolution layers are set as 3 × 3, 4 × 4 and 5 × 5, respectively. Table 3 shows the recognition accuracies of different convolution kernel sizes. In Table 3, it can found that the recognition accuracy increases with the size of the kernel, and the number of parameters needed to be trained also increases. Therefore, the kernel with size of 4 × 4 is selected. (3) Number of Convolution Kernels: The more the number of convolution kernels is, the more features are extracted from the convolution layer, and the easier it is to overfit. On the contrary, the fewer the number of convolution kernels is, the fewer the features extracted from the convolution layer are, and the more prone to under-fitting. The experiment's aim is to find the suitable number of convolution kernels. The Table 4 illustrates recognition accuracies of four different configurations.
The result of Table 4 shows that the first configuration has a comparable accuracy, but it has minimum training parameters and minimum training cost. (4) Number of FC layers: In order to optimize architecture of the recognition model, the impact of number of the FC layers is examined. Four different configurations of FC layers are tested. Table 5 shows the results under different numbers of FC layers. From the table, it can be concluded that one FC layer is enough. (5) Size of the FC layer: The impact of the size of the FC layer is also examined by our experiments. Table 6 shows the recognition accuracies with different FC layer sizes vary from 128 to 1024. From the table, it can be known that the performance enhances the increase of the FC layer size. (6) Number of Epochs: The number of epochs is not enough and the model is not trained to be optimal. Too many epochs will require a long training time. Based on the chosen parameters in previous sections, the recognition model is trained with different numbers of epochs of 300, 500, 800 and 1000. Table 7 shows the results. The results indicate that the recognition accuracy increases with the number of epochs, and the accuracy barely increased when the number of epochs reached 800. Therefore, the number of epochs is set to 800 in subsequent experiments.

Performance Evaluation
For the gesture recognition training and validation experiments, the dataset is randomly divided into two parts, 70% for training and 30% for test, and a standard k-fold leave-one-subject approach is used, where k is five in the experiments. Table 8 shows the result of 5-fold cross validation. The average recognition accuracy is 97.11%. In terms of classification error analysis, the confusion matrix for all 5-folds is presented in Figure 10. The

Conclusions
In this paper, a new method of ultrasonic finger gesture recognition is proposed. Firstly, a hardware structure of ultrasonic active sensing, which uses only two ultrasonic transducers, is constructed. Second, based on the micor-Doppler effect of a moving finger, the received echos from finger gestures are processed in preprocessing stage, including sampling, bandpass filtering, IQ demodulation and lowpass filtering. In micro-Doppler processing, coherent processing and FFT are used to obtain high resolution micro-Doppler feature maps of finger gestures resulting from finger motions. Third, a recognition model based on convolution neural networks is built. Some experiments are designed to optimize some parameters of the model. The results of 5-fold cross validation show that our proposed method can recognize five finger gestures and achieve an average accuracy of 96.51%; the accuracy of each finger gesture is more than 96%.