From Signal to Image: Enabling Fine-Grained Gesture Recognition with Commercial Wi-Fi Devices

Gesture recognition acts as a key enabler for user-friendly human-computer interfaces (HCI). To bridge the human-computer barrier, numerous efforts have been devoted to designing accurate fine-grained gesture recognition systems. Recent advances in wireless sensing hold promise for a ubiquitous, non-invasive and low-cost system with existing Wi-Fi infrastructures. In this paper, we propose DeepNum, which enables fine-grained finger gesture recognition with only a pair of commercial Wi-Fi devices. The key insight of DeepNum is to incorporate the quintessence of deep learning-based image processing so as to better depict the influence induced by subtle finger movements. In particular, we make multiple efforts to transfer sensitive Channel State Information (CSI) into depth radio images, including antenna selection, gesture segmentation and image construction, followed by noisy image purification using high-dimensional relations. To fulfill the restrictive size requirements of deep learning model, we propose a novel region-selection method to constrain the image size and select qualified regions with dominant color and texture features. Finally, a 7-layer Convolutional Neural Network (CNN) and SoftMax function are adopted to achieve automatic feature extraction and accurate gesture classification. Experimental results demonstrate the excellent performance of DeepNum, which recognizes 10 finger gestures with overall accuracy of 98% in three typical indoor scenarios.


Introducti on
The interactions with smart sensors have advanced to an unprecedented extent in this day and age [1]. Body gesture, as a novel communication modality, is widely adopted in HCI for its natural and straightforward properties [2]. To realize practical gesture recognition, tremendous efforts have been made both in terms of accuracy and granularity. Most schemes achieve the goals based on dedicated motion sensors [3] or pre-installed depth cameras (e.g., RGB-D sensor [4], Kinect [5] and Leap motion [6]). Despite their high accuracy in tracking subtle finger motions, they have intrinsic limitations such as whole-day intrusive equipment, light and Line-of-Sight (LoS) condition, potential privacy issues and extra installation cost. We look forward to a ubiquitous, non-invasive, low-cost finger gesture recognition system, which could bring real advantages to a user-friendly HCI. For instance, if a housewife enjoys a bubble bath, it is preferable for her to lie in the bathtub and control intelligent appliances remotely (e.g., aquastat, light and audio clusters), rather than worry about the waterproofness of precise instruments or privacy leakage caused by cameras.
Recent advances in Wi-Fi-based in-air gesture recognition shed light on the potential of realizing fine-grained gesture recognition with almost-everywhere commercial Wi-Fi devices. Generally, these systems reuse pervasive Wi-Fi signals by capturing and analyzing informative signal characteristics induced by human motions. Early attempts were devoted to deriving motion-induced Doppler

•
First, we are the first to embed CSI segmentation in a CNN-based system. To achieve real-time gesture recognition, detecting transition points of human movements and segmenting CSI frame into slices are critical processes. Yet it is non-trivial to import varied-size segments into the CNN model directly, since the input layer requires fixed-size input. SignFi simplifies this problem and only collects 200 packets for each gesture, which cannot be practical in real scenarios. To address this problem, we present a novel method to limit the width and height of regions of interest, and design a region-selection indicator to evaluate the significance of regions. • Secondly, we construct depth radio images with sensitive CSI information. In SignFi, CSI measurements of both amplitude and phase are preprocessed and fed into CNN model. In contrast, we construct depth radio images with only amplitude from the most sensitive antenna and conjugate-multiplied phase from antenna pairs. In addition, spatial relations are fully considered to eliminate image noises. Therefore, we were able to preserve the most discriminative features of finger gestures while discarding polluted and insensitive signatures.
We implement DeepNum on commercial Wi-Fi infrastructures equipped with slightly modified NICs and evaluate the system performance in three typical indoor environments. Extensive experiments are conducted to show that DeepNum could achieve overall accuracy of 98%, slightly exceeding SignFi and obviously better than WiFinger in predefined scenarios. A variety of experimental studies are performed to test the influencing factors and system robustness. In the future, we envision DeepNum will promote the integration of wireless sensing applications with cutting-edge deep learning techniques.
Our contributions are summarized as follows.
• From macro-movements to micro finger gestures. DeepNum enables fine-grained finger gesture recognition using only a pair of commercial Wi-Fi devices. Extensive experiments are conducted to validate the system performance and show superior performance compared with the state of the art.

System Overview
To answer the above questions, we illustrate a processing procedure as is shown in Figure 1, which can be divided into three parts: Data Collection, Signal Processing and Image Processing. In the Data Collection step, we use a pair of laptops equipped with wireless cards to collect raw CSI measurements. One laptop, acting as the AP, transmits CSI packets with only one antenna. Another laptop, acting as the MP, receives CSI continuously with three antennas.
For the Signal Processing part, we first select a sensitive antenna as a reference and utilize the wavelet energy of only amplitude information to segment the continuous CSI sequence. Notice that we do not adopt the traditional signal de-noising methods here for the sake of simplicity and information integrity. Then, we construct a 3-dimensional tensor (called a 3-channel radio image) with two relative phase images and one amplitude image (for challenge 1).
For the Image Processing part, we make our contributions by designing novel processing methods for radio images. First of all, considering the informative features involved in channel correlations, we sanitize noisy radio images by leveraging Higher-Order Singular Value Decomposition (HOSVD) for high-dimensional principal components (for challenge 2). Then we borrow the idea of selective search in the image domain, extracting fixed-size image regions from varied-size radio images based on discriminative color and texture features with weighted values. Extracted image regions could fit the bill of the input layer in a deep learning architecture (for challenge 3). Finally, a 7-layer CNN and SoftMax function are applied to output classified labels of finger gestures.

Signal Processing
Raw CSI measurements provided by commercial Wi-Fi devices cannot be utilized directly for micro-movement recognition. This is because motion characteristics involved in signal distortions can be sheltered by irrelevant interferences. Therefore, we only utilize the amplitude from the most sensitive antenna and relative phase information from conjugated antenna pairs. Additionally, as APs continuously broadcast CSI signals into the air, it is necessary to detect and segment motioninduced CSI frames in a time that is computationally efficient, so we leverage robust wavelet energy to indicate state transitions. In addition, we argue that manipulating CSI amplitude and phase information jointly preserves more available signatures; not only time-frequency features, but also spatial relationships. Hence, we construct 3-dimensional depth radio images, instead of 2dimensional ones. In this section, we will elaborate the signal processing steps, including the antenna selection, action segmentation, and high-dimensional radio image construction.

Antenna Selection
Qian et al. [28] firstly demonstrated the aptness of tracking human path using two appropriately selected antennas for the dominant Doppler shift. As is shown in Figure 2a, the boxplot of 7-second CSI amplitude data across all 90 subcarriers reveals the distribution of CSI amplitude. We observe that the second antenna, with lower average amplitude values, is more likely to produce larger dynamic responses. This indicates that the second antenna has a weaker static component and is more likely to be distorted by micro-movements. Figure 2b depicts the amplitude variation of the No.1 subcarrier received in three different antennas. We can see the second antenna possesses the most significant amplitude fluctuations compared with the other two antennas. Inspired by the above two observations, we select the most sensitive antenna as the reference from the three receiving antennas and only leverage its amplitude information to detect motion transitions.

Signal Processing
Raw CSI measurements provided by commercial Wi-Fi devices cannot be utilized directly for micro-movement recognition. This is because motion characteristics involved in signal distortions can be sheltered by irrelevant interferences. Therefore, we only utilize the amplitude from the most sensitive antenna and relative phase information from conjugated antenna pairs. Additionally, as APs continuously broadcast CSI signals into the air, it is necessary to detect and segment motion-induced CSI frames in a time that is computationally efficient, so we leverage robust wavelet energy to indicate state transitions. In addition, we argue that manipulating CSI amplitude and phase information jointly preserves more available signatures; not only time-frequency features, but also spatial relationships. Hence, we construct 3-dimensional depth radio images, instead of 2-dimensional ones. In this section, we will elaborate the signal processing steps, including the antenna selection, action segmentation, and high-dimensional radio image construction.

Antenna Selection
Qian et al. [28] firstly demonstrated the aptness of tracking human path using two appropriately selected antennas for the dominant Doppler shift. As is shown in Figure 2a, the boxplot of 7-second CSI amplitude data across all 90 subcarriers reveals the distribution of CSI amplitude. We observe that the second antenna, with lower average amplitude values, is more likely to produce larger dynamic responses. This indicates that the second antenna has a weaker static component and is more likely to be distorted by micro-movements. Figure 2b depicts the amplitude variation of the No.1 subcarrier received in three different antennas. We can see the second antenna possesses the most significant amplitude fluctuations compared with the other two antennas. Inspired by the above two observations, we select the most sensitive antenna as the reference from the three receiving antennas and only leverage its amplitude information to detect motion transitions.

Gesture Segmentation
In reality, signal segmentation is an indispensable step for real-time gesture recognition, since it enables low computation cost and timely dynamics detection. For practical real-time gesture recognition, if we failed to detect the start points and end points, we would have to manually activate the system and deal with redundant information. Previous work [30] adopts moving variance or short time energy to indicate signal distortions because it accumulates the variations within a short time, as is shown in Figure 3a. Specifically, we request that one volunteer continuously performs three finger gestures. The participant performs each gesture using the following steps: raise the right hand, stretch fingers, retract fingers and put down the hand. Noting that movement variance could reflect the rough trend of signal variation caused by human part, however, the value is vulnerable due to unrelated motion interferences or background noises.
To detect subtle finger gesture transitions more efficiently, we develop a novel segmentation method based on an insight. Compared with the movement of other body parts, finger gestures produce slighter but relatively higher-frequency distortions. It is a natural choice to detect the transition points in specific frequency sub-bands. Therefore, we were inspired to adopt a wavelet energy-based segmentation method, which mainly consists of the following steps.
Firstly, we decompose raw CSI amplitude into approximation coefficients and detail coefficients. Generally, wavelet decomposition adopts a series of a low-pass filter l[k] and high-pass filter h[k] to process a raw signal with length I, producing both the approximation coefficients (AC) Am,l [I] and detail coefficients (DC) Am,h [I] of m-th level simultaneously as: where AC depicts the fluctuation shape of CSI amplitude on a coarse scale (i.e., dominant body movements), while DC captures the fine details of signal dynamics (i.e., subtle finger gestures). We rebuild multi-scale signals on six levels with Daubechies db6 wavelet and only utilize DC to avoid the disturbance of irrelevant motions. Then we resort to sliding windows to calculate wavelet power, since it strengthens the sensitivity by combing squared amplitude values in a short time. To strike a balance between time efficiency and accuracy, we set the window size w at 100, step width s at 50 and step number as Q = (l − w)/s + 1. We calculate the wavelet energy in the m-th level and the energy summation at the q-th step as: Finally, we normalize the value as pm[q] = Em[q]/Esum [q], which represents the relative wavelet power, detecting the signal distortions of interest without additional noise reduction. Specifically, when we set the transmitting rate at 500 pkt/s, the gesture-corresponding frequency components

Gesture Segmentation
In reality, signal segmentation is an indispensable step for real-time gesture recognition, since it enables low computation cost and timely dynamics detection. For practical real-time gesture recognition, if we failed to detect the start points and end points, we would have to manually activate the system and deal with redundant information. Previous work [30] adopts moving variance or short time energy to indicate signal distortions because it accumulates the variations within a short time, as is shown in Figure 3a. Specifically, we request that one volunteer continuously performs three finger gestures. The participant performs each gesture using the following steps: raise the right hand, stretch fingers, retract fingers and put down the hand. Noting that movement variance could reflect the rough trend of signal variation caused by human part, however, the value is vulnerable due to unrelated motion interferences or background noises.
To detect subtle finger gesture transitions more efficiently, we develop a novel segmentation method based on an insight. Compared with the movement of other body parts, finger gestures produce slighter but relatively higher-frequency distortions. It is a natural choice to detect the transition points in specific frequency sub-bands. Therefore, we were inspired to adopt a wavelet energy-based segmentation method, which mainly consists of the following steps.
Firstly, we decompose raw CSI amplitude into approximation coefficients and detail coefficients. Generally, wavelet decomposition adopts a series of a low-pass filter l[k] and high-pass filter h[k] to process a raw signal with length I, producing both the approximation coefficients (AC) A m,l [I] and detail coefficients (DC) A m,h [I] of m-th level simultaneously as: where AC depicts the fluctuation shape of CSI amplitude on a coarse scale (i.e., dominant body movements), while DC captures the fine details of signal dynamics (i.e., subtle finger gestures). We rebuild multi-scale signals on six levels with Daubechies db6 wavelet and only utilize DC to avoid the disturbance of irrelevant motions. Then we resort to sliding windows to calculate wavelet power, since it strengthens the sensitivity by combing squared amplitude values in a short time. To strike a balance between time efficiency and accuracy, we set the window size w at 100, step width s at 50 and step number as Q = (l − w)/s + 1. We calculate the wavelet energy in the m-th level and the energy summation at the q-th step as: , which represents the relative wavelet power, detecting the signal distortions of interest without additional noise reduction. Specifically, when we set the transmitting rate at 500 pkt/s, the gesture-corresponding frequency components mainly exist in the DCs of the 6th layer. In Figure 3b, we can see that relative wavelet power is more sensitive to subtle finger movements, since it efficiently excludes the disturbance of irrelevant body movements. Moreover, it is resistant to environmental noise, while movement variance can be more sensitive. Therefore, we only need to predefine a simple threshold by empirical study to indicate motion transitions. In Figure 3c, it is obvious that the most discriminative signal fluctuations induced by finger gestures are preserved, while the irrelevant body motions are discarded, which demonstrates the efficiency of our proposed segmentation method. mainly exist in the DCs of the 6th layer. In Figure 3b, we can see that relative wavelet power is more sensitive to subtle finger movements, since it efficiently excludes the disturbance of irrelevant body movements. Moreover, it is resistant to environmental noise, while movement variance can be more sensitive. Therefore, we only need to predefine a simple threshold by empirical study to indicate motion transitions. In Figure 3c, it is obvious that the most discriminative signal fluctuations induced by finger gestures are preserved, while the irrelevant body motions are discarded, which demonstrates the efficiency of our proposed segmentation method.

Image Construction
Recall that we aim to construct discriminative 3-dimensional radio images by fully exploiting the amplitude and phase information. On the one hand, we have extracted sensitive amplitude without the loss of motion information. On the other hand, we have to cope with time-variant random phase noises caused by unsynchronized transmitters. In this paper, we refer to a novel phase sanitation method proposed in [41]. In particular, we apply conjugate multiplication among antenna pairs to remove the same CSI phase offsets across subcarriers: where wavelength  is 12 cm) movement by removing static components, but also avoid the random phase offsets in phase measurements. We direct interested readers to Widance [41] for more detail. Figure 4a shows the extracted relative phase of all 30 subcarriers in the TX-RX_1 (red lines) and TX_RX_2 (blue lines) antenna pair when a volunteer performs finger gesture '1'. We observe that the

Image Construction
Recall that we aim to construct discriminative 3-dimensional radio images by fully exploiting the amplitude and phase information. On the one hand, we have extracted sensitive amplitude without the loss of motion information. On the other hand, we have to cope with time-variant random phase noises caused by unsynchronized transmitters. In this paper, we refer to a novel phase sanitation method proposed in [41]. In particular, we apply conjugate multiplication among antenna pairs to remove the same CSI phase offsets across subcarriers: where ∼ H n t , | ∼ H n t |, We note that relative phase information could not only capture the micro cm-scale (∆d = −λ ∼ θ n t /2π, wavelength λ is 12 cm) movement by removing static components, but also avoid the random phase offsets in phase measurements. We direct interested readers to Widance [41] for more detail. Figure 4a shows the extracted relative phase of all 30 subcarriers in the TX-RX_1 (red lines) and TX_RX_2 (blue lines) antenna pair when a volunteer performs finger gesture '1'. We observe that the relative phase could reveal obvious signal variations when finger gesture is performed, which indicates its sensitivity. Based on this observation, we are inspired to construct distinct 3-dimensional (channel) radio images based on one amplitude signal and two relative phase signals.
As is shown in Figure 4b, we design a novel sensitive CSI tensor consisting of two relative phase images and one amplitude image. Each image contains M × T pixels, where M denotes subcarrier indices from 1 to 30 (we interpolate the column value for finer granularity), T is the specific time slices. The higher the pixel value is, the darker the gray image. We mention that previous works only conduct the image processing method on a 2-dimensional matrix, which can be considered as a gray-scale map without color or spatial information. We creatively build 3-dimensional depth radio images for gestures 1, 2, and 3 with sensitive signal properties, and visualize them in the form of 3-channel RGB images in Figure 4c-e. As shown, the three radio images reveal distinctive color and texture characteristics. In addition, compared with existing gray-scale image processing, constructing CSI color images enables multi-scale features to be explored and exploited, which contributes to the detection of subtle finger gestures.
Sensors 2018, 18, x FOR PEER REVIEW 9 of 21 relative phase could reveal obvious signal variations when finger gesture is performed, which indicates its sensitivity. Based on this observation, we are inspired to construct distinct 3-dimensional (channel) radio images based on one amplitude signal and two relative phase signals.
As is shown in Figure 4b, we design a novel sensitive CSI tensor consisting of two relative phase images and one amplitude image. Each image contains M × T pixels, where M denotes subcarrier indices from 1 to 30 (we interpolate the column value for finer granularity), T is the specific time slices. The higher the pixel value is, the darker the gray image. We mention that previous works only conduct the image processing method on a 2-dimensional matrix, which can be considered as a grayscale map without color or spatial information. We creatively build 3-dimensional depth radio images for gestures 1, 2, and 3 with sensitive signal properties, and visualize them in the form of 3channel RGB images in Figure 4c-e. As shown, the three radio images reveal distinctive color and texture characteristics. In addition, compared with existing gray-scale image processing, constructing CSI color images enables multi-scale features to be explored and exploited, which contributes to the detection of subtle finger gestures.

Image Processing
In this section, we transfer the problem from signal pattern recognition to color image classification and design a series of novel image processing techniques accordingly. (1) To jointly reduce image noise, HOSVD-based de-noising is adopted to extract high-dimensional principal components; (2) To lower the computation cost and regularize the size of the radio image, we implement the concept of Selective Search and select a characteristic 500 × 500 region based on color and texture features. (3) To achieve efficient image classification, we resort to a 7-layer CNN architecture without extensive parameter study and SoftMax function to output classified labels.

HOSVD-Based Image De-Noising
Bearing in mind that a color radio image is constructed with multi-dimensional relations, we aim to reduce irrelevant noises left over by fully exploiting tensor properties. HOSVD is an efficient tool to extract high-dimensional core component via performing tensor factorization, which is an extension of matrix SVD. The basic idea of HOSVD is to select the dominant eigenvalues by nullification of individual values below a predefined threshold [42]. Let us denote radio image as T0

Image Processing
In this section, we transfer the problem from signal pattern recognition to color image classification and design a series of novel image processing techniques accordingly. (1) To jointly reduce image noise, HOSVD-based de-noising is adopted to extract high-dimensional principal components; (2) To lower the computation cost and regularize the size of the radio image, we implement the concept of Selective Search and select a characteristic 500 × 500 region based on color and texture features. (3) To achieve efficient image classification, we resort to a 7-layer CNN architecture without extensive parameter study and SoftMax function to output classified labels.

HOSVD-Based Image De-Noising
Bearing in mind that a color radio image is constructed with multi-dimensional relations, we aim to reduce irrelevant noises left over by fully exploiting tensor properties. HOSVD is an efficient tool to extract high-dimensional core component via performing tensor factorization, which is an extension of matrix SVD. The basic idea of HOSVD is to select the dominant eigenvalues by nullification of individual values below a predefined threshold [42]. Let us denote radio image as T 0 ∈R M×T×C , C is the channel number. The HOSVD process of tensor T 0 is depicted in Figure 5a, and its decomposition formula can be written as: where is the 1st, 2nd and 3rd model tensor product, respectively. Compared with applying SVD in each matrix separately, HOSVD takes advantage of relations between orders by reordering the elements of an N-mode array into a matrix (e.g., a 3 × 4 × 5 tensor can be arranged as a 12 × 5 matrix or a 3 × 20 matrix). Figure 5b plots the 3-dimensional eigenvalues of core tensor S when a volunteer performs gestures. X-axis and Y-axis denote the selected number of the eigenvalue matrix, Z-axis is spatial dimension, and eigenvalue magnitude is represented by color-map. We observe that significant singular values only concentrate on a few samples with small indices, which exponentially overwhelm other eigenvalues and almost capture the whole image information. Therefore, the task of noise removal can be achieved by applying a hard threshold δ to reconstruct core T via formula 8, with dominant eigenvalue coefficients reserved and slight singular value discarded. Though simple, it avoids the problem of 'out of memory' and multi-rank parameter estimation when we manipulate a large tensor with MATLAB tensor toolbox, thus realizing fast computation [43]. ∈R M×T×C , C is the channel number. The HOSVD process of tensor T0 is depicted in Figure 5a, and its decomposition formula can be written as: where ∈R C×C are orthogonal matrices, S∈R M×T×C is the 3-dimensional coefficient array, ×1,2,3 is the 1st, 2nd and 3rd model tensor product, respectively. Compared with applying SVD in each matrix separately, HOSVD takes advantage of relations between orders by reordering the elements of an N-mode array into a matrix (e.g., a 3 × 4 × 5 tensor can be arranged as a 12 × 5 matrix or a 3 × 20 matrix). Figure 5b plots the 3-dimensional eigenvalues of core tensor S when a volunteer performs gestures. X-axis and Y-axis denote the selected number of the eigenvalue matrix, Z-axis is spatial dimension, and eigenvalue magnitude is represented by color-map. We observe that significant singular values only concentrate on a few samples with small indices, which exponentially overwhelm other eigenvalues and almost capture the whole image information. Therefore, the task of noise removal can be achieved by applying a hard threshold  to reconstruct core tensor Though simple, it avoids the problem of 'out of memory' and multi-rank parameter estimation when we manipulate a large tensor with MATLAB tensor toolbox, thus realizing fast computation [43].

Image Region Selection
At the very beginning, we survey the related state of the art and illustrate the rationality of image region selection. As CNN-based deep learning methods have led to revolutionary change in the visual community and substantially improved classification performance, there is still a critical issue that needs to be considered: the fixed-size constraint of the CNN network. To be more specific, the fully connected layer at a deeper stage of existing CNN network requires a fixed length of input by definition [44]. Simple cropping or warping may lose the most discriminative features and induce unwanted geometric distortion. SPP-net [39] feeds images with slightly varying scales during training and generates fixed-length representations by adding a spatial pyramid pooling (SPP) layer, while the size of radio images can be more arbitrary because gesture duration varies. We borrow the idea from the Selective Search [45] and propose a novel region selection method based on weighted color and texture features. Instead of grouping initial regions with multiple similarity strategies until the whole image becomes a single region, we design a sliding region method by comparing the color features of fixed-size region with weighted values and choosing the most significant one.
Specifically, we then extract color moment mean, standard deviation, and skewness to depict the color distribution of each matrix, to capture gray-level texture features from gray-level pairs and weighted EMD-based values to show the similarity between matrices. Let pi,j k denote the [i,j]-th pixel

Image Region Selection
At the very beginning, we survey the related state of the art and illustrate the rationality of image region selection. As CNN-based deep learning methods have led to revolutionary change in the visual community and substantially improved classification performance, there is still a critical issue that needs to be considered: the fixed-size constraint of the CNN network. To be more specific, the fully connected layer at a deeper stage of existing CNN network requires a fixed length of input by definition [44]. Simple cropping or warping may lose the most discriminative features and induce unwanted geometric distortion. SPP-net [39] feeds images with slightly varying scales during training and generates fixed-length representations by adding a spatial pyramid pooling (SPP) layer, while the size of radio images can be more arbitrary because gesture duration varies. We borrow the idea from the Selective Search [45] and propose a novel region selection method based on weighted color and texture features. Instead of grouping initial regions with multiple similarity strategies until the whole image becomes a single region, we design a sliding region method by comparing the color features of fixed-size region with weighted values and choosing the most significant one. Specifically, we then extract color moment mean, standard deviation, and skewness to depict the color distribution of each matrix, to capture gray-level texture features from gray-level pairs and weighted EMD-based values to show the similarity between matrices. Let p i,j k denote the [i,j]-th pixel value of the k-th channel, the fixed size of the region is N × N, and color moment features can be calculated by the following formulas: Standard Deviation: Skewness: where E k , σ k , and s k represent the average signal intensity, magnitude variation level and how asymmetric the distribution is, respectively. To further calculate the spatial relations of similar gray tones, we adopt gray-level co-occurrence matrices, which are easy to compute, and extract four standard features from Ng distinct gray levels, which can be expressed as: Contrast: Correlation: Entropy: where ATW k illustrates the level of homogeneity of k-channel matrix, Con k measures the difference in luminance, Cor k indicates the linear level between pix pairs and Ent k quantifies the disorder, randomness and complexity of the proposed region.
To demonstrate the aptness of three color features and four texture features, we conduct a simple experimental study by applying a 500 × 500 sliding region with step size 250 through the whole 1000 × 1000 image and extract 7 × 3 features from a total of nine segmented regions. To eliminate the feature dimensions, we normalize the feature matrix and aggregate the same feature of the same three regions. In Figure 6, we notice that different features show varied performance in different regions, revealing the image properties from a different perspective, while the (1, 2) region contains the maximum valuable features which are marked with red frames. Inspired by the above observations, we design a novel indicator which combines all color and texture features: where a 1 , . . . , a 7 denotes the descending order number of all regional features. For example, the (1, 2) region gets the largest mean value in all 9 regions, so a 1 is allocated with the highest weighted value 9, a 2 receives the second-highest value 8, vice versa. In real experiments, we set region size at 500 × 500 and step size at 100 for computation efficiency. Additionally, we select the top ten image regions in the training step to improve the tolerance of image distortion and only choose the region with the highest weighted score. Therefore, we could relieve the constraints of radio image size.
where a1,…, a7 denotes the descending order number of all regional features. For example, the (1, 2) region gets the largest mean value in all 9 regions, so a1 is allocated with the highest weighted value 9, a2 receives the second-highest value 8, vice versa. In real experiments, we set region size at 500 × 500 and step size at 100 for computation efficiency. Additionally, we select the top ten image regions in the training step to improve the tolerance of image distortion and only choose the region with the highest weighted score. Therefore, we could relieve the constraints of radio image size.

CNN-Based Classification
We adopt a 7-layer CNN to recognize input image regions and output classified labels. The CNN model possesses several attractive qualities over the previous DNN model. Firstly, CNN has much fewer connections and parameters, so it could facilitate computing speed. Secondly, CNN enables large-scale high-dimension data training with current GPUs and local fields of perception. Thirdly, CNN could automatically capture abstract pixel-level characteristics through a hierarchical architecture with highly optimized implementation of convolution filters. Therefore, we believe CNN theoretically outperforms the DNN model in image classification. Figure 7 shows the architecture and related parameters of our CNN model, which consists of three core building blocks: Input Layer, Convolution (Conv.) Layer, Pooling (Po.) Layer, and Fully Connected (FC) Layer. We will introduce the function of each part and provide some key parameters. Input Layer. For each gesture, we input the top 10 labeled fixed-size images (i.e., 500 × 500 × 3) to train the CNN model and only use the top 1 region in the predicting step. Therefore, input layer will keep the size of an image with width 500, height 500 and depth 3.
Conv. Layer. Conv. Layer has natural advantages by applying a set of learnable spatial filters to achieve local connectivity and parameter sharing, because it connects each neuron to only a local region of the input volume (local in space, but all along the entire depth). Additionally, each filter slides through the whole input volume with the same kernel and generates 3-dimensional abstract feature maps that give the responses in every spatial position. For example, if we decide to adopt 64 3 × 3 × 3 convolutional kernels (or the receptive field) to filter a 500 × 500 × 3 raw image (stride length is 3 and zero padding size is 2), then we would obtain a volume of 168 × 168 × 64. Assuming that we use a total of 168 × 168 × 64 neurons, here, and only [(3 × 3 + 1) × 64] × 168 × 168 = 18,063,360 connections,

CNN-Based Classification
We adopt a 7-layer CNN to recognize input image regions and output classified labels. The CNN model possesses several attractive qualities over the previous DNN model. Firstly, CNN has much fewer connections and parameters, so it could facilitate computing speed. Secondly, CNN enables large-scale high-dimension data training with current GPUs and local fields of perception. Thirdly, CNN could automatically capture abstract pixel-level characteristics through a hierarchical architecture with highly optimized implementation of convolution filters. Therefore, we believe CNN theoretically outperforms the DNN model in image classification. Figure 7 shows the architecture and related parameters of our CNN model, which consists of three core building blocks: Input Layer, Convolution (Conv.) Layer, Pooling (Po.) Layer, and Fully Connected (FC) Layer. We will introduce the function of each part and provide some key parameters.

CNN-Based Classification
We adopt a 7-layer CNN to recognize input image regions and output classified labels. The CNN model possesses several attractive qualities over the previous DNN model. Firstly, CNN has much fewer connections and parameters, so it could facilitate computing speed. Secondly, CNN enables large-scale high-dimension data training with current GPUs and local fields of perception. Thirdly, CNN could automatically capture abstract pixel-level characteristics through a hierarchical architecture with highly optimized implementation of convolution filters. Therefore, we believe CNN theoretically outperforms the DNN model in image classification. Figure 7 shows the architecture and related parameters of our CNN model, which consists of three core building blocks: Input Layer, Convolution (Conv.) Layer, Pooling (Po.) Layer, and Fully Connected (FC) Layer. We will introduce the function of each part and provide some key parameters. Input Layer. For each gesture, we input the top 10 labeled fixed-size images (i.e., 500 × 500 × 3) to train the CNN model and only use the top 1 region in the predicting step. Therefore, input layer will keep the size of an image with width 500, height 500 and depth 3.
Conv. Layer. Conv. Layer has natural advantages by applying a set of learnable spatial filters to achieve local connectivity and parameter sharing, because it connects each neuron to only a local region of the input volume (local in space, but all along the entire depth). Additionally, each filter slides through the whole input volume with the same kernel and generates 3-dimensional abstract feature maps that give the responses in every spatial position. For example, if we decide to adopt 64 3 × 3 × 3 convolutional kernels (or the receptive field) to filter a 500 × 500 × 3 raw image (stride length is 3 and zero padding size is 2), then we would obtain a volume of 168 × 168 × 64. Assuming that we use a total of 168 × 168 × 64 neurons, here, and only [(3 × 3 + 1) × 64] × 168 × 168 = 18,063,360 connections, the calculated amount in the FC layer can be explosive, since the number of connection is (500 × 500) Input Layer. For each gesture, we input the top 10 labeled fixed-size images (i.e., 500 × 500 × 3) to train the CNN model and only use the top 1 region in the predicting step. Therefore, input layer will keep the size of an image with width 500, height 500 and depth 3.
Conv. Layer. Conv. Layer has natural advantages by applying a set of learnable spatial filters to achieve local connectivity and parameter sharing, because it connects each neuron to only a local region of the input volume (local in space, but all along the entire depth). Additionally, each filter slides through the whole input volume with the same kernel and generates 3-dimensional abstract feature maps that give the responses in every spatial position. For example, if we decide to adopt 64 3 × 3 × 3 convolutional kernels (or the receptive field) to filter a 500 × 500 × 3 raw image (stride length is 3 and zero padding size is 2), then we would obtain a volume of 168 × 168 × 64. Assuming that we use a total of 168 × 168 × 64 neurons, here, and only [(3 × 3 + 1) × 64] × 168 × 168 = 18,063,360 connections, the calculated amount in the FC layer can be explosive, since the number of connection is (500 × 500) × (168 × 168 × 64) ≈ 4.5 × 10 11 .
In our experiment, we summarize some tricks to select appropriate parameters. First, we set the kernel size as 3 × 3 × 3, since stacked small kernels provide finer characteristics [46]. Secondly, we use stride 3 in the first two layers to reduce the dimension and stride 1 in the last two layers for feature traversal. Thirdly, we put zeros on the image border to allow an exact division. Finally, we tune the number of kernels to learn sufficient characteristics and control the output number. We also add Rectified Linear Unit (RELU) layers between Conv. Layers and Po. Layers so as to activate the positive part of feature maps.
Po. Layer. The function of pooling layer is to perform a down-sampling operation vertically and horizontally through the whole feature map. We adopt max pooling function to reveal only the maximum value of each cluster in the prior layer. In this way, we gain computation efficiency by obtaining less spatial information and avoid over-fit by training fewer parameters. Since there is no weight to train, we set the kernel size and stride size as (2, 2) through the whole process.
FC Layer. Fully connected layer connects every neuron in the former Po. Layer to every neuron on the hidden layer, which aims to combines all deep features and train the output after all the Conv. Layer and Po. Layer. The output size of FC layer is designed to be 10 in order to align the corresponding gesture labels. A softmax regression layer is added on the bottom of the architecture, which serves to squash the 10-dimensional feature data to the probability p j of j clusters, which is given by we adopt cross entropy as a loss function to compare the difference between the predicted labels and true labels, and further utilize the Stochastic Gradient Descent with Momentum to train the parameters.

Experimental Study
In this section, we firstly introduce the experimental configuration, then describe the performance evaluation, and finally discuss the impacts of the experimental variables and limitations.

Experiment Configuration
Hardware setup. We implement DeepNum using a pair of ThinkPad X200 laptops (Lenovo, Beijing, China) equipped with modified Intel 5300 NICs (Intel, Santa Clara, USA) and installed CSITool [26]. Both laptops are set up to work in monitor mode on Channel 165 at 5.825 Hz, one acts as the transmitter with only one external antenna, the other one serves as the receiver, possessing three external antennas for extra spatial information. Three receiving antennas are linearly arranged with a spacing distance of 5.2 cm. The pack transmitting rate is set to 500 Hz for fast dynamics, and transmission power is set to 15 dBm by default. During the training process, we employ MATLAB 2017b (The MathWorks, Natick, USA) running on an Intel i7-5700HQ 2.70GHz CPU (Intel, Santa Clara, USA) to process the CSI matrix.
Environment and data collection. All experiments are conducted in typical indoor scenarios of an academic building, including a meeting room, a corridor and a student studio. As is shown in Figure 8a-c, there is a 6 × 6 m 2 meeting room with a conference table and twelve chairs, a 2 × 20 m 2 empty corridor surrounded by walls and a 6 × 12 m 2 complex student office with two worktables, and multiple blocks and electronic interferences. We set the distance between laptops as 2 m, 1 m and 3 m, respectively, and fix the height of the antennas at 1.3 m. Eight male and two female student volunteers perform controlled finger gestures as illustrated in Figure 8d. All volunteers major in computer engineering, and have ages between 22 and 35. When CSI collection begins, the participant is asked to follow the instructions given by a recorder, step into the wireless environment and arrive at appointed locations. Then the recorder randomly gives the command number from 0 to 9 and records the digits, while the participant continuously performs the finger gestures with the body stationary. During the experiments, every participant performs each gesture by following similar steps: raise the right hand, stretch fingers, retract fingers and put down the hand. We collect a total of 300 instances for all 10 volunteers in all 3 scenarios within a 3-week period. Each instance contains 20 action segments covering all 10 digit gestures with short breaks. In the training stage, we use 50% of the records for all 10 volunteers to train the CNN model and the remaining 50% of the records to test the performance. Evaluation metrics. To evaluate the system performance, we adopt two common metrics. (1) Confusion matrix. The x-axis and y-axis represent the classified finger gestures and actually performed finger gestures, respectively. The number in the (i, j)-th cell denotes the probability of the j-th gesture classified into the i-th label. (2) Accuracy. We conduct comparison studies and parameter studies to verify the system robustness.

Performance Evaluation
Overall performance. We first report the overall performance of DeepNum. As is shown in Figure 9a-b, DeepNum yields a considerable average accuracy of 98% and 99% in the empty meeting room and corridor. In particular, all gesture accuracy is improved in the corridor because of the shorter transmitting distance and the stronger reflected signals. The average accuracy is 2% lower in the student office (Figure 9c) due to the multiple blocks and electronic interferences. We also notice a similar distribution of prediction accuracy in all three environments. For instance, digit '5' dominates in the confusion matrix of all experimental scenarios because it requires a full stretch of all 5 fingers and induces larger signal fluctuations, while digit '6' is easily confused with '0' and '2', since they have similar movement patterns.
Compared to baselines. We survey the state-of-the-art Wi-Fi-based gesture recognition systems from the last 3 years and list the relevant works in Table 1. To fully demonstrate the high performance of DeepNum, we implement DeepNum alongside two typical schemes, SignFi [23] and WiFinger [13], for comparison. The reason is two-fold. On the one hand, as the emerging masterpiece in using a deep learning model to recognize finger gestures, SignFi imports full amplitude and phase information into a typical CNN model, and achieves high accuracy in specific scenes. However, it circumvents the selection of raw CSI signals and the signal segmentation step. We aim to show the efficiency of the image construction step and the feasibility of the region selection step by comparison. On the other hand, since most of works follow a similar methodology (i.e., handcrafted feature extraction and traditional classification algorithms), we choose WiFinger [13] as a benchmark in order to show the superiority of the deep learning model. In Figure 9d, we observe that WiFinger has an acceptable overall accuracy of over 93% in all three scenarios; however, it is lower than the two CNNbased schemes. Furthermore, WiFinger fails to consider both amplitude and phase information, which may miss informative signal features. SignFi achieves competitive results in both the meeting room and corridor by careful parameter tuning. However, the accuracy drops more obviously in the Evaluation metrics. To evaluate the system performance, we adopt two common metrics.
(1) Confusion matrix. The x-axis and y-axis represent the classified finger gestures and actually performed finger gestures, respectively. The number in the (i, j)-th cell denotes the probability of the j-th gesture classified into the i-th label. (2) Accuracy. We conduct comparison studies and parameter studies to verify the system robustness.

Performance Evaluation
Overall performance. We first report the overall performance of DeepNum. As is shown in Figure 9a-b, DeepNum yields a considerable average accuracy of 98% and 99% in the empty meeting room and corridor. In particular, all gesture accuracy is improved in the corridor because of the shorter transmitting distance and the stronger reflected signals. The average accuracy is 2% lower in the student office (Figure 9c) due to the multiple blocks and electronic interferences. We also notice a similar distribution of prediction accuracy in all three environments. For instance, digit '5' dominates in the confusion matrix of all experimental scenarios because it requires a full stretch of all 5 fingers and induces larger signal fluctuations, while digit '6' is easily confused with '0' and '2', since they have similar movement patterns.
Compared to baselines. We survey the state-of-the-art Wi-Fi-based gesture recognition systems from the last 3 years and list the relevant works in Table 1. To fully demonstrate the high performance of DeepNum, we implement DeepNum alongside two typical schemes, SignFi [23] and WiFinger [13], for comparison. The reason is two-fold. On the one hand, as the emerging masterpiece in using a deep learning model to recognize finger gestures, SignFi imports full amplitude and phase information into a typical CNN model, and achieves high accuracy in specific scenes. However, it circumvents the selection of raw CSI signals and the signal segmentation step. We aim to show the efficiency of the image construction step and the feasibility of the region selection step by comparison. On the other hand, since most of works follow a similar methodology (i.e., handcrafted feature extraction and traditional classification algorithms), we choose WiFinger [13] as a benchmark in order to show the superiority of the deep learning model. In Figure 9d, we observe that WiFinger has an acceptable overall accuracy of over 93% in all three scenarios; however, it is lower than the two CNN-based schemes. Furthermore, WiFinger fails to consider both amplitude and phase information, which may miss informative signal features. SignFi achieves competitive results in both the meeting room and corridor by careful parameter tuning. However, the accuracy drops more obviously in the student office. This is mainly because motion-induced distortions are overwhelmed by background noises. Thanks to sensitive image construction and sanitation, DeepNum outperforms SignFi in NLOS scenarios. Moreover, DeepNum avoids the size requirement, which renders it more practical.

Parameter Study
Impact of user diversity. To evaluate whether our system works consistently for different users, we employ 10 volunteers of different genders, heights, weights and body mass index (BMI). Table 2 shows the statistics of the participants; User ID 1 to 8 are male, and User ID 9 to 10 are female. We also present the recognition results of all participants in Figure 10. As shown, DeepNum is able to handle diversity and achieve robust recognition for all individuals with an accuracy higher than 95%. However, there is no direct correlation between user diversity and accuracy. We leave further study of this for our future work.

Parameter Study
Impact of user diversity. To evaluate whether our system works consistently for different users, we employ 10 volunteers of different genders, heights, weights and body mass index (BMI). Table 2 shows the statistics of the participants; User ID 1 to 8 are male, and User ID 9 to 10 are female. We also present the recognition results of all participants in Figure 10. As shown, DeepNum is able to handle diversity and achieve robust recognition for all individuals with an accuracy higher than 95%. However, there is no direct correlation between user diversity and accuracy. We leave further study of this for our future work.

Parameter Study
Impact of user diversity. To evaluate whether our system works consistently for different users, we employ 10 volunteers of different genders, heights, weights and body mass index (BMI). Table 2 shows the statistics of the participants; User ID 1 to 8 are male, and User ID 9 to 10 are female. We also present the recognition results of all participants in Figure 10. As shown, DeepNum is able to handle diversity and achieve robust recognition for all individuals with an accuracy higher than 95%. However, there is no direct correlation between user diversity and accuracy. We leave further study of this for our future work.   Impact of training strategy. In the experiments, we collected a total of 6000 gesture segments from all 10 volunteers. Specifically, we use 50% of the records of all 10 volunteers to train the CNN model, and the remaining 50% of the records to test the performance. To further discuss the impact of the training strategy, we train the CNN model with decreased numbers of users and numbers of training samples per user. In Figure 11, we firstly train the model with all records and test the performance using the data of one user, which obtains the highest accuracy. Our training method uses fewer training samples but achieves considerable results, with an accuracy of 98%. We also decrease the number of users from 10 to 5, and use the remainder as the test groups. A significant drop is observed in the accuracy from 98% to 65%, while the differences between diverse training sample sizes is not significant. So adding participant diversity really matters for a practical system.  Impact of training strategy. In the experiments, we collected a total of 6000 gesture segments from all 10 volunteers. Specifically, we use 50% of the records of all 10 volunteers to train the CNN model, and the remaining 50% of the records to test the performance. To further discuss the impact of the training strategy, we train the CNN model with decreased numbers of users and numbers of training samples per user. In Figure 11, we firstly train the model with all records and test the performance using the data of one user, which obtains the highest accuracy. Our training method uses fewer training samples but achieves considerable results, with an accuracy of 98%. We also decrease the number of users from 10 to 5, and use the remainder as the test groups. A significant drop is observed in the accuracy from 98% to 65%, while the differences between diverse training sample sizes is not significant. So adding participant diversity really matters for a practical system. Figure 11. Impact of training strategy.

Impact of raw material.
To evaluate the efficiency of image construction, we perform contrast experiment by applying various combinations of amplitude and relative phases. Please note that the depth of the radio image is fixed at 3, since that is a requirement of the current CNN model. We leave 'higher-dimensional tensor processing' for future work. Figure 12 shows that amplitude or relative phase information alone are not sufficient to detect slight finger gestures. Based on observation, we extract sensitive amplitudes from one antenna and two relative phases from all three antenna pairs. Impact of image de-noising. We also check the influence of HOSVD, which reduces the noise in the radio image. In Figure 13, without any de-noising step, the average accuracies are 83%, 85% 78% for the meeting room, corridor and student office, respectively. This implies that the radio image is built up from sensitive and robust information. Applying SVD in each matrix helps a little in overall accuracy, but if we perform HOSVD from a high-dimensional perspective, the accuracy can be   Figure 11. Impact of training strategy.

Impact of raw material.
To evaluate the efficiency of image construction, we perform contrast experiment by applying various combinations of amplitude and relative phases. Please note that the depth of the radio image is fixed at 3, since that is a requirement of the current CNN model. We leave 'higher-dimensional tensor processing' for future work. Figure 12 shows that amplitude or relative phase information alone are not sufficient to detect slight finger gestures. Based on observation, we extract sensitive amplitudes from one antenna and two relative phases from all three antenna pairs.  Impact of training strategy. In the experiments, we collected a total of 6000 gesture segments from all 10 volunteers. Specifically, we use 50% of the records of all 10 volunteers to train the CNN model, and the remaining 50% of the records to test the performance. To further discuss the impact of the training strategy, we train the CNN model with decreased numbers of users and numbers of training samples per user. In Figure 11, we firstly train the model with all records and test the performance using the data of one user, which obtains the highest accuracy. Our training method uses fewer training samples but achieves considerable results, with an accuracy of 98%. We also decrease the number of users from 10 to 5, and use the remainder as the test groups. A significant drop is observed in the accuracy from 98% to 65%, while the differences between diverse training sample sizes is not significant. So adding participant diversity really matters for a practical system. Figure 11. Impact of training strategy.

Impact of raw material.
To evaluate the efficiency of image construction, we perform contrast experiment by applying various combinations of amplitude and relative phases. Please note that the depth of the radio image is fixed at 3, since that is a requirement of the current CNN model. We leave 'higher-dimensional tensor processing' for future work. Figure 12 shows that amplitude or relative phase information alone are not sufficient to detect slight finger gestures. Based on observation, we extract sensitive amplitudes from one antenna and two relative phases from all three antenna pairs. Impact of image de-noising. We also check the influence of HOSVD, which reduces the noise in the radio image. In Figure 13, without any de-noising step, the average accuracies are 83%, 85% 78% for the meeting room, corridor and student office, respectively. This implies that the radio image is built up from sensitive and robust information. Applying SVD in each matrix helps a little in overall accuracy, but if we perform HOSVD from a high-dimensional perspective, the accuracy can be Impact of image de-noising. We also check the influence of HOSVD, which reduces the noise in the radio image. In Figure 13, without any de-noising step, the average accuracies are 83%, 85% 78% for the meeting room, corridor and student office, respectively. This implies that the radio image is built up from sensitive and robust information. Applying SVD in each matrix helps a little in overall accuracy, but if we perform HOSVD from a high-dimensional perspective, the accuracy can be improved significantly, especially in NLOS environments. The reason for this is that HOSVD is able to manipulate multi-channel correlation simultaneously, so slight motion-induced distortions can be captured even in a noisy environment. improved significantly, especially in NLOS environments. The reason for this is that HOSVD is able to manipulate multi-channel correlation simultaneously, so slight motion-induced distortions can be captured even in a noisy environment.  Figure 14 plots the performance of DeepNum. Under the initial condition, DeepNum has its lowest average accuracy of 81%, due to irrelevant body movements and severe multi-path effect. The accuracy climbs to a peak as the antenna height reaches 1.3 m, with a strong reflected signal. We also notice that DeepNum consistently yields high accuracy at a height of 1.7 m. Therefore, DeepNum could work well in real indoor environments, since most Wi-Fi devices are placed on the ceiling. Impact of sampling rate. Please note that we extract image regions with a fixed size of 500 × 500; the high sampling rate introduces fine-grained local movement information, but limited global gesture features. To strike a balance, we first initialize the sampling rate at 1000 Hz and gradually discard CSI data to obtain sampling rates of 500 Hz, 250 Hz and 125 Hz. In Figure 15, we note that 500 Hz achieves the best performance, since it includes 1 s gesture information with sufficiently detailed features. A sampling rate of 1000 Hz or higher may lower the performance, because the image region contains less temporal or gestural contextual information. Impact of feature extraction. We further conduct experiments and compare the accuracy of different feature extraction strategies in Table 3. As the size requirement really matters, we firstly select the fixed-size regions in the center of images, without feature extraction step. The results are  Figure 14 plots the performance of DeepNum. Under the initial condition, DeepNum has its lowest average accuracy of 81%, due to irrelevant body movements and severe multi-path effect. The accuracy climbs to a peak as the antenna height reaches 1.3 m, with a strong reflected signal. We also notice that DeepNum consistently yields high accuracy at a height of 1.7 m. Therefore, DeepNum could work well in real indoor environments, since most Wi-Fi devices are placed on the ceiling. improved significantly, especially in NLOS environments. The reason for this is that HOSVD is able to manipulate multi-channel correlation simultaneously, so slight motion-induced distortions can be captured even in a noisy environment.  Figure 14 plots the performance of DeepNum. Under the initial condition, DeepNum has its lowest average accuracy of 81%, due to irrelevant body movements and severe multi-path effect. The accuracy climbs to a peak as the antenna height reaches 1.3 m, with a strong reflected signal. We also notice that DeepNum consistently yields high accuracy at a height of 1.7 m. Therefore, DeepNum could work well in real indoor environments, since most Wi-Fi devices are placed on the ceiling. Impact of sampling rate. Please note that we extract image regions with a fixed size of 500 × 500; the high sampling rate introduces fine-grained local movement information, but limited global gesture features. To strike a balance, we first initialize the sampling rate at 1000 Hz and gradually discard CSI data to obtain sampling rates of 500 Hz, 250 Hz and 125 Hz. In Figure 15, we note that 500 Hz achieves the best performance, since it includes 1 s gesture information with sufficiently detailed features. A sampling rate of 1000 Hz or higher may lower the performance, because the image region contains less temporal or gestural contextual information. Impact of feature extraction. We further conduct experiments and compare the accuracy of different feature extraction strategies in Table 3. As the size requirement really matters, we firstly select the fixed-size regions in the center of images, without feature extraction step. The results are Impact of sampling rate. Please note that we extract image regions with a fixed size of 500 × 500; the high sampling rate introduces fine-grained local movement information, but limited global gesture features. To strike a balance, we first initialize the sampling rate at 1000 Hz and gradually discard CSI data to obtain sampling rates of 500 Hz, 250 Hz and 125 Hz. In Figure 15, we note that 500 Hz achieves the best performance, since it includes 1 s gesture information with sufficiently detailed features. A sampling rate of 1000 Hz or higher may lower the performance, because the image region contains less temporal or gestural contextual information. improved significantly, especially in NLOS environments. The reason for this is that HOSVD is able to manipulate multi-channel correlation simultaneously, so slight motion-induced distortions can be captured even in a noisy environment.  Figure 14 plots the performance of DeepNum. Under the initial condition, DeepNum has its lowest average accuracy of 81%, due to irrelevant body movements and severe multi-path effect. The accuracy climbs to a peak as the antenna height reaches 1.3 m, with a strong reflected signal. We also notice that DeepNum consistently yields high accuracy at a height of 1.7 m. Therefore, DeepNum could work well in real indoor environments, since most Wi-Fi devices are placed on the ceiling. Impact of sampling rate. Please note that we extract image regions with a fixed size of 500 × 500; the high sampling rate introduces fine-grained local movement information, but limited global gesture features. To strike a balance, we first initialize the sampling rate at 1000 Hz and gradually discard CSI data to obtain sampling rates of 500 Hz, 250 Hz and 125 Hz. In Figure 15, we note that 500 Hz achieves the best performance, since it includes 1 s gesture information with sufficiently detailed features. A sampling rate of 1000 Hz or higher may lower the performance, because the image region contains less temporal or gestural contextual information. Impact of feature extraction. We further conduct experiments and compare the accuracy of different feature extraction strategies in Table 3. As the size requirement really matters, we firstly select the fixed-size regions in the center of images, without feature extraction step. The results are Impact of feature extraction. We further conduct experiments and compare the accuracy of different feature extraction strategies in Table 3. As the size requirement really matters, we firstly select the fixed-size regions in the center of images, without feature extraction step. The results are unacceptable, since the regions only preserve a part of the motion-induced information. Using color features or texture features as a priori knowledge significantly improves the accuracy in the meeting room and corridor, while the performance could be affected by multiple interferences in the student office, at 78.9% and 83.4%, respectively. We utilize both color and texture features so as to enhance the robustness of region selection, and achieve the highest accuracy in the three scenes.

Limitations and Discussion
Multiple moving objects. Targeting single-hand finger gesture recognition of single user, DeepNum is vulnerable to multiple moving hands or bodies, which may induce super-imposed signal fluctuations at the receivers and dampen the promotion of gesture-based applications. Wang et al. [47] propose a multi-person breath monitoring method based on tensor decomposition with a pair of Wi-Fi devices. However, this requires prior knowledge of the number of users or needs to be tackled as a NP-hard problem. How to recognize finger gestures performed by multiple hands simultaneously will still remain an open and challenging problem for the future.
Increasing sample diversity. User diversity and training strategy are important factors for ubiquitous gesture recognition. In our experiments, we try to increase sample diversity by recruiting different kinds of volunteers; however, it is labor-intensive to collect such an amount of CSI data. WiAG [15] offers an alternative by applying translation functions on limited training samples, which inspires us to broaden sample diversity and increase training samples by slight parameter adjustment.
Designing proper architecture. In this paper, since our major concern is to break the constraints of size requirements, we simplify the CNN architecture by adopting a conventional 7-layer CNN model and adjusting its related parameters (i.e., kernel size, kernel number, stride size and zero padding) for considerable results. In the future, we will seek to promote system performance in NLOS environments by deepening the model and designing more appropriate architectures based on up-to-date deep learning models, like VGG-net [46], GoogleNet [48] and Deep Residual Learning [49].
Tracking long-term behaviors. Borrowing the idea from the image-based domain, DeepNum recognizes finger gestures from a high-dimensional perspective and extracts deeper features based on the CNN model, which can be applied to various applications. However, we argue that CNN is not a panacea for all scenarios, especially with respect to long-term behavior monitoring, such as habit modeling and health monitoring. As an analogy, these applications seem more like video-based problems rather than image-based issues, and require a whole different line of thinking and context awareness. Please note that CNN cannot capture the relations between temporal segments, and we leave the behavior modeling for our future work.
Computation cost. To evaluate the computation cost of DeepNum, we calculate the average running time on a laptop with an Intel i7-5700HQ 2.70GHz CPU (Intel, Santa Clara, USA). Table 4 illustrates the running time of each processing part. As shown, the major time cost comes from the region selection step. This is because we select the most significant region by extensive search. The calculation time can be reduced by using high-performance processors. Since the total running time for each gesture is only 57.85 ms, it is still acceptable for Wi-Fi-based systems, which will generally achieve finger gesture recognition on the scale of seconds.

Conclusions
In this paper, we present a novel finger gesture recognition system, named DeepNum, the core part of which is radio image construction and image manipulation. In the first part, we extract amplitude from the most sensitive antenna and segment action slices using wavelet energy. To fully exploit signal characteristics, we obtain two sensitive relative phase matrices from pairwise coupling antenna pairs and further construct a 3-dimensional tensor. Next, in the second part, we resort to HOSVD for image de-noising and region selection for the input requirement of CNN model, which contributes to extracting deep features automatically and outputs classified results accurately. Experimental results show that DeepNum achieves an overall recognition accuracy of 98% in three typical indoor environments. From micro-movement detection to micro-movement recognition, and from statistical properties calculation to deep feature extraction, DeepNum goes a step further by processing CSI data from a higher-dimensional perspective. As a pioneer, DeepNum is expected to promote the progress of wireless sensing applications.
Author Contributions: Q.Z. and J.X. conceived and designed the experiments; Q.Z. and W.C. performed the experiments; Q.Z. and X.Z. analyzed the data; Q.Z. and Q.Y. contributed analysis tools; Q.Z. wrote the paper.