WiFi-Based Gesture Recognition for Vehicular Infotainment System—An Integrated Approach

: In the realm of intelligent vehicles, gestures can be characterized for promoting automotive interfaces to control in-vehicle functions without diverting the driver’s visual attention from the road. Driver gesture recognition has gained more attention in advanced vehicular technology because of its substantial safety beneﬁts. This research work demonstrates a novel WiFi-based device-free approach for driver gestures recognition for automotive interface to control secondary systems in a vehicle. Our proposed wireless model can recognize human gestures very accurately for the application of in-vehicle infotainment systems, leveraging Channel State Information (CSI). This computationally efﬁcient framework is based on the properties of K Nearest Neighbors (KNN), induced in sparse representation coefﬁcients for signiﬁcant improvement in gestures classiﬁcation. In this typical approach, we explore the mean of nearest neighbors to address the problem of computational complexity of Sparse Representation based Classiﬁcation (SRC). The presented scheme leads to designing an efﬁcient integrated classiﬁcation model with reduced execution time. Both KNN and SRC algorithms are complimentary candidates for integration in the sense that KNN is simple yet optimized, whereas SRC is computationally complex but efﬁcient. More speciﬁcally, we are exploiting the mean-based nearest neighbor rule to further improve the efﬁciency of SRC. The ultimate goal of this framework is to propose a better feature extraction and classiﬁcation model as compared to the traditional algorithms that have already been used for WiFi-based device-free gesture recognition. Our proposed method improves the gesture recognition signiﬁcantly for diverse scale of applications with an average accuracy of 91.4%.


Introduction
Distracted driving is one of the main concerns that compromise road safety. A large number of road accidents are reported because of driver's engagement in performing conventional secondary tasks using visual-manual interfaces. With the advancements in vehicular technology and the introduction of human computer interaction (HCI), gesture-based touchless automotive interfaces are being incorporated in vehicle designs to reduce driver visual distraction. Therefore, driver gesture recognition has become the most interesting research topic in recent years. Human gestures recognition has been widely explored in the literature for a variety of applications to reduce the complexity of human interaction with computers and other digital interfaces [1][2][3][4].
Existing research on in-vehicle gesture recognition is mainly focused on sensors, radars, or cameras [5][6][7], which have their own limitations in practical scenarios [8,9]. Subjectively, haptified gestures with ultrasound are being incorporated in the automotive domain [10]. During the last decades, WiFi-based human gesture recognition systems have gained more attention, exploiting The classification algorithm plays a vital role in the gesture recognition framework. Most of the existing CSI-based gesture recognition systems rely on a single classification algorithm, which cannot be guaranteed for maximum performance in vehicular technology. Carefully integrating two or more classification algorithms may enhance the recognition performance of a classifier for automotive interfaces. In this context, both KNN and SRC classifiers have been efficiently used as stand-alone, to solve various classification problems in WiFi-based device-free localization and activity or gesture recognition [13,[16][17][18].
In our proposed scheme, the superiority of SRC is applied on a nearest neighbor rule, to further improve the performance of SRC with reduced computational cost. Therefore, the advanced classification is based on the integrated properties of SRC and modified-KNN to produce efficient results. We utilize the Mean of Nearest Neighbors (MNN); a variant of KNN. The concept of the mean-based algorithm was originally proposed by Mitani and Hamamoto [19] to classify query patterns, using a local mean-based non-parametric classifier. The idea of local mean-based KNN has been successfully applied in various pattern recognition systems [20,21]. Our prototype is based on multiple variants of KNN and SRC algorithms in the literature [22][23][24][25][26]. Different from others, we incorporate MNN with SRC for significant improvement in device-free gesture recognition for the application of automotive infotainment system, exploiting CSI. The combination of MNN and SRC has a computational advantage over using only SRC. We integrate MNN and SRC to overcome the computational complexity of SRC.
Our proposed system leverages off-the-shelf WiFi devices to collect channel information, which are readily available in the form of CSI measurements on commercial WiFi devices. Our algorithm utilizes the change in WiFi channel information caused by driver gestures. The driver needs to perform specific gestures in the WiFi coverage area. As far as we know, this is the first attempt towards CSI-based device-free driver gesture recognition using nearest neighbor induced sparse representation for the application of automotive interface.
Our contributions can be summarized as follows: • We present a WiFi-based device-free innovative framework to address the problem of driver gesture recognition for the application of vehicle infotainment systems leveraging CSI measurements.

•
We demonstrate a novel classification model by integrating SRC and a variant of the KNN algorithm to overcome the problem of expensive computational cost.

•
To evaluate the performance of our proposed framework, we perform comprehensive experiments in promising application scenarios.

•
To validate the results, we compare our system performance with state-of-the-art methods.
The rest of the paper is organized as follows; Section 2, demonstrates the traditional techniques that are relevant to our work. In Section 3, we provide an overview of our proposed system. Section 4 presents the detailed system flow with the methodology of our proposed solution. Section 5 highlights the experimental settings and validates our results by way of performance evaluation. In Section 6, we discuss important aspects and limitations of the proposed framework. Finally, we conclude our work with some future suggestions in Section 7.

Related Work
In this section, we will briefly review the CSI-based activity and gesture recognition systems that exploit different classification methods including KNN and SRC algorithms. Fine-grained physical layer CSI holds pervasive indoor localization and has attracted many researchers because of its potential for accurate indoor localization. With the advancement in wireless technology and the ubiquitous deployment of indoor wireless systems, indoor localization has been permeated into a new era of modern life [27]. Recently, a WiFi-based training-free localization system has been presented with good performance [28].
WiSee [29] is a WiFi based system that can recognize nine different gestures in line-of-sight (LOS), non line-of-sight (NLOS) and through-the-wall scenarios via Doppler shifts. Through-wall motion detection was investigated by WiVi [30] using multi antenna techniques. However, traditional solutions have been prototyped with special software or hardware to capture OFDM (Orthogonal Frequency Division Multiplexing) signals. As compared to these conventional methods, our proposed solution leverages off-the-shelf WiFi devices without any change in infrastructure.
The emerging device-free localization and activity recognition relies on CSI for better characterization of WiFi signals influenced by the human activities [31][32][33][34]. In recent years, CSI based micro-activity recognition [35] and intrusion detection [36][37][38] systems have emerged with good recognition results.
WiG [11] focused on WiFi-based gesture recognition for both LOS and NLOS scenarios. CSI-based gesture detection was presented in Reference [12] leveraging packet transmission and the recognition of gestures was performed by distinguishing their strengths. WiGer [39] demonstrated a WiFi-based gesture recognition system by designing a fast dynamic time warping algorithm to classify hand gestures. Recently, Reference [40] presented the writing in air with WiFi signals for virtual reality devices with more complexity and increased diversity as compared to simple gestures.
Most of existing CSI-based systems rely on single classification algorithms. During the previous decade, KNN and SRC algorithms have been numerously used in various CSI-based device-free activity recognition models, as a stand-alone classifier. WiCatch [41] proposed WiFi-based hand gesture recognition system utilizing weak signals reflected from hands and the recognition is based on support vector machines (SVMs). Wi-Key [16] recognized keystrokes via CSI-waveform using the KNN classifier. WiFinger [13] leverages WiFi signals for finger gesture recognition by examining the unique patterns of CSI using the KNN classifier. The effect of Doppler shifts on CSI for health-care applications using the SRC classification algorithm was described in Reference [17]. Human activity recognition using the SRC classifier has been successfully demonstrated in Reference [18] with high accuracy for both LOS and NLOS scenarios leveraging CSI. Different from previous work, our classification algorithm is based on the integration of modified-KNN and SRC for automotive infotainment interface.
A limited work is reported in the literature on WiFi-based driver's in-vehicle activity or gestures recognition. In recent years, WiFi-based driver fatigue detection and driver activity recognition systems have been investigated with some constraints. In this context, WiFind [42] is suitable only for fatigue detection, while WiDriver [43] leverages a driver's hand movements for better characterization of driver action recognition using WiFi signals. SafeDrive-Fi [44] demonstrated CSI-based dangerous driving recognition through body movements and gestures, using variance of CSI amplitude and phase measurements. Different from them, we have specifically focused on driver hand, finger and head gesture recognition for the application of in-vehicular infotainment system.

System Overview
In this section, we will present the overview of our proposed classification algorithm and system architecture, along with the important facts about CSI relevant to our work.

Background of SRC and MNN Algorithms
Both KNN and SRC classifiers have been efficiently used in various wireless device-free localization and recognition systems [13,[16][17][18]. Conventional SRC is time consuming in the sense that a testing sample is usually represented by all training samples. The decision rule of SRC follows that if the testing sample has a great similarity with any training sample, the sparse coefficients of that specific training sample will be larger to represent the certain testing sample. SRC is efficient in a sense that all coefficients participate well in decision making. However, the computational cost of SRC increases with the increase in size of training data.
KNN is simple yet effective classifier with optimal performance. However, KNN has the issue of neighborhood size and simple majority voting for the classification, which can degrade its performance. The traditional KNN classifier chooses the K nearest neighbors from training data and majority voting is used to decide the class. Mean Nearest Neighbor (MNN) is the variant of KNN using the mean as a prototype of associated class.

Integration of SRC and MNN Algorithms
We start with traditional KNN, estimate the K nearest neighbors from all training samples of each class. Thus, we calculate the mean of K nearest neighbors within each class. We use the decision rule of SRC to supervise MNN, and sparse representation coefficients are computed from the mean vector of nearest neighbors. Finally, the class of testing sample is decided on the residual between testing sample and the mean of nearest neighbors within each class. The more details about this integrated classification algorithm are given in Section 4.3.

CSI Overview
Our prototype leverages WiFi ambient signal as information source to analyze the influence of a driver's gesture on a wireless channel. Existing WiFi devices that exploit IEEE 802.11n/ac protocols typically consist of multiple transmitting (Tx) and receiving (Rx) antennas and thus support the widely used Multiple-Input Multiple-Output (MIMO) technology. CSI refers to fine-grained signal containing physical layer information based on Orthogonal Frequency-Division Multiplexing (OFDM).
In our experiment, an IEEE 802.11n enabled Access Point (AP) was used as a transmitter and Intel 5300 NIC was used as a receiver to collect CSI data from the physical layer of the WiFi system, which supports 30 subcarriers for each CSI Tx-Rx antenna pair. It records the channel properties of each Tx-Rx antenna pair in OFDM subcarriers. The channel variations are readily available in the form of CSI measurements on commercial WiFi devices [45]. A typical narrowband flat-fading channel for packet index i, exploiting MIMO and OFDM technology can be modeled as: where H i is the CSI channel matrix for packet index i, N i denotes the Gaussian noise vector, Y i and X i are the received and transmitted signals respectively, N refers to the total number of received packets. Let N Tx and N Rx be the number of transmitting and receiving antennas respectively, then the CSI matrix consists of N Tx × N Rx × 30 complex values for each CSI stream. CSI matrix H for each Tx-Rx antenna pair can be written as: Each h is a complex value, which carries the information for both amplitude and phase response; estimated as: where |h| denotes the amplitude and h indicates the phase information. CSI information on multiple channels is correlated, whereas all streams behave independently. We utilize both the amplitude and phase information to unleash the full potential of CSI measurements.

System Architecture
Our device-free driver gesture recognition system is comprised of following three basic modules-(1) CSI pre-processing module, (2) feature extraction module and (3) classification module, as illustrated in Figure 1. CSI pre-processing module; collects and pre-processes CSI measurements using basic filtering techniques. The feature extraction module is responsible for gesture detection, dimension reduction and feature extraction. Recognition is performed in the classification module, which relies on the integrated classification method. In Section 4, we will explain the function of each module in detail.

Methodology
In this section, we will explain complete flow of our system methodology.

CSI Pre-Processing
The CSI received signal is a combination of useful information as well as undesirable noises embedded in the signal. It is essential to filter and calibrate CSI amplitude and phase measurements, respectively. The human gestures have a relatively low frequency as compared to noise frequency. To remove high frequency noise, we apply a second order low pass Butterworth filter. We adjusted packets sampling rate (F s ) at 80 packets/s, the same as the normalized cutoff frequency w n = 2π f /F s = 0.025π rad/s. CSI raw phase measurements behave extremely randomly because of the unsynchronized time clock of transmitter and receiver. In order to extract the actual phase and eliminate channel frequency offset; phase calibration and linear transformation are performed by following [46].

Phase Calibration
CSI raw phase measurements behave extremely randomly because of the unsynchronized time clock of transmitter and receiver. The relation between measured phase and true phase can be written as: where h j indicates the measured phase of jth subcarrier and h j denotes the actual phase, ∆t represents the time lag, n j denotes the subcarrier index, N stands for the size of FFT, β indicates the unknown phase offset and z is random noise. We cannot measure the exact value of β and ∆t for each packet. However, we can get the same value of β and ∆t each time by using a simple transformation. Phase error 2π n j N ∆t + β is linear function of subcarrier index n j . We can define two parameters a and b for calibration of phase error such that: subtracting a j k + b from raw phase h j k to get the sanitized phase h j k as: The phase sanitization is performed on all the subcarriers and re-assembled according to the corresponding amplitudes.

Amplitude Information Processing
CSI amplitude measurements are very noisy for gesture extraction because of environmental noise and signal interference. We propose applying the weighted moving average (WMA) over CSI amplitude streams to eliminate the outliers and avoid false anomaly by the following procedure [47].
Let |h t | denotes the amplitude information at time interval t, then the expression for WMA can be written as: where |h t | is the averaged amplitude at time t. New amplitude |h t | has the weight value of m, which decides to what degree the current value relates to historical data. The value of m can be empirically selected based on the experiments. Figure 2, shows the comparison of the original signal and the signal processed after WMA implementation. It is clear that WMA can make the subcarrier waveform much smoother by reducing noise level.

Gesture Detection
In the CSI data, both amplitude and phase information have the capability to be used for gesture detection [48]. Our gesture detection algorithm is based on the variance of amplitude and phase to characterize gestures from CSI filtered data. In this scheme, all subcarriers are aggregated to evaluate the variance. We normalized the variance and estimated the corresponding energy of CSI amplitude and phase over a specific time window.
Let ν i indicate the variance of a single link for all the subcarriers belonging to packet index i. For N number of received packets, variance ν = {ν 1 , ν 2 , . . . , ν N } obeys Gaussian distribution with ν ∼ N(µ, σ 2 ). We normalized ν to represent normalized amplitude and phase as V A and V P , respectively. Let E A and E P are the corresponding energy of normalized amplitude (V A ) and phase (V P ), respectively. We can estimate the energy over a length of time window τ and compare it with the threshold η th as: and We set a threshold (η th = 100) for gesture detection based on our preliminary measurements depending on different conditions and scenarios. Corresponding to this threshold, the gesture is detected as: If the energy E is greater than or equal to the threshold η th , the system takes it as a gesture and records the maximum value of normalized variance over this window, and when it goes lower than the threshold it is considered as a non-gesture.

Dimensionality Reduction
In CSI data, some subcarriers are very sensitive to noise but non-significant for gesture sensitivity. From Figure 3, it is clear that the sensitivity of different subcarriers varies for the same gesture. Therefore, we suggest Principal Component Analysis (PCA) to reduce the dimensionality and eliminate such types of unpredictable subcarriers. PCA is commonly used to extract most representative components and removes background noise [49]. In the PCA scheme, new variables are being generated by the linear combination of the original data, called principal components. These principal components organize the original variables in orthogonal form in order to eliminate redundant information.
In our experiment, PCA is applied to each packet received to extract p principal components. Based on our observations, we particularly select the second, third and fourth principal components that is, p = 3 for both the amplitude and phase of the CSI signal. Finally, we obtain a matrix with p × N a dimensions, where N a denotes the number of anomalous packets.
First, we normalized CSI matrix H and static components were removed. Let H n represent the normalized matrix after the subtraction of the average CSI value from each column of CSI matrix H. We then calculated the corresponding correlation matrix as H n × H T n . After eigen decomposition of the correlation matrix, we compute eigen vectors q = q 1 , q 2 , . . . , q i and simultaneously p = p 1 , p 2 , . . . , p i principal components are structured. For ith packet, it can be expressed as: where q i and p i stand for ith eigen vector and ith principal component, respectively.

Feature Extraction
Let H c (s) be the first CSI packet detected by our algorithm and S c stands for the number of subcarriers between a single link of antenna pair Tx − Rx. We can extract N c successive CSI packets for any activity profile.
Based on our preliminary experimental investigations and detailed analysis of extracted CSI, we particularly selected six statistical features that is, mean, standard deviation, median absolute deviation, maximum value, 25th percentile and 75th percentile. Mathematically, these features are defined as [50]: (1) Mean: The mean µ(j) is defined as the average CSI of all packets H c belonging to jth subcarrier written as: where N c is the total number of successive CSI packets for activity profile and j ∈ [1, S c ].
(2) Standard Deviation: The standard deviation σ(j) of jth subcarrier is basically the square root of variance. Assume j ∈ [1, S c ], then σ(j) can be expressed as: (3) Median Absolute Deviation: A robust way to quantify CSI variations for any activity segment is Median Absolute Deviation (MAD). Mathematically, median absolute deviation MAD(j) for jth subcarrier is defined as: where H cj is the median of H cj . (4) Maximum Value: The maximum value is a unique number, that is the highest value of all other values in the CSI data set. Maximum value MAX(j) of jth subcarrier with j ∈ [1, S c ] is mathematically calculated as: (5) 25th Percentile: If 25 is the ordinal rank, the 25th percentile of a data set is defined as the value at which 25 percent of distribution is below it. Mathematically, 25th percentile P 25th (j) for j ∈ [1, S c ] is explained as: (6) 75th Percentile: Similarly, 75th percentile P 75th (j) can be formulated as: In order to differentiate between multiple gesture profiles, we integrate all features into a feature vector. We propose constituting a tuple F of integrated features utilizing both magnitude and phase, defined as: where f i is a feature and F stands for the dataset consisting of all gesture features. We get twelve features in total (6-amplitude features and 6-phase features).

Classification Module
In the classification module, MNN and SRC algorithms were integrated. We started with traditional KNN, estimated the K nearest neighbors from all training samples of each class [26]. Thus, we calculated the mean of K nearest neighbors within each class.

K Nearest Neighbor
Assume our classification problem has t number of classes. Let F i be a set comprising of training samples for class index i represented as: where N i is the total number of training samples for class index i. Assume f ir is the prototype for ith class. For any testing sample y, let us find its nearest neighbor f ir from training samples in each class. We adopt squared Euclidean distance to measure the similarity between nearest neighbor f ir and testing sample y, defined as: The Nearest Neighbor (NN) algorithm estimates the nearest training sample based on the distance measure and designate that class to testing sample y which has the minimal distance. Mathematically, The KNN algorithm is the extension of 1-NN by taking K nearest neighbors from all the training data. In KNN, the class assignment to y is based on the rule of majority voting. Assuming ith class has k i samples, the testing sample y belongs to the class which has a maximum number of nearest neighbors as: Such that, y is designated to the class based on maximum number of nearest neighbors. In general, the class of the testing sample is normally decided on the basis of majority voting of K nearest neighbors that are specifically chosen from training samples with a certain minimum distance.

Mean of Nearest Neighbor (MNN)
We implemented the mean of the nearest neighbor rule in our proposed algorithm because it may be a meaningful compromise between the nearest neighbor and minimum distance. Assume K nearest neighbors of testing sample y for class index i are represented as F i = { f i1 , f i2 , . . . , f iK i }. The mean vector of nearest neighbors for testing sample y in class i is defined as: After estimating the mean vector per class, we can determine the class of y. For this purpose, we find the distance from the mean vector as: Thus, the class is estimated as follows:

Sparse Representation Based Classification (SRC)
In SRC, a testing sample is expressed as a linear sparse combination of all the training samples [22]. The sparse representation coefficients can be obtained by solving L 1 optimization problem. We characterise the training samples for ith class as . . , F t } in which all training samples of entire activity classes are being concatenated.
The testing sample y is defined as y = Fv for the sparsest solution. The coefficient vector v has nonzero values only for the entries that are associated to class i. The sparse solution for coefficient vector v may be optimized with L 0 norm constraint as: Assuming the solution of L 0 norm constraint is equivalent to L 1 norm constraint, we obtain v 1 = arg min v 1 s.t. Fv = y.
After getting the optimalv 1 solution for sparsity, SRC can be designed for class specific reconstruction residual.
Let δ i (v) is the vector associated with ith class that selects non-zero entries corresponding to v. Based on the coefficients of ith class, we can reconstruct the test sample y as: whereŷ i indicates the reconstructed testing sample. The residual of reconstructed class can be obtained as: The decision of class is based on the following principle: In our proposed framework, we used the decision rule of SRC to supervise MNN and sparse representation coefficients were computed from the mean vector of nearest neighbors.

MNN Induced SRC (MNN-SRC)
Our proposed MNN-SRC method is much faster as compared to conventional SRC. The mean of K nearest neighbors is much smaller in comparison to the total number of training samples. In our proposed framework, the mean of K nearest neighbors is computed from all training samples. The testing sample is represented according to the mean vector of K nearest neighbors. Sparse representation based classification is applied on the mean vector of nearest neighbors instead of all the training samples. The residuals are estimated between the testing sample and the mean vector of nearest neighbors for each class. The class of the testing sample is decided on the residual between testing sample and the mean of nearest neighbors within each class.
Suppose we have K i nearest neighbors for . . , f iK i } belonging to ith class with K = ∑ t i=1 K i . We find K nearest neighbors for each class corresponding to testing sample and estimate its mean. The mean vector of nearest neighbors for testing sample y in class i is defined as: Assume M is a matrix comprising of mean vectors of each class. The sparse representation coefficients were calculated using M to estimate y. Mathematically, In some situations, y cannot be exactly equal to Mv for any coefficient. To overcome this constraint, a Lagrange factor λ is imposed by the following: Let δ i (v) be the vector associated with ith class that selects non-zero entries corresponding to v. Based on the coefficients of ith class, we can reconstruct the test sample y as: whereŷ i indicates the reconstructed testing sample. The residual of the reconstructed class is obtained as: The class i has residual r i , which can determine the class of the testing sample. The decision for class designation is based on the following principle:

Experimentation and Evaluation
In this section, we will describe the experimental settings and evaluate the performance of our proposed framework.

Experimentation Settings
We conducted experiments using 802.11n enabled off-the-shelf WiFi devices. Specifically, we used a Lenovo laptop as a receiver equipped with an Intel 5300 network interface card and an Ubuntu 11.04 LTS operating system to collect CSI data. The laptop connects to a commercial WiFi Access Point (AP); TP-Link router as transmitter operating at 2.4 GHz. The receiver can ping the AP at rate of 80 packets/s. The transmitter has single antenna, whereas the receiver has three antennas, that is, N Tx = 1 and N Rx = 3 (1 × 3 MIMO system) generating 3 CSI streams of 30 subcarriers each. We run 802.11n CSI Tool [45] on the receiver to acquire and record CSI measurements on 30 subcarriers of 20 MHz channel. The required signal processing is performed using MATLAB R2016a.
To evaluate the robustness of our proposed scheme, we choose following three scenarios: • Scenario-I (Indoor environment)-In this scenario, all prescribed gestures are performed in an empty room of size 11 × 12 feet, while sitting on a chair between Tx and Rx, separated by a distance of 2 m. For in-vehicle scenarios, we set up our testbed in a local manufactured vehicle which was not equipped with pre-installed WiFi devices. Due to the unavailability of the WiFi access point in our test vehicle, we configured the commercial TP-Link router as AP, placed on the dashboard in front of the driver. The receiver was placed at the co-pilot's seat to collect CSI data.
In each experiment, 16 possible human gestures, as shown in Table 1, were performed by five volunteers (2-females and 3-males university students). Each volunteer repeated all gestures 20 times for each experiment, and a single gesture was performed within a window of 5 s. In total, the data set comprised of 1600 samples (5-volunteers × 16-gestures × 20-times repeated) for each experiment; of which 50% were used for training and 50% for testing. In our experiments, the training data do not contain the samples from the testing data, and we keep the testing samples out for cross validation. Furthermore, we also tested the generalization of our model using an unbiased Leave-One-Participant-Out Cross-Validation (LOPO-CV) scheme.

Performance Evaluation
First, we tested the usefulness of our extracted features, that is, mean (µ), standard deviation (σ), median absolute deviation (MAD), maximum value (MAX), 25th percentile (P 25th ) and 75th percentile (P 75th ). Table 2 represents some prominent values of each calculated feature extracted for different gestures. One can notice that all features are distinctively different, indicating that these features can achieve high recognition accuracy. The recognition performance of the proposed method was observed by conducting extensive experiments. For simplicity, we use abbreviated terms for our proposed prototype, that is, MNN integrated with SRC as MNN-SRC, similarly KNN with SRC as KNN-SRC.
We particularly selected a confusion matrix and recognition accuracy as metrics for performance evaluation. The occurrence of the actual gesture performed is represented by the column of the confusion matrix, whereas the occurrence of the gesture classified was represented by the rows. The confusion matrix in Figure 4 reveals the fact that our proposed method can recognize sixteen different gestures very accurately with an average accuracy of 91.4%, 90.6% and 88.7% for scenarios I, II and III respectively.  To ensure the reliability and efficacy of proposed framework, we analyzed the results by adopting different evaluation metrics including precision, recall and F 1 -score. These evaluation metrics are presented as: 1. Precision is defined as positive predictive value, mathematically described as: where TP and FP are true positive and false positive respectively. True Positive (TP) is the probability that a model correctly predicts the positive class. Whereas, False Positive (FP) is the probability of that a model incorrectly predicts the positive class.
2. Recall is defined as the True Positive Rate (TPR) and measures the sensitivity of system as follows: where FN is a false negative and is defined as the probability that the model incorrectly predicts the negative class. We will also evaluate our proposed method with False Negative Rate (FNR), defined as: 3. F-measure or F 1 -score is defined as the weighted average of precision and recall, calculated as: Figure 5 shows the results related to precision, recall and F 1 -score for all three scenarios, using our proposed scheme. Table 3 summarizes the performance of our MNN-SRC algorithm using TPR and FNR, for each gesture. In general, the MNN-SRC algorithm has an average TPR of over 88.9%, with average FNR less than 11.1% for all three scenarios. In order to determine the efficacy of the integrated classification algorithm, some experiments were performed with each classifier separately and the results are presented in Figure 6. As can be seen, the average recognition accuracy of stand-alone MNN (Mean Nearest Neighbor) or SRC (Sparse Representation based Classification) method is less as compared to the integrated MNN-SRC algorithm.
We have compared the performance of MNN-SRC with KNN-SRC as shown in Figure 7. It is obvious that the average recognition accuracy of MNN-SRC algorithm is higher as compared to KNN-SRC. The overall performance comparison of particular state-of-the-art classifiers is illustrated in Table 4. From the experimental results, it can be concluded that our proposed MNN-SRC method is much better as compared to KNN-SRC or stand-alone conventional classification methods including MNN, KNN, SRC, SVM and NB. From Figure 9, we can observe that the proposed system could achieve reasonable performance even using only the amplitude or phase information. However, the recognition accuracy will be significantly better when we are combining both the amplitude and phase information. To study the impact of nearest neighbors, we performed experiments with varying K values. From Figure 10, it is clear that MNN-SRC outperforms KNN-SRC for almost all K values. However, optimal values K = 20 and K = 5 were used for MNN and KNN, respectively, throughout our experiments. The detailed results with different K values of MNN-SRC are illustrated in Table 5.
We have proven our results by calculating the execution time as illustrated in Figure 11. It is clear that the computational cost of our proposed algorithm MNN-SRC is much less with an execution time of 141.5 ms as compared to the SRC alone with an execution time of 761.8 ms. Furthermore, the execution time of MNN-SRC is less as compared to KNN-SRC. This is due to the fact that we are using the mean of nearest neighbors, which takes much less time to calculate its sparse coefficients as compared to traditional KNN. Although the execution time of MNN-SRC is a little higher as compared to MNN alone; however, as a compromise the recognition accuracy of MNN-SRC is comparatively better. The execution time of KNN is 130.7 ms, which is again higher as compared to MNN.   In order to present the practical performance of our proposed framework, we perform a user independence test. We particularly adopt the Leave-One-Participant-Out Cross-Validation (LOPO-CV) scheme, in which the training data do not know about the test user. LOPO-CV is an effective technique for evaluating the generalization of results for unseen data [51]. In this experiment, all data are treated as the training data set, except a particular personś data that is selected as the test data. This process is repeated for each person. Figure 12, reveals the fact that our proposed method has acceptable performance even using LOPO-CV scheme with a recognition accuracy of 84.7%, 82.8% and 82.1% for scenarios I, II and III respectively. It can be concluded that our presented model is capable for the generalization of new users. The unbiased LOPO-CV estimator is difficult to implement due to its large amount of computation. However, it is suggested to acquire a large amount of training data from a variety of entities to get more better results.

Discussion
In this section, we discuss the limitations and potential results obtained from the experiments. We observe that all gestures are classified with a very good recognition performance using our MNN-SRC classification algorithm, however several factors may influence the accuracy. In this context, some gestures have a great resemblance to each other, such as flick, which has great resemblance with grab and push hand forward. Similarly, push hand backward has great resemblance with grab and push hand forward which degrades the recognition accuracy. Although, all these types of limitations degrade the system performance; however, the overall performance of MNN-SRC is still better as compared to other algorithms.
Although CSI-based systems can achieve a reasonable performance, but there are still some limitations. Firstly, the CSI measurements are much more sensitive to moving objects. As a result, the recognition accuracy may suffer a degradation in performance when there is any other vehicle's motion in the testing area. Moreover, the system is designed by considering only a single person, that is, the driver. However, in real vehicle scenarios there may exist more than one person which can degrade system performance accordingly by making the recognition much more complex. In general, other vehicles on the road and people outside the vehicle may have a very slight influence [42]. Thus, additional signal processing may overcome these issues which we will consider in future.
Despite these limitations, our CSI-based device-free driver gesture recognition system is more scalable and easy to deploy as compared to other models. It should be noted that our proposed classification method is a general solution to solve any device-free localization and activity or gesture recognition problem. In this paper, we have utilized this method for in-vehicle driver gesture recognition.

Conclusions
In this paper, we have presented a novel framework for device-free robust driver gesture recognition. It can be concluded that the recognition rate is significantly improved by leveraging an integrated classification algorithm. Experimental results show that the mean of nearest neighbors based sparse representation coefficients framework can achieve remarkable performance in terms of gestures recognition and execution time. Our proposed integrated classifier is a promising algorithm for driver gesture recognition in the field of automotive vehicle infotainment systems.
This integrated classification approach opens a new direction for a diverse scale of potential applications. There are still several aspects that need be considered. In the future, we are interested in more complex driving scenarios based on the findings presented in this paper.

Conflicts of Interest:
The authors declare no conflict of interest.