Driver Distraction Recognition Using Wearable IMU Sensor Data

: Distracted driving has become a major cause of road trafﬁc accidents. There are generally four different types of distractions: manual, visual, auditory, and cognitive. Manual distractions are the most common. Previous studies have used physiological indicators, vehicle behavior parameters, or machine-visual features to support research. However, these technologies are not suitable for an in-vehicle environment. To address this need, this study examined a non-intrusive method for detecting in-transit manual distractions. Wrist kinematics data from 20 drivers were collected using wearable inertial measurement units (IMU) to detect four common gestures made while driving: dialing a hand-held cellular phone, adjusting the audio or climate controls, reaching for an object in the back seat, and maneuvering the steering wheel to stay in the lane. The study proposed a progressive classiﬁcation model for gesture recognition, including two major time-based sequencing components and a Hidden Markov Model (HMM). Results show that the accuracy for detecting disturbances was 95.52%. The accuracy associated with recognizing manual distractions reached 96.63%, using the proposed model. The overall model has the advantages of being sensitive to perceptions of motion, effectively solving the problem of a fall-off in recognition performance due to excessive disturbances in motion samples.


Introduction
Driver distraction and inattention is a common driving behavior that has become a growing public safety hazard. The National Highways Traffic Safety Administration (NHTSA) (2020) [1] found that 2841 people in the United States were killed because of distracted driving in 2018. More significantly, results from the 100-Car Naturalistic Driving Study [2] reported that almost 78% of crashes, and 65% of near-crashes, involved driver inattention. Driver inattention is primarily caused by secondary tasks that compete for mental and physical resources with the primary driving task [3]. The secondary task can take different forms, including visual distractions (e.g., looking away from the roadway), auditory distractions (e.g., listening to music), manual distractions (e.g., adjusting the radio volume), and cognitive distractions (e.g., becoming lost in thought) [4][5][6]. In recent years, drivers have started to multitask more often, given the extensive availability of in-vehicle intelligent systems, such as smartphones and navigation devices. These tools have made driver distraction and inattention much more common in everyday driving. In most kinds of distractions, a driver lets go of or keeps only one hand on the steering wheel to complete other tasks. This can lead the driver to look away from the roadway or to engage in lapses in attention and judgment. This can cause a decrease in critical information accessibility and processing capabilities required for safe driving, making the driver prone to accidents. Therefore, effectively detecting driver distractions is important for avoiding potentially dangerous situations and reducing the number of traffic accidents. Many studies have proven that messages of traffic safety can help improve a driver's driving performance, traffic mobility, and transportation efficiency [7][8][9][10][11].
Many distracted driver detection methods have been proposed, by both academia and industry. The dominant information has come from different source categories, including physiological indicators, vehicle behavior parameters from controller area network (CAN), and machine-visual features.
Most driver distraction detection systems have used physiological data, such as skin conductance (Mehler et al. [12], Baldauf et al. [13]), heart rate variability (Miyaji et al. [14]), or an electromyogram (Lawrence et al. [15]). A few studies have also used other machinevisual features to detect driver distractions. For example, Zhang et al. [16] proposed using the Hidden Conditional Random Fields (HCRF) model to detect cellular phone-induced driver distraction; the method assess the driver's face, mouth, and hand from images captured by a dashboard-mounted color camera. Hoang et al. [17] used a Faster-RCNN (Region Convolutional Neural Networks) approach to identify distracted states by detecting the presence of the driver's hands on or off the steering wheel and detecting cellular phone usage. Huang et al. [18] used a Fisher linear classifier and Back Propagation (BP) neural network to identify distracted states based on the driver's mouth. Harbluk et al. [19] proposed a driver inattention monitoring method based on eye-movements and visual scanning, to analyze the changes in visual behavior. The Saab Driver Attention Warning System [20] uses the driver's eyeball and head movement information to provide an alert when it detects a driver's inattentive state. Kutila et al. [21] described a machine vision-based detection system, using stereo vision and lane tracking data to monitor driver distraction through rule-based and support vector machine (SVM) classification methods.
Other studies have used vehicle behavior parameters to identify whether a driver is distracted. For example, Yang et al. [22] proposed a Gaussian Mixture Model (GMM) using vehicle information collected using GPS to detect driver distractions. Wollmer et al. [23] used lane departure data and vehicle parameters as inputs to the Long Short-Term Memory (LSTM) model for detecting driver distraction. Jin et al. [24] applied driving performance measures from a CAN as inputs to SVM classifiers to identify a distracted driver. Torkkola et al. [25] constructed a distraction classifier with a Random Forest (RF), based on steering angle, accelerator pedal position, lane boundaries, and the upcoming road curvature. Aksjonov et al. [26] developed a system to detect and evaluate driver distractions, using driver performance information and analyzing the data with an effective fuzzy logic algorithm and machine learning algorithm.
However, these monitoring methods have some inherent limitations and are not suitable for an in-vehicle environment. Physiological-based methods require attaching special devices (e.g., electrodes) to the user's skin or eye. These are intrusive and interfere with drivers. Machine vision-based methods and vehicle behaviors-based methods are non-intrusive, but also have some limitations. Machine vision-based methods are affected by lighting variations in real-world applications and are costly. Vehicle behavior-based methods also have constraints, including the vehicle type and status, driver experiences, and driving conditions. This makes it difficult to implement these monitoring methods in conventional vehicles. As a result, there remains an unmet need to implement distracted driver detection algorithms that can be used for real-time traffic monitoring.
To respond to this need, smart wearable technologies [27], wearable sensor systems (WSS), wearable accelerometers, and inertial measurement units (IMUs) have been developed and are growing in popularity in other domains. For example, Hu et al. [28] used Dynamic Time Warping (DTW) to analyze the body kinematic data obtained from wearable IMUs. The device was able to classify a series of basketball activities, including shooting, passing, dribbling, and lay-ups. Yan et al. [29] developed a warning system for detecting dangerous operational patterns; the approach automatically assesses risky postures using motion data captured by wearable IMUs. Nouredanesh et al. [30] used machine learning-based models to detect a risk of falling, based on data obtained from the IMU and surface electromyography (sEMG) features. Borowska-Terka et al. [31] developed a SVM model using an IMU around the forehead to distinguish different head gestures (e.g., yaw, pitch, roll, and immobility). Dalmazzo et al. [32] used inertial motion information from the right forearm to automatically classify different violin bow gestures using the Hierarchical Hidden Markov Models (HHMM).
Previous IMU applications indicate they can automatically capture gesture data in a precise and reliable way. A wearable IMU can help assess potential driving hazards using captured gesture data without disrupting driver manual operations because of their portability. Thus, IMU sensors have significant potential for use in the traffic safety domain. This makes them attractive for research about task-based distractions while driving, which is an understudied area [33].
The more commonly used recognition algorithm within the category of motion sensorsbased gesture recognition algorithms is the Hidden Markov Model (HMM). The HMM provides a high level of recognition accuracy because it determines the recognition features through machine learning training. However, due to the large number of state transition matrices and observation matrices involved in the HMM, the mathematical calculation is quite complex. In a real-world driving environment, there are too many interference sequences. As such, the increase of gesture complexity would yield an exponential growth in the number of parameters. Further, too many disturbance sequences during actual driving would lead to serious algorithm lag and degradation of recognition performance, increase the computational cost, and weaken the durability of the hardware. However, it would be difficult to recognize gestures correctly if only a single HMM method was directly used.
To address the issues above, a progressive classification model for gesture recognition was developed, including two major time-based sequencing components and a Hidden Markov Model. The two major time-based sequencing components can be called the combined two-class subsequence matching method. The two-class subsequence matching method combined a time-domain analysis with a Dynamic Time Warping (DTW) algorithm to reduce excessive disturbance samples. The most likely probabilistic state transition sequence was calculated using a Baum Welch algorithm, designed to recognize the motion with the highest probability of occurring. This method theoretically accounts for the timedependent characteristics of IMU sensor data compared to the single HMM method. The approach also reduces the sequence recognition scale of the HMM to maintain or improve the recognition accuracy.
Thus, this study proposes a task-specific detection and type identification method during driving, using the following elements: (1) wrist kinematics data generated by wearable IMUs, (2) a progressive classification model to detect excessive disturbance samples, and (3) establishing the HMM to recognize task-specific activities during effective driving.

Participants
Twenty experienced Chinese drivers (14 males, 6 females), ages 21-35 (mean = 25.1, SD = 3.51), were recruited to participate in the study. All the drivers held a valid driving license, with 2 to 15 years of driving experience (mean = 6.2, SD = 3.64). All participants reported having no sleep disorders, and their corrected visual acuity was normal. All were medically evaluated prior to the study, were in good physical condition, and were not on medication.
On the day of the experiment, the participants were asked to not drink alcohol and to restrict tea and caffeine consumption. Prior to participating in the study, each subject signed a written informed consent in accordance with Chinese laws on scientific research using healthy human volunteers. Each participant was paid to be in the study.

Apparatus
We designed the hardware platform based on a wearable inertial measurement unit (IMU) (BWT901CL Module, Wit Intelligence Technology Co., Shenzhen, CHN). The IMU Sustainability 2021, 13, 1342 4 of 17 includes a triaxial accelerometer, a triaxial gyroscope, and a triaxial magnetometer. Realtime sensor output was transferred to the attached application for data processing, and thirteen types of kinematic measures were generated from the raw data. The output frequency is adjustable from 0.1 Hz to 200 Hz. The platform's output frequency was set to 10 Hz. The platform has a size of 55 × 40 × 20 mm 3 and can accurately measure the following dimensions: 3-dimensional acceleration (measuring range: ±16 g, precision: ±0.01 g), 3-dimensional angular velocity (measuring range: ±2000 • /s, precision: ±0.05 • /s), 3-dimensional angle (measuring range: ±180 • ) and four-dimensional quaternion algebra. Quaternion algebra, developed by Hamilton, is a mathematical number system that can provide an alternative representation of the three-dimensional space. In this paper, we also measured driver gesture data with quaternion format. The output information of the hardware platforms is shown in Table 1.

Abbreviations
Definitions acc x , acc y , acc z x-, y-, z-axis accelerations (g) omg x , omg y , omg z x-, y-, z-axis angular velocities ( • /s) ang α , ang β , ang γ Euler angles ( • ) q 0 , q 1 , q 2 , q 3 quaternion algebra The actual driving platform is the New Elantra EV, with a cab structure designed to have the driver's seat on the left. The driver's behavior and the surrounding environment were monitored and recorded in real-time using a CMOS camera (ZED2). Figure 1 shows the vehicle and experimental apparatus.

Apparatus
We designed the hardware platform based on a wearable inertial measurement unit (IMU) (BWT901CL Module, Wit Intelligence Technology Co., Shenzhen, CHN). The IMU includes a triaxial accelerometer, a triaxial gyroscope, and a triaxial magnetometer. Realtime sensor output was transferred to the attached application for data processing, and thirteen types of kinematic measures were generated from the raw data. The output frequency is adjustable from 0.1 Hz to 200 Hz. The platform's output frequency was set to 10 Hz. The platform has a size of 55 × 40 × 20 mm³ and can accurately measure the following dimensions: 3-dimensional acceleration (measuring range: ±16 g, precision: ±0.01 g), 3dimensional angular velocity (measuring range: ±2000°/s, precision: ±0.05°/s), 3-dimensional angle (measuring range: ±180°) and four-dimensional quaternion algebra. Quaternion algebra, developed by Hamilton, is a mathematical number system that can provide an alternative representation of the three-dimensional space. In this paper, we also measured driver gesture data with quaternion format. The output information of the hardware platforms is shown in Table 1. ,, x y z omg omg omg x-, y-, z-axis angular velocities (°/s) ,, ang ang ang The actual driving platform is the New Elantra EV, with a cab structure designed to have the driver's seat on the left. The driver's behavior and the surrounding environment were monitored and recorded in real-time using a CMOS camera (ZED2). Figure 1 shows the vehicle and experimental apparatus.

Driving Gesture Tasks
In this study, we instructed participants to engage in four common types of gestures that occur during driving. There were three types of manual distraction (dialing a handheld cellular phone, adjusting the audio or climate controls, and reaching for an object in the back seat) and one regular driving motion (maneuvering the steering wheel to stay in  In this study, we instructed participants to engage in four common types of gestures that occur during driving. There were three types of manual distraction (dialing a handheld cellular phone, adjusting the audio or climate controls, and reaching for an object in the back seat) and one regular driving motion (maneuvering the steering wheel to stay in the lane). All these activities are large amplitude wrist-movement tasks. The wrist-movement tasks were prescribed as follows: Task 1-Dialing a hand-held cellular phone: The participants followed research assistant instructions to dial the hand-held cellular phone during the drive. The contents of these conversations were prepared in advance, consistent with real life. Task 2-Adjusting audio or climate controls: All participants received information about the audio and climate controls and engaged in trials to become familiar with the controls before starting the actual driving experiments. During the experimental drive, participants followed research assistant instructions to perform a manual function, such as changing the song or adjusting the temperature.
Task 3-Reaching for an object in the back seat: The participants followed research assistant instructions to reach for an object in the back seat. The object was positioned to make it easy for the participant to perform this task only once.
Task 4-Steering wheel maneuvering to stay in the lane: Participants drove through the multi-curve road scenarios without any distractions, performing only steering wheel maneuvering related activities to stay in the lane.

Experimental Design
This driving experiments for this study was conducted at an unoccupied test site, with a size of 200 m × 300 m. Participants wore the IMU-based wearable platform on the right wrist to access wrist movements. After training to become familiar with the car, each participant drove the same designated routes seven times: five for exploratory experiments and two for verification experiments. Each wrist-movement task was performed two or three times by each participant during each experiment, the sequence of four movements was performed randomly. During the five exploratory experiments, research assistants tagged the sample types online and immediately saved the data after each wrist-movement task for later analysis and training. During two verification experiments, data for the complete sequence were acquired in combination with the motion images captured by the CMOS camera for later validation.

Data Preprocessing
The IMU raw data were processed with a sensor-fusion algorithm using an Extended Kalman Filter (EKF) approach. Thirteen types of kinematic measures were obtained from the raw data, forming the following kinematic parameter matrix: Y = acc x , acc y , acc z , omg x , omg y , omg z , ang α , ang β , ang γ , q 0 , q 1 , q 2 , q 3 T The preprocessing method for the kinematic parameter matrix above was divided into two steps: (1) zero-mean normalization, and (2) baseline subtraction. First, the data used to train the model were normalized to reduce the effect of individual differences. This reduced errors in similarity comparisons between samples. Data entered into the model for training or prediction were also normalized. The standardized value (z-score) of each variable for each participant was calculated using the mean and standard deviation for each variable, as shown in Equation (1): where X i is the raw input variable; µ is the mean of the variable; and σ is the standard deviation of the variable.
To eliminate the baseline offset variations in the gesture sequences, a baseline subtraction was performed using the analysis function in ORIGIN (Version 2018, OriginLab Co., Northampton, MA, USA). The connection between the beginning and end of the gesture sequence was selected as the baseline. Figure 2 shows an example of the q 0 time series (adjusting audio or climate controls) before and after the baseline subtraction. This preprocessing method was then applied to all the other kinematic parameters from the same experiment. of the gesture sequence was selected as the baseline. Figure 2 shows an example of the 0 q time series (adjusting audio or climate controls) before and after the baseline subtraction. This preprocessing method was then applied to all the other kinematic parameters from the same experiment.

Quaternion-Based Gesture Subsequence Matching
To capture human motion and facilitate the analysis of motion trajectories, quaternions are often used to calculate the rotation angle of three-dimensional objects. Compared to matrices, quaternions are more efficient and take up less storage space. A quaternion consists of both real and imaginary parts, where the imaginary part is usually three-dimensional. As such, the quaternion is essentially an expression vector in a four-dimensional space. A quaternion is defined as: where 0 q is the real part of the quaternion; q i q j q k  is a vector representing the axis of rotation in three-dimensional space. The quaternion can be constructed using the rotation axis of the carrier and the angle of rotation around that axis. This involves establishing a vector in the arbitrary reference system O-XYZ by rotating  angle around rotation axis n. The rotation axis n is: The quaternion is shown in Equation (4):

Quaternion-Based Gesture Subsequence Matching
To capture human motion and facilitate the analysis of motion trajectories, quaternions are often used to calculate the rotation angle of three-dimensional objects. Compared to matrices, quaternions are more efficient and take up less storage space. A quaternion consists of both real and imaginary parts, where the imaginary part is usually three-dimensional. As such, the quaternion is essentially an expression vector in a four-dimensional space. A quaternion is defined as: where q 0 is the real part of the quaternion; q 1 , q 2 , q 3 are the imaginary part of the quaternion; i, j, k is the standard orthogonal basis in three-dimensional space; and q 1 i + q 2 j + q 3 k is a vector representing the axis of rotation in three-dimensional space. The quaternion can be constructed using the rotation axis of the carrier and the angle of rotation around that axis. This involves establishing a vector in the arbitrary reference system O-XYZ by rotating α angle around rotation axis n. The rotation axis n is: The quaternion is shown in Equation (4): where α is the angle of rotation around the rotation axis; and cos(β x ), cos β y , cos(β z ) are the components (direction cosines) that locate the rotation axis in the x, y, z directions. Due to variability in driver performance, there are differences in the amplitude along the longitudinal axis of the time series. Equation (4) shows that q 1 , q 2 , q 3 changes as q 0 changes. Therefore, q 0 was selected as the sequence matching parameter for the proposed progressive classification model.
Due to the nature of the quaternion, there are two different ways to represent the arbitrary angular displacement in the three-dimensional space. Therefore, the quaternion q and −q represent the same angular displacement and have the same geometric significance. Therefore, the unit quaternions must be geometrically transformed during data processing to unify the quaternion symbols of the same group of gestures. Figure 3 shows that q 01 and q 02 are realistic sequences, and q 02 is the transformed sequence.
the longitudinal axis of the time series. Equation (4) shows that 1 2 3 ,, q q q changes as 0 q changes. Therefore, 0 q was selected as the sequence matching parameter for the proposed progressive classification model.
Due to the nature of the quaternion, there are two different ways to represent the arbitrary angular displacement in the three-dimensional space. Therefore, the quaternion q and q  represent the same angular displacement and have the same geometric significance. Therefore, the unit quaternions must be geometrically transformed during data processing to unify the quaternion symbols of the same group of gestures. Figure 3 shows that 01 q and 02 q are realistic sequences, and 02 q is the transformed sequence.

Template Creation
The purpose of a motion-based template is to increase the rate of recognition with respect to a specific action. We created different class templates for each action, with the data generated during the training stage. The class templates were constructed based on the selected kinematic features; that is, different numbers of the quaternion (i.e., 0 q ) for each time series of the training sample for each action. Each template consisted of relevant driving gesture categories, representing the unique characteristics of each driving task. When creating the class template, the data from each experiment were different in duration and timing, due to the intra-individual and inter-individual differences.
To address this problem of individual differences, the kinematic data from the training set were corrected using peak detection and an alignment algorithm. The location index of each peak of the selected kinematic measures was automatically detected using the

Template Creation
The purpose of a motion-based template is to increase the rate of recognition with respect to a specific action. We created different class templates for each action, with the data generated during the training stage. The class templates were constructed based on the selected kinematic features; that is, different numbers of the quaternion (i.e., q 0 ) for each time series of the training sample for each action. Each template consisted of relevant driving gesture categories, representing the unique characteristics of each driving task. When creating the class template, the data from each experiment were different in duration and timing, due to the intra-individual and inter-individual differences.
To address this problem of individual differences, the kinematic data from the training set were corrected using peak detection and an alignment algorithm. The location index of each peak of the selected kinematic measures was automatically detected using the peakdetection toolbox from MATLAB (Version 2020A, MathWorks Co., Natick, MA, USA). The selected kinematic measures were aligned along their time axes using the detected peak.
Afterwards, the aligned kinematic sequences were trimmed, to ensure that all sequences were of the same data length. For this purpose, the aligned peak in the data was considered to be the center of the sequence. The trimming started at the point where a kinematic sequence had the least number of data points, from the left side of the sequence to its peak. The trimming ended at the point where a kinematic sequence had the least number of data points from the peak to the right side of sequences. Finally, the average length of the trimmed sequence was calculated and served as the class template. The final class template is presented in Figure 4.

Subsequence Primary Matching Method Based on Time Domain Features
The combined two-class subsequence matching method was proposed as a preclassifier in this paper. This included a subsequence primary matching method based on time domain features and subsequence secondary matching method based on dynamic time warping. Extracting the features related to each driving gesture movement was crucial to implement the first step of the proposed progressive gesture classification model. The disturbance sequences represent sequences that cannot be defined due to unknown wrist movements. A gesture pre-classifier, which distinguishes between task-specific gesture sequences and disturbance sequences in the full sequence set, can effectively reduce the number of disturbance samples and improve the accuracy of gesture recognition. This is done by selecting distinguishing and representative features based on wrist-movement characteristics. The time domain features of each wrist-movement can reflect the differ-ences between gestures; as such, the subsequence primary matching method was proposed in this paper. We extracted three-time domain features (i.e., number of peaks, number of troughs, and the order of occurrence of peaks and troughs) to design a primary classification covering four classes of wrist-movements. The location and index of each peak of the selected kinematic measures was automatically detected using the peak-detection toolbox from MATLAB (Version 2020A, MathWorks Co., Natick, MA, USA). We then identified a primary subsequence set q 0i = [q 01 , q 02 , · · · , q 0n ] from the full sequence set Y i (1 ≤ i ≤ N) that matched the time domain features of q 0 . peak-detection toolbox from MATLAB (Version 2020A, MathWorks Co., Natick, Massachusetts, US). The selected kinematic measures were aligned along their time axes using the detected peak. Afterwards, the aligned kinematic sequences were trimmed, to ensure that all sequences were of the same data length. For this purpose, the aligned peak in the data was considered to be the center of the sequence. The trimming started at the point where a kinematic sequence had the least number of data points, from the left side of the sequence to its peak. The trimming ended at the point where a kinematic sequence had the least number of data points from the peak to the right side of sequences. Finally, the average length of the trimmed sequence was calculated and served as the class template. The final class template is presented in Figure 4.

Subsequence Primary Matching Method Based on Time Domain Features
The combined two-class subsequence matching method was proposed as a pre-classifier in this paper. This included a subsequence primary matching method based on time domain features and subsequence secondary matching method based on dynamic time warping. Extracting the features related to each driving gesture movement was crucial to implement the first step of the proposed progressive gesture classification model. The disturbance sequences represent sequences that cannot be defined due to unknown wrist

Subsequence Secondary Matching Method Based on Dynamic Time Warping
The selected kinematic measures (i.e., q 0 ) from the testing data were compared with each class template using DTW, which generated a measurement known as an overall "distance". For each wrist-movement task, the lower bound threshold value of the DTW distance was statistically determined. The value was then used to determine the validity of the sequence for subsequent identification and to eliminate invalid data. Thus, the DTW was used to calculate the "distance" between the kinematic measures of the "unknown" driving task and the corresponding class templates of the different target driving gestures.
The time complexity of the DTW algorithm is 0(mn), which means its calculation requires extensive computational resources, and the data volumes are very large. As such, significant time is required to simultaneously match them. Therefore, this study statistically determined a DTW distance called D best . The D best was defined as the cumulative distance along the shortest warping path from the beginning of the sequence to the highest peak in the sequence using the DTW algorithm. While computing the lower bound threshold value , if the cumulative distance between each pair of corresponding datapoints exceeded the D best before reaching the highest peak, then the calculation was stopped. If the full D best needed to be calculated, we then calculated the full DTW by adding D best with the partial DTW accumulation from the highest peak in the sequence to the ending of the sequence. This was then compared to the lower bound threshold value . After that, we found a secondary subsequence set q 0 = [ q 01 , q 02 , · · · , q 0n ] on the primary subsequence set q 0i = [q 01 , q 02 , · · · , q 0n ] to calculate the two lower bounds (i.e., D best and ).

Gesture Recognition Systems Based on HMM
The Hidden Markov Model (HMM) is a popular algorithm for modeling human behavior because it is suitable for sequence data. In our proposed system, wrist-movement kinematic metricsŶ, generated after training by the two matching algorithms above, were used to train classifiers to recognize gestures. In this study, the HMM-based classifier was used for gesture classification, while the Baum-Welch estimation algorithm was used to train the model λ = (π, A, B). Assume that Q = {q 1 , q 2 , · · · , q N } is the set of all possible states and V = {v 1 , v 2 , · · · , v M } is the set of all possible observations for a certain class of gestures. Assume also that N is the number of all possible states and M is the number of possible observations. The expression I = (i 1 , i 2 , · · · , i T ) is the state sequence of length T; O = (o 1 , o 2 , · · · , o T ) is the corresponding observation sequence; and the parameters in model λ = (π, A, B) are shown in Equation (5): N} A = a ij , i = 1, 2, · · · , N, j = 1, 2, · · · , N B = b jk , j = 1, 2, · · · , N, k = 1, 2, · · · , M where π i denotes the initial state probability; a ij denotes the probability that the state at time t is q i and shifts to state q j at time t + 1; and b jk denotes the probability of generating an observation v k while in state q j .
The Baum-Welch algorithm process is divided into three steps (initialization, recursion, and termination) to determine the best model λ = (π, A, B). In other words, the probability P(O|λ ) of generating an observed sequence under this model converges to a small value. The Baum-Welch method procedures are as follows: (1) Initialization: i , and generate the model λ (0) = π (0) , A (0) , B (0) . (2) Induction: Two sets of probability variables, ξ t (i, j) and γ t (i), are introduced. The term ξ t (i, j) is the probability that the state is q i at time t and the state is q j at time t + 1; and γ t (i) is the probability of being in state q i at time t. For n = 1, 2, · · · , as shown in Equations (6)- (8): The values on the right side of the equations above were calculated from observation (3) Termination: The model parameters λ (n+1) = π (n+1) , A (n+1) , B (n+1) were obtained once P(O|λ ) no longer increased significantly.
In this study, we defined an HMM for each gesture class and then trained the models only with the training sequence for the respective class. To apply the HMM for gesture recognition, the input observations used for training or classification needed to be discrete. Thus, the raw data were discretized using a K-Means algorithm. The discrete labels replaced the raw data as the new training compliance sequence.
There were N gesture classes in this study; therefore, there was a set of N HMMs, modeled as λ = {λ 1 , λ 2 , λ 3 , · · · , λ N }. When a new sequence of gesture observations needed to be identified as O = (O 1 , O 2 , O 3 , · · · , O T ), it was necessary to evaluate the probability that this sequence is generated by a specific HMM λ i , that is, the P(O|λ i ) of all HMMs in the λ. Finally, we determined the HMM generating the maximum likelihood estimation of the observation sequence O, and used the gesture class represented by this HMM to label new gestures, as shown in Equation (9): Thus, the HMM can determine the probability that the model will generate the specific sequence, which can be used to determine the class label for the gesture performed.
Equations (10)-(12) were used to calculate the confidence level defined for each new gesture recognition: P second = max P O λ gesture except recognized gesture (11) con f idence == 1/P second − 1/P max 1/P second (12) where P max is the maximum value of the probability that the new gesture sequence will be identified as a specific HMM λ i ; and P second is the second maximum value of the probability that the new gesture sequence will be identified as a specific HMM λ i . The closer the confidence level was to 1, the more confidence there was that the data represented the gesture. When fitting the model, the initialization elements of an HMM include two key parameters. The parameter N is the number of hidden states in the model, and the parameter M is the number of distinct observation symbols for each state (here, the K-Means cluster labels served as the symbols).
In this study, we selected different ranges of N values, but found that the confidence level did not correspondingly increase with the increase in the number of states. This led to instead mainly determining the M values. This study mitigated the problem of undergeneralization and generalization by using 5-fold cross-validation to calculate the variation in the effect of M-values on recognition accuracy. Table 2 shows the relationship between the M value and the corresponding recognition accuracy. In Figure 5, the confidence curve shows an upward and then a downward trend. Therefore, the value of M was selected that generated the maximum mean confidence (i.e., M = 120). This determined parameters N and M in the training model.

Time-Domain Analysis for Driving Gestures
As described above, different class templates were created for each wrist-movement. Each class template was constructed and tested based on selected kinematic measures (i.e., 0 q ). Each class template consisted of different numbers of kinematic time series, and represented the unique features of each driving gesture. Figure 6 shows the analysis for the different class templates for each driving gesture task. The peaks and troughs are shown in red and green, respectively.

Time-Domain Analysis for Driving Gestures
As described above, different class templates were created for each wrist-movement. Each class template was constructed and tested based on selected kinematic measures (i.e., q 0 ). Each class template consisted of different numbers of kinematic time series, and represented the unique features of each driving gesture. Figure 6 shows the analysis for the different class templates for each driving gesture task. The peaks and troughs are shown in red and green, respectively.

Segmentation
This system was applied in a non-controlled environment. As such, it was important to ensure that only the user interacted with the wearable IMU. To intercept the gesture waveform from the continuous waveform, we applied a threshold interception method based on triaxial acceleration. The acceleration signal was smoother in the absence of gestures and changed significantly when gestures begin. Based on this property, we applied a specific threshold, and when the amplitude of the selected kinematic measure exceeded the threshold, the index position was determined to be the starting point of the gesture. In contrast, when the amplitude of the selected kinematic measure was lower than the threshold, the index position was determined as the end point of the gesture.
This generated the starting and ending points of each gesture, and the gesture waveform was intercepted from the continuous waveform output from the accelerometer. The different values of triaxial accelerations were used to construct thresholds to obtain the starting and ending points of a task-specific gesture sequence in a full-sequence continuous waveform used for validation. The threshold was calculated as the cumulative total of difference values of N = 10 points within a 1 s time window of the data amplitude. Equation (13) shows the calculation of the cumulative total of different values: where X acc k , Y acc k , Z acc k denote the acceleration values in the x-axis, y-axis, and z-axis at time t.

Time-Domain Analysis for Driving Gestures
As described above, different class templates were created for each wrist-movement. Each class template was constructed and tested based on selected kinematic measures (i.e., 0 q ). Each class template consisted of different numbers of kinematic time series, and represented the unique features of each driving gesture. Figure 6 shows the analysis for the different class templates for each driving gesture task. The peaks and troughs are shown in red and green, respectively.

Confusion Matrix
Validation tests confirmed the recognition performance and universality of the proposed algorithm. Specifically, validation datasets were correctly labelled using surveillance videos of the driver's distracted motions. The validation dataset consists of 647 intercepted sequences, of which 102 were gesture sequences and 545 were disturbance sequences.
A combined two-class subsequence matching method was used to divide the gesture sequence state into two categories: Positive and Negative (i.e., gesture sequences and disturbance sequences). The gesture recognition systems based on HMM were used to divide the gesture sequence states into four categories (i.e., dialing hand-held cellular phone, adjusting audio or climate controls, reaching for object in back seat, steering wheel maneuvering to stay in lane).
A confusion matrix contains information about actual and predicted classifications completed by a classifier. The performance of such a classifier can be evaluated using the data in the matrix. The accuracy, precision, recall (also known as sensitivity), F-measure, and specificity were calculated for each identification stage, and the overall model performance was evaluated, using Equations (14)- (18): where TP (true positives) represents the number of correct identifications of gesture sequences; TN (true negatives) represents the number of correct identifications of disturbance sequences; FP (false positives) represents the number of incorrect classifications of gesture sequences as disturbance sequences; and FN (false negatives) is the number of incorrect identifications of disturbance sequences as gesture sequences. Table 3 shows the confusion matrix for the results of the screening for negatives, using the subsequence matching method. For the first subsequence matching method, the number of true positives (TP) was 88; the number of false positives (FP) was 92; the number of true negatives (TN) was 453; and the number of false negatives (FN) was 14. For the second subsequence matching method, there were 81 true positives (TP); there were 8 false positives (FP); there were 84 true negatives (TN), and there were 7 false negatives (FN). The negative screenings for the first subsequence matching method were better that the second method. The first resulted in 51.11% of the total sample proportion of the second subsequence matching method negative samples and 48.89% of the total sample proportion of the positive samples after screening. The combined two-class subsequence matching method effectively screened the overall disturbance sequences, with the proportion of positive samples increasing from 15.76% to 91.01% and the proportion of interference samples decreasing from 84.23% to 8.98% of the total samples.  Table 4 shows the subsequence matching method accuracy, precision, recall (also known as sensitivity), F-measure, and specificity. For the first subsequence matching method, the accuracy was 83.62%; the precision was 48.89%; the recall was 86.27%; the F-measure was 62.41%; and the specificity was 83.12%. The second subsequence matching method had a higher accuracy (91.67%) compared to the first, increasing by 9.63%. The second method also had a higher precision (91.01%), which was an increase of 86.16%, and a higher recall (92.05%), which increased by 6.69%. The second method also had a higher F-measure (91.53%), which was an increase of 4.66%, and a higher specificity level (91.30%), which was an increase of 9.85%. The specificity of the combined two-class subsequence matching method increased significantly to 98.53%; the recall was 79.41%; and the precision was 91.01%. These results indicate that the overall model was more sensitive to positive case perception and could screen for disturbance sequences. The F-measure was 84.82%, indicating that the model was robust. Finally, the HMM algorithm in the proposed progressive classification model for gesture recognition had an excellent detection rate (96.63%).

Conclusions
This paper proposed a non-intrusive method for detecting in-transit manual distractions, using body kinematic measures collected from wearable inertial measurement units (IMU). The goal was to detect specific gesture tasks and types of gestures during effective driving. Due to the problem of excessive disturbances in real driving scenarios, this study developed a progressive classifier to recognize gestures, which is a key contribution of the study. This study focused on establishing a combined two-class subsequence matching method to screen disturbance sequences, by analyzing gesture kinematic measures. The study also constructed a Hidden Markov Model (HMM) database to recognize gestures using the Baum-Welch algorithm.
First, we annotated an online driving gesture dataset, generated by wearable IMU units around participants' wrists to recognize gestures. Then, we extracted the time-domain features of the quaternions in the Kinematic Parameters and constructed a gesture template. Second, the unknown motion samples were compared with the selected kinematic signatures, based on time-domain analysis and an improved Dynamic Time Warping (DTW) algorithm. Finally, the raw data from the driving gesture dataset under each motion gesture was discretized using K-Means to train the models. By establishing a Hidden Markov Model, the corresponding parameters were defined using a 5-fold cross-validation. The most likely probabilistic state transition sequence was calculated using the Baum Welch algorithm to recognize the motion with the highest possibility.
The class templates of different driving gesture tasks were created based on selected kinematic measures. The features were easily extracted using clear peaks and drops, or patterns in the movement. The accumulated distance (D best and ) between the testing data and the template was considered a cost function, reflecting the similarity between them. Therefore, the model could record complete data about the body movement kinematics during driving, and subtle changes could be relied on to screen the disturbance samples.
The evaluation indicators (accuracy, precision, recall, F-measure, and specificity) from confusion matrices were used to evaluate the performance of the proposed model.
The results show that the specificity of the combined two-class subsequence matching method in the progressive classification model was 98.53%; the recall was 79.41%; and the precision was 91.01%. These results indicate that the model was highly efficient in screening disturbance sequences. The model achieved a 96.63% accuracy level with respect to HMM dynamic gesture recognition. Overall, the progressive classification model demonstrated a high level of accuracy (95.52%) in separating disturbance sequences from gesture sequences during driving. The model also showed good accuracy (96.63%) in distinguishing taskspecific wrist-movement categories. The advantage of this method is that the overall model is sensitive to motion perception, and effectively addresses the problem of a fall-off in recognition performance due to excessive disturbance samples. Moreover, the proposed model is easy to implement and provides high recognition accuracy.
The results of this study can be popularized and implemented in the field of humanmachine cooperative driving. Intelligent vehicles with assistive driving now represent a mainstream driving pattern and will continue in the long term. The driver remains the main subject, and the real-time monitoring of driving behavior and states is indispensable in shared autonomy scenarios. Therefore, the most important application of this study may be its ability to provide useful guidance for future driving risk assessments. This supports improvements in the safety, efficiency, and sustainability of the transportation system. It may also support the successful development and implementation of distracted driver detection algorithms to advance the sustainability of road systems in a connected vehicle environment. Future research will focus on extracting data from different driving gesture tasks during real driving to validate the model and examine more gesture motions for better generalization. Combined with the future development of Intelligent Connected Vehicle technologies, data processing and transmission capacity will continue to be enhanced, based on real-time monitoring of driver behavior and driving status. This will continue to improve early warning information, and may support future applications, including planning the timing of a driver's takeover of an autonomous vehicle.