Deep Residual Network for Smartwatch-Based User Identification through Complex Hand Movements

Wearable technology has advanced significantly and is now used in various entertainment and business contexts. Authentication methods could be trustworthy, transparent, and non-intrusive to guarantee that users can engage in online communications without consequences. An authentication system on a security framework starts with a process for identifying the user to ensure that the user is permitted. Establishing and verifying an individual’s appearance usually requires a lot of effort. Recent years have seen an increase in the usage of activity-based user identification systems to identify individuals. Despite this, there has not been much research into how complex hand movements can be used to determine the identity of an individual. This research used a one-dimensional residual network with squeeze-and-excitation (SE) configurations called the 1D-ResNet-SE model to investigate hand movements and user identification. According to the findings, the SE modules have enhanced the one-dimensional residual network’s identification ability. As a deep learning model, the proposed methodology is capable of effectively identifying features from the input smartwatch sensor and could be utilized as an end-to-end model to clarify the modeling process. The 1D-ResNet-SE identification model is superior to the other models. Hand movement assessment based on deep learning is an effective technique to identify smartwatch users.


Introduction
Annually, the quantities of information produced by wearable devices linked to the Internet increase [1]. Smartphones play a significant part among these gadgets because of their increasing functionality and consumer acceptability. As a result, the safety and security of this equipment are the top priority throughout the design phase. Biometrics can be employed in several of the newest strategies for controlling illegal access to mobile devices. An individual's observable characteristics and behaviors are examined and measured to recognize or identify that person.
User authentication is an excellent way to protect personal information. The design of the authentication mechanism must take into account the fact that the aim of authenticating is to validate the user's information [2]. In preventing identity theft, many new solutions have been introduced in recent years. Identifying the user and providing a pleasant user experience are the primary goals of these detection techniques, but there are still several obstacles to overcome. Digital identity is increasingly built on usernames and passwords [3], making it vulnerable to theft, hacking, and fraud. Cryptographic algorithm-based digital signatures are another common choice [4]. A highly capable computer system is needed to produce digital signatures; therefore, devices with fewer resources have difficulties establishing this identification. Hardware-based PUF (Physical Unclonable Function) has recently emerged to identify individuals, and several authentication methods have been constructed on this basis [5]. There are certain drawbacks to PUF, however, including the need for additional equipment. A hardware-based identification solution is implemented via tokens and access cards [6].
A biometrics-based identification approach is the next advance in identifying and verifying individuals [7]. Due to obvious reasons, individuals are regarded more efficiently than the previously listed digital IDs. Because they are a part of ourselves, biometrics are easy to utilize. Compared to more conventional verification and identification methods, including credentials, PINs, and tokens, biometrics are almost impossible to lose or steal [8]. Secondly, since each person's biometrics are distinctive, they are complicated to reproduce. It is also easy to verify the properties of biometric IDs [9]. Modern computer systems have several physiological biometrically based identifiers. Several products employ face detection to verify that the individual is who they indicate they are. When it comes to biometrics, the fingerprint is the most often utilized [10]. There are a number of common biometric signatures, including ECG/EEG characteristics [11], iris recognition [12], and palm vein variations [13]. All of these options need specialized technology to collect biometric data. This could be prohibitively costly, time consuming, and intrusive to the participant. A further drawback of these physiologically based approaches is that they are vulnerable to emulation. Fraudulent activity includes voice impersonation, iris-copying lenses, and concealment, and these are but a few examples.
Many emerging biometric identification alternatives are low cost, better suited than classic biometrics, or could be used in conjunction with more classical biometrics like multifactor authentication to enhance security and usability. On the other hand, some biometric authentication methods demand human engagement, which might be difficult for the end-user. A few alternatives include entering the password, unlocking the smartphone via face recognition, or tapping the fingerprint reader. In continuous authentication, the user is required to authenticate many times [14]. This is more challenging for the user. Since biometric characteristics are collected indirectly when the user interacts with the device, movement sensor-based identification approaches such as wearable sensor-based gait recognition [15], contact gesture-based recognition, keystroke-based recognition, etc., could tackle this issue. Compared to standard vision-based movement identification, these methods are more private [16] and use less energy.
The advancement of science and technology has influenced the techniques of biometric identification. Fingerprinting, facial detection, retinal scanning, palm geometry, and voice recognition are some of the most well known techniques, but there are many more. Meanwhile, less invasive biometric variations are making their way into widespread utilization. Identifying persons based on the features of their activity is an example of this. There are both pros and cons to this approach, which indicate how individuals perform their daily routines. Since the person needs to carry the equipment (which will generally be a smartphone) or be captured to perform recognition by computer vision, its key benefits are that it enables automated, regular, and non-intrusive recognition. Biometric techniques that are less accurate than fingerprints might be seen as a drawback. Continuous and periodic identification could help with this. For gait-based person recognition systems, performance can be improved if the data samples are created by a broad range of user actions, ensuring that a valid classification is achieved in the least amount of time feasible.
In many homes and workplaces, smartphones have become an integral part of our daily lives and are routinely used to access cloud-based security apparatuses. Smartwatches offer an interesting environment for authentic identity verification through cloud-based solutions like Internet banking if a smartphone is easily stolen or cobbled. When using cloud-based or other data sources to connect mission-critical Internet services, it is vital to identify the genuine user who is doing so reliably. Automated and non-bypassable identification is required.
In the recent decade, learning techniques, including machine learning (ML), have been employed to achieve good outcomes with biometric-based user identification. Within controlled circumstances, machine learning techniques such as K-nearest neighbors, Support Vector Machine (SVM), Decision Tree (DT), and Random Forest (RF), among others, have been established to deliver satisfactory results [17][18][19]. The accuracy of these standard machine learning models is highly dependent on the strategy of human-manually extracted and selected features.
Nowadays, deep learning (DL) algorithms have succeeded in user identification studies. One of the most significant components of deep learning is its ability to automatically identify and classify features with increased accuracy, influencing user identification studies [20][21][22]. Deep neural networks can learn discriminative characteristics from raw data automatically, and they have revealed tremendous promise for evaluating diverse data and have a high capacity for generalization. Numerous primary and sophisticated deep learning models have been proposed to capitalize on deep learning approaches by compensating for the drawbacks of traditional machine learning while leveraging the multiple levels of characteristics available in various hierarchies. A hierarchy of layers is used in machine learning techniques to handle low-and high-level features, along with linear and nonlinear feature conversions at different levels, contributing to learning and optimizing features. To this end, deep learning models such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) are utilized to overcome the limitations of conventional machine learning algorithms that relied on manual feature selection, where an error in feature selection could have negative consequences for the applications at hand. As a result, deep learning networks have found practical applications in identification tasks and are often employed in activity recognition studies for feature extraction. One disadvantage of the deep learning approach, mainly when complex DL architectures are used, is the higher expense of processing the massive number of accessible datasets. Nevertheless, the cost is justified since an identification approach relies on the accuracy of the classification consequences of the deep learning model.
While enhancing the level of a CNN could extract more abstract features and improve effectiveness, it may also result in degraded performance [23]. To address this issue, He et al. [24] suggested a residual network (ResNet) for image identification, which has been integrated into the investigation of human behavior. For instance, Li et al. [25] used the 1D-ResNet model to extract spatial features from multidimensional inertial sensor inputs and bidirectional LSTM. This approach can achieve improved performance with fewer parameters. Moreover, Ronald et al. [26] established a link between wearable sensor vectors and individual motions using an improved deep learning model based on ResNet and Inception modules. The improved model suggested here shows exceptional arrangement in human activity recognition (HAR) implementations.
Using a squeeze-and-excitation (SE) module [27] is a technique for channel attention that could be contained in current CNNs to increase classification interpretation. The SE block operates as an embedding unit, improving the effectiveness of deep neural networks. The SE block is one of the popular improvements of the many CNN architectures because it is simply added without changing the shape of the existing model. For example, [28][29][30] concatenated the SE block into CNNs with varying convolutional layer levels. The findings indicate that CNNs with the SE block outperform simple CNNs with accuracy rates. Additionally, the efficiency of the SE block has been shown in the assessment of pork freshness using NIRS [31] and ECG-signal classification [32].
Inspired by the works mentioned earlier, this research combines the SE block with a 1D-ResNet to evaluate the SE block's potential for user identification using sensor data from a smartwatch. The main difference between the proposed network and our previous works in [28][29][30] is that the proposed network uses a one-dimensional deep residual network with shortcut connections instead of a one-dimensional convolutional neural network. This work aims to investigate user identification using a smartwatch on the basis of complex hand movements. We use a residual network to extract more abstract spatial characteristics from CNN. Squeeze-and-excitation modules were also incorporated in the one-dimensional ResNet to improve recognition interpretation.
The following is an overview of the study's most significant contributions: • This article aims to investigate the possibility of the 1D-ResNet-SE for sensor-based user identification by analyzing complex hand movement signals captured by smartwatch sensors. We compared standard CNN-based deep learning models to RNNbased LSTM networks for sensor-based user identification using smartwatch sensor data to examine the algorithm's effectiveness. • We conducted comprehensive experiments using many smartwatch-based HAR datasets encompassing simple and complex hand movements to increase ecological validity. We observed the connection between hand motion patterns and the recognition of smartwatch owners using the 1D-ResNet-SE model. The SE blocks are combined with the residual network to increase the sensitivity to relevant features. Compared to CNN-and LSTM-based deep learning models, the model demonstrated here shower superior performance in user identification in complex hand movement situations.
The rest of this article is organized as follows: In the second section, we examine the latest research on sensor-based deep learning algorithms for user identification. Proposals for new techniques are introduced in Section 3. Experiments are described, and findings are shown in Section 4. Deep-learning algorithms used in the research to achieve effectiveness in identification are discussed in detail in Section 5. The last section, Section 6, focuses on the study's limitations and suggests new opportunities for future research.

Related Works
Identifying users based on their activities has proved to be a challenging problem to solve. We have compiled a collection of resources connected to our study in this area.

Sensor-Based User Identification
Wearable sensors have been considered in recent years as part of sensor-based identification systems. For example, [33,34] proposed a mechanism for explicitly and continuously identifying the individual. However, there is no compelling evidence that the individual's behavior has changed enough to warrant an apparent classification in most circumstances. Luca et al. [35] present a technique for explicitly determining the distance between pattern traces using the dynamic time warping mechanism. Most of the 22 unusual touch patterns shown by Sae-Bae et al. [36] include using all five fingers concurrently. They used k-nearest neighbors and support vector machines to categorize the 22 analytical characteristics from touch traces analyzed in the study [37].
As a result, each action is correlated with two essential characteristics: Time and space, according to the behavior-based model's principle. For example, if you are looking to identify a user, you could look to activities such as those described by [38]. Multi-model continuous user identification was suggested by the works of [39]. Another distinctive architecture for ongoing user identification was proposed in [40] by leveraging historical smartphone records and positions.
To a certain degree, all the tasks mentioned above need additional details and a source of user identification. Casale et al. [41] provided a gait-based user identification over an inconspicuous biometric pattern to address these concerns. A four-layered structure made use of the geometric principle of a convex hull. For example, it only functions in specific locations which is a severe disadvantage. Wearable devices based on gait signals recorded using a three-dimensional accelerometer were employed in the studies of [8,42], where the accelerometer was simply attached to the individual's waist at the rear. A threefold technique for user identification based on data distribution statistics, correlation, and time-frequency characteristics was developed by [43]. At the same time, the people were deliberately requested to stroll at different velocities, such as slow, regular, or quick. The fundamental disadvantage of Mantyjarvi's work is that only one person can walk at a time, and with relatively restricted variants.
There are many existing approaches that used gait-based systems which are summarized above and in Table 1. There is still scope for further improvement based on physical adjustments, carrying objects, orientation, placement, movement surface, psychosocial factors of a participant, stimulants, and other considerations. These limitations significantly hamper the gait-based system's performance in real-world situations. Several research studies on time series classification (TSC) have focused on deep learning and obtained notable results in recent years. CNN has been a prominent deep learning technique in TSC because of its capability to extract the connection between local organizations in the form of array information. Yang et al. [46] reveal one of the first applications of CNN in TSC. According to the researchers, a higher-level description of raw sensor data can be derived using CNN's deep architecture. Additionally, combining feature learning and classification in a single model makes the learned features more discriminative. According to Ronao and Cho [47], a deep CNN with 1D convolutional processes outperforms conventional pattern recognition algorithms for movement categorization employing smartphone sensors. Jiang and colleagues [48] sent the sensor data into a twodimensional neural network instead of utilizing a one-dimensional convolution to capture both temporal and spatial characteristics from the action patterns for the classification test.
The two-stage CNN model [49] increases the classification performance of actions with complicated structures and limited training data.
TSC has subsequently benefited from the use of numerous cutting-edge CNN architectures that have been introduced in the machine vision area. A TSC model based on U-Net [50] was presented in [51] to conduct sampling point-level forecasting, thereby overcoming the multi-class issue. Mahmud et al. [52] use a residual block-based CNN to extract features and categorize behaviors from 1D time-series sensor data. TSC's compact deep convolutional neural network is constructed by Tang et al. [53] using the Lego filter [54].

Recurrent Neural Networks
Time series sensor data are commonly processed using recurrent neural networks (RNNs) because they store insights into the history of previous items in a sequence. Zeng et al. [55] presented a long short-term memory (LSTM) model based on continuous attention that emphasizes relevant sensor modalities and significant sections of the sensor data during TSC analysis. Barut et al. [56] constructed a multitask framework employing layered LSTM layers to classify and estimate activity intensity from raw sensor data. Rather than utilizing raw data, the bidirectional LSTM recurrent neural network in [57] is employed to feature data generated from principal component analysis (PCA) and discrete wavelet transform (DWT). Where there is a lack of label data, [58] recommends extracting features using spectrograms. Then the identification is carried out using an extended support vector machine (SVM). Fusing LSTM-RNN with handcrafted elements could improve the performance of a system, according to [59], where it was shown. Local feature-based LSTM networks suggested by Chen et al. [60] can encode temporal dependence and learn features from a high sampling rate of acceleration data.

Hybrid Neural Networks
In recent years, considerable research has demonstrated that substantial TSC effectiveness could well be achieved by combining hybrid models derived from several kinds of deep learning approaches. GRUs (gated recurrent units) have been introduced in [61] to uncover sequential temporal relationships in complex activity recognition by using an inception module-based CNN [62]. In a sleep-wake detection system, Chen et al. [63] employed a 1D-CNN-LSTM model to capture feature information from lengthy acceleration sequences, then combined an attention mechanism with the handcrafted characteristics of heart rate variability data. In [64], a recurrent convolutional attention model was presented to cope with the imbalance of the labeled data in a semi-supervised manner. Small segments of window data are supplied into an LSTM layer for motion identification after a CNN is applied to the data. For the first time, an LSTM-CNN model was suggested by Xia et al. [65] in which a two-layer LSTM is applied topically to the raw sensor data before actually employing 2D convolutional layers. Deep learning and traditional pattern recognition approaches are successful in the research; however, further examination exposes several gaps and flaws. Rather than evaluating the connection between neighboring windows, most research has solely looked at the data from specific windows to make predictions about behavioral aspects. In multi-class classification applications, including face recognition, this technique could deliver great accuracy, but it can lack the characteristic of long-term reliance on sensor data. A method named MFAP, developed by Chen et al. [59], addresses this weakness by considering both the past and present a priori data. To maintain the assumption of independence between the observed values and the preceding ones, we consider the activity sequence a first-order Markov chain. This strategy, unfortunately, necessitates an additional manual job on the result of the deep neural network's Softmax layer.
There has also been some research involved to apply and evaluate the principles under real-life scenarios; however, most past research employs clean datasets. The data of each action are gathered, interpreted, and preserved independently without taking transitions between movements into consideration. In reality, activities must be performed sequentially, and some, such as lying down and jogging, cannot be performed side by side without a transition. A hierarchical hybrid approach, known as HiHAR, has been proposed to overcome these issues. The process can determine local and global temporal dependence in window sequences using the hierarchical design.

Simple and Complex Human Activities
Human activities could well be classified into two categories according to [66][67][68][69]: Simple human activities (SHA) and complex human activities (CHA). As Shoaib et al. [70] observed, simple human activities are repeated, common movements that could be primarily determined using an accelerometer, such as strolling, running, sitting, and standing. Another issue is that behaviors that are not repeatable, such as smoking, eating, delivering a speech, or sipping coffee, cannot be clearly detected at smaller segmentation windows, in contrast to repetitive activities such as walking, running, or cycling. Human behaviors that are complex are less repetitive than those that are simple. Complex activities often need the use of the hands, such as smoking, eating, and drinking. Additional sensors, like a gyroscope, could be utilized to determine if CHA is present. Due to the difficulty of characterizing such actions with a single accelerometer, this research categorized stair-related movements as CHA.
Alo et al. [66] distinguished two types of human activities: Simple and complex. Walking, running, sitting, standing, and jogging are simple human activities that are quick human behaviors. On the other hand, complex human activities, such as smoking, eating, taking medicine, cooking, and writing, are composed of longer-duration operations. Peng et al. [67] divided human activities into simple ones (e.g., walking, jogging, or sitting) based on repetitive movements or a single body position, which does not genuinely describe everyday activities. On the other hand, complex activities are more challenging and are composed of many straightforward operations. Complex actions, such as eating breakfast, office working, or shopping usually require an extended period of time and have broad meanings. These are more accurate components of people's everyday lives. According to Liu et al. [68], human activity is complicated. A complex activity is a collection of chronologically and productively related atomic engagements. In contrast, an atomic movement is a single unit-level action that cannot be further decomposed under practical comprehension. Rather than doing a single atomic operation, individuals frequently perform several activities in various ways, both sequentially and simultaneously. Chen et al. [69] distinguished two types of human activities: Simple and complex. SHA could be considered as a single repeated motion that a single accelerometer could recognize. CHA would rarely occur in repeatable form similar to simple activities and will usually include many simultaneous or overlapping actions that can be observed only via multimodal sensor data.

Available Sensor-Based Activity Datasets
Many sensor-based activity datasets are accessible to the public and could develop deep learning models.
All 51 individuals in the WISDM-HARB dataset [17] were recorded while participating in 18 activities of daily life. Each participant wore a smartwatch on their dominant wrist while completing the tasks to ensure accuracy. The research goal was to identify which integration form of accelerometer and gyroscope sensors achieved the best performance on both smartphones and smartwatches.
Smartwatch and smartphone data loggers are included in the UT-Smoke dataset [71,72] to collect various sensor data simultaneously. For three months, the participants in this study smoked for a total of 17 hours while strolling, standing, sitting, or speaking with others. Eleven people volunteered to take part in these events. According to our knowledge, this is the most significant dataset compared to other research of this type.
Annotated data from complicated hand-based movements recorded by smartwatches are utilized as a baseline for complex hand movement studies in the two datasets above [73][74][75]. UT-Complex [70], PAMAP2 [76], and OPPORTUNITY [77] are further sensor-based activity datasets. On the other hand, this research did not include data from an annotated smartwatch sensor for sophisticated hand-based tasks.

Proposed Methodology
An emphasis is placed in this part on the methods used for the training of the advanced learning model and the identification of individuals through wearable sensors and smartwatches which incorporate built-in sensors. In Figure 1, the proposed methodology for the CHM-UserIden framework is shown, which comprises data acquisition, pre-processing, training the model, and user identification. Each stage is explained in further detail as follows.

Deep Learning Models
Train Models Training Results

Validation Results
Raw

Data Acquisition
The benchmark datasets utilized to evaluate this research were the focus of this section. The assessment employed two public datasets (UT-Smoke and WISDM-HARB datasets). The inertial data from smartwatch sensors were included in the UT-Smoke and WISDM-HARB datasets. As a group of individuals engaged in everyday tasks, such as dining, having a drink, smoking, and so on, the data in each dataset were gathered. Data from accelerometer, gyroscope, and magnetometer-equipped smartwatch sensors were used to produce these datasets.
To investigate user identification through a smartwatch, we classified human activities using the SC 2 representational taxonomy [68]. This division of human activities into simple and complex ones was based on their chronological interconnections. • A simple activity cannot be subdivided further at the atomic scale. For instance, walking, running, and ascending are all considered simple activities owing to their inability to be coupled with other activities. • A complex activity is a high-level activity formed via the sequencing or overlapping of atomic-level activities. For example, the representation "smoking while strolling" incorporates the two atomic actions of "strolling" and "smoking".
Characteristics of both activity-based datasets are described in Table 2.

UT-Smoke Dataset
The UT-Smoke dataset, which was previously provided in [71,72], is used in this study as a public complex hand-based activity dataset. Over three months, 11 volunteers (two female and nine male) aged 20-45 were tracked using a smartwatch application. The program records data from a smartwatch and a smartphone's triaxial accelerometer and gyroscope, as well as a timestamp. 50 Hz is the sampling rate for all data.

WISDM-HARB Dataset
Fifty-one people were recruited to participate in the WISDM-HARB dataset [17] and complete various everyday tasks, including easy and sophisticated studies using smartphones and smartphone sensors. The subjects performed these tasks for three minutes. The accelerometer and gyroscope sensors recorded data at 20 Hz. Individuals aged 19 to 48 volunteered to participate in the study, which collected sensor data.

Data Pre-Processing
Due to the participants' lively motions throughout the data collection, raw sensor data contained measurement noise and other unanticipated noise. Signals with a lot of noise distort the data they convey. As a result, it was critical to limit the impact of noise on signal processing so that useful information could be retrieved from the signal [42,78]. Mean, lowpass, and Wavelet filtering are some of the most frequently used techniques for filtration. Using a 3rd order Butterworth filter, we de-noised all three dimensions of accelerometers, gyroscopes, and magnetometers using the 20 Hz cutoff frequency. At this pace, 99.9% of body movements are captured, making it ideal for the recording of motion [79].
It was necessary to alter the sensor data once it had been cleansed of unwanted noise. Each data point was transformed using a Min-Max normalization approach, which projects its values into the range [0, 1]. Having a way to balance the impacts of different dimensions might be beneficial for the learning processes. Normalized data from all sensors are split into equal-sized sections for model training using fixed-size sliding windows in the data segmentation stage of the process. To construct sensory data streams with a length, we employed a sliding window with a duration of 10 s in this study as suggested by [17]. The 10-s window is utilized for user identification because it is long enough to record crucial features of a person's activities, such as numerous repeats of fundamental motions such as walking and stair ascending, and it enables faster biometric identification. Additionally, prior activity recognition investigations revealed that a 10-s window size surpasses others [80].

Data Generation
Data samples are separated into training and test data in this phase, while temporal windows from the signals are utilized to create a model, and test data are used to assess the learned model. Cross-validation is the standard approach for separating data into training and test sets [81]. Numerous strategies, such as k-fold cross-validation [7], could be used to separate the data for training and testing. This stage estimates the learning algorithm's capacity to generalize to new data. This stage takes advantage of stratified ten-fold cross-validation inside the framework for smartwatch-based user identification. The entire dataset is partitioned into 10 equal folds or subsets for this validation approach. Nine of these folds are utilized for training and one for testing in each cycle. This procedure is performed 10 times, utilizing all data for both training and testing. Stratified data imply that each fold has about the same amount of data from each participant.  The input sensors were processed using convolutional blocks and SE-ResNet blocks. An ELU layer, a convolutional layer, a batch normalization layer, and a max-pooling layer were all included in the convolution component. Each of the trainable convolutional kernels in the convolutional layer creates a feature map, which is then used in the convolutional layer. One-dimensional kernels are just like the input spectrum. Because of this, BN was used to stabilize and speed up the learning process. The model's expression capability was improved with the help of ELU, a nonlinear function. Preserving key characteristics was achieved by using the MP layer to minimize map size. The following section goes into further detail about the SE-ResNet module. Flattened layers were utilized to turn the averages of each feature map into a 1D vector using the GAP. Using a Softmax function, the result of the fully linked layer was transformed into probabilistic reasoning. The cross entropy loss function, which is often used in classification applications, was applied to compute the network's losses.

SE-ResNet Block
As the network layers increased, a degradation incident occurred: Accuracy rapidly reached saturation and ultimately declined [82]. Adding a bypass link to ResNet's residual block could successfully solve the degradation issue [24]. Figure 3 depicts the architecture of a residual block. Convolutional layers, BN, ELU, and a bypass connection are all part of this algorithm. There are no differences between the residual block and the convolutional block except for the bypass link. A residual function F(x): = H(x) − x is defined as the sum of the foundation mappings H(x) and F(x) placed on top of each other. Since the initial mapping was transformed into F(x) + x, the initial mapping is no longer relevant. The residual learning is simpler to implement and avoids the degradation issue than simply fitting H(x) using stacked layers.

Squeeze-and-Excitation Module
By combining spatial and channel-specific data, convolutional neural networks extracted the features [83]. The SE module aims to improve the representative capacity of the model's channel association. After the convolution procedure, many feature maps are obtained. Nonetheless, a few feature maps could well be overloaded with duplication of data. The SE block performs feature recalibration to improve the valuable traits while inhibiting the less valuable ones. Each feature map is squeezed as a first step, and a weight vector is generated. The feature weights are then redistributed using fully connected layers and a sigmoid activation function in the excitation procedure. A gradient descent technique is used to direct the redistribution. Weights are then used to adjust the weights of the features. To recalibrate the feature maps obtained from the stacked layers, the SE block was put behind BN in each residual block in this investigation. Figure 4 depicts the SE-ResNet component's overall structure and functionality.

Activation Function
The activation function provides a nonlinear element in the model as an essential component. Nonlinear distributed data are challenging to adjust in a network lacking activation functions. Because of this, a network's capacity to conform to its environment is greatly improved by the activation function. The activation functions that are utilized in this study are as follows.

Evaluation Metrics
User identification could be seen as a categorization with many classes. Accuracy, F1-score, and Equal Error Rate are commonly used performance indicators for evaluating and comparing identification systems. These performance indicators are determined using a confusion matrix to determine the model's ability to detect objects.
Let consider a multiclass classification problem with set A containing the n different class labels C i (i = 1, 2, 3, . . . , n) denoted by {C 1 , C 2 , C 3 , . . . , C n }. The confusion matrix for that problem is an n × n matrix presented in Figure 5. Each row of the matrix represents the instances of an actual class, while each column represents the instances of a predicted class. An element C ij of the confusion matrix at row i and column j provides the number of instances for which the actual class is i and the predicted class is j. are all aspects that could be derived from the confusion matrix and utilized to produce performance metrics. Consider the following mathematical formulae for calculating the label classes C i , TP(C i ), FP(C i ), FN(C i ), and TN(C i ).
When a biometric-based user identification approach recognizes an invalid individual or fails to realize an actual individual, an error of accuracy happens. False Acceptance Rate (FAR) and False Rejection Rate (FRR) are the most usually exploited measures to determine issues. Equal Error Rate (EER) represents the rate at which FAR and FRR become equal, and hence, a lower EER rate indicates more accuracy.
The typical technique for assessing FAR and FRR for multiclass classifiers is converting the multiclass classification issue to multiple binary classifications. Each class has its own FAR and FRR error values. The EER can be calculated as, EER = FAR+FRR 2 , where |FAR + FRR| is the smallest value. Table 3. Performance metrics for a multiclass confusion matrix.

Metrics Formulas
Accuracy

Experimental Results
This section provides the results of all of the experiments we conducted to find the successful deep learning models for sensor-based user identification. The UT-Smoke and WISDM-HARB datasets were used as the two benchmark datasets for person identification utilizing smartwatch sensing data in the research. The accuracy, F1-score, and confusion matrix of the deep learning models were evaluated using these measures.

Software Configuration
Google Colab Pro+ [84] was utilized in this investigation. A graphics processor device called the Tesla V100-SXM2-16GB was used to accelerate the training of the deep learning models. There are several basic deep learning techniques in the Python library, including the 1D-ResNet-SE and Tensorflow backend (version 3.9.1) [85] and CUDA (8.0.6). The following Python libraries were the subject of these explorations: • When reading, manipulating, and interpreting sensor data, Numpy and Pandas were utilized for data management. • For plotting and displaying the outcomes of data discovery and model assessment, Matplotlib and Seaborn were utilized. • Scikit-learn (Sklearn) was used in experiments as a library for sampling and data generation.
• Deep learning models were implemented and trained using TensorFlow, Keras, and TensorBoard.

Experimental Findings
The UT-Smoke and WISDM-HARB datasets were used to validate the developed approach against baseline deep learning methods. These deep learning approaches trained on smartwatch sensing data from the benchmark datasets are described in the following subsections, which give experimental findings. Summary hyperparameters of all models conducted in this study are described in Appendix A.

UT-Smoke
The UT-Smoke dataset was used to collect smartwatch sensor data from 11 participants.
Smoking, Eating, Drinking, and Inactive are the four categories of physical activities listed in Table 4. Using classification performance indicators, the evaluated results of the deep learning models were measured (Accuracy and F1 measurements).
Several combinations of smartwatch sensor data and other deep learning techniques, such as CNN and the proposed 1D-ResNet-SE method, can be investigated on the UT-Smoke dataset. To see the categorization results for the DL models mentioned in Table 4, F1 of our recommended 1D-ResNet-SE model derives the score of 97.24% for smoking, 98.13% for eating, and 96.44% for drinking when employing accelerometer, gyroscope, and magnetometer data accordingly. The proposed 1D-ResNet-SE has a greater accuracy and F1 score than existing smartwatch sensor variations. Therefore, we could infer that the approach we propose can recognize smartwatch users quite effectively by employing complex hand movements.

WISDM-HARB
As a second dataset, we used the WISDM-HARB dataset. Smartwatch sensor readings from 44 persons performing 18 physical activities are included in this dataset. This dataset's classification effectiveness is summarized in Tables 5-7.
With the WISDM-HARB dataset, we conducted extensive analysis utilizing two baseline DL models and proposed the 1D-ResNet-SE model. Three separate sensor configurations made use of data from the smartwatch. For "Clapping" and "Teeth", the proposed approach obtained the highest F1 (>95 percent) utilizing both accelerometer and gyroscope data, as shown in Tables 5-7.   Table 5. Recognition effectiveness on classifier evaluation of deep learning models using WIDSM-HARB dataset (Acc. and Gyro. sensors).  Table 6. Recognition effectiveness on classifier evaluation of deep learning models using WIDSM-HARB dataset (Acc. sensor).

Research Discussion
This study aimed to present a deep learning-based framework for identifying users through complicated hand movements using a smartwatch. The proposed approach was evaluated against two distinct benchmark datasets comprising sensor data of various physical human activities acquired by smartwatch motion sensors (accelerometer, gyroscope, and magnetometer). The 1D-ResNet-SE model outperformed previous standard deep learning techniques for smartwatch-based user identification according to experimental outcomes. The 1D-ResNet-SE model uses shortcut connections to resolve the network's vanishing gradient issue. The proposed model includes SE-ResNet blocks consisting of Conv1D layers, BN layers, ELU layers, squeeze-and-excitation (SE) modules, and a shortcut connection. By combining spatial and channel-specific data, the SE-ResNet block improves identification performance and hierarchically extracts features.

Impact of Squeeze-and-Excitation Modules
It was hypothesized that the squeeze-and-excitation (SE) module might enhance a deep learning model's channel representational capability. This effort necessitates numerous feature maps and subsequent convolutional procedures. Repetitive information could exist in a few feature maps. The SE module performs feature recalibration to improve the significant attributes while inhibiting the less effective ones. Additional experiments were conducted to compare the introduced 1D-ResNet-SE model versus a modified model that took out the SE component to explore how the SE module affected the results.
To analyze the improvement, a statistical analysis was performed to find out whether there are significant performance differences of accuracy between the baseline 1D-ResNet model and the proposed 1D-ResNet-SE. As suggestion in [86], we perform the Wilcoxon test [87], which is non-parametric statistical test for pairwise comparing the significant difference. In the statistical test, we assume that the null hypothesis H 0 is as follows: "There are no significant difference between the model performances". When performing the non-parametric Wilcoxon test, the null hypothesis that all model performances were equal could be rejected with a significance level of α = 0.05. Hence, the result is statistically significant when p-value < 0.05. Tables 8 and 9 report the statistical analysis performed via the Wilcoxon test on the UT-Smoke and WISDM-HARB datasets, respectively. Based on the UT-Smoke dataset, the statistical test reveals that the SE module significantly improves accuracy of smartwatch-based user identification using the sensor data of smoking and drinking activities. For statistical analysis based on the WISDM-HARB, the analyzed results reveal similarly that the SE module can improve the user identification using the sensor data of typing, writing, clapping, eating sandwiches, and drinking activities with statistical significance.

Impact of Sensor Combinations
Each sensor's usefulness to the smartwatch-based user identification is examined in this task. Using accelerometer and gyroscope data as independent inputs, we evaluated the efficiency of the suggested 1D-ResNet-SE model. Utilizing raw accelerometer data with the proposed model resulted in a superior F1 score compared to gyroscope data for all hand-based activities. To analyze the impact of sensor combinations, we utilized the Friedman aligned ranking test [88], which is a non-parametric statistical test for comparing the significant difference. In addition, we applied the Finner post-hoc test [89] with a significance level of α = 0.05 to examine whether the differences in the performance of the model accuracies were statistically significant.
Tables 10 and 11 present the statistical analyses, performed with non-parametric comparisons that relate to the accuracy metrics of the 1D-ResNet-SE using different sensor data for user identification. The statistical results indicate that the accuracy performance of the 1D-ResNet-SE can be improved significantly by using both accelerometer and gyroscope for smoking, drinking, and eating activities.

Comparison with Previous Works
The recommended 1D-ResNet-SE model is compared to previously trained models on the same dataset (WISDM-HARB). Previous research [17] has revealed that using a machine learning technique called the Random Forest (RF) technique makes it possible to reach high-performance user identification using smartwatch sensors. Prior work presented the stratified 10-fold cross-validation approach, which we employed in our study. Table 12 outlines the comparative findings. The comparison results indicate that the proposed 1D-ResNet-SE model achieved better accuracy than the previous model for most of the activities.

Conclusions and Future Studies
Using complicated hand gestures and a smartwatch, this study proposes a heterogeneous framework for user identification. Two independent benchmark datasets comprising sensor data from smartwatch motion sensors acquired during diverse individual biological movements were used to evaluate the system (accelerometer, gyroscope, and magnetometer). Three deep learning models were used to classify each dataset's sensor data, including the standard CNN and LSTM and our proposed 1D-ResNet-SE model.
Metrics such as accuracy and the F-measure were used to determine the experimental outcomes. Classifiers were compared to see how well they performed. Across both datasets, the proposed 1D-ResNet-SE classifier outperformed every other classifier by a wide margin. For user identification, the UT-Smoke dataset delivered high performances from complex hand movements such as eating, smoking, and drinking. We used all three smartwatch sensors (accelerometer, gyroscope, and magnetometer) to classify eating behavior in the UT-Smoke dataset to get a great outcome. Each DL classifier employed in this study performed well with accelerometer data when evaluating its identification capability as a smartwatch sensor. As an alternative, the gyroscope and magnetometer could be utilized to identify individuals. Similar to the WISDM-HARB dataset, the three DL classifiers were examined and assessed using smartwatch sensor data from 18 physical activities. The 1D-ResNet-SE classifier surpassed the other baseline DL classifiers in the investigation. User identification also gave valuable insights into the nature of users' actions. Using a smartwatch to identify a user was an effective solution for this kind of action.
Even though the existing smartwatch-based user identification sensor method achieves good results, future studies might benefit from researching different replacements for the proposed solution. Another option is to include a wide range of activities, such as more complicated and transitional tasks, within a systematic framework to increase user identification. In the future, a complete smartphone and smartwatch dataset could be evaluated that includes numerous body locations for smartphone placements, since the smartwatch sensor data are only investigated in one position in the current study. Positionbased user identification can be used to enhance identification outcomes in this manner.