Investigation of the Impact of Damaged Smartphone Sensors’ Readings on the Quality of Behavioral Biometric Models

Cybersecurity companies from around the world use state-of-the-art technology to provide the best protection against malicious software. Recent times have seen behavioral biometry becoming one of the most popular and widely used components in MFA (Multi-Factor Authentication). The effectiveness and lack of impact on UX (User Experience) is making its popularity rapidly increase among branches in the area of confidential data handling, such as banking, insurance companies, the government, or the military. Although behavioral biometric methods show a high degree of protection against fraudsters, they are susceptible to the quality of input data. The selected behavioral biometrics are strongly dependent on mobile phone IMU sensors. This paper investigates the harmful effects of gaps in data on the behavioral biometry model’s accuracy in order to propose suitable countermeasures for this issue.


Introduction
According to the FBI's (Federal Bureau of Investigation) 2021 report on Internet crime [1], the number of phishing attacks reported to the IC3 (Internet Crime Complaint Center) doubled in 2021 (241,342 incidents) compared to 2020 (114,702 incidents) and was almost ten-times higher than in 2019 (26,379 incidents). This data clearly shows how popular phishing attacks [2,3] are and how the demand for phishing countermeasures is growing in the government, banking, and military sectors. Although a multitude of companies and governments organize staff training on cybersecurity, accounts are still being hacked as fraudster attacks become smarter and better targeted [4,5]. Fortunately, even if the user's login and password have been voluntarily provided to the fraudster, there are still a number of ways to protect accounts against being hijacked; one of them is behavioral biometrics. Smartphone sensors (for example, accelerometers, magnetometers, gyroscopes) find a lot of applications when it comes to mobile apps-from entertainment (mobile games) to monitoring user's behavior (pedometers, sleep monitoring, and others). Recently, the use of smartphone sensors has found applications in more advanced systems such as health monitoring [6] or cybersecurity. As the consumption of multimedia and mobile resources access rises day-to-day [7], the topic of behavioral biometry and its impact on cybersecurity has recently been present in many research papers covering both desktop [8] and mobile [9] device usage.
The latest studies [10,11] have shown that the use of behavioral biometrics has become an increasingly popular part of MFA (Multi-Factor Authentication) [12]. Over the last couple of years, institutions for confidential data handling have been more prone to reach for users' behavioral patterns (e.g., keyboard strokes, mouse movements, or mobile device handling) when implementing identity theft countermeasures, as this sort of data does not require any additional user involvement harmful for the UX (User Experience) [13].
Moreover, behavioral biometric models show high resistance to fraudsters as their behavior is vastly different from what users tend to do; as such, the combination of suspicious activity on user's account (e.g., a transfer for a high amount) with unusual keyboard or smartphone readings may indicate an attack [14,15].
Although the individual user behavioral biometrics model is a powerful weapon against account hijacking, it is still vulnerable to low data quality delivered from used devices. Swapping mobile phones or damaging certain sensors may lead to reduced quality of fraud detection. For the sake of maintaining the high quality of behavioral biometry authentication services, it is a must to implement precautionary rules. This paper compares how exemplary behavioral models based on aggregated accelerometer and gyroscope readings deal with incomplete anonymized user data. The first stage for applying quality drop countermeasures would be labeling certain users' behavioral data readings as "low quality" to prevent data damage that reduces the model's classification accuracy below established thresholds. The next stages would be more complex-for example, providing user models resistant to sensor damage or applying additional safe mode pipelines.
The most recent papers published on the matter of behavioral biometrics show experimental attitudes towards data and introduce results which do not cover real-life scenario difficulties. The following analysis presents a real-life industrial-based case study on behavioral authentication for one of the leading national banks. This paper covers all the data processing pipeline issues and problems regarding the quality and quantity of certain users' data as well as the struggle encountered with the use of multiple devices and OS versions.
At the beginning of the manuscript, the dataset is introduced; then, the mathematical background of the behavioral model (input vectors, structure, hyperparameters etc.) is presented. In the end, the models' input vectors are artificially disturbed with commonly occurring data damage and their quality is examined to verify whether the presented solution is susceptible to certain types of data absence.
The related papers the authors highly recommend getting familiarized with are [16], where researchers present a proposal of a continuous authentication system for smartphone user classification based on interactions with the device, and [17], which provides BehavePassDB-a public database for mobile behavioral biometrics solution benchmarking. This database can be used as a sandbox for testing new features and classification algorithms before feeding further data. The use of behavioral biometry in mobile devices provides reliable security for zero price when UX is considered, and numerous researchers emphasize the importance of maintaining UX of the highest quality [18]. It is the clients themselves (banks and other institutions covered by behavioral biometry cybersecurity solutions) that insist on keeping the system user-friendly. Additionally, as the global COVID-19 pandemic and its repercussions caused severe changes in the use of digital resources, behavioral biometry has proven to be a high-quality cyberattack countermeasure in fields where other security systems have failed [19].

Materials and Methods
The data used for this research was acquired from the accelerometer and gyroscope sensors of smartphone devices running banking applications on the Android operating system [20], coming from users of one of the leading national banks. Sensor readings were collected from the beginning until the end of use of the banking application. In this case, the data was sent to the upstream node and evaluated in real time. Whenever fraudulent behavior is detected at any stage of the session, an alert signal is sent to the mobile application provider (usually the bank's department of security). The alerting system does not take any additional meta-data apart from an historical behavioral profile. No information on age, gender, banking history, device type, OS, or other data is stored or analyzed. The following block diagram presents how the information exchange between the client and the provided security system is organized (Figure 1): information on age, gender, banking history, device type, OS, or other data is stored or analyzed. The following block diagram presents how the information exchange between the client and the provided security system is organized (Figure 1): Figure 1. Data exchange pipeline for the behavioral biometry security system.

Dataset Structure
Users perform a certain number of connections called sessions (1) with the server via the banking application. Each user session that lasts for seconds, consists of feature vectors whose count is a number denoted by (2). Every single user feature vector in the entire population consists of a fixed number ( = 6) of features (3). The feature vector used for the analysis was formed in the following manner: accelerometer x axis; accelerometer y axis; accelerometer z axis; gyroscope x axis; gyroscope y axis; gyroscope z axis.

=
, , … , , … , For the sake of applying user classification, the sensor readings from a single session are aggregated column-wise into windows of intervals = 20 s (4). Each element of window vector W denotes a single window where all feature vectors from a certain interval are stored. The sampling frequency is not uniform and varies based on the user's device and data pipeline processing issues (e.g., packet losses). The aggregates are responsible for converting the data into smaller chunks and for immunizing it from being sampling frequency-susceptible.

=
, , … , , ℎ = ⌊ TS ⌋ indicates the session starting time and indicates the time which passed since until the creation of a feature vector (5). The number of all feature vectors in a single window is denoted by (6).

Dataset Structure
Users u perform a certain number SN u of connections called sessions S u (1) with the server via the banking application. Each user session s u i that lasts for TS s u i seconds, consists of feature vectors whose count is a number denoted by FC s u i (2). Every single user feature vector D in the entire population consists of a fixed number (N = 6) of features d (3). The feature vector used for the analysis was formed in the following manner: accelerometer x axis; accelerometer y axis; accelerometer z axis; gyroscope x axis; gyroscope y axis; gyroscope z axis.
For the sake of applying user classification, the sensor readings from a single session s u i are aggregated column-wise into windows W of intervals W I = 20 s (4). Each element of window vector W denotes a single window where all feature vectors from a certain interval are stored. The sampling frequency is not uniform and varies based on the user's device and data pipeline processing issues (e.g., packet losses). The aggregates are responsible for converting the data into smaller chunks and for immunizing it from being sampling frequency-susceptible.
t 0 indicates the session starting time and t D x indicates the time which passed since t 0 until the creation of a feature vector D x (5). The number of all feature vectors in a single window is denoted by N M (6).
The following Formulas (7)- (9) show how the aggregated features vector F is created: The total length of the aggregated feature vector equals the length of the original feature vector times the number of all the aggregating methods (8)-standard deviation, arithmetic mean, amplitude, and median-which is denoted by 4N. The formula on which the aggregates are based is shown below (10):

Data Preparation
In order to ensure reliable training and testing datasets, only those users with more than a certain number of unique sessions (SN u > 12), and ones for which a model could be built (undamaged data) were considered. Each viable session had to last for a fixed amount of time TS min or longer (TS s u i > TS min = 100 s) to assure the occurrence of at least WC min = 5 windows lasting for W I. Each window generated a sub-score; further sub-score processing resulted in score generation. The user choice rule described above is presented in a flow chart below ( Figure 2): The total length of the aggregated feature vector equals the length of the original feature vector times the number of all the aggregating methods (8)-standard deviation, arithmetic mean, amplitude, and median-which is denoted by 4 . The formula on which the aggregates are based is shown below (10):

Data Preparation
In order to ensure reliable training and testing datasets, only those users with more than a certain number of unique sessions ( 12), and ones for which a model could be built (undamaged data) were considered. Each viable session had to last for a fixed amount of time or longer ( > = 100 s) to assure the occurrence of at least = 5 windows lasting for . Each window generated a sub-score; further sub-score processing resulted in score generation. The user choice rule described above is presented in a flow chart below ( Figure 2): From a total of 264 users, only 127 fulfilled the requirements (118 users did not meet data quantity needs and 19 users failed training, resulting in generation of a low-quality model). The exemplary aggregate W of the sensor readings of 4 random users are presented in Figure 3. From a total of 264 users, only 127 fulfilled the requirements (118 users did not meet data quantity needs and 19 users failed training, resulting in generation of a low-quality model). The exemplary aggregate W of the sensor readings of 4 random users are presented in Figure 3.
The total length of the aggregated feature vector equals the length of the original feature vector times the number of all the aggregating methods (8)-standard deviation, arithmetic mean, amplitude, and median-which is denoted by 4 . The formula on which the aggregates are based is shown below (10):

Data Preparation
In order to ensure reliable training and testing datasets, only those users with more than a certain number of unique sessions ( 12), and ones for which a model could be built (undamaged data) were considered. Each viable session had to last for a fixed amount of time or longer ( > = 100 s) to assure the occurrence of at least = 5 windows lasting for . Each window generated a sub-score; further sub-score processing resulted in score generation. The user choice rule described above is presented in a flow chart below ( Figure 2): From a total of 264 users, only 127 fulfilled the requirements (118 users did not meet data quantity needs and 19 users failed training, resulting in generation of a low-quality model). The exemplary aggregate W of the sensor readings of 4 random users are presented in Figure 3.  Each session consisted of at least 5 sub-scores, varying from 0 to 1 (indicating the similarity measure coming from the classifier's output)-to indicate that the behavior was user-like. In order to correctly evaluate the session, the final score-consisting of M sub-scores-was calculated as their average (11). Whenever the final score exceeded the user-defined threshold (evaluated based on certain business requirements of the bank), the session was considered fraudulent (12).
To provide in-depth data insight, the feature importance [21][22][23] for 24 aggregates was calculated. The feature significance was estimated using the XGBoost (eXtreme Gradient Boosting) classification algorithm, with its hyperparameters heuristically optimized. [24]. The boxplot below (Figure 4) shows the feature importance distribution, sorted by descending average importance. Each session consisted of at least 5 sub-scores, varying from 0 to 1 (indicating the similarity measure coming from the classifier's output)-to indicate that the behavior was user-like. In order to correctly evaluate the session, the final score-consisting of M subscores-was calculated as their average (11). Whenever the final score exceeded the userdefined threshold (evaluated based on certain business requirements of the bank), the session was considered fraudulent (12).
To provide in-depth data insight, the feature importance [21][22][23] for 24 aggregates was calculated. The feature significance was estimated using the XGBoost (eXtreme Gradient Boosting) classification algorithm, with its hyperparameters heuristically optimized. [24]. The boxplot below (Figure 4) shows the feature importance distribution, sorted by descending average importance. By analyzing the data presented in Figure 2, we can notice a relationship-the accelerometer data worked best with averages while the gyroscope provided the best diagnostic value (higher average and median feature importance) with standard deviations. This is caused by the nature of the data provided by those sensors; a certain user distinguishability depending on the sensor type and aggregation method can be seen in Figure 5 (accelerometer) and Figure 6 (gyroscope). In order to provide a clearer visualization, 4 randomly chosen users were highlighted. By analyzing the data presented in Figure 2, we can notice a relationship-the accelerometer data worked best with averages while the gyroscope provided the best diagnostic value (higher average and median feature importance) with standard deviations. This is caused by the nature of the data provided by those sensors; a certain user distinguishability depending on the sensor type and aggregation method can be seen in Figure 5 (accelerometer) and Figure 6 (gyroscope). In order to provide a clearer visualization, 4 randomly chosen users were highlighted.

XGBoost Training
To train a model, the user's data (sub-sessions) were labeled as an authorized session and the data considering the remaining 126 users were labeled as a fraudulent one. Subsequently, the data set was passed to the XGBoost classifier-the boosting estimator based on decision trees [25][26][27][28][29][30], in which the trees are expanded to a forest where each estimator is built on the residual value of the previous classification. To reach the best possible model, the XGBoost's hyperparameters were tuned for the whole population [31]. The method used for reaching the best possible model quality is presented in a flow chart in Figure 7. The method trains users stored in a queue with certain hyperparameters set. When all the users are trained, their mean model quality is calculated and is treated as fitness. This procedure was reproduced a fixed number of times with a different set of hyperparameters (which came from the evolutionary algorithm). The best hyperparameter set did not change and was treated as the target set.

XGBoost Training
To train a model, the user's data (sub-sessions) were labeled as an authorized session and the data considering the remaining 126 users were labeled as a fraudulent one. Subsequently, the data set was passed to the XGBoost classifier-the boosting estimator based on decision trees [25][26][27][28][29][30], in which the trees are expanded to a forest where each estimator is built on the residual value of the previous classification. To reach the best possible model, the XGBoost's hyperparameters were tuned for the whole population [31]. The method used for reaching the best possible model quality is presented in a flow chart in Figure 7. The method trains users stored in a queue with certain hyperparameters set. When all the users are trained, their mean model quality is calculated and is treated as fitness. This procedure was reproduced a fixed number of times with a different set of hyperparameters (which came from the evolutionary algorithm). The best hyperparameter set did not change and was treated as the target set. The classifier hyperparameters were optimized using an evolutionary algorithm whose outcome is presented in Table 1 below (non-listed parameters were set to default). The evolutionary algorithm started with a population of 10 randomly chosen values of 6 hyperparameters taken randomly from the uniform distribution, with upper and lower boundaries denoted as "min" and "max" in Table 1) and performed 10 steps of vector  The classifier hyperparameters were optimized using an evolutionary algorithm whose outcome is presented in Table 1 below (non-listed parameters were set to default). The evolutionary algorithm started with a population of 10 randomly chosen values of 6 hyperparameters taken randomly from the uniform distribution, with upper and lower boundaries denoted as "min" and "max" in Table 1) and performed 10 steps of vector crossing (selecting 2 equal subparts of two different hyperparameters' dictionaries and combining them) and a one-value mutation (randomly reselecting a particular parameter's value from the specified domain), leaving only the top set of hyperparameter values at each step. Its fitness function was the same as the classifier's objective function, denoted by Formula (13). The vector that provided the best model qualities among the entire population was kept for further consideration. The exemplary genotype division and mutation are presented in Figure 8.  The classifier hyperparameters were optimized using an evolutionary algorithm whose outcome is presented in Table 1 below (non-listed parameters were set to default). The evolutionary algorithm started with a population of 10 randomly chosen values of 6 hyperparameters taken randomly from the uniform distribution, with upper and lower boundaries denoted as "min" and "max" in Table 1) and performed 10 steps of vector crossing (selecting 2 equal subparts of two different hyperparameters' dictionaries and combining them) and a one-value mutation (randomly reselecting a particular parameter's value from the specified domain), leaving only the top set of hyperparameter values at each step. Its fitness function was the same as the classifier's objective function, denoted by Formula (13). The vector that provided the best model qualities among the entire population was kept for further consideration. The exemplary genotype division and mutation are presented in Figure 8.  The model validation data set consists of 30% of the total data. To reproduce real-life model usage, as well as to prevent data leakage, the evaluation was run only on the most recent readings. The objective function used for the model training is represented by Formula (13), where TP stands for "true positive" prediction, TN for "true negative", FP for "false positive", and FN for "false negative": To illustrate what an undamaged sensor's model quality looks like, let us examine Figure 9, showing the model quality distribution in terms of the objective function and the receiver operating characteristic's area under the curve (ROC-AUC) for the entire population. The higher the value of both the objective function and the receiver operating characteristic, the better the model quality is. Any damage done to models (for example, by applying data with missing values for training) will result in the decay of both metrics.
To illustrate what an undamaged sensor's model quality looks like, let us examine Figure 9, showing the model quality distribution in terms of the objective function and the receiver operating characteristic's area under the curve (ROC-AUC) for the entire population. The higher the value of both the objective function and the receiver operating characteristic, the better the model quality is. Any damage done to models (for example, by applying data with missing values for training) will result in the decay of both metrics.

Results
The experiment investigating the harmful effects of sensor damage on behavioral biometrics model quality was run by replacing certain axis data with zero values. The idea behind conducting such an analysis derives from the necessity of knowing whether the model output is still valid, meaning the authentication provided by the behavioral biometry can be trusted. For minor damage-e.g., one gyroscope axis-the model could still provide valuable information, while deleting the data coming from all the axes may cause the model to become utterly useless. To find those boundaries, the two most common data collection failures were considered: damaging the data from one axis of the sensor and damaging the data from all axes of the sensor. Table 2 provides information on how average models' quality decays due to the zeroing of certain vector values. Table 2. The influence of sensor data damage on average model quality.

Results
The experiment investigating the harmful effects of sensor damage on behavioral biometrics model quality was run by replacing certain axis data with zero values. The idea behind conducting such an analysis derives from the necessity of knowing whether the model output is still valid, meaning the authentication provided by the behavioral biometry can be trusted. For minor damage-e.g., one gyroscope axis-the model could still provide valuable information, while deleting the data coming from all the axes may cause the model to become utterly useless. To find those boundaries, the two most common data collection failures were considered: damaging the data from one axis of the sensor and damaging the data from all axes of the sensor. Table 2 provides information on how average models' quality decays due to the zeroing of certain vector values. The presented table explicitly proves that the lack of even one axis may severely damage the quality of predictions. What is also worth noticing is that the true negative rate heavily dropped due to data changes while the true positive rate remained almost the same. Such behavior causes models to produce an increased amount of false positive output, and this leads to overwhelming of the system with false fraudster alerts (correct user detection is in most cases the same though). The histograms presented in Figure 10 (sensitivity) and Figure 11 (specificity) clearly show this relationship-the higher the specificity, the less likely the model is to classify a fraudster as a user, and the higher the sensitivity, the more reliable user detection becomes. same. Such behavior causes models to produce an increased amount of false positive output, and this leads to overwhelming of the system with false fraudster alerts (correct user detection is in most cases the same though). The histograms presented in Figure 10 (sensitivity) and Figure 11 (specificity) clearly show this relationship-the higher the specificity, the less likely the model is to classify a fraudster as a user, and the higher the sensitivity, the more reliable user detection becomes. (a) (b) Figure 11. User specificity histogram with undamaged data (a) and data without accelerometer axis z (b). output, and this leads to overwhelming of the system with false fraudster alerts (correct user detection is in most cases the same though). The histograms presented in Figure 10 (sensitivity) and Figure 11 (specificity) clearly show this relationship-the higher the specificity, the less likely the model is to classify a fraudster as a user, and the higher the sensitivity, the more reliable user detection becomes.

Discussion
(a) (b) Figure 10. User sensitivity histogram with undamaged data (a) and data without accelerometer axis z (b).
(a) (b) Figure 11. User specificity histogram with undamaged data (a) and data without accelerometer axis z (b). Figure 11. User specificity histogram with undamaged data (a) and data without accelerometer axis z (b).

Discussion
The presented paper investigates commonly occurring issues with mobile sensor data used for behavioral biometry. To properly approach faulty data handling (while detecting, e.g., zero values on a certain axis), it is necessary to know the model quality drop for particular-axis damage. The results of the analysis showed that when certain sensor axis data is missing, then vectors used for user authentication may cause a major drop in model accuracy. What is worth noticing and what derives from both the feature importance plot and from the damage influence table is the fact that the most harmful factor for user identification is the loss of accelerometer readings (especially the z axis-34 p.p. compared to the original objective function). Such damage causes the same ROC-AUC drop as losing all the gyroscope's readings (ROC-AUC lowered by 0.15). This is caused by the fact that the accelerometer's z axis provides user-distinctive data-it is highly responsible for indicating at what position the smartphone or tablet is held by the user and how his/her grip changes over time. On the other hand, when it comes to the gyroscope's z axis, this parameter holds the lowest amount of user-distinctive data (objective function only dropping by 9 p.p.). This may be caused by the fact that there exists no substantial angular movement in this axis, or because all the movements are repeatable over the entire population.

Conclusions
The true negative ratio is the most affected metric, and this means that the model's ability to correctly distinguish a user from the rest of the population will not work well-the model will classify user sessions as fraudulent ones. Such behavior will heavily deteriorate security systems by setting off false alerts. To prevent this, we can change the classification threshold level by increasing specificity at the cost of sensitivity; yet, this approach will not improve the total model accuracy. These minor specificity drops should be compensated for by increasing the classification threshold by a predefined factor (which will result in increases in objective function), while higher drops should raise a flag indicating that the evaluation score is invalid. A different approach to dealing with lower-quality models rests with decreasing their weights in an authentication system. Usually, authorization via behavioral biometry uses several models that measure several types of activity-if we can assess a certain model's quality for certain data, we can lower the contribution provided by this model in generating the final score. Yet another way of evading quality drop is using different classifiers-those less susceptible to data damage or those using data preprocessing methods that immunize models against data damage. We can as well think of modifying the classification pipeline-whenever nothing but zero values are present on a specific axis, we can decide whether to evaluate the session or to skip the evaluation (for example, the authentication process is run only if sensitivity did not drop below 70% and specificity did not drop below 60% after introducing certain-axis damage). The additional session evaluation block would take model statistics, as well as a data structure, and assess the prediction reliability ( Figure 12).

Conclusions
The true negative ratio is the most affected metric, and this means that the model's ability to correctly distinguish a user from the rest of the population will not work wellthe model will classify user sessions as fraudulent ones. Such behavior will heavily deteriorate security systems by setting off false alerts. To prevent this, we can change the classification threshold level by increasing specificity at the cost of sensitivity; yet, this approach will not improve the total model accuracy. These minor specificity drops should be compensated for by increasing the classification threshold by a predefined factor (which will result in increases in objective function), while higher drops should raise a flag indicating that the evaluation score is invalid. A different approach to dealing with lower-quality models rests with decreasing their weights in an authentication system. Usually, authorization via behavioral biometry uses several models that measure several types of activity-if we can assess a certain model's quality for certain data, we can lower the contribution provided by this model in generating the final score. Yet another way of evading quality drop is using different classifiers-those less susceptible to data damage or those using data preprocessing methods that immunize models against data damage. We can as well think of modifying the classification pipeline-whenever nothing but zero values are present on a specific axis, we can decide whether to evaluate the session or to skip the evaluation (for example, the authentication process is run only if sensitivity did not drop below 70% and specificity did not drop below 60% after introducing certain-axis damage). The additional session evaluation block would take model statistics, as well as a data structure, and assess the prediction reliability ( Figure 12). What may be worth noticing is that further examination of the model quality drop can be used to estimate numerous device/user-related issues, e.g., to identify device damage. If the user did not show any symptoms of classification problems, and after some time generates numerous false alerts, we may conclude that the sensors do not work What may be worth noticing is that further examination of the model quality drop can be used to estimate numerous device/user-related issues, e.g., to identify device damage. If the user did not show any symptoms of classification problems, and after some time generates numerous false alerts, we may conclude that the sensors do not work correctly anymore or that there are different issues (e.g., user illness or malware attack). These assumptions, however, require further studies.
Future work will mostly be focused on building high-quality damaged-sensor handling pipelines. In case of damaged data occurrence, we must be able to quickly assess the quality of incoming information and to find an efficient way of detecting session hijacking, even if the most valuable information is lost due to data collection errors.
Additional study directions to take should be focused on immunizing the system to erroneous data as well as improving the system's overall quality by introducing novel noise-resistant classifiers and data processors. What should also be kept in mind is the fact that different devices (keyboard or mouse) and different features may not respond similarly to what the analysis has shown. Quality drops caused by missing data be separately examined and countermeasures introduced to them may differ from the ones implemented for mobile devices.