A Framework for Continuous Authentication Based on Touch Dynamics Biometrics for Mobile Banking Applications

As smart devices have become commonly used to access internet banking applications, these devices constitute appealing targets for fraudsters. Impersonation attacks are an essential concern for internet banking providers. Therefore, user authentication countermeasures based on biometrics, whether physiological or behavioral, have been developed, including those based on touch dynamics biometrics. These measures take into account the unique behavior of a person when interacting with touchscreen devices, thus hindering identitification fraud because it is hard to impersonate natural user behaviors. Behavioral biometric measures also balance security and usability because they are important for human interfaces, thus requiring a measurement process that may be transparent to the user. This paper proposes an improvement to Biotouch, a supervised Machine Learning-based framework for continuous user authentication. The contributions of the proposal comprise the utilization of multiple scopes to create more resilient reasoning models and their respective datasets for the improved Biotouch framework. Another contribution highlighted is the testing of these models to evaluate the imposter False Acceptance Error (FAR). This proposal also improves the flow of data and computation within the improved framework. An evaluation of the multiple scope model proposed provides results between 90.68% and 97.05% for the harmonic mean between recall and precision (F1 Score). The percentages of unduly authenticated imposters and errors of legitimate user rejection (Equal Error Rate (EER)) are between 9.85% and 1.88% for static verification, login, user dynamics, and post-login. These results indicate the feasibility of the continuous multiple-scope authentication framework proposed as an effective layer of security for banking applications, eventually operating jointly with conventional measures such as password-based authentication.


Introduction
Currently, at least 5 billion people use mobile telephones [1], including 3.2 billion people who use smartphones [2], and among them, 2 billion use their smartphones to access banking applications [3]. With the widespread adoption of these devices, there is a related growing number of malware specifically targeting mobile devices. As reported in a Kaspersky lab report released in 2019, the number of attacks on mobile devices doubled in 2018, with more than 1165 million [4]. This migration to mobile applications motivated the evolution of authentication methods over time, aiming to ensure fraud prevention, especially in the case of critical applications such as financial ones.
Commonly, the first interaction of a user with a mobile device and applications is the authentication process. There are three traditional methods to authenticate a user: possession of something, knowledge of information, and biometrics, i.e., something that is part of the person's body or behavior [5]. Biometric authentication is an interesting means 1.
Multiple-scope approach is uesd so that the authentication models are validated for different feature sets, with the best performing scopes added to the framework.

2.
Six different scopes were developed to improve the performance compared to the results previously presented in [11], which used only one scope.

3.
With the selected scopes, the efficiency metric of the models presents a minimum F1 score of 90%. 4.
Although the proposal validation uses templates of all users who participated in the data collection scenarios, only those models presenting a FAR below 10% were integrated into the new framework.
Considering this set of contributions, our results show that the new models created for the enhanced Biotouch framework provide higher resiliency when compared with the previous results achieved in [11].

Organization of This Work
This paper is organized as follows. Section 2 summarizes the related works and some considerations about their results. In Section 3, the improvements of the continuous authentication framework are discussed. Section 4 presents and discusses the validation results of the experiments with the new proposed framework. Lastly, Section 5 provides the conclusion and proposes future lines of work.

State-of-the-Art
In this section, a literature review describing the main concepts that support the development of the framework is presented. Additionally, the discussion of related work leads to a definition of the features used and a benchmark for the comparison of performance indicators for the proposed framework. It is also considered that a touch dynamics authentication framework must be composed of three phases: user enrollment, user authentication, and data retraining [12].

Security and Mobile Banking Applications
Online banking systems require efficient security models capable of identifying users and authorizing transactions, thus mitigating fraud [13]. As smart devices have become commonly used to access internet banking applications, these devices constitute appealing targets for fraudsters. Impersonation attacks are an essential concern for internet banking providers. In this context, the main challenge for electronic banking is ensuring the correct usage and verification of applications for banking security.
Usually the model for common attacks against online banking systems is to exploit vulnerabilities inherent in the people (engineering social and phishing) and then to gain control of the device (malware) and to steal the credentials of a legitimate user (fake Web pages and malware) [13]. Therefore, user authentication countermeasures based on biometrics have been developed, including those based on touch dynamics biometrics. Biometric characteristics are specific to the user and difficult to copy, share, and distribute [14].

Continuous Authentication for Mobile Banking Applications Based on Touch Dynamics Biometrics
Touch dynamics biometrics refers to the process of measuring and assessing human touch actions on the touchscreen of mobile devices [12]. To characterize the biometric behavior when using a touchscreen device, an individual's biometric pattern is modeled based on information collected from the various sensors that make up modern smartphones, such as the accelerometer, ambient light sensor, typing compass, gyroscope, GPS, proximity sensor, touchscreen, and Wi-Fi [15].
In the context of mobile applications, behavioral biometrics emerged as a less intrusive biometrics model that can be captured implicitly. Additionally, it provides a greater balance between security and usability, especially because the touch behavior (information captured) is not part of the user's private information.
Active and continuous authentication can be defined as the continuous verification of a person's identity based on aspects of their behavior when interacting with a computing device [16]. The main characteristic of an authentication method is continuity, as authentication is constant during the entire time that the user interacts with the device by means of re-authentication tasks that occur periodically and transparently. The entire authentication process can be performed in the background without interrupting the user's activities [17].
Behavioral biometrics, based on touch dynamics for authentication applications, have been investigated in recent years. An experiment regarding data collection for authentication is presented in [7], aimed at continuous static verification, with a sample of 42 users. The devices used in the cited experiment were a Nexus 7 tablet and a Mobil LG Optimus L7 II, and the users needed to enter the same password (considered strong) 30 times in each session. In this work, the supervised machine learning algorithm that presented the best performance was Random Forest (RF), which resulted in 82.53% and 93.04% accuracies, respectively, for the sets of 41 and 71 selected characteristics.
In [8], an experiment for capturing gestures was proposed for users, where each guest interacted with an application by reading three documents about different subjects and by interacting with two images to find the differences. The purpose was to collect data from the interaction of users with a touchscreen, including horizontal and vertical sliding, through continuous dynamic verifications. Forty-one volunteers participated in this research, using five different smartphones: Droid Incr., Nexus One, Nexus S, Galaxy S, and all on Android 2.3.x. Using the supervised machine learning algorithm Support Vector Machine (SVM), the best performance results for EER were 0% for intra-sessions, 2-3% for inter-sessions, and 4% when the authentication occurred a week after enrolment.
More recent work specifically focused on banking applications in order to increase the spectrum of fraud identification. For instance, the authors of [18] developed a banking analogous application with continuous static verification by considering password typing and by evaluating it on a set with 95 participating volunteers and data captured from the touch interaction and sensors available on smartphones, obtaining a 96% accuracy with the RF algorithm. In [19], using a fuzzy-based classifier, a static and dynamic continuous authentication model was proposed with the data captured from touchscreen and accelerometer interactions in an application developed using the characteristics of a real mobile banking application and from use by 22 volunteers, giving an EER of 11.5%.

Location and Continuous Authentication
The inclusion of information from a user in continuous authentication processes can contribute as an additional factor in verifying a user's usage pattern for a mobile application. Location is one contextual information that is, for instance, used in works such as [16,20] as one of the factors that make up so-called multimodal authentication schemes for mobile applications.
In [16], the user's location data were used in conjunction with the user's movement information and device usage. Two profiles were established for each user: one for weekdays and the other for weekends. The pattern was set based on each user's history, and the K-means algorithm was used to group the location data. In [20], the user's location pattern data were used in a proposal for a multimodal biometrics system that combined GPS information, stylometry, the use of the application, and the web search pattern.
As location constitutes information that generally remains consistent in a user's pattern of use of mobile applications, this contextual information contributes as additional data to validating a user's pattern in conjunction with other factors.

Data Fusion
In this paper, fusion is an approach used to combine data and information from multiple sources to improve the accuracy or performance of a biometric authentication method. The information of various sources can be combined in four different ways: imagelevel fusion, scoring-level fusion [12], decision-level fusion, and feature-level fusion [21].
In multimodal authentication frameworks, one of the issues that need to be resolved is how to merge the classification results obtained for each of the used modalities. To solve this issue, the work in [16] used the Fusion Decision Center technique, which collects decisions from a local detector and uses them to define whether the result is −1 or 1. The scoring approach can also be used for merging results, as noted in [22], where a decision center combines all of the scores of the modalities, generating an overall decision score.

Proposed Model
As mentioned before, this work improves the Biotouch framework [11], seeking better performance and robustness by using a new union of scopes and by adding a new step in the authentication process, this latter being defined as the imposters' FAR (I_FAR) threshold. The new, proposed framework model aims to capture the features of a user's interaction with a mobile application via touchscreen. The model considers two main verifications: one static and the other dynamic. Static verification (SV) is achieved when a user types their password at login, while dynamic verification (DV) runs when the user interacts with the application after login.
The main objectives in the new version of the Biotouch authentication framework are to identify which of the proposed scopes, among six in the new version, perform better and to detect whether these scopes can be combined to generate better results than the previous ones found in [11]. The new scope results are checked based on the F1 score metric to validate which scopes are complementary to each other. Other goals include investigating if the inclusion of the new FAR test step for imposters makes the model more robust and verifying if the supervised machine learning algorithms that present the best results in [11] can be kept. These validations are important to understand the improvements brought to the quality of the authentication models in the new version of the Biotouch framework. Additionally, in order to investigate how the use of touch dynamics biometrics can reduce fraud in banking applications, we conducted an experiment with an application developed with characteristics similar to that of a real banking application. This application requires few touches with a short period of interaction with the user, but it serves the purposes of evaluation because, when such types of applications are attacked, it usually involves the loss of finances.

The New Framework Description
The proposed model involves checking the patterns of typing and sliding, and the location. This model is defined to cover both the login time, named Moment 1, and the interaction with the application, named Moment 2. The framework is illustrated in Figure 1. The two moments of user authentication are characterized differently: • Moment 1: typing password, is classified as SV; • Moment 2: interaction with the application to carry out a transaction is classified as DV.
The rule for merging the classifier results is based on score, using the accuracy (AC) of the location and the F1 score for the password-typing pattern. These characteristics are considered SV. For the application's pattern interaction, it is considered DV. Extending [11], one more step is included, i.e., the I_FAR check, and different from this previous work, the evaluation metric is replaced by the F1 score, which is applied for both SV and DV to cover all of the new framework scopes. Thus, referencing Figure 1, the detailed steps of the new authentication framework are the following:

•
Step 1: Capture the pattern of location and password typing; • Step 2: Calculate AC for the location pattern captured in Step 1 using the best model (which obtained AC ≥ 90% in the tests that are described hereafter), and calculate the F1 Score for the password-typing pattern captured in step 1 using the best model (which obtained F1 ≥ 90% and I_FAR ≤ 10% in the tests); • Step 3: Fuse the AC for the location pattern with the F1 Score for the password-typing pattern using the simple arithmetic mean. • Step 4: If the result of Step 3 is a score below 90%, an alert is generated, indicating a possible imposter; • Step 5: Capture the location and interaction pattern with the application post-login activities; • Step 6: Calculate the accuracy (AC) for the location pattern captured in Step 5 using the model that obtained AC ≥ 90%, and calculate F1 for the pattern of post-login interaction with the application in Step 5 using the best model that obtained F1 ≥ 90% and I_FAR ≤ 10%; • Step 7: Fuse the AC, for the location pattern, with the F1 Score, for the pattern interaction with the application post-login activity, using the simple arithmetic mean; • Step 8: If the result of Step 7 is a score below 90%, an alert is generated, indicating a possible imposter.
The following subsections describe the methods used to create the models mentioned in these steps.

Data Collection
For data collection, an Android application named Biotouch was developed and published on the Play Store. The application consists of a registration screen, a login screen (L), a menu service screen (MS), and two more screens for each of the three available services: (a) account screen (Cc), one menu account screen (Cc1) and one from the account transaction screen (Cc2); (b) transfer screen (T), one transfer menu screen (T1) and one transfer transaction screen (T2); and (c) payment screen (P), one for payment menu screen (P1) and one for the payment transaction screen (P2).
The number of people that participated as volunteers and attended the experiment were 51. The collection period was two weeks. The number of generated templates ranged from 9 to 630 between the users, depending on how many times the user interacted with the app screen. It represents a total of 3443 templates,used in the experiments, composed of various data vectors, corresponding to each touch of the user on the screen. The templates were saved on the Firebase platform. The registration flow is detailed in Figure 2, and the service flow is detailed in Figure 3.
To start using the Biotouch application, the user needs to register a password with 6 to 8 numeric digits. The user identifier is transparently defined by the installation task, and no action from the user is necessary to inform an identifier at the time of registration since the Firebase platform provides an Instance ID that is used as a unique identifier for each instance of the application [23]. Thus, this Instance ID is used as the user's unique identifier.
During data collection, users were asked to interact at least five times a day with the application during the experiment period, no matter which of the flows were executed. The user should interact with five different screens to complete each selected flow.  Each usage template is then represented by the events generated by the users' touch made during an interaction with each screen. Therefore, the number of vector data used for each authentication can vary.
The smartphone models used in the experiment were SM-G973F, SM-G9600, LG- About the weakness or attack surface of the captured templates of the users, the data were sent to the Firebase platform using HTTPS with the latest version of TLS. During the collection period, the data were available in memory and could only be captured if an attacker had access to the device's memory.

Selection of Features
The extraction of touch dynamics biometrics features (catches) could be performed in different ways: spatial, movement [12], temporal, dynamic, and geometric [17]. In the proposed framework, feature generation was performed with the data collected by various sensors: accelerometer, gyroscope, magnetometer, orientation, linear acceleration, and gravity. Additionally, information regarding the pattern of interaction with the touchscreen comprised different ways of extracting touch biometrics, as detailed in Table 1.
Overall, a total of 29 features were obtained in Moment 1 and 31 features were obtained in Moment 2. Two features, the coordinates X and Y, and the latitude and longitude that are used to define the location pattern were added to the latter. These features extend the features used in [24], considering all features generated by sensors declared in related works, and adds features generated by more two sensors: rotation and acceleration. Acceleration force along the Y axis (including gravity) [25]. Accelerometer Acceleration force along the Z axis (including gravity) [25]. Accelerometer Rate of rotation around the X axis [25]. Gyroscope Rate of rotation around the Y axis [25]. Gyroscope Rate of rotation around the Z axis [25]. Gyroscope Geomagnetic field of the environment for the physical X axis in µT [26]. Magnetometer Geomagnetic field of the environment for the physical Y axis in µT [26] Magnetometer Geomagnetic field of the environment for the physical Z axis in µT [26] Magnetometer Rotation vector component along the X axis (X * sin(θ/2)) [25]. Rotation Sensors (software or hardware) Rotation vector component along the Y axis (Y * sin(θ/2)) [25].
Rotation Sensor (software or hardware)

Feature Sensor
Estimated heading Accuracy [27]. Rotation Sensor (software or hardware) Acceleration force along the X axis (excluding gravity) [25]. Acceleration Sensors (software or hardware) Acceleration force along the Y axis (excluding gravity) [25]. Acceleration Sensors (software or hardware) Acceleration force along the Z axis (excluding gravity) [25]. Acceleration Sensors (software or hardware) Force of gravity along the X axis [25].
Gravity Sensors (software or hardware) Force of gravity along the Y axis [25].
Gravity Sensors (software or hardware) Force of gravity along the Z axis [25].
Gravity Sensors (software or hardware) The representation of the X, Y, and Z-axes on a smartphone is detailed in Figure 4 to facilitate the understanding of the collection of features about axes as detailed in Table 1. For the creation of a feature ranking for Moments 1 and 2, the Random Forest (RF) algorithm was used with an impurity criterion based on entropy and a multiclass model. All user templates considered the hyperparameters defined in the Table 2 that were defined empirically based on the best results obtained during training and considering the test time.

Model Creation and Parameter Test
According to [28], it is not an easy task to find a machine learning classifier suitable for user authentication. This is the reason we evaluate several algorithms in this paper. The Random Forest (RF) algorithm [29][30][31] was chosen based on the good performance presented in [24]. Two more algorithms based on ensemble methods, Gradient Boost (GB) [32,33] and Extreme Gradient Boosting (XGB) [34], were also selected. These are considered more modern algorithms with high performance, despite requiring greater computational power for training. Oppositely, two algorithms based on probabilities, i.e., the Naive Bayes Bernoulli (NBB) and Naive Bayes Gaussian (NBG) [31,35,36], were also added because they are simpler, thus implying low processing costs while being rapid for prediction and training. One more algorithm, Support Vector Machine (SVM) [31,36,37], was considered because it achieved good results in related works.
Therefore, this work covered the analysis of a set of six different algorithms to identify which is the best for each user, based on the F1 Score (F1), accuracy (AC), and the complexity of the algorithm: (a) RF; (b) SVM; (c) XGB; (d) GB; and (e) NBB; and (f) NBG, for continuous authentication, both static and dynamic. As an additional checkpoint, the SVM One-Class algorithm was included for the location pattern. The tools used for building the authentication models were the scikit-learn library and the XGBoost python.
To adjust the hyperparameters of the algorithms based on ensemble methods and SVM, the Grid Search technique was used, using a predefined parameter list for the RF, SVM, GB, and XGB algorithms, as shown in Table 3.

Evaluation Metrics
In this version of the framework, the main evaluation metrics are F1 Score (F1) and FAR. F1 [38][39][40] is the harmonic mean between Recall and Precision [38], and its definition and meaning can be found in [40,41]. This metric was chosen instead of precision, as this new version of the framework has multiclass scopes, so if precision is used, the true negatives have a high weight, which could generate good accuracy but does not necessarily indicate good model performance.
Regarding FAR [42][43][44][45], it is used to measure the performance of the model regarding impostors that are accepted as legitimate users.

Fusion of Scores
For the proposed model, the fusion rule for the classifier results is the score average of accuracy for the location pattern, and F1 for Moments 1 and 2, as shown in Figure 5. Opposite to [11], which proposed to deduct the standard deviation from the mean value, this requirement has been removed in this new version of the framework because the new model also performs a test to distinguish an impostor from all other users of the system. Therefore, the penalty generated in the final score by deducting the standard deviation is unnecessary since all score values are only accepted if they have a value of more than 90% for both the location and the Moments 1 and 2. Moments 1 and 2 use the F1 of the best classifier, which can be individualized for the user or shared with others, but always consider F1 for the whole class.

Results and Discussion
This section presents the experimental scenarios developed for the validation of the new authentication framework and discusses the results of the validation process.

Experimental Scenarios
Three scenarios (S1, S2, and S3) were defined to specify the minimum number of interactions with the application required to create a supervised machine learning model that can obtain a 90% F1 Score (F1) and a maximum FAR of 10%. Each user model is confronted with the templates from other users, as shown in Table 4, based on the number of templates that the user generated during the experimental period. A ratio of one login template to three interactions with the application is used in the test scenarios since, to complete a flow, the user must interact with three screens after login: (1) Services Menu screen, (2) Transaction Menu screen, and (3) Transaction screen.
As detailed in Table 4, for the user to participate in SV within scenario 1 (S1), they would need at least 10 login interactions with the application. Out of these generated templates, at least 5 would be used for the training phase, and all other user-generated templates would be used for testing. In scenario 2 (S2), the user should have at least 15 templates, 10 of which would be used for training while all others would be used for testing. To participate in scenario 3 (S3), there would have to be at least 20 interactions with the login screen, 15 of which would be used for training and all others for testing. For the user to participate for DV in scenario 1 (S1), the user would need at least 10 post login interactions with the application generating 30 templates, out of which at least 15 would be used for the training phase and all the other user-generated templates would be used for testing. In scenario 2 (S2), the user should have used at least 45 templates, where 30 would be used for training and all others for testing. To participate in scenario 3 (S3), there would have to be at least 60 templates, 45 of which of these would be used for training and all others for testing.

Model Implementation
To build the authentication model according to the framework rules, the following steps were carried out:

1.
All six supervised machine learning algorithms are trained and tested with balanced data. The same number of vectors, lines contained in the templates, is used for legitimate and illegitimate users based on the number of vectors contained in the legitimate user templates; 2.
The algorithms that obtained F1 from 90% are identified; 3.
Those models that obtain 100% accuracy are then discarded because this behavior may indicate overfitting or that the data is not yet sufficient to define the user pattern; 4.
If among the models with F1 at 90% there is NBB or NBG, they are preferred as they are simpler and faster algorithms for prediction. Otherwise, the model with the highest F1 value is selected; 5.
The best authentication model selected in the previous step is then confronted with data from at least 50 other users. The model is only considered good if it obtains a maximum I_FAR of 10%, which in this case may represent that, among the 50 other imposter users, 5 have a behavior pattern that is identified as similar to that of the evaluated user, based on features used in the experiment.

FeaturesRanking
To understand the representativeness of each feature for the creation of the models and how they could influence the creation of the scopes, a ranking of the feature was created using the RF algorithm with multiple classes, which is detailed in Table 7 (top 10 features for SV) and in Table 8 (top 10 features for DV). From these rankings, it can be observed that the features with the greatest importance for both SV and VD are finger size and average finger size. These two features have a weight close to or greater than 40% for the definition of the models when using the multiclass scope. This behavior served as a basis for the creation of experiment scopes, seeking to understand how the models behave without these features. The different test and training scopes were created based on the ranking of the features, so the scopes with the best performance could be evaluated for the proposed framework by capturing features that are relevant to less training and testing time.

Scopes
For each scenario, S1 to S3, the framework was trained and tested in six different scopes in order to find the scopes with the best performance, respecting the number of templates for training and testing of each scenario, as detailed in Section 4.1. Each scope was created based on features, and if the model is multiclass or binary, the scopes are characterized as follows: SA is the same scope used in [11] while the others were created for this new framework. The idea was to find the best performance of the authentication framework for the collected data and the creation of the models in the scenarios of the experiment.
Given the results captured in the static and dynamic verification scopes from SA to SF, as can be seen in the Sections 4.6.1 and 4.6.2 for SV and in Sections 4.7.1 and 4.7.2 for DV, it was defined that the framework will incorporate scopes SD, SA, and SB, in this order.
SD takes precedence over other scopes because, in SD, the model has already been tested with data from all of the other users that generated the model in addition to this approach providing a reduction in the number of possible overfittings that may happen in individual models that were only trained with two classes, the legitimate users (1) and the imposters (0), since in a shared model the internal classes can have an accuracy of up to 100% without necessarily overfitting the model. Additionally, it was noticed that SA could be complementary to SD.
Scopes SA and SB are only used if it is not possible to find an F1 of at least 90% for the user in SD. If an F1 of at least 90% is not found for the class in SD, then training in SA is carried out. If it is still not possible to find the defined F1, training in SB is carried out. The ineffectiveness in finding the expected F1 in any of the scenarios may be an indication that the user usage pattern cannot be captured with the proposed system and scenarios. The description of the algorithm is described in Algorithm 1.

Experimental Results for SV
In this subsection, the results are analyzed for SV between the proposed scopes and scenarios. In the following tables, the ALG field indicates the best algorithm, the ALG(S) field indicates the algorithm and scope together, the QTD field indicates the total number of templates for the user and, the I_FAR field indicates the FAR of the model about the authentication of the templates of all the other 50 users imposters. The light-gray lines indicate that the model also met the requirement of Step 5, to have a FAR less than 10%, that is, it was possible to find a model that met all of the requirements of the framework in one or more of the scenarios.

Results for SV between Scenarios
With the proposed approach, for SV, uniting the SA, SB, and SD scopes in the framework, it was possible to find an algorithm with the F1 intended for 92% of users in one or more scenarios, which was not found by only two users. Scope D was responsible for solving the search for 52% of users in S1, 66% in S2, and 62.23% in S3; Scope A was responsible for 32% of users in S1, 16.66% in S2, and 15.38% in S3; and the Scope B was responsible for 4% of users in S1, 0% in S2, and 0% in S3, thus indicating that Scope D is comprehensive in finding the best algorithm for the data collected and analyzed in this experiment for static verification.
The results for Step 5 of the proposed framework, the impostor FAR test, are detailed in Table 9. This is the last screening that the model must pass to be considered suitable for user authentication.
Based on the steps defined for the framework, it was possible to find a model with an F1 of at least 90% and with an I_FAR of up to 10% for 80% of users in SV, of which five out of five users offered enough samples to participate in S1, indicating that the use of the proposed SV framework, if used in conjunction with conventional methods such as passwords, can offer an additional line of security with good performance. In the case of this experiment, in a test environment with different devices and different scopes, it provided a quality model for 20 of the 25 users, which is the majority of the users who participated in the experiment.    Tables 10-12 were calculated using the simple average between the values found for each user. All comparisons are made so that the proposed framework can have a minimum benchmark for performance comparison, even though the features and metrics used in each literature-reviewed work are not the same as those used in the proposed framework. The general results of the framework and the scopes that compose it, SA, SB, and SD, are marked in bold.
Regarding the average results for SV, according to Table 10, it was possible to obtain a result of up to 98.25% accuracy with the model proposed in the framework in scenario 3, even if this is not the metric considered in the proposed model. For comparison purposes, a greater accuracy than the one described in the literature review was obtained, where the highest reported accuracy values were 96% for static verification in a mobile banking application [18] and 93.04% for static verification in [7]. Based on Table 11, which shows the average EER, the values varied between 4.57 and 1.88%, and F1, which is detailed in Table 12 was resulted between 95.32 and 97.05%, with the average results above the threshold defined in 90%, indicating that the proposed model managed to obtain a good performance for most users who participated in the SV experiment. - ALG(S) S1: Algorithm (Scope) Scenario 1, F1 S1: F1 Score Scenario 1, I_FAR S1: Impostors FAR Scenario 1, ALG(S) S2: Algorithm (Scope) Scenario 2, F1 S2: F1 Score Scenario 2, I_FAR S2: Impostors FAR Scenario 2, ALG(S) S3: Algorithm(Scope) Scenario 3, F1 S3: F1 Score Scenario 3, I_FAR S3: Impostors FAR Scenario 3. Note: The light-gray lines indicate that the model also met the requirement of Step 5, to have a FAR less than 10%, that is, it was possible to find a model that met all of the requirements of the framework in one or more of the scenarios.

Experimental Results for DV
In this subsection, the results are analyzed for DV between the proposed scopes and scenarios. In the following tables, the ALG field indicates the best algorithm, the ALG(S) field indicates the algorithm and scope together, the QTD field indicates the total number of templates for the user, and the I_FAR field indicates the FAR of the model for authentication of the templates of the other 50 users or imposters. The light-gray lines indicate that the model also met the requirement of step 5: to obtain a FAR less than 10%. It was possible to find a model that met all of the requirements of the framework in one or more of the scenarios.

Results for DV
The proposed approach with DV as well as the SA, SB, and SD scopes for the framework made it possible to find, an algorithm with F1 intended for 86.95% users in one or more scenarios, leaving only four users, of which three did not offer enough samples to participate in scenario 3 without a model with an F1 intended for dynamic verification. Scope D was responsible for solving for 17% of users in S1, 27% in S2, and 81.81% in S3; Scope A was responsible for solving for 13.04% of users in S1, 38.88% in S2, and 9.09% in S3; and Scope B was responsible for solving for 30.43% of users in S1, 16.66% in S2, and 0% in S3. Unlike static verification, where the resolution of the intended F1 search was concentrated in SD, in the case of dynamic verification, there was a greater distribution among the scopes, indicating that the task of finding a model with the intended F1 for DV is more complex, which involves a set of screens, some that have only one touch on the screen for interaction, and not just one screen, as in SV.
The result for Step 5 of the proposed framework, the imposter FAR test, is detailed in Table 13. This is the last screening that the model must pass to be considered suitable for user authentication.
Based on the steps defined in the framework, it was possible to find a model with an F1 of at least 90% and with an I_FAR of up to 10% for 69.56% of users in DV (16 users); for 7 of them, it was not possible to obtain a model that met the requirements: users must provide enough samples to participate in up to S2. Of the users who provided enough samples to participate in S3, only one did not have a good model created, indicating that, for DV, more training data are needed to build a model with good performance. Therefore, it was suggested that the use of post-login interaction data for user authentication is promising.

Average Results for DV
In this subsection, the results of the proposed framework are analyzed in terms of the expected results concerning the results observed in the work listed in the literature review. The average results presented in Tables 14-16 were calculated using the simple average between the values found for each user. All comparisons were made so that the proposed framework can have a minimum benchmark for performance comparison, even though the features and metrics used in each literature-reviewed work are not the same as those used in the proposed framework. The general results of the framework and the scopes that compose it, SA, SB, and SD, are marked in bold.  ALG(S) S1: Algorithm(Scope) Scenario 1, F1 S1: F1 Score Scenario 1, I_FAR S1: Impostors FAR Scenario 1, ALG(S) S2: Algorithm(Scope) Scenario 2, F1 S2: F1 Score Scenario 2, I_FAR S2: Impostors FAR Scenario 2, ALG(S) S3: Algorithm(Scope) Scenario 3, F1 S3: F1 Score Scenario 3, I_FAR S3: Impostors FAR Scenario 3. Note: The light-gray lines indicate that the model also met the requirement of Step 5, to have a FAR less than 10%, that is, it was possible to find a model that met all of the requirements of the framework in one or more of the scenarios.
As detailed in Table 14, for DV, the accuracy varied between 90.1 and 98%. In Table 15, it was possible to observe an EER of up to 3.07% in scenario 3, lower than that reported in the literature for dynamic authentication in [8], with 4% between weeks, and as shown in Table 16, the F1 score varied between 90.68 and 95.72, an average value higher than the threshold defined at 90%, indicating that the proposed model performed well for most users.

Algorithm Frequency
During the experiments and as explained before, six different algorithms were used to create the models between the scenarios that would meet all of the requirements defined for the framework. The algorithms that met all of the proposed framework's requirements varied, as shown in Table 17. In the case of SV, the best algorithms were RF, with higher frequencies, followed by NBG and NBB. For DV, the algorithm with the highest frequency was NBG, followed by RF. SV DV S1 S2 S3 S1 S2 S3 The results in Table 17 show that the best algorithms varied based on ensemble and on Naive Bayes, with the best results obtained for RF in SV and for NBG in DV. It was also possible to verify that the SVM algorithm, despite being referenced in the literature as having good results in [7], did not obtain good results in our experiments or in the creation of models that met all of the requirements of the framework with GB and XGB. It is possible to note that the issue addressed in the experiment, with the features used, can be solved using simpler algorithms, such as NBB, NBG, and RF.

Outlier Detection
A model to be used in behavioral biometrics has to be good at classifying a legitimate user, maintaining a balanced FAR and False Rejection Rate (FRR) at low rates. This indicates that the model is effective in both identifying legitimate users and imposters.
As detailed in Section 4.7.2, the EER varied between 1.88 and 4.57% for SV and between 3.07 and 9.85% for DV, keeping it balanced and with low values, especially when considering only S3, with 1.88% for SV and 3.07% for DV, between the two-week duration of the test.
Besides this balance and the good values found, the model tries to make it more resilient to imposters. In this regard, only the model that obtained an F1 larger than 90% and a FAR lower than 10% were considered and used. The obtained values were then confronted with the templates of all of the other 50 users who participated in the experiment.
Models that passed all of the steps defined by the framework and had I_FARs greater than 1% did not identify an impostor as a 100% legitimate user in any of the cases evaluated. An I_FAR value greater than 1% was the sum of authentications, within the threshold of acceptance, of some data vectors for different imposters authenticated as legitimate users.
The proposed framework can be highly efficient in identifying impostors (at least 90% F1) based on the proposed features and scopes when using a mobile banking application.
Even with all of the precautions defined in the model creation requirements for the framework, an imposter can still have access to the system without being detected, as there are margins of error, and as shown in [46], the data from continuous authentication based on behavioral biometrics can suffer imitation attacks.
In [46], data from the various sensors available on smartphones were not taken into account, unlike the experiment proposed in this work. For attacks against touch behavioral biometrics, considering sensor data, the authors in [47] suggested that the consideration of sensor data can be a strong biometric authentication mechanism against recently proposed practical attacks.

Fused Results
The last step in generating the final score in the proposed framework is to combine the values generated for accuracy of the location with the F1 Score values generated for Moments 1 and 2 by calculating the arithmetic mean of these values. To represent the result of the fusion of scores according to the proposed model, the first result was selected between the scenarios for SV and DV. It considered the users who had a model created for Moments 1 and 2 according to the requirements of the framework. The value of the location accuracy taken into account is always the scenario with the highest number according to the SV and DV results. This is just an example, as the results of SV and DV were obtained with the authentication of several templates from the same user. However, in a real scenario, the fusion of scores comes from the results related to only one template at the run time in SV, DV, and location. The results for the fusion of scores is shown in Table 18. According to the results demonstrated in this subsection, the proposed framework was able to find satisfactory models for SV and DV for 14 of the 25 users who participated in the experiment. For SV, the majority were satisfactory in scenario 1; for DV, there was a greater distribution among the scenarios, indicating that, in general, more data are needed in DV than in SV to create a quality model.

Comparison with Previous Work
In this subsection, we compare the results observed in the literature review with our continuous authentication framework in Table 19.
Based on the observations made in the literature review, which are summarized in Table 19, the continuous authentication framework proposed in this work, developed for continuous authentication based on touch dynamics biometrics, and focused on mobile banking applications presents better results than that in [19], which proposed models that also use continuous authentication based on both static and dynamic verification for authentication of the user, during the entire interaction with an application. Figure 6 illustrates the performance results achieved by the continuous authentication framework proposed in comparison with the best results from the reviewed literature, as detailed in Table 19, for accuracy and F1.  As detailed in the Figure 6, our accuracy results were better than those reported in [7,8,18]. The proposed framework took into account the F1 score to avoid the strong influence that accuracy can have on true negatives. On the other hand, it also considered the accuracy metric as well, which was also better than the other related works. Figure 7 illustrates the performance results achieved by the continuous authentication framework proposed in comparison with the best results from the reviewed literature, as detailed in Table 19, for EER.
If compared with [19], our model achieved an EER of 1.88%, which is much better than the 11.5% found in [19]. Our proposed approach showed superior results concerning the work presented in the literature review focusing on mobile banking applications and addresses both static and dynamic verification.
Therefore, in addition to the capture layers when typing a password and interactions with the application, the proposed model also captures of information from different sensors that are present in mobile devices, such as the rotation and acceleration sensors, and from the user location pattern, such as that used in [16,20], respectively. Besides the search for an algorithm that has an F1 of at least 90%, the framework proposes an additional step so that the model is accepted within the defined requirements: it must be confronted using data generated by all other users who participated in the experiment and cannot have an I_FAR greater than 10%. This step was incorporated with the objective of finding balanced models regarding FAR and FRR when using balanced training and test data and when tested with imposter data to make the models more resilient and robust.
Regarding the identification of impostors, the approach of using this extra verification step makes it more difficult for a fraudster to be able to impersonate a legitimate user when using the application since, in addition to obtaining their password, the impostor must have a pattern of interaction with the application quite similar to that of the legitimate user in the face of a model that has already been tested against the data of at least 50 imposters.

Conclusions and Future Work
The experimental results obtained during this research reinforce our thesis on the complexity of finding a machine learning algorithm that is suitable for several different users since the pattern of interaction with an application is unique to each individual. The use of six different algorithms by the proposed framework appears to be a promising approach to overcoming this difficulty.
To validate the framework as an additional method used against identity-related fraud, three different scenarios were created. The scenarios were based on the number of templates generated for testing and training in order to understand how the models would perform in each case evaluated. Six different scopes were also created based on removing some features to understand how they influence the performance of the model and with different types of model (multiclass or binary).
For construction of the models according to the experiments carried out with the six different machine learning algorithms, the F1 score was used. Due to the scopes being multiclass and binary, there was a need for a general metric in which the true negative was not weighted as high.
A relevant characteristic involved in the steps of creating the model was verification of the FAR for imposters, making the models good and more resilient since they were tested once more with the data of all other users who participated in the experiment.
Regarding the scopes studied in this work, the ones that presented the best results were SA, SB, and SD. These scopes were incorporated into the new framework, leading to good results. In SV, it was possible to find a satisfactory model for 20 of the 25 users who participated in the experiment. In DV, a suitable model for 16 users out of 23 was found. It was also observed that SA could be complementary to SD.
For the investigation into the creation of more robust models with the inclusion of FAR verification of imposters, it was possible to observe that this is an important step as the models need to be balanced between the identification of imposters and legitimate users, with low rates of FRR and FAR.
To be considered good, all models, when trained with balanced data, had to pass the test with only up to 10% FAR for imposters when confronted with data from all other users of the experiment. Therefore, the inclusion of the I_FAR step enriched the framework and made the models more robust.
The algorithms that showed the best performance had results that varied between the scopes. Regarding SV, a better result was observed for RF, an algorithm based on the ensemble method, followed by the algorithms based on Naive Bayes, NBG, and NBB. In the case of DV, NBG is the best performing algorithm, with a higher frequency, followed by NBG.
For the average F1 and EER, the results varied between 90.68% and 97.05% and between 9.85% and 1.88%, respectively, among the scenarios from the proposed scopes. Therefore, it validates the promising perspective of the use of touch dynamics biometrics as a new layer of security if used in conjunction with traditional methods such as passwords. Such layers can evolve in combination as security layers to mitigate authentication fraud in mobile banking applications.

Future Work
Future work related to this research includes field studies to capture data in a real online banking application with a larger number of users and a longer duration to validate the proposed framework in a setup closer to the final application.
Additionally, an important question to consider is the relationship between cell phone models and the quality of the data coming from the capture sensors, yielding an interesting analysis on how device quality reflects the quality of the model of the device user generated.
Regarding the performance of the proposed framework, promising perspectives in this investigation include considering methods based on adaptive selection by weighting features of interest and using advanced feature engineering techniques. Furthermore, the application of filters on the data captured from the sensors possibly improves the performance of the models created.
Since continuous authentication is an evolving field, further studies on appropriate and better-performing machine learning algorithms, and better understanding of the impacts of their parameters and the related approaches for data collection and preservation, model training, deployment, and maintenance are also needed.
A related study will consider analyzing and comparing the performance and cost of continuous authentication methods based on physiological and behavioral biometrics.
Finally, as the proposed framework needs to show resilience against attacks, future studies must conduct further exploration of adversarial models to implement the necessary countermeasures.

Conflicts of Interest:
The authors declare no conflicts of interest. The sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: