Continuous Smartphone Authentication via Multimodal Biometrics and Optimized Ensemble Learning

Cheng, Chia-Sheng; Chang, Ko-Chien; Chen, Hsing-Chung; Chou, Chao-Lung

doi:10.3390/math14020311

Open AccessArticle

Continuous Smartphone Authentication via Multimodal Biometrics and Optimized Ensemble Learning

¹

Department of Information Engineering and Computer Science, Feng Chia University, Taichung 40724, Taiwan

²

Department of Computer Science and Information Engineering, Asia University, Taichung 41354, Taiwan

³

China Medical University Hospital, China Medical University, Taichung 404327, Taiwan

⁴

Master’s Program of Information and Communication Security, Feng Chia University, Taichung 40724, Taiwan

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(2), 311; https://doi.org/10.3390/math14020311 (registering DOI)

Submission received: 26 December 2025 / Revised: 14 January 2026 / Accepted: 14 January 2026 / Published: 15 January 2026

(This article belongs to the Special Issue Emerging Applications of Artificial Intelligence Algorithms in Computer and Network Security)

Download

Browse Figures

Versions Notes

Abstract

The ubiquity of smartphones has transformed them into primary repositories of sensitive data; however, traditional one-time authentication mechanisms create a critical trust gap by failing to verify identity post-unlock. Our aim is to mitigate these vulnerabilities and align with the Zero Trust Architecture (ZTA) framework and philosophy of “never trust, always verify,” as formally defined by the National Institute of Standards and Technology (NIST) in Special Publication 800-207. This study introduces a robust continuous authentication (CA) framework leveraging multimodal behavioral biometrics. A dedicated application was developed to synchronously capture touch, sliding, and inertial sensor telemetry. For feature modeling, a heterogeneous deep learning pipeline was employed to capture modality-specific characteristics, utilizing Convolutional Neural Networks (CNNs) for sensor data, Long Short-Term Memory (LSTM) networks for curvilinear sliding, and Gated Recurrent Units (GRUs) for discrete touch. To resolve performance degradation caused by class imbalance in Zero Trust environments, a Grid Search Optimization (GSO) strategy was applied to optimize a weighted voting ensemble, identifying the global optimum for decision thresholds and modality weights. Empirical validation on a dataset of 35,519 samples from 15 subjects demonstrates that the optimized ensemble achieves a peak accuracy of 99.23%. Sensor kinematics emerged as the primary biometric signature, followed by touch and sliding features. This framework enables high-precision, non-intrusive continuous verification, bridging the critical security gap in contemporary mobile architectures.

Keywords:

zero trust; continuous authentication; multimodal biometrics; ensemble learning; smartphone

MSC:

68Q32

1. Introduction

The ubiquitous integration of smartphones into daily life has transformed them into primary repositories for sensitive personal, financial, and professional data. Despite this elevated role, the prevailing security paradigm remains constrained to traditional one-time authentication mechanisms, such as personal identification numbers (PINs), graphical patterns, or static biometric scans. While effective at the initial access point, these methods suffer from a fundamental “trust gap”: once identity is verified, the device blindly trusts the user for the duration of the session. This “trust but verify once” model leaves the system vulnerable to post-authentication threats, including session hijacking and unauthorized physical access during unattended device scenarios in supervision. To mitigate the vulnerabilities inherent in PoE paradigms, the cybersecurity domain is increasingly pivoted toward Zero Trust Architecture (ZTA). As formally defined by NIST in Special Publication 800-207 [1], Zero Trust is an architectural framework that operates on the foundational principle of “never trust, always verify.” Unlike traditional perimeter-based models that grant implicit trust based on one-time authentication, NIST SP 800-207 mandates that trust must be ephemeral, dynamic, and evaluated on a per-request basis.

Continuous authentication (CA) mechanisms primarily verify a user’s identity by continuously capturing biometric or behavioral characteristics through the device, thereby ensuring that the device is being used by the legitimate user [2]. Behavioral biometrics is one of the key technologies enabling CA. By analyzing user behavior patterns—such as keystroke dynamics, mouse movement trajectories, touch pressure, or gait—this approach enables unconscious and continuous identity verification, thereby mitigating the vulnerabilities associated with traditional static authentication methods (e.g., passwords or fingerprints). Recent research indicates that the adoption of behavioral-biometric-based CA in Zero Trust environments is rapidly evolving, particularly in device-to-device authentication and user authentication [3], as well as in privacy-preserving mechanisms [4]. The practical implementation of CA on mobile devices is underpinned by behavioral biometrics, a modality that leverages the unique, idiosyncratic patterns of user interaction rather than static physiological traits. Unlike explicit authentication, which requires active user intervention, behavioral biometrics operate transparently in the background. This study specifically targets two primary behavioral dimensions: screen touch and sensor signals. Touch capture the subtle mechanics of finger–screen interaction, such as contact area and inter-event latency during tapping or navigation. Simultaneously, sensor signals utilize the smartphone’s embedded inertial measurement unit (IMU) comprising accelerometers and gyroscopes to model the microscopic tremors and orientation changes characteristic of a user’s holding posture. Individually, these signals provide partial evidence of identity; together, they form a robust, multidimensional behavioral signature.

However, processing such high-dimensional, heterogeneous, and noisy sensor data requires sophisticated modeling techniques. To address this, this research employs a heterogeneous deep learning approach to extract distinct feature sets from each modality. CNNs are utilized to capture the local dependencies and frequency patterns within continuous inertial sensor streams. In parallel, Recurrent Neural Networks (RNNs), specifically LSTM networks and GRUs, are deployed to model the long-term temporal sequences inherent in sliding and the sparse nature of discrete touch events. This modality-specific architectural design ensures that the unique inductive biases of each model are aligned with the underlying data structure.

To effectively synthesize the outputs of these heterogeneous classifiers, this study adopts an Ensemble Learning strategy. We investigate and compare three distinct fusion mechanisms: Hard Voting, which aggregates discrete class predictions; Soft Voting, which averages confidence probabilities; and Weighted Voting, which assigns adaptive importance to each modality based on its reliability. Given the strict security requirements of Zero Trust and the highly imbalanced nature of real-world authentication datasets, where legitimate user samples are typically vastly outnumbered by impostor samples, standard voting rules often prove insufficient. Consequently, we introduce a rigorous Grid Search Optimization (GSO) strategy. This GSO strategy systematically explores the hyperparameter space to identify the global optimum for voting weights and decision thresholds. By maximizing the F1-Score, the system establishes a precise decision boundary that definitively classifies the current user as either Authorized or Unauthorized. By integrating these advanced methodologies, this research aims to deliver a high-precision, non-intrusive security framework.

Aligned with NIST SP 800-207, this study shifts from static one-time authentication to dynamic CA to eliminate the implicit trust zone. The proposed heterogeneous neural ensemble optimizes feature extraction by matching CNN, GRU, and LSTM architectures to the physical characteristics of touch, sliding, and inertial data, respectively. Finally, a GSO strategy targeting the F1-score serves as a post-processing mechanism to address the “Accuracy Paradox” inherent in the class imbalance typical of Zero Trust environments.

The remainder of this paper is organized as follows. Section 2 provides a comprehensive review of the related literature. Section 3 presents the details of the proposed method. Section 4 reports the experimental results and analysis. Finally, Section 5 concludes the paper.

2. Related Works

To contextualize the proposed framework for continuous smartphone authentication via multimodal behavioral biometrics, this section reviews the existing literature across four strategic domains: the evolution of behavioral biometrics, advancements in multimodal fusion, the application of deep learning ensembles, and optimization methodologies for Zero Trust architectures.

2.1. The Evolution of Behavioral Biometrics in Mobile Security

The paradigm of user authentication has shifted from static, one-time verification to continuous, implicit monitoring. Early research originated with keystroke dynamics on physical keyboards, as pioneered by Spillane et al. [5], who analyzed inter-key latency. With the ubiquity of smartphones, research pivoted to touchscreen dynamics. Frank et al. [6] introduced the Touchalytics framework, validating the feasibility of using touch features (e.g., stroke duration, displacement) for user verification.

Recent advancements have expanded the granularity of touch analysis. For instance, Li et al. [7] proposed Swivel Touch, which incorporates 3D finger rotation features (yaw, pitch, roll) to capture unique anatomical motor control. Similarly, Wang et al. [8] developed a touchpoints-sensitive model (TSCA) to enhance security against mimicry attacks. Beyond touch, sensor signals (motion analysis) has emerged as a critical modality. Abuhamad et al. [9] demonstrated that micro-motion patterns captured by inertial sensors (accelerometer, gyroscope) could achieve high-frequency authentication with low latency. More recently, Alawami et al. [10] and Cariello et al. [11] have further validated the robustness of motion-based features in distinguishing user holding postures and gait patterns.

2.2. Advancements in Multimodal Fusion Strategies

Single-modal systems are often susceptible to environmental noise and sophisticated spoofing, such as the “GhostTouch” attacks analyzed by Wang et al. [12]. Consequently, multimodal fusion has become the standard for robust authentication. By combining touch with inertial sensors, systems can leverage complementary behavioral evidence. Zhang et al. [13] pioneered this approach with TouchID, fusing on-screen touch data with off-screen device recoil.

Current research focuses on advanced fusion architecture. Li et al. [14] proposed a Deep Reinforcement Learning (DRL) framework (AMDFAuth) that dynamically adapts fusion weights for multi-domain features, achieving superior accuracy. Similarly, Ayeswarya and Singh [2] highlighted in their comprehensive review that multimodal systems significantly reduce False Acceptance Rates (FAR) compared to unimodal counterparts. Kaur et al. [15] emphasized the importance of explainable AI in these complex fusion models, using attention mechanisms to identify critical sensor channels.

2.3. Heterogeneous Neural Pipelines and Ensemble Learning

To process high-dimensional and heterogeneous sensor data, simple machine learning models have been superseded by Deep Learning (DL) architectures. However, a modality-specific model often fails to capture the distinct characteristics of different modalities. Recent studies advocate for heterogeneous neural pipelines. For example, Mekruksavanich [16] and Zhang et al. [17] utilized combinations of CNNs for spatial feature extraction and RNNs, LSTM for temporal sequence modeling.

Ensemble learning further enhances robustness by synthesizing predictions from diverse models. Guo et al. [18] introduced EQLC-EC, an efficient voting classifier that aggregates predictions from multiple 1D CNNs to reduce variance. Bittle et al. [19] demonstrated that optimized classification ensembles could effectively handle the behavioral variations inherent in mobile usage. The theoretical consensus, as discussed by Wood et al. [20], is that enhancing the diversity of classifiers through heterogeneous architectures or diverse data views will fundamentally improve the overall system’s generalization capability.

2.4. Optimization and Grid Search in Zero Trust Architectures

A critical challenge in continuous authentication is the extreme class imbalance typical of “Zero Trust” environments, where legitimate user samples are vastly outnumbered by impostor attempts. Standard voting mechanisms with default thresholds often yield poor recall for legitimate users. To address this, automated optimization strategies are increasingly adopted over manual tuning.

Li et al. [21] applied Neural Architecture Search (NAS) to automate the design of authentication models, while Yang et al. [22] utilized learning-to-rank algorithms to optimize feature selection in unsupervised settings. In the context of ensemble fusion, systematic “Grid Search” allows for the deterministic identification of optimal decision thresholds and voting weights. This approach ensures the system can establish a precise decision boundary that maximizes security without compromising usability, a necessity effectively demonstrated in recent adaptive systems [23,24,25].

Table 1 summary of related comparative analyses of advanced methods in multimodal learning, deep learning, and optimization strategies utilized between 2023 and 2025. Six core strategies for addressing class imbalance in advanced authentication are summarized: (i) data re-sampling, (ii) generative data augmentation, (iii) unsupervised class modeling, (iv) stratified cross-validation, (v) loss-function reweighting, and (vi) attention-based mechanisms.

3. Proposed Method

The proposed system employs implicit authentication by analyzing three behavioral biometrics: touch, sliding, and sensor signals. This methodology ensures that identity verification remains continuous and transparent, thereby maintaining security without degrading the user experience.

The proposed continuous authentication framework is scientifically grounded to address behavioral-biometric complexity, and its architecture (Figure 1) is evaluated through three testable hypotheses. Hypothesis 1 posits that smartphone modalities (touch, sliding, and inertial sensing) have distinct temporal–spatial properties; therefore, modality-specific models—CNNs for high-frequency sensor, LSTMs for continuous trajectories, and GRUs for sparse touch events—will outperform a homogeneous architecture. Hypothesis 2 posits that a heterogeneous voting ensemble will reduce error and improve robustness compared to naïve fusion or single classifiers, while remaining computationally efficient for mobile deployment and enabling independent model updates. Hypothesis 3 posits that F1-targeted GSO will tune the decision boundary to mitigate the accuracy paradox under class imbalance, yielding gains unattainable via static or accuracy-driven optimization.

3.1. System Architecture

Figure 1 illustrates the system architecture of the proposed method. The system workflow follows a sequential pipeline: Data Acquisition Layer performs the real-time capture of raw telemetry. Pre-processing Layer is the Signal refinement and noise reduction. Ensemble Learning Layer is a modality-specific model using heterogeneous neural networks. Post-processing Layer is the optimized decision logic for final authentication. This modular design allows for the targeted processing of specific signal characteristics prior to integration.

3.2. Data Acquisition Layer

A low-overhead Android background service was implemented to intercept high-fidelity behavioral telemetry in real-time. The acquisition strategy relies on the hypothesis of behavioral consistency, which posits that legitimate users demonstrate temporally stable haptic and kinematic patterns. The system targets three complementary modalities, Touch capture the micro-mechanics of finger–digitizer interactions. The premise is that physiological hand characteristics and muscle memory produce a distinct “touch signature”. Sliding gesture kinematics analyzes the kinematic properties of sliding (left, right, up, down), which constitute many navigation inputs. Inertial sensor signals utilize the device’s IMU to record six-axis sensor data (accelerometer and gyroscope), modeling device-holding posture and micro-vibration patterns.

3.3. Feature Pre-Processing Layer

Raw sensor telemetry is susceptible to stochastic hardware noise, quantization errors, and environmental artifacts. To preserve data integrity, the pre-processing layer executes a systematic pipeline to parse, filter, and transform high-entropy raw signals into lower-dimensional, discriminative feature vectors. This normalization mitigates outlier influence (e.g., accidental touches) and standardize inputs for downstream classifiers. Touch quantify the mechanics of the finger–digitizer interface. To model cognitive processing speed, the system extracts the inter-event latency (Δt) to represent cognitive processing speed. Simultaneously, Euclidean displacement (Δd) is calculated to distinguish between static tapping and navigational adjustments using

\sqrt{{(x_{t} - x_{t - 1})}^{2} + {(y_{t} - y_{t - 1})}^{2}}

in Table 2.

Given the prevalence of gestural navigation, this module models the trajectory of directional inputs. Table 2 outlines the geometric properties extracted, including start-end coordinate vectors and kinematic descriptors such as accumulated contact area (

A_{c o n t a c t}

), total curvilinear distance traveled using

\sqrt{{(X_{s} - X_{E})}^{2} + {(Y_{s} - Y_{E})}^{2}}

and mean overall velocity magnitude (

V_{p a t h}

). Vector velocity is decomposed into velocity components along orthogonal axes using

V_{x} = Δ x / Δ t

,

V_{y} = Δ y / Δ t

. These features correlate strongly with user-specific thumb reach and muscle memory.

Environmental and postural data are harvested via the device’s IMU. As detailed in Table 2, the system isolates user-induced motion from ambient forces by separating the total acceleration vector (a) into gravitational (g) and linear (l) components. Rotational dynamics are captured via angular velocity (

ω

) and quaternion orientation vectors (q), providing a robust representation of the device-holding posture.

Data normalization addresses sensor quantization errors and ensures feature robustness for gradient-based optimization, a Z-score standardization protocol is implemented. Let St =

{[X}_{t}, Y_{t}, Z_{t}]

^T denote the raw sensor measurement vector at timestamp t. The normalized feature vector

{\hat{s}}_{t}

is derived as:

{\hat{s}}_{t}

=

\frac{S_{t, i} - u_{i}}{σ_{i}}

, where

u_{i}

and

σ_{i}

represent the empirical mean and standard deviation of the i-th axial component, respectively. This transformation eliminates amplitude scaling biases across heterogeneous sensor streams.

3.4. Ensemble Learning & Voting Mechanisms Layer

To address the diverse temporal and spatial characteristics of the collected behavioral modalities, this study employs a heterogeneous neural ensemble strategy. Unlike homogeneous ensembles that utilize identical base learners, this framework assigns distinct neural topologies to each modality, leveraging the specific inductive biases best suited for the underlying data structure.

The inertial data (accelerometer and gyroscope) is characterized by high-frequency, continuous signal fluctuations where local patterns (e.g., a specific wrist rotation curve) carry significant discriminative power. CNNs are selected for this modality due to their translational invariance and ability to extract local temporal features from multi-channel time-series data. CNNs architecture efficiently filters transient noise while capturing the micro-mechanics of device handling. Touch events are discrete, sporadic, and typically constitute short-term sequences. GRUs are adopted for this modality. As a lightweight variant of RNNs, GRUs effectively model the sequential dependencies of keystroke dynamics without the computational overhead of more complex architectures, making them ideal for processing sparse, event-driven touch signals.

Unlike discrete taps, sliding involve continuous, curvilinear trajectories that evolve over longer durations. LSTM networks are employed to capture the extended temporal dependencies and geometric properties of these trajectories. The LSTM’s gating mechanism preserves gradients over longer sequences, ensuring that the global shape and velocity profile of the gesture are accurately modeled. Consequently, the ensemble integrates three parallel feature extraction pipelines. It uses CNNs for sensors data, GRUs for touch, and LSTM for sliding. Each pipeline outputs a posterior probability vector representing the confidence of user verification.

To aggregate outputs from heterogeneous classifiers, three voting mechanisms are employed to evaluate, allowing each modality to be processed by its locally optimized model prior to aggregation. A two-stage threshold-based Boolean consensus is implemented by hard voting. The posterior probability

P_{m} (x)

from each classifier is first binarized using a strict prediction threshold

θ_{p r e d}

in Equation (1).

b_{m} (x) = 1 [P_{m} (x) > θ_{p r e d}]

(1)

These individual verdicts are then aggregated into a vote vector v(x). The total positive votes v(x) and the vote ratio r(x) are calculated using Equation (2), and the final authentication decision

{\hat{y}}_{h a r d} (x)

relies on meeting a dynamic quorum in Equation (3) to prevent false acceptances from single-modality errors. In a standard configuration where all three modalities are present (

M x

= 3), setting

θ_{p r e d}

= 0.5 and p = 0.5 results in a classic “2-out-of-3” majority rule.

v (x) = \sum_{m \in M x} b_{m} (x), r (x) = \frac{v (x)}{M x}

(2)

{\hat{y}}_{h a r d} (x) = 1 [v (x) \geq k (x)],

(3)

Soft voting averages the confidence scores of active classifiers to capture marginal decision nuances. The aggregated score

S_{s o f t}

(x) is the meaning of the posterior probabilities in Equation (4), compared against a global sensitivity threshold

T_{s o f t}

in Equation (5).

S_{s o f t} (x) = \frac{1}{M_{x}} \sum_{m \in M x} P_{m} (x)

(4)

{\hat{y}}_{s o f t} (x) = 1 [S_{s o f t} (x) > T_{s o f t}]

(5)

In real-world usage, specific modalities may be absent (e.g., lack of touch input during media consumption). To mitigate the impact of such sparse data, a weighted voting strategy with dynamic re-normalization is employed. The weight vector W = [

W_{s}, W_{d}, W_{t}

]^T (where

Σ W_{m} = 1)

is adaptive, suppressing the influence of inactive streams. The final score

S_{w} (x)

normalizes the contributions of currently active models in Equation (6), ensuring robust verification under partial data availability. If all modalities are present, the denominator is 1, simplifying in Equation (7). The decision logic applies to a weighted threshold

T_{w}

in Equation (8).

S_{w} (x) = \frac{Σ_{m \in m_{x}} W_{m} P_{m} (x)}{Σ_{m \in M x} W_{m}}

(6)

S_{w} (x) = W_{s} P_{s} (x) + W_{d} P_{d} (x) + W_{t} P_{t} (x)

(7)

{\hat{y}}_{w} (x) = 1 [S_{w} (x) > T_{w}]

(8)

3.5. Post-Processing Optimization Layer

To optimize the trade-off between security (false acceptance) and usability (false rejection), hyperparameter selection is formulated as a constrained optimization problem. The Grid Search Optimization (GSO) strategy is applied over the validation set

D_{v a l}

to identify the global optimum for the weight vector

W^{*}

and decision threshold

T^{*}

. The search space

θ

is defined as the Cartesian product of the valid weight simplex and the threshold interval. The decision threshold T ranges from [0.0, 0.99] with a step

δ t

= 0.01. To satisfy the convex combination constraint (

Σ w_{i} = 1

), the search iterates through independent variables

W_{s e n s o r}

and

W_{t o u c h}

, deriving

W_{s l i d i n g}

automatically. Combinations yielding negative weights are pruned. The optimization problem is formally defined as finding the configuration (

W^{*}

,

T^{*}

) that maximizes classification accuracy on

D_{v a l}

using Equation (9). This process fine-tunes the balance between the stability of continuous sensor data and the high precision of interactive gestures.

(W^{*}, T^{*}) = \underset{(W, T) \in θ}{argmax} \frac{1}{| D_{v a l} |} \sum_{X \in D_{v a l}} 1 (\hat{y} (x; W, T) = y_{t r u e})

(9)

Upon solving the optimization problem defined in Equation (9), the globally optimal weight vector

W^{*}

and decision threshold

T^{*}

are frozen and deployed into the runtime authentication engine. As depicted in the final stage of the system workflow Figure 1, the post-processing layer executes a deterministic decision rule for every incoming behavioral instance

χ

. The system first calculates the optimized weighted confidence score

S_{o p t} (x)

by aggregating the posterior probabilities from the active classifiers using the derived weights

W^{*}

in Equation (10). Subsequently, the final authentication verdict

Y_{f i n a l} (x)

is determined by comparing this score against the optimized sensitivity threshold

T^{*}

in Equation (11).

S_{o p t} (x) = \sum_{m \in M_{x}} W_{m}^{*} P_{m} (χ)

(10)

Y_{f i n a l} (x) = {\begin{matrix} 1 (A u t h o r i z e d U s e r) i f S_{o p t} (x) > T^{*} \\ 0 (U n a u t h o r i z e d U s e r), i f S_{o p t} (x) \leq T^{*} \end{matrix}

(11)

Under this logic, if the aggregated confidence exceeds the threshold

Y_{f i n a l} (x) = 1

, the system classifies the subject as authorized, thereby transparently maintaining the device’s trust status. Conversely, if the score falls below the critical threshold

Y_{f i n a l} (x) = 0

, the subject is flagged as unauthorized. In alignment with Zero Trust architecture, this negative verdict triggers immediate defensive, such as revoking session tokens or locking the device to neutralize potential post-authentication threat.

4. Experimental Results and Analysis

4.1. Experimental Configuration and Characteristics

To facilitate experimental reproducibility and ensure the ecological validity of the behavioral dataset, the acquisition was standardized as shown in Table 3. Data collection was performed using the OPPO A3 Pro smartphone equipped with a MediaTek Dimensity 7050 processor running Android 14. This device served as the edge computing node for capturing high-fidelity telemetry. The study involved 15 participants (10 males, 5 females) to ensure gender diversity in behavioral patterns. The data collection pipeline consisted of three synchronized phases: touch (keystroke dynamics), sliding (geometric trajectory analysis), and sensor signals (IMU sensor fusion). Each data collection session comprised three distinct phases: (1) a numeric keypad interaction requiring discrete taps on specific digits, (2) a trajectory tracing task where users were prompted to follow a geometric template, and (3) a device kinematic task involving physical orientation shifts (tilting the device left or right). This comprehensive sequence was performed over four consecutive rounds, with the first and second rounds utilizing a circular template, while the third and fourth rounds employed a square template to capture variations in motor control across different geometric constraints.

To ensure rigorous evaluation, we implemented a user-dependent, round-based 4-fold cross-validation strategy. Data were partitioned by acquisition phase—using three rounds for training and one for testing—rather than random shuffling. This approach enforces strict temporal isolation, preventing intra-session data leakage while assessing resilience to behavioral drift. Additionally, limitations regarding the cohort size (N = 15) and its impact on generalizability are explicitly acknowledged.

The total dataset comprises 35,519 samples. To simulate realistic zero-trust attack scenarios where unauthorized access attempts significantly outnumber authorized operations, the dataset maintains a strict class imbalance. It contains 1984 positive samples (Target User) and 33,535 negative samples (Unauthorized/Others). A stratified sampling method was employed to divide the dataset into training and testing sets with an 80:20 ratio, ensuring that the temporal integrity of the behavioral sessions was preserved. The temporal fluctuations and signal densities of the three behavioral modalities are illustrated in Figure 2, providing critical insights into the data acquisition layer’s performance. Visualization reveals a fundamental distinction between the continuous nature of inertial sensors and the event-driven sparsity of interactive gestures. The persistent availability of sensor signals serves as a foundational edge computing node for continuous authentication, ensuring that the authentication framework maintains an uninterrupted security posture across diverse mobile usage scenarios.

4.2. Performance Evaluation of the Optimized Heterogeneous Ensemble

To validate the hypothesis that distinct behavioral modalities require specialized neural topologies, we evaluated the performance of six permutations of CNNs, GRUs, and LSTMs across three data streams: touch, sliding, and sensor signals. The performance metrics of the heterogeneous classifier architectures are illustrated in Figure 3. The configuration employing CNNs for sensor data, GRUs for touch input, and LSTMs for sliding achieves the highest accuracy and F1-score, confirming the necessity of modality-specific architectural alignment. This architecture attains a peak accuracy of 99.23% and an F1-score of 82.14%. This performance hierarchy can be attributed to the intrinsic properties of each neural architecture. CNNs demonstrate superior capability in modeling high-frequency inertial signals, providing a stable baseline. GRUs are computationally efficient and effectively capture the short-term, sparse characteristics of touch tap events. LSTMs excel at modeling longer-term temporal dependencies inherent in continuous sliding. In contrast, architectures that misalign models with modalities (e.g., using LSTMs for sensor data, CNNs for touch, and GRUs for sliding) suffer significant performance degradation, with F1-scores dropping to as low as 22.92%. These results confirm that a specialized model approach is insufficient for multimodal behavioral biometrics.

A critical challenge in continuous authentication systems is the inherent class imbalance between authorized users and impostor attempts. In this study, the dataset exhibits a significant imbalance ratio of approximately 1:17. Under such conditions, standard voting mechanisms using default decision thresholds (e.g.,

τ

= 0.5) often yield misleadingly high accuracy by biasing predictions toward the majority negative class. This results in a low True Positive Rate (Recall), rendering the system ineffective for genuine user verification.

To mitigate this, we implemented a systematic GSO strategy to optimize the hyper-parameters, specifically the weight vectors and decision thresholds across hard, soft, and weighted voting architectures. Table 4 details the performance metrics for each mechanism before and after optimization. In the baseline configuration, all three voting methods exhibited poor sensitivity to the authorized user class. As shown in Table 4, while the baseline hard voting achieved an accuracy of 95.87%, its F1-score was limited to 15.48% due to a critical recall of only 9.17%. Similarly, soft voting and weighted voting produced identical baseline results with an F1-score of 22.62%, indicating that without calibration, the ensemble classifiers failed to effectively distinguish the minority class from the noise of the majority class.

Figure 4 shows the performance comparison of the substantial impact of GSO strategy, particularly on the weighted voting scheme, which demonstrates the most robust balance between precision and recall. These results confirm that a modality-specific model voting strategy is insufficient for behavioral biometrics. The optimized weighted voting mechanism successfully establishes a precise decision boundary that maximizes the detection of authorized users while maintaining zero false acceptances, thereby satisfying the stringent security requirements of the Zero Trust framework.

Table 5 provides a comprehensive performance benchmark, contrasting the singular modalities, CNNs, GRUs, and LSTM against the proposed method that integrates these heterogeneous topologies via a Grid Search-optimized weighted voting scheme. The data indicates that the standalone CNNs exhibit a respectable accuracy of 92.85% by effectively capturing spatial patterns in high-frequency sensor, yet it remains hindered by a relatively low precision of 55.26%, suggesting a propensity for false positives in complex usage scenarios. Similarly, although the GRUs architecture achieves a baseline accuracy of 97.29%, its recall remains capped at 58.33%, demonstrating an inability to resolve the subtle temporal dependencies required for robust identity verification across diverse interaction types.

Empirical results indicate that GSO yielded divergent performance enhancements across the evaluated voting architectures. Hard voting saw a substantial improvement, with its F1-score increasing to 52.60%. Conversely, soft voting demonstrated limited plasticity; despite achieving perfect precision (100.00%), its Recall remained low at 18.33%, resulting in a modest F1-score of 30.95%. This suggests that soft voting, which relies on averaging probability distributions, struggles to resolve confidence ambiguities in highly imbalanced data. The most significant performance gain was observed in the weighted voting mechanism. This approach dynamically assigned optimal weights to the most reliable modalities, particularly sensor data. After fine-tuning the decision threshold, it achieved a peak accuracy of 99.23%. More importantly, it attained a precision of 100.00% and a recall of 79.17%, pushing the F1-score to 82.14%.

To ensure rigorous validation, training and testing sliding windows were kept strictly disjoint, originating from distinct temporal sessions. This design effectively evaluates robustness against intra-subject behavioral drift, mirroring the operational demands of continuous authentication. Regarding the optimization objective, the empirical results in Table 4 validate the necessity of prioritizing the F1-score over accuracy. Given the severe class imbalance, standard voting mechanisms succumbed to the “Accuracy Paradox,” yielding F1-scores below 23%. In contrast, by shifting the GSO objective to the F1-score, the Weighted Voting mechanism achieved a substantial performance increase to 82.14%. This confirms that decision-level optimization targeting the F1-score is critical for correcting majority-class bias.

The proposed method fundamentally overcomes these constraints by aligning modality-specific data structures with the most appropriate neural inductive biases, assigning CNNs to sensor streams, GRUs to discrete touch events, and LSTM to curvilinear sliding. This architectural synergy, further refined by the systematic optimization of decision thresholds and voting weights, yields a peak accuracy of 99.23% and establishes a near-perfect precision of 100.00%. As visualized in the performance comparison chart, the proposed framework achieves a statistically significant escalation in the F1-score, rising from the 50–55% range seen in standalone models to 82.14%. Furthermore, the reduction in Equal Error Rate (EER) to 0.0016 ± 0.0032 underscores the framework’s efficacy in maintaining a precise decision boundary within the Zero Trust paradigm, ensuring that non-intrusive authentication does not compromise system integrity.

To further evaluate the diagnostic capability and robustness of the proposed framework, Figure 5 presents the Receiver Operating Characteristic (ROC) curves for the individual base classifiers (CNN, LSTM, and GRU) compared against the optimized weighted ensemble. The ROC analysis provides a comprehensive visualization of the performance trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across varying decision thresholds. As illustrated in Figure 5, the proposed method achieves a near-ideal Area Under the Curve (AUC) of 0.999, which significantly exceeds the performance of the standalone CNN (AUC = 0.974), LSTM (AUC = 0.961), and GRU (AUC = 0.913) architectures. This superior discriminative power demonstrates that the GSO strategy effectively aligns the multimodal feature spaces to establish a precise decision boundary. Specifically, the proposed ensemble maintains a high TPR even at extremely low FPR levels—a critical requirement for maintaining a rigorous security posture within ZTA environments without compromising user experience.

4.3. Sensitivity Analysis of Temporal Granularity

In continuous authentication systems processing time series data, the selection of temporal parameters, specifically window size and stride size is critical. The window size dictates the amount of historical context captured in each input sequence, while the stride size determines the temporal overlap between consecutive sequences. These parameters directly influence the model’s ability to capture latent behavioral patterns and its computational efficiency. In this study, to maintain temporal consistency and avoid data leakage between training and testing sets, the window size was set equal to the stride size. A series of experiments were conducted to evaluate the impact of varying these temporal parameters on the system’s performance. Table 6 presents the performance metrics obtained for window and stride sizes of 30, 50, 70, and 90 data points.

As detailed in Table 6 and visualized in Figure 6, an optimal balance between security and stability was identified at a window and stride size of 70. At this setting, the system achieved the highest accuracy (99.23%) and perfect precision (100.00%), resulting in a robust F1-score of 82.14%. At smaller granularities (W, S = 30 or 50), the models exhibited significantly lower recall (21.96% and 17.37%, respectively) and F1-scores (34.86% and 23.43%). The data suggests that lower temporal granularities fail to capture sufficient latent behavioral context, resulting in elevated false rejection rates. Conversely, increasing the granularity to 90 achieved perfect recall (100.00%) but at the cost of reduced precision (81.25%) and accuracy (97.60%). The decline in precision indicates that excessively large windows may incorporate redundant or noisy data points that degrade the distinctiveness of the user’s behavioral patterns. Consequently, size 70 was established as the optimal operating point for the proposed framework.

While the proposed framework achieves near-perfect precision (>99%), these results should be interpreted within the context of a user-dependent authentication model. Unlike user-independent systems that must generalize across a broad population, our model learns distinctive behavioral signatures for each individual. Furthermore, the multimodal fusion of touch and inertial features significantly reduces ambiguity compared to unimodal approaches. The consistent performance observed across the strict temporal separation of the 4-fold cross-validation further supports the reliability of these results, minimizing the likelihood of overfitting.

4.4. Training Dynamics and Computational Efficiency

The computational efficiency of the training phase is a critical determinant for the feasibility of implementing the proposed framework on resource-constrained edge devices. To characterize the convergence behavior and identify the optimal stopping point, this study monitored the validation accuracy of the individual neural components, CNNs, GRUs, and LSTM networks over a trajectory of training epochs. As illustrated in Figure 7, the heterogeneous classifiers demonstrate distinct convergence profiles. The LSTM architecture exhibits the most rapid convergence, while the CNNs maintains stability from the initial stages of training. The GRUs component requires approximately 10 epochs to achieve an accuracy exceeding 99%. These empirical results suggest that the base learners reach a state of stability by epoch 15. The performance metrics for these individual classifiers at the stable Epoch 15 mark are summarized in Table 7.

To further refine the ensemble performance, a comparative analysis was conducted specifically for the weighted voting mechanism across different epoch intervals. As detailed in Table 8, the ensemble model attains its peak performance at epoch 15, achieving a maximum accuracy of 99.23% and a precision of 100.00%.

4.5. Interpretability of Kinematic Feature and Trajectory Analysis

To interpret the model’s decision logic, we conducted a feature importance analysis using random forest. As illustrated in Figure 8, the angular velocity features derived from the Gyroscope dominate the decision boundary. Specifically, GyroY, GyroX, and GyroZ occupy the top rankings with important scores exceeding 0.29. This distribution indicates that the dynamic rotation and orientation changes in the device during usage constitute the primary biometric signature, surpassing linear acceleration features (AccY, AccX, AccZ). While sliding and touch features exhibit lower individual importance scores, they provide essential context-aware cues that complement the continuous sensor.

To evaluate the discriminative power of complex gestural patterns, the experiment incorporated a multi-stage interaction sequence. The experimental results, summarized in Table 9 reveal that circular trajectories achieved slightly higher classification accuracy (98.46%) compared to square trajectories (98.10%). From a kinematic perspective, circular motion is characterized by higher continuity and smoothness, which leads to more consistent feature representations across different sessions for the same user. In contrast, square trajectories induce significant behavioral variance. The requirement to navigate sharp corners and 90-degree directional changes typically results in momentary deceleration followed by a rapid change in angular velocity. These “corner-turning” kinematics are often executed with less consistency by the same individual across multiple attempts, introducing intra-class noise that complicates the classification boundary. Despite this increased complexity, the square gestures achieved a marginally higher F1-score (66.67%) compared to circular gestures (64.29%), suggesting that the sharp kinematic transitions provide unique anchors for identity verification even if they are more difficult to model precisely.

Despite the framework’s efficacy, three limitations persist. First, the cohort size (N = 15) may constrain generalizability across diverse demographics. Second, controlled data acquisition likely establishes performance upper bounds by omitting “in-the-wild” variables such as locomotion artifacts. Finally, reliance on a single device model introduces hardware dependencies, necessitating future validation across heterogeneous IMU sensitivities to address potential domain shifts.

5. Conclusions

This study presents a non-intrusive multimodal continuous authentication framework that operationalizes a heterogeneous deep learning pipeline on mobile devices. The system integrates CNNs for sensors, GRUs for touch, and LSTMs for sliding. This setup enables fine-grained modeling aligned with Zero Trust principles. Empirical evaluation on 35,519 samples underlines substantial performance improvements: the F1-score increased from 22.62% to 82.14%, precision reached 100%, the EER dropped to 0.16%, and overall accuracy peaked at 99.23%. These results confirm that the proposed method delivers a scalable, high-precision solution for continuous verification and satisfies the stringent requirements of Zero Trust applications.

Future research will focus on expanding the demographic scale and integrating complementary biometric modalities to further enhance the framework’s robustness. We aim to conduct a longitudinal study involving a larger and more diverse population across different age groups and physical conditions. Furthermore, we will investigate the inclusion of additional modalities, such as ambient context-aware signals or gait-induced vibration patterns, to improve the system’s generalization capability and stability under diverse physical activities.

Author Contributions

Conceptualization, C.-L.C., H.-C.C. and C.-S.C.; methodology, C.-L.C. and K.-C.C.; software, K.-C.C.; validation, C.-L.C., H.-C.C. and C.-S.C.; formal analysis, H.-C.C.; investigation, C.-S.C. and K.-C.C.; writing—review and editing, C.-L.C., H.-C.C. and C.-S.C.; visualization, C.-S.C.; supervision, C.-L.C. and H.-C.C.; project administration, C.-L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Defense Science and Technology Academic Collaborative Research Project.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rose, S.; Borchert, O.; Mitchell, S.; Connelly, S. Zero Trust Architecture; NIST Special Publication 800-207; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2020. [Google Scholar]
Ayeswarya, S.; Singh, K.J. A Comprehensive Review on Secure Biometric-Based Continuous Authentication and User Profiling. IEEE Access 2024, 12, 82996–83021. [Google Scholar] [CrossRef]
Lee, J.-S.; Chen, T.-H.; Chew, C.-J.; Wang, P.-Y.; Fan, Y.-Y. Unconsciously Continuous Authentication Protocol in Zero-Trust Architecture Based on Behavioral Biometrics. IEEE Trans. Reliab. 2025, 74, 2591–2604. [Google Scholar] [CrossRef]
Wu, T.; Li, G.; Wang, J.; Xiao, B.; Song, Y. PPCA: Privacy-Preserving Continuous Authentication Scheme with Consistency Proof for Zero-Trust Architecture Networks. IEEE Internet Things J. 2025, 12, 17596–17609. [Google Scholar] [CrossRef]
Spillane, R. Keyboard Apparatus for Personal Identification. IBM Tech. Discl. Bull. 1975, 17, 3346. [Google Scholar]
Frank, M.; Biedert, R.; Ma, E.; Martinovic, I.; Song, D. Touchalytics: On the Applicability of Touchscreen Input as a Behavioral Biometric for Continuous Authentication. IEEE Trans. Inf. Forensics Secur. 2013, 8, 136–148. [Google Scholar] [CrossRef]
Li, C.; Yu, J.; He, K.; Feng, J.; Zhou, J. SwivelTouch: Boosting Touchscreen Input with 3D Finger Rotation Gesture. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2024, 8, 53. [Google Scholar] [CrossRef]
Wang, Y.; Hu, R. TSCA: Enhancing Smartphone Security with a touchpoints-sensitive and context-aware model for touchscreen-based authentication. In Proceedings of the 2025 28th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Compiegne, France, 5–7 May 2025. [Google Scholar]
Abuhamad, M.; Abuhmed, T.; Mohaisen, D.; Nyang, D. AUToSen: Deep-Learning-Based Implicit Continuous Authentication Using Smartphone Sensors. IEEE Internet Things J. 2020, 7, 5008–5020. [Google Scholar] [CrossRef]
Alawami, M.A.; Abuhmed, T.; Abuhamad, M.; Kim, H. MotionID: Towards Practical Behavioral Biometrics-Based Implicit User Authentication on Smartphones. Pervasive Mob. Comput. 2024, 101, 101922. [Google Scholar] [CrossRef]
Cariello, N.; Eslinger, R.; Gallagher, R.; Kurtzer, I.; Gasti, P.; Balagani, K.S. Posture and Body Movement Effects on Behavioral Biometrics for Continuous Smartphone Authentication. IEEE Trans. Biom. Behav. Identity Sci. 2025, 7, 3–15. [Google Scholar] [CrossRef]
Wang, K.; Yan, C.; Ji, X.; Xu, W.; Mitev, R.; Sadeghi, A.-R. Analyzing and defending GhostTouch attack against capacitive touchscreens. IEEE Trans. Dependable Secur. Comput. 2024, 21, 4360–4375. [Google Scholar] [CrossRef]
Zhang, X.; Yin, Y.; Xie, L.; Zhang, H.; Ge, Z.; Lu, S. TouchID: User Authentication on Mobile Devices via Inertial-Touch Gesture Analysis. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2020, 4, 1–29. [Google Scholar] [CrossRef]
Li, Y.; Yang, Y.; Deng, S.; Huang, H. DRL-Based Adaptive Multidomain Feature Fusion for Continuous Authentication on Smartphones. IEEE Internet Things J. 2025, 12, 50355–50370. [Google Scholar] [CrossRef]
Kaur, D.; Pauls, A.; Carvalho, G.H.S. Towards Explainable AI in Continuous Smartphone Authentication: Leveraging CNN, BiLSTM, and Attention Techniques. In Proceedings of the 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), Toronto, ON, Canada, 8–11 July 2025; pp. 1486–1491. [Google Scholar]
Mekruksavanich, S.; Jitpattanakul, A. Identifying Smartphone Users Based on Activities in Daily Living Using Deep Neural Networks. Information 2024, 15, 47. [Google Scholar] [CrossRef]
Zhang, X.; Liu, P.; Lu, B.; Wang, Y.; Chen, X.; Zhang, Y.; Wang, Z. MTSFBet: A Hand-Gesture-Recognition-Based Identity Authentication Approach for Passive Keyless Entry Against Relay Attack. IEEE Trans. Mob. Comput. 2024, 23, 1902–1913. [Google Scholar] [CrossRef]
Guo, L.; Wang, Y.; Liu, Z.; Zhang, F.; Zhang, W.; Xiong, X. EQLC-EC: An Efficient Voting Classifier for 1D Mass Spectrometry Data Classification. Electronics 2025, 14, 968. [Google Scholar] [CrossRef]
Bittla, S.R.; Kumar, S.; Yarram, S. Towards Robust Mobile Authentication: Behavioral Biometrics and Optimized Classification for Continuous User Recognition. In Proceedings of the 2025 3rd International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, 21–22 February 2025; pp. 1–7. [Google Scholar]
Wood, D.; Mu, T.; Webb, A.M.; Reeve, H.W.J.; Luján, M.; Brown, G. A Unified Theory of Diversity in Ensemble Learning. J. Mach. Learn. Res. 2023, 24, 1–49. [Google Scholar]
Li, Y.; Luo, J.; Deng, S.; Zhou, G. SearchAuth: Neural Architecture Search Based Continuous Authentication Using Auto Augmentation Search. ACM Trans. Sens. Netw. 2023, 19, 92. [Google Scholar] [CrossRef]
Yang, Z.; Li, Y.; Zhou, G. Unsupervised Sensor-Based Continuous Authentication with Low-Rank Transformer Using Learning-to-Rank Algorithms. IEEE Trans. Mob. Comput. 2024, 23, 8839–8854. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X.; Hu, H. Continuous User Authentication on Multiple Smart Devices. Information 2023, 14, 274. [Google Scholar] [CrossRef]
Cao, H.; Jiang, H.; Yang, K.; Chen, S.; Wu, W.; Liu, J. Data-Augmentation-Enabled Continuous User Authentication via Passive Vibration Response. IEEE Internet Things J. 2023, 10, 14137–14151. [Google Scholar] [CrossRef]
Delgado-Santos, P.; Tolosana, R.; Guest, R.; Lamb, P.; Khmelnitsky, A.; Coughlan, C.; Fierrez, J. SwipeFormer: Transformers for Mobile Touchscreen Biometrics. Expert Syst. Appl. 2024, 237, 121537. [Google Scholar] [CrossRef]

Figure 1. The system architecture of the proposed method.

Figure 2. Examples of temporal fluctuations and signal densities of the three collected behavioral modalities.

Figure 3. The performance metrics of heterogeneous classifier architectures. The specific assignment of neural models to modalities plays a decisive role in classification success.

Figure 4. The impact of Grid Search Optimization across Accuracy, Precision, Recall, and F1-Score.

Figure 5. Comparative ROC curve for CNN, LSTM, GRU, and the proposed method (Weight Voting with GSO).

Figure 6. Performance metrics comparison across different temporal granularities.

Figure 7. Training convergence analysis of heterogeneous classifiers.

Figure 8. Feature importance analysis. (a) linear acceleration features (AccY, AccX, AccZ). (b) dynamic rotation and orientation features (GyroY, GyroX, GyroZ).

Table 1. Comparison of multimodal authentication methods (2023–2025).

Method	Modalities	Architecture	Advantages	Limitations	Class Imbalance Handling
1. AMD-Fauth (2025) [14]	Accelerometer, Gyroscope	Diffusion Transformer (DiT), CNN + GRU, and Actor-Critic DRL	High accuracy (98.52%); robust to degraded or incomplete data	Limited sensor variety; sensitive to non-habitual behaviors	(i) data re-sampling
2. DeepResNeXt (2024) [16]	Accelerometer, Gyroscope, Magnetometer	1D ResNeXt with sliding window	High F1-score (98.96%); stable across device placements	High computational overhead for large datasets	(iv) stratified cross-validation
3. MTSFBet (2024) [17]	Accelerometer, Gyroscope	Multi-Task Siamese FPN and BiGRU	Effective against relay attacks; hard-to-forge features	Lower accuracy for identity vs. gesture classification	(i) data re-sampling
4. EQLC-EC (2025) [18]	1D Mass Spectrometry (MS)	Ensemble Voting Classifier with 1D Lite CNNs	Efficient parallel computing; mitigates batch effects	Performance limited by base model diversity	(vi) attention-based mechanisms
5. CapsNet + CAL (2025) [19]	Touchscreen, Motion sensors	Capsule Network with Class Attention Layer and SMA optimization	98% accuracy; robust global exploration via SMA	Sensitive to data imbalance; requires large sample sizes	(vi) attention-based mechanisms
6. SearchAuth (2023) [21]	Accelerometer, Gyroscope, Magnetometer	NAS (MobileNetV3) and Auto Augmentation Search (AAS)	Extremely lightweight; automated architecture design	Vulnerable to mimicry; sensitive to sampling rates	(i) data re-sampling; (iv) stratified cross-validation
7. CALL (2024) [22]	Accelerometer, Gyroscope, Magnetometer	Unsupervised 1D Autoencoder and Shuffled Low-rank Transformer	No labeled data required; lightweight for mobile use	High sensitivity to sensor count (minimum 2 required)	(iii) unsupervised class modeling
8. HandPass (2023) [24]	Accelerometer (Vibration)	GBDT for denoising and CVAE for augmentation	Non-intrusive; effective with minimal initial samples	Affected by long-term physiological changes	(i) data re-sampling, (ii) generative data augmentation
9. SwipeFormer (2024) [25]	Touchscreen, Accelerometer, Gyroscope	Two-stream Transformer with Gaussian Range Encoding (GRE)	Ultra-fast inference (0.11 ms); robust in unconstrained scenarios	Requires hardware-specific adaptation strategies	(v) loss-function reweighting
10. Proposed Method	Touchscreen (Touch, Sliding), Accelerometer, Gyroscope	Heterogeneous Ensemble (CNN, LSTM, GRU) with GSO strategy	High accuracy (99.23%); Modality-aligned feature extraction; Robust decision boundary via GSO	Limited demographic scale (n = 15); Controlled environment; Single hardware dependency	(iv) stratified cross-validation

Table 2. Symbols and descriptions of the extracted multimodal behavioral features.

Symbol	Feature Name	Description
(x, y)	Contact Coordinate	Cartesian coordinates of the touch point
Δt	Inter-event Latency	Time interval between consecutive touch events $(t_{n} - t_{n - 1})$
Δd	Euclidean Displacement	Euclidean distance between consecutive points
$V_{f l a g}$	Validity Flag	Binary indicator for effective touch events
$(X_{s}, Y_{s})$	Start Point	Coordinate at ACTION_DOWN event
$(X_{E}, Y_{E})$	End Point	Coordinate at ACTION_UP event
$A_{c o n t a}$	Contact Area	Accumulated touch contact area
$L_{p a t h}$	Path Length	Total curvilinear distance traveled
$V_{p a t h}$	Path Velocity	Overall velocity magnitude: v = ( $L_{p a t h}$ /Δt)
( $V_{x,} V_{y}$ )	Vector Velocity	Decomposed velocity components along orthogonal axes.
$a_{x, y, z}$	Total Acceleration	Raw linear acceleration forces including gravity
$g_{x, y, z}$	Gravity Vector	Gravitational component (low pass filtered)
$l_{x, y, z}$	Linear Acceleration	User-induced motion (a–g).
$ω_{x, y, z}$	Angular Velocity	Rate of rotation around principal axes.
$q_{x, y, z, w}$	Rotation Vector	Quaternion vector components representing device attitude

Table 3. Summary of Experimental Setup and Dataset Statistics.

Category	Parameter	Specification/Description
Hardware Environment	Device Model	OPPO A3 Pro (Manufacture: OPPO Inc., City: Dongguan, Country: China)
	Processor	MediaTek Dimensity 7050 (Manufacture: MediaTek Inc., City: Hsinchu, Country: R.O.C)
	Operating System	Android 14
Participants	Subject Count	15 Subjects
Participants	Demographics	10 Males, 5 Females
Dataset Statistics	Total Samples	35,519 samples
	Positive Samples	1984 (Target User)
	Negative Samples	33,535 (Unauthorized/Others)
Evaluation	Data Partitioning	Stratified Sampling 4-fold
Evaluation	Train/Test Split	80% Training/20% Testing

Table 4. Performance comparison among different voting mechanisms.

Method	Accuracy	Precision	Recall	F1-Score
Hard Voting	95.87%	50.00%	9.17%	15.48%
Hard Voting (with GSO)	97.72%	70.83%	50.00%	52.60%
Soft Voting	96.09%	75.00%	13.33%	22.62%
Soft Voting (with GSO)	96.33%	100.00%	18.33%	30.95%
Weighted Voting	96.09%	75.00%	13.33%	22.62%
Weighted Voting (with GSO)	99.23%	100.00%	79.17%	82.14%

Table 5. Performance comparison between individual classifiers and the proposed method.

Classifier	Accuracy	Precision	Recall	F1-Score	EER
CNNs	92.85%	55.26%	72.50%	50.74%	0.0576 ± 0.0749
GRUs	97.29%	53.75%	58.33%	55.52%	0.0917 ± 0.1067
LSTM	95.65%	42.20%	75.00%	52.64%	0.0417 ± 0.0833
Proposed Method	99.23%	100.00%	79.17%	82.14%	0.0016 ± 0.0032

Table 6. Performance metrics for various window and stride sizes.

Window & Stride Size	Accuracy	Precision	Recall	F1-Score
30	95.82%	90.00%	21.96%	34.36%
50	95.24%	50.00%	17.37%	23.43%
70	99.23%	100.00%	79.17%	82.14%
90	97.60%	81.25%	100.00%	85.00%

Table 7. Performance metrics of individual classifiers at epoch 15.

Architecture	Accuracy	Precision	Recall	F1-Score	Execution Time (s)
CNN	92.85%	55.26%	72.50%	50.74%	9.6089 s
GRU	97.29%	53.75%	58.33%	58.33%	9.6089 s
LSTM	95.65%	42.20%	75.00%	52.64%	9.6089 s

Table 8. Performance metrics of weighted voting among different epochs.

Epoch	Accuracy	Precision	Recall	F1-Score	Execution Time (s)
15	99.23%	100.00%	79.17%	82.14%	11.8373 s
20	97.23%	75.00%	34.17%	40.48%	13.7961 s
25	97.44%	100.00%	39.17%	48.81%	16.3134 s

Table 9. Performance metrics for circular and square geometric trajectories.

Trajectory Shape	Accuracy	Precision	Recall	F1-Score	Avg. Processing Time (s)
Circle	98.46%	100.00%	58.46%	64.29%	11.859 s
Square	98.10%	100.00%	60.00%	66.67%	12.56 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, C.-S.; Chang, K.-C.; Chen, H.-C.; Chou, C.-L. Continuous Smartphone Authentication via Multimodal Biometrics and Optimized Ensemble Learning. Mathematics 2026, 14, 311. https://doi.org/10.3390/math14020311

AMA Style

Cheng C-S, Chang K-C, Chen H-C, Chou C-L. Continuous Smartphone Authentication via Multimodal Biometrics and Optimized Ensemble Learning. Mathematics. 2026; 14(2):311. https://doi.org/10.3390/math14020311

Chicago/Turabian Style

Cheng, Chia-Sheng, Ko-Chien Chang, Hsing-Chung Chen, and Chao-Lung Chou. 2026. "Continuous Smartphone Authentication via Multimodal Biometrics and Optimized Ensemble Learning" Mathematics 14, no. 2: 311. https://doi.org/10.3390/math14020311

APA Style

Cheng, C.-S., Chang, K.-C., Chen, H.-C., & Chou, C.-L. (2026). Continuous Smartphone Authentication via Multimodal Biometrics and Optimized Ensemble Learning. Mathematics, 14(2), 311. https://doi.org/10.3390/math14020311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Continuous Smartphone Authentication via Multimodal Biometrics and Optimized Ensemble Learning

Abstract

1. Introduction

2. Related Works

2.1. The Evolution of Behavioral Biometrics in Mobile Security

2.2. Advancements in Multimodal Fusion Strategies

2.3. Heterogeneous Neural Pipelines and Ensemble Learning

2.4. Optimization and Grid Search in Zero Trust Architectures

3. Proposed Method

3.1. System Architecture

3.2. Data Acquisition Layer

3.3. Feature Pre-Processing Layer

3.4. Ensemble Learning & Voting Mechanisms Layer

3.5. Post-Processing Optimization Layer

4. Experimental Results and Analysis

4.1. Experimental Configuration and Characteristics

4.2. Performance Evaluation of the Optimized Heterogeneous Ensemble

4.3. Sensitivity Analysis of Temporal Granularity

4.4. Training Dynamics and Computational Efficiency

4.5. Interpretability of Kinematic Feature and Trajectory Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI