Client-Side Continuous Authentication Using Keystroke Dynamics: A Lightweight Pipeline and Cross-Session Evaluation

Zhang, Zhanhe; Papaioannou, Maria; Choudhary, Gaurav; Dragoni, Nicola

doi:10.3390/electronics15112325

Open AccessEditor’s ChoiceArticle

Client-Side Continuous Authentication Using Keystroke Dynamics: A Lightweight Pipeline and Cross-Session Evaluation

Section of Cybersecurity Engineering, Department of Applied Mathematics and Computer Science, Technical University of Denmark (DTU), 2800 Kongens Lyngby, Denmark

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2325; https://doi.org/10.3390/electronics15112325

Submission received: 20 April 2026 / Revised: 18 May 2026 / Accepted: 25 May 2026 / Published: 27 May 2026

(This article belongs to the Special Issue Improving IoT Security and Efficiency Through Advanced Data Analysis Method)

Download

Browse Figures

Versions Notes

Abstract

Post-login threats such as device sharing and session takeover motivate continuous authentication with behavioral signals. This paper studies a lightweight keystroke-dynamics pipeline designed for strict cross-session evaluation and browser-side scoring. Using the fixed-text and free-text tracks of the public KeyRecs dataset, we extract compact repetition-level and sliding-window digraph-timing features and train per-user one-vs-rest Logistic Regression verifiers on Session 1 (S1). Thresholds are selected only on S1 and transferred unchanged to Session 2 (S2), preventing test-set tuning and exposing operating-point instability under session drift. Fixed-text achieves S2 AUC mean/median 0.895/0.918 with a half total error rate (HTER) around 0.19, while free-text reaches AUC mean/median 0.884/0.899 with a similar transferred-threshold HTER. Personal thresholds and a pooled-S1 global threshold perform similarly on average, suggesting that global thresholding can simplify deployment without replacing per-user scoring models. A scaler-only warm-up update yields limited and inconsistent gains, showing that mean/variance adaptation alone is insufficient. Finally, compact JSON artifacts and replay-based browser benchmarks demonstrate deterministic client-side scoring with very small per-sample latency. Overall, the results show that useful threshold-free separability does not by itself guarantee stable operating-point transfer under cross-session drift.

Keywords:

keystroke dynamics; continuous authentication; behavioral biometrics; cross-session evaluation; threshold transfer; client-side inference; logistic regression; browser-based deployment

1. Introduction

Authentication verifies that an entity (e.g., a user) is who it claims to be [1]. In most deployed systems, authentication is performed at the point of entry (e.g., login), and the system then assumes that the same user remains in control for the rest of the session [2]. This assumption creates a security gap for post-login threats such as session hijacking, where an attacker can operate under an already authenticated context after stealing a session credential [3]. Similar exposure exists for bearer-token-based authorization, where token leakage can undermine session-level security guarantees [4]. Continuous authentication (CA) addresses this gap by repeatedly re-validating the current operator during interaction, serving as a complementary layer after initial login [5].

Behavioral biometrics are attractive for continuous authentication because they can be collected during normal interaction with limited explicit user effort [6,7,8]. Biometric recognition is a broad class of approaches for verifying identity from physiological or behavioral traits [9]. Among these modalities, keystroke dynamics is a long-studied signal that captures characteristic timing patterns in typing behavior [10,11]. Surveys further document common feature families, classifiers, and experimental designs in keystroke-based authentication [12,13,14].

A central barrier to real-world adoption is that behavioral signals drift across time and context, which can degrade performance when models and operating thresholds are transferred from enrollment to later sessions. Template update and adaptation mechanisms have, therefore, been studied for keystroke dynamics [15,16]. Moreover, CA systems must commit to an operating point (i.e., a decision threshold), so cross-session score shift can translate into substantial changes in false accepts and false rejects. This motivates evaluation protocols that strictly separate threshold selection from final testing, and report both ranking-oriented and operating metrics (e.g., ROC/AUC and thresholded error rates) [17,18].

In addition to accuracy, client-side CA faces deployment constraints. Behavioral traces may leak sensitive information, motivating privacy-preserving designs [19]. Relying on remote services can add latency and reduce responsiveness; edge-computing perspectives motivate bringing computation closer to the data source [20,21]. Finally, user experience can degrade under frequent prompts or false rejections, reinforcing the need for lightweight, low-intrusive mechanisms [8].

This paper studies a lightweight keystroke-based CA pipeline under a strict cross-session setting. Using KeyRecs fixed-text repetitions and free-text digraph streams, we construct fixed-dimensional representations (per repetition for fixed-text; sliding-window aggregation for free-text) and train compact per-user one-vs-rest Logistic Regression models using Session 1 only. Operating thresholds are selected on Session 1 and transferred to Session 2 for evaluation without test-set tuning. We further analyze (i) personalized per-user thresholds versus a pooled Session 1 global threshold (shared across users while keeping per-user models unchanged), and (ii) a minimal client-feasible adaptation that updates only feature standardization statistics (scaler-only warm-up) without retraining models or retuning thresholds.

Contributions

We emphasize that the novelty of this work does not lie in the individual use of StandardScaler normalization, Logistic Regression, threshold selection, or conventional keystroke timing features, which are deliberately lightweight and well-established components. Instead, the contribution lies in the strict, reproducible, and deployment-oriented evaluation of these components for client-side continuous authentication under cross-session drift.

A lightweight client-side keystroke-based CA pipeline covering both fixed-text and free-text KeyRecs conditions, with repetition-level fixed-text features and sliding-window digraph-timing aggregation for free-text.
A strict cross-session evaluation protocol in which models and thresholds are trained/selected only on Session 1 and evaluated unchanged on Session 2, reporting both ranking metrics and operating metrics under transferred thresholds [17].
An operating-point transfer study comparing per-user personal thresholds with a pooled Session 1 global threshold, while keeping the scoring models user-specific.
A conservative cross-session adaptation analysis via scaler-only warm-up, which updates normalization statistics without retraining classifiers or retuning thresholds.
Deployment-oriented evidence through compact client-side JSON exports, browser-side JavaScript scoring benchmarks, and reproducibility materials, showing how the trained verifier can be executed deterministically outside the offline notebook environment.

2. Literature Review

2.1. Continuous Authentication and Behavioral Biometrics

CA aims to repeatedly verify whether the current operator remains the legitimate user after initial login, and must balance security with usability, privacy, and deployment cost [22,23]. Behavioral biometrics are widely studied in CA because they can be captured during normal interaction with limited explicit user effort, but they are also sensitive to context and behavioral drift [22,23,24]. These constraints motivate lightweight pipelines and evaluation designs that separate development-time calibration from deployment-time operation [25].

2.2. Keystroke Dynamics in Fixed-Text and Free-Text Settings

Keystroke dynamics characterizes users by timing patterns (e.g., dwell time and inter-key latencies), and has a long history in user verification [26,27,28]. Prior work commonly distinguishes fixed-text and free-text settings [29,30]. Fixed-text (e.g., password retyping) offers consistent lexical content and facilitates feature alignment and controlled evaluation [29]. Free-text is more natural for continuous authentication but introduces variable content, sparsity, and non-stationarity, motivating robust aggregation and modeling strategies [31,32,33,34,35]. Recent work also explores scaling free-text keystroke biometrics to larger populations and richer representations [36].

2.3. Representation and Lightweight Modeling

Keystroke-based systems span distance/statistics-based detectors, supervised learners, and deep models [28]. Deep architectures (e.g., CNN/RNN ensembles) have been explored for continuous authentication [37,38], but may impose higher computational and engineering costs in constrained clients. For deployment-oriented settings, classical models with predictable inference cost and small footprints remain attractive; linear classifiers are particularly practical due to mature tooling and efficient implementations [39], while compact tree models provide lightweight non-linear baselines [40]. Recent optimized SVM-based work in IoT network security also reflects the broader interest in efficient machine-learning models for secure and resource-constrained environments [41]. Although that work addresses IoT network security rather than behavioral biometrics, it supports the general motivation for predictable and lightweight inference in deployment-oriented security systems. Our study applies this motivation to keystroke-dynamics continuous authentication, focusing on compact exported verifiers that can be scored in the browser. Beyond model choice, statistical testing and careful protocol design are needed to avoid over-claiming improvements under heterogeneous user behavior [42].

2.4. Cross-Session Evaluation, Thresholding, and Adaptation

A persistent challenge in behavioral biometrics is cross-session shift: even when the user identity is unchanged, score distributions can drift due to device/context changes and behavioral variability [24,25]. Because deployed decisions are threshold-based, evaluation should report operating-point metrics (e.g., FAR/FRR, HTER) under threshold transfer rather than relying solely on threshold-independent separability measures [18,43]. Threshold selection can be personalized or shared globally; this trade-off interacts with user heterogeneity (e.g., “goat/wolf” effects) and motivates calibration/normalization strategies [44]. Score normalization and non-parametric alternatives have been studied in biometric verification [45,46], including adaptive settings [47] and client-oriented threshold/score adjustment approaches [48]. From a learning perspective, cross-session variability can be viewed as a form of dataset shift [49], and covariate-shift corrections (e.g., reweighting) have been studied as principled mitigations [50]. Recent surveys further highlight that adaptation benefits can be limited or unstable under realistic behavioral drift, especially when only lightweight calibration is feasible [23].

2.5. Dataset Landscape and Rationale for KeyRecs

Public keystroke datasets vary in acquisition protocol, population size, text type, device constraints, and the availability of multiple sessions; these factors can strongly affect reported performance and comparability [51]. We, therefore, use KeyRecs [52] because it (i) is openly accessible and well documented, (ii) includes both fixed-text (password retyping) and free-text (transcription) settings, and (iii) provides explicit session identifiers that support strict cross-session evaluation with development on Session 1 and testing on Session 2.

2.6. Positioning and Comparison Strategy

Because related works often differ in datasets, splits, and operating-point selection, direct numerical comparison across papers can be misleading [25,51]. Accordingly, we position this work by providing a strict, reproducible cross-session protocol on a public dataset, and by quantifying lightweight baselines under an identical S1-select/S2-evaluate evaluation pipeline.

To contextualize our design choices and highlight comparability factors, Table 1 summarizes representative quantitative results from the literature together with two key flags: (i) whether results are obtained under strict cross-session threshold transfer (S1-select/S2-evaluate), and (ii) whether client-side deployment evidence is provided. Because studies differ in datasets, decision units, and evaluation protocols, the reported numbers should be interpreted as indicative rather than directly comparable across papers [25,51].

2.7. Critical Interpretation of Table 1

Table 1 should be interpreted as a protocol-aware comparison rather than as a direct ranking of error rates. The reported numbers differ across studies because prior work varies substantially in dataset, task formulation, decision unit, and evaluation protocol. For example, some studies evaluate fixed-text password attempts, while others evaluate free-text streams, fixed-length sequences, one-minute windows, or aggregated digraph/trial units. These choices affect the amount of behavioral evidence available for each decision: longer windows or multiple trials can reduce score variance, but they also imply a different latency–stability trade-off from the shorter free-text windows used in this paper.

Evaluation protocol is another major source of difference. Many reported EER, FAR, or FRR values are obtained under cross-validation, random splits, within-session evaluation, or task-specific protocols, whereas this work uses a strict S1

\to

S2 setting. In our protocol, both the model and the operating threshold are selected only on Session 1 and then applied unchanged to Session 2. The resulting HTER and FRR@FAR = 5%, therefore, include the effect of cross-session score-distribution shift and threshold-transfer instability, rather than measuring only threshold-free separability or within-session discrimination.

For this reason, the higher transferred-threshold HTER and FRR@FAR values observed in our experiments should not be read simply as poorer biometric separability than prior work. Instead, they show that useful AUC can coexist with unstable deployment-time operating points when thresholds are transferred across sessions without test-set tuning. This is the main difference between our study and many prior comparisons: we focus on cross-session operating-point stability and client-side deployability of a lightweight verifier, rather than only reporting discriminative performance under a single evaluation protocol.

3. Proposed Method

3.1. Overview

We propose a lightweight client-side CA pipeline based on keystroke dynamics. Typing events are converted into fixed-dimensional feature vectors, standardized, scored by a compact per-user verifier, and then thresholded to produce accept/reject decisions. Figure 1 summarizes the end-to-end flow, including an optional scaler-only warm-up that updates standardization statistics without retraining the classifier or changing thresholds. The instantiated evaluation protocol (S1→S2), dataset filters, and metrics are described in Section 4.

3.2. Feature Extraction and Representation

3.2.1. Timing Primitives

For free-text digraph timing, we use four standard keystroke primitives derived from key down/up timestamps: dwell time (DU), down-down latency (DD), up-down latency (UD), and up-up latency (UU), measured in seconds. Note that

U D_{i} = t_{down}^{(i)} - t_{up}^{(i - 1)}

can be negative under key overlap. Figure 2 illustrates these definitions.

3.2.2. Fixed-Text Repetition-Level Features (47D)

For the fixed-text track, the decision unit is one complete repetition of the prompted fixed-text input. Each row in the KeyRecs fixed-text table corresponds to one repetition typed by participant p in session s with repetition index r. Following the strict cross-session protocol, these rows are not merged across sessions or repetitions: each repetition is treated as one independent verification sample.

The fixed-text representation is constructed directly from the repetition-level feature table. The columns participant, session, and repetition are retained only for label construction, S1/S2 splitting, and bookkeeping, but are excluded from the model input because they would leak identity or protocol information. All remaining numeric columns are then used, in their dataset order, as the fixed-text feature vector. No additional feature selection, dimensionality reduction, or sliding-window aggregation is applied. This deterministic preprocessing rule yields a 47-dimensional feature vector for each fixed-text repetition:

x_{p, s, r}^{fixed} = {[x_{1}, x_{2}, \dots, x_{47}]}^{⊤} \in R^{47},

(1)

where p denotes the participant, s denotes the session, and r denotes the repetition index. The 47 coordinates, therefore, correspond to the numeric repetition-level timing descriptors available in the KeyRecs fixed-text table after removing the three non-feature columns. In contrast to the free-text representation, where a stream of digraph timing primitives is summarized over sliding windows, the fixed-text track already provides one structured repetition-level record per authentication attempt. Thus, the fixed-text pipeline operates at the repetition level, while the free-text pipeline operates at the window level.

The feature order is fixed once after preprocessing and is used consistently throughout the pipeline: pooled-S1 StandardScaler fitting, per-user one-vs-rest Logistic Regression training, S2 scoring, statistical comparison, and the exported client-side artifacts. This ensures that the same 47-dimensional contract is used in both offline evaluation and replay-based client-side scoring. Table 2 summarizes the fixed-text representation.

3.2.3. Free-Text Representation via Sliding-Window Aggregation

For free-text, we aggregate a digraph stream using an overlapping sliding window of W digraphs with step size S, producing one feature vector per window. Unless stated otherwise,

W = 30

and

S = 10

. To ensure sufficient samples per (participant, session), we require at least four windows under the chosen

(W, S)

(equivalently, at least

W + 3 S

digraph records).

3.2.4. Free-Text Window-Level Features (19-Dimensional)

Each window is mapped to a compact 19-dimensional vector designed for efficient client-side execution. As summarized in Table 3, the feature set consists of (1) 16 basic/robust statistics (mean, std, median, MAD) computed for each primitive in

{DU, DD, UD, UU}

, and (2) three derived summaries: the coefficient of variation of DD, a long-pause ratio, and a typing-rate proxy.

The 19-dimensional representation is designed as a compact, deployment-oriented summary rather than an exhaustive description of all possible keystroke patterns. It is appropriate for this study because it covers the four standard digraph timing primitives, thereby representing both key-hold behavior and inter-key transition behavior. For each primitive, mean and standard deviation capture central tendency and variability, while median and MAD provide robust summaries that are less sensitive to occasional pauses or noisy timing events. The three derived features add complementary information about relative timing variability (

c v_{D D}

), hesitation or interruption behavior (long-pause ratio), and overall typing speed (kps). Because the proposed verifiers are trained on S1 and evaluated on S2 without test-set tuning, a much higher-dimensional representation could increase overfitting and make transferred thresholds less stable. Thus, the 19-dimensional feature vector provides a controlled trade-off between behavioral expressiveness, cross-session robustness analysis, and efficient browser-side scoring. We do not claim that this feature set is globally optimal; rather, it is a deliberately lightweight representation aligned with the paper’s evaluation and deployment goals.

3.3. Pooling Terminology

We use pooled to denote aggregation across users within S1 for estimating shared quantities (e.g., a track-level StandardScaler or a global operating threshold). Pooling does not mean training a global classifier: the verifier remains per-user (one-vs-rest) in all settings; only the scaler and/or threshold may be shared.

3.4. Per-User Verification Model and Scoring

We formulate CA as user-dependent verification. For each enrolled user u, we train a binary (one-vs-rest) Logistic Regression (LR) model where samples from u are labeled genuine (

y = 1

) and samples from all other users are labeled impostor (

y = 0

).

3.4.1. Standardization

Let

μ \in R^{d}

and

σ \in R^{d}

denote per-feature mean and scale. A StandardScaler transforms

x \in R^{d}

into

z = (x - μ) ⊘ σ,

(2)

where ⊘ is element-wise division. The baseline uses a pooled S1 scaler shared across users within each track.

3.4.2. Lr Score

Given standardized features

z

, LR outputs a genuineness score

s (z) = σ (w^{⊤} z + b) \in (0, 1),

(3)

where

σ (\cdot)

is the sigmoid function; higher scores indicate higher confidence of genuine usage for user u.

3.5. Thresholding Strategies

Decisions are made by thresholding

s_{u} \geq τ

. We consider: (i) personal thresholds

τ_{u}

selected per user, and (ii) a pooled-S1 global threshold

τ_{g}

derived by pooling S1 scores across users. We also study a FAR-constrained operating point (e.g., FAR ≤ 5%) for deployment-relevant reporting. Threshold selection details and the strict threshold-transfer evaluation are specified in Section 4.

3.6. Scaler-Only Warm-Up (Minimal Adaptation)

To emulate a minimal, client-feasible cross-session adaptation, we implement a scaler-only warm-up. For a user u, the first K decision units in S2 are assumed to be trusted genuine and are used to refit the scaler on pooled S1 feature vectors augmented with these warm-up units, while keeping the LR parameters

(w, b)

and the S1-selected thresholds fixed. Warm-up units are excluded from evaluation to avoid optimistic bias. The warm-up protocol and K settings are detailed in Section 4.

4. Experimental Setup

This section specifies how the method in Section 3 is instantiated for evaluation, including data inclusion criteria, the strict S1→S2 protocol, threshold selection procedures, the scaler-only warm-up evaluation, and metric definitions. Background on the KeyRecs dataset and the rationale for selecting it are provided in Section 2.

4.1. Data Preparation and Inclusion Criteria

We use KeyRecs [52] with two tracks: fixed-text (repetitions) and free-text (digraph timing). For each track, we include only users who have data in both Session 1 (S1) and Session 2 (S2), since our focus is cross-session generalization.

For free-text, we construct window-level samples using the windowing configuration defined in Section 3.2.3 (default

W = 30

, step

S = 10

). A (participant, session) segment is included only if it yields at least four windows under

(W, S)

, i.e., it contains at least

W + 3 S

digraph records after validity filtering. This prevents degenerate evaluation with too few S2 decisions.

4.2. Strict S1→S2 Protocol (Instantiated)

We follow a strict two-session protocol:

1.: S1 enrollment (offline): Fit a pooled StandardScaler on pooled S1 feature vectors (track-level). Then, for each enrolled user u, train a user-specific one-vs-rest Logistic Regression (LR) verifier on S1 only.
2.: S1 operating-point selection: Using S1 scores produced by the trained verifier(s), select personal thresholds $τ_{u}$ and derive a pooled-S1 global threshold $τ_{g}$ (details in Section 4.4).
3.: S2 evaluation (inference only): Apply the S1-fitted scaler, or the warm-up updated scaler if enabled, to S2 feature vectors, compute S2 scores, and evaluate performance under transferred thresholds selected on S1. No model retraining or threshold re-tuning is performed on S2.

This design is intended to emulate an enrollment-to-deployment setting. In practice, labeled enrollment data can be used to fit the verifier and choose an operating threshold, but the labels of a later session are not available before deployment. Therefore, thresholds are selected only on S1 and then kept fixed on S2. S2 labels are used only after scoring to compute evaluation metrics. This avoids test-set threshold tuning and makes HTER, FAR, FRR, and FRR@FAR = 5% reflect the difficulty of transferring an operating point across sessions.

4.3. Genuine/Impostor Construction

For each target user u, we construct a binary verification task:

Genuine samples: feature vectors originating from user u.
Impostor samples: feature vectors from all other users ( $\neq u$ ) within the same track.

This construction is applied consistently in both sessions: S1 data are used for training and threshold selection, while S2 data are used only for testing under threshold transfer.

4.4. Threshold Selection Procedures

Let

s_{u} (\cdot)

denote the score function of user u’s verifier, where higher scores indicate higher confidence of “genuine”. All thresholds are selected on S1 and then transferred unchanged to S2.

4.4.1. Per-User EER Threshold (Personal)

For each user u, we compute S1 scores for genuine and impostor samples and evaluate a finite set of candidate thresholds. We select the personal EER threshold

τ_{u}^{EER}

as the candidate that minimizes the absolute gap between false accept rate and false reject rate on S1:

τ_{u}^{EER} = arg min_{τ \in T_{u}} | F A R_{u}^{S 1} (τ) - F R R_{u}^{S 1} (τ) | .

(4)

4.4.2. Per-User FAR-Constrained Threshold (Personal, $FAR \leq 5 %$ )

To report a deployment-relevant operating point, we also select a personal threshold

τ_{u}^{5 %}

on S1 under the constraint

F A R_{u}^{S 1} (τ) \leq 0.05

. Among candidates satisfying the constraint, we choose the one minimizing FRR (i.e., maximizing genuine acceptance) on S1:

τ_{u}^{5 %} = arg min_{τ \in T_{u} : F A R_{u}^{S 1} (τ) \leq 0.05} F R R_{u}^{S 1} (τ) .

(5)

If no candidate meets the constraint (rare under discrete score thresholds), we fall back to the candidate that achieves the smallest

F A R_{u}^{S 1} (τ)

.

4.4.3. Pooled-S1 Global EER Threshold

To study operational simplicity, we derive a single global EER threshold

τ_{g}^{EER}

by pooling S1 scores and labels across users (while keeping the classifiers per-user). Concretely, we aggregate all users’ S1 genuine/impostor scores into a single set and select:

τ_{g}^{EER} = arg min_{τ \in T_{g}} | F A R_{g}^{S 1} (τ) - F R R_{g}^{S 1} (τ) | .

(6)

This isolates the effect of sharing an operating threshold without changing the per-user modeling assumption.

4.5. Scaler-Only Warm-Up Evaluation

We evaluate a minimal cross-session adaptation that is feasible for client-side deployment: scaler-only warm-up. The warm-up assumes a trusted genuine prefix in S2 of length K decision units (fixed-text: K repetitions; free-text: K windows). For each user u:

1.: Take the first K genuine S2 units as warm-up samples.
2.: Refit the StandardScaler using pooled S1 feature vectors augmented with these warm-up samples, producing updated $(μ, σ)$ .
3.: Keep the LR parameters $(w, b)$ and the S1-selected thresholds fixed (no retraining, no threshold shift).
4.: Evaluate on the remaining S2 units excluding the warm-up prefix to avoid optimistic bias.

Practical Threat Model for the Trusted Prefix

The trusted-prefix assumption is a deployment assumption used to evaluate conservative adaptation, not information that would be available from an arbitrary unauthenticated continuous stream. In practice, such a prefix should only be treated as genuine when it is anchored by a recent high-assurance authentication event, such as a fresh password/MFA login, WebAuthn-based re-authentication, device unlock on a known device, or an explicit short calibration period immediately after login. The warm-up procedure should, therefore, not be triggered automatically from an already suspicious or low-confidence continuous-authentication stream.

This restriction is important because a compromised prefix could act as a scaler-poisoning channel. If an attacker controls the samples used for warm-up, the updated mean and scale parameters may shift the normalization applied to later samples and thereby alter the score distribution. For this reason, the evaluated warm-up mechanism is intentionally limited: it updates only the StandardScaler statistics, keeps the LR weights, intercept, and S1-selected thresholds fixed, bounds the number of warm-up units by K, and excludes the warm-up samples from S2 evaluation. A practical deployment should additionally log or quarantine warm-up updates and allow rollback to the original S1 scaler when the authentication context becomes suspicious.

We report results for several warm-up lengths

K \in {5, 10, 15, 20}

to assess sensitivity to the amount of trusted adaptation data.

4.6. Metrics and Aggregation

We report both threshold-free and operating-point metrics on S2:

4.6.1. AUC (Threshold-Free)

For each user u, we compute the area under the ROC curve (AUC) on S2 scores using genuine vs impostor labels. AUC summarizes rank-based separability and is independent of any chosen threshold [17].

4.6.2. Operating-Point Metrics Under Transferred Thresholds

Given a threshold

τ

, we compute on S2:

\begin{matrix} F A R (τ) & = \frac{# {impostor scores \geq τ}}{# {impostor scores}}, \end{matrix}

(7)

\begin{matrix} F R R (τ) & = \frac{# {genuine scores < τ}}{# {genuine scores}}, \end{matrix}

(8)

\begin{matrix} H T E R (τ) & = \frac{F A R (τ) + F R R (τ)}{2} . \end{matrix}

(9)

We additionally report FRR@FAR = 5% on S2 by transferring the S1-selected FAR-constrained threshold

τ_{u}^{5 %}

.

4.6.3. User-Level Reporting

Unless stated otherwise, all metrics are computed per user and then aggregated across users (e.g., mean and distributional summaries). This avoids dominance by users with more samples and reflects typical verification reporting practice.

4.6.4. Statistical Uncertainty and Paired Testing

To avoid over-interpreting small average differences, we additionally report user-level paired uncertainty analyses for the main comparative claims. For each paired comparison, we compute the per-user difference in the relevant metric, e.g.,

Δ HTER = {HTER}_{global} - {HTER}_{personal}

for the threshold comparison, or

Δ HTER = {HTER}_{model} - {HTER}_{LR}

for lightweight model comparisons. We summarize the paired differences using the mean, median, a non-parametric 95% bootstrap confidence interval, and a two-sided Wilcoxon signed-rank test. All statistical tests are performed at the user level, so that each user contributes one paired difference to the comparison.

4.7. Implementation Notes

All experiments follow the same preprocessing contract as deployment: feature vectors are cast to floating point and missing values are handled consistently (zero-imputation) prior to standardization and scoring. The client-side inference path consists only of feature extraction, StandardScaler transform, LR scoring, and threshold comparison, matching the exported model/scaler artifacts.

5. Results

This section reports cross-session performance on KeyRecs under the strict S1→S2 evaluation and threshold-transfer setting defined in Section 4. We present results for (i) fixed-text repetitions and (ii) free-text windowed digraph features, focusing on: cross-session separability (AUC), transferred-threshold operating behavior (HTER/FAR/FRR), threshold strategy (personal

τ_{u}

vs. pooled-S1 global

τ_{g}

), scaler-only warm-up, and lightweight baseline comparisons.

5.1. Fixed-Text Track

5.1.1. Cross-Session Baseline and Threshold Strategy

Across 99 users, fixed-text yields strong separability on S2 (AUC mean/median

\approx 0.895 / 0.918

; as shown in Figure 3), while operating-point error under threshold transfer remains noticeably higher (HTER ≈ 0.19), consistent with cross-session score shift affecting calibration around the decision boundary. Table 4 compares transferred-threshold HTER on S2 using personal thresholds versus a single pooled-S1 global threshold (models remain per-user).

Global thresholding is only marginally better on average, and the two HTER distributions largely overlap (Figure 4), indicating that most users are insensitive to whether the operating point is personalized or shared under this setting. Figure 5 further shows that

Δ HTER = {HTER}_{g l o b a l} - {HTER}_{p e r s o n a l}

is sharply concentrated near zero, with small tails reflecting user-level heterogeneity.

5.1.2. Lightweight Classifier Comparison

We compare LR against several compact classical baselines under the same S1-select/S2-evaluate protocol and transferred-threshold reporting (Section 4). Table 5 summarizes mean/median AUC (S2), HTER at the transferred EER threshold, and

FRR @ FAR = 5 %

at the transferred FAR-constrained threshold.

LR and LinearSVM behave similarly overall. Figure 6 shows that RandomForest achieves the highest AUC on S2, while LR and LinearSVM are close and DecisionTree is substantially lower. However, Figure 7 indicates that RandomForest degrades markedly in transferred-threshold HTER, whereas LR/LinearSVM (and GaussianNB) remain lower and more stable across users. At the FAR-constrained operating point, Figure 8 reports

FRR @ FAR = 5 %

on S2 with thresholds selected on S1: LR/LinearSVM yield the lowest FRR on average, while GaussianNB and RandomForest exhibit higher FRR and larger dispersion. Together, these results illustrate that strong ranking performance (AUC) does not necessarily translate into stable operating behavior across sessions when score calibration shifts.

5.1.3. Scaler-Only Warm-Up (Fixed-Text)

Warm-up refits only standardization parameters using the first K genuine S2 units (here

K = 10

repetitions), while keeping per-user classifiers and S1-selected thresholds fixed; warm-up units are excluded from evaluation (Section 4). Figure 9 shows near-identical HTER distributions with and without scaler-only warm-up, and Figure 10 shows

Δ HTER = {HTER}_{n o - w a r m} - {HTER}_{w a r m}

concentrated near zero with a small negative tail, indicating negligible average effect and occasional degradation.

5.2. Free-Text Track

5.2.1. Cross-Session Baseline and Threshold Strategy (Default W = 30, S = 10)

Across 98 users, free-text window features achieve strong separability on S2 (AUC mean/median

= 0.884 / 0.899

, as shown in Figure 11), but transferred-threshold operating error remains non-negligible (HTER median

\approx 0.167

). Table 6 summarizes baseline results and shows that pooled-S1 global thresholding is again very close to personal thresholding on average.

To contextualize HTER, Figure 12 compares per-user HTER, FAR, and FRR on S2 under personal vs. pooled-S1 global thresholding. Consistent with Table 6, differences are small overall; the global threshold slightly reduces FRR for some users while FAR remains comparable, suggesting that session shift is primarily expressed as modest score calibration changes around the decision boundary rather than a collapse of separability.

5.2.2. Lightweight Classifier Comparison (Free-Text)

Table 7 reports model comparison results on free-text under default windowing, and Figure 13 visualizes per-user distributions. Linear models (LR and LinearSVM) provide among the lowest HTER under transferred thresholds, while RandomForest achieves the highest AUC but increases HTER at the transferred EER operating point. Notably, RandomForest can reduce FRR at the FAR-constrained operating point, underscoring that cross-session behavior depends on both ranking (AUC) and score calibration under the target operating regime.

5.2.3. Scaler-Only Warm-Up (Free-Text) and Sensitivity to K

Under the default setting, warm-up effects are small at the population level: Figure 14 shows strong overlap between no-warm and warm-up HTER distributions for

K = 10

. We further vary

K \in {5, 10, 15, 20}

(Section 4) and compare medians (Figure 15) and per-user

Δ HTER

distributions (Figure 16). Across

K \leq 15

, changes are marginal; at

K = 20

, warm-up degrades the median HTER in this run, indicating that using more warm-up data does not guarantee improved cross-session operating behavior.

5.3. Operating Metrics Under Threshold Transfer

A recurring pattern across the fixed-text and free-text experiments is that threshold-free separability and threshold-based operating behavior tell different stories. The S2 AUC values indicate that the learned one-vs-rest scorers often preserve useful ranking ability across sessions: genuine samples tend to receive higher scores than impostor samples even under S1→S2 transfer. However, AUC does not specify where the operating threshold should be placed, nor does it guarantee that a threshold selected on S1 will induce the same FAR/FRR trade-off on S2.

This distinction is important for continuous authentication deployment. In a deployed CA system, the user experience and security posture are governed by operating-point metrics such as FAR, FRR, HTER, and FRR at a constrained FAR, rather than by AUC alone. A high AUC can coexist with non-negligible HTER or high FRR if the genuine and impostor score distributions shift between sessions. Therefore, the transferred-threshold results should be interpreted as evidence that cross-session operating-point calibration is a central challenge for client-side continuous authentication.

5.4. Client-Side Browser Timing and Feasibility

To strengthen the deployment-oriented evaluation, we complemented the artifact-footprint analysis with a browser-side JavaScript timing benchmark. The benchmark measures the runtime inference path used by the replay demo: feature alignment, StandardScaler normalization, linear Logistic Regression scoring, optional sigmoid conversion, and threshold comparison. Training, threshold selection, and feature extraction from raw keystroke logs are not included in this timing benchmark because they are performed offline in the current demo setting.

For each track, we loaded the exported model.json and scaler.json artifacts in the browser and replayed feature vectors in the same order used by the offline evaluation. Timing was measured with performance.now() after a JavaScript benchmark warm-up pass, and we report the median and 95th-percentile per-sample latency. The artifact footprint is reported as the combined size of the inference-required JSON files, namely model.json and scaler.json.

Table 8 reports browser-side JavaScript scoring latency on both desktop and mobile environments. Across all tested settings, the scoring path remains below 0.3 μs/sample at the median and below 0.3 μs/sample at the 95th percentile. The fixed-text track is consistently slower than the free-text track because it uses a 47-dimensional repetition-level representation, whereas the free-text scorer uses a 19-dimensional window-level representation. Mobile execution is slower than desktop execution, and Firefox on Android is slower than Chrome on Android in this benchmark. Nevertheless, all measured latencies remain far below interactive timescales, supporting the feasibility of client-side scoring from compact exported artifacts.

5.5. Statistical Uncertainty of Main Comparisons

To support the main comparative claims, we performed paired user-level uncertainty analyses for two comparisons: threshold strategy and lightweight model choice. This paired design matches the experimental setup because each user is evaluated under both conditions. Therefore, the relevant quantity is the per-user difference, not only the difference between population means. For each comparison, we report the mean and median paired difference, a non-parametric 95% bootstrap confidence interval, and a two-sided Wilcoxon signed-rank test.

Table 9 shows that the average differences between personal and pooled-S1 global thresholds are very small in both tracks. For fixed-text, the pooled-S1 global threshold yields a slightly lower mean HTER than the personal threshold (

Δ = - 0.0031

, 95% CI

[- 0.0060, - 0.0003]

,

p = 0.0378

). Although this difference is statistically distinguishable, its magnitude is only about 0.3 percentage points in HTER. For free-text, the mean difference is similarly small (

Δ = - 0.0020

, 95% CI

[- 0.0044, 0.0004]

,

p = 0.2276

), and the paired test does not indicate a statistically distinguishable difference. Overall, these results support the practical conclusion that personal and pooled-S1 global thresholds behave similarly under the S1→S2 transfer protocol.

Table 10 complements the mean/median model comparison in Table 5 and Table 7. For fixed-text, LinearSVM and GaussianNB do not show statistically distinguishable transferred-threshold HTER differences from LR. In contrast, DecisionTree and RandomForest produce substantially higher HTER than LR, with mean increases of 0.0814 and 0.1818, respectively. For free-text, LinearSVM yields a statistically distinguishable but practically small HTER reduction relative to LR (

Δ = - 0.0047

, 95% CI

[- 0.0074, - 0.0019]

,

p = 0.0020

), while GaussianNB and DecisionTree do not show statistically distinguishable HTER differences from LR. RandomForest again produces higher transferred-threshold HTER than LR in free-text (

Δ = 0.0449

, 95% CI

[0.0344, 0.0557]

,

p < 0.001

). Together, these paired results support the conclusion that stronger rank separability, as measured by AUC, does not necessarily imply more stable operating-point behavior under S1→S2 threshold transfer.

5.6. Summary of Findings

Across both tracks, per-user models achieve high cross-session separability (AUC), while transferred-threshold operating metrics are more sensitive to cross-session score shift. Personal and pooled-S1 global thresholds behave similarly on average, and scaler-only warm-up yields limited and sometimes unstable gains. Under client-side constraints, compact linear models provide the most stable operating behavior under threshold transfer, whereas more complex models can improve AUC yet degrade transferred-threshold performance.

6. Client-Side Deployment and Replay Demonstration

This section documents the exported client-side scoring artifacts and the browser replay demo. The goal is to demonstrate deterministic client-side scoring from compact exported artifacts without transmitting raw keystroke logs. The demo is intended for reproducibility and illustration rather than production hardening.

As illustrated in Figure 17, our workflow separates offline enrollment from client-side scoring. Offline notebooks operate on S1 to fit a pooled StandardScaler, train per-user LR models, and select operating thresholds, then export compact JSON artifacts for deployment. Table 11 lists the exported artifacts used by the browser demo and clarifies which files are required for inference versus demo-only metadata. The browser-side demo loads these artifacts and replays an S2 feature stream from a CSV to produce a deterministic score stream and thresholded decisions on-device. Importantly, the optional prompting policy in the demo consumes only the score stream and is not part of the strict S1→S2 evaluation protocol reported in Section 5.

6.1. Deployment Goal and Scope

Training and threshold selection are performed offline on S1, producing compact artifacts. At runtime, the browser loads the exported parameters and computes a score stream from feature vectors (replayed from CSV in the demo). The runtime path performs standardization + linear scoring + optional sigmoid conversion + threshold comparison, with no server-side processing.

6.2. Exported Artifacts

For each track, the offline pipeline exports a minimal set of files sufficient to reproduce inference deterministically on the client (for a selected target user in the demo; in full deployment the same format can be stored per enrolled user):

6.3. Replay Feature CSV (Demo Data Source)

To demonstrate client-side scoring without implementing raw keystroke parsing in JavaScript, the demo replays a feature stream exported as a CSV. Only the feature columns listed in model.json (in the exported order) are consumed by the inference pipeline. Any additional columns (e.g., a label column or an index-like column such as repetition/window id) are ignored by inference and used only for visualization.

Each row corresponds to one authentication unit: (i) fixed-text: one repetition-level sample, or (ii) free-text: one window-level feature vector (windowing and feature extraction are performed offline for the demo).

6.4. Browser-Side Inference Contract

At runtime, the client aligns CSV columns to the exported feature list in model.json to ensure consistent ordering. Given a feature vector

x \in R^{d}

, the client computes:

\begin{matrix} z & = (x - μ) ⊘ σ, \end{matrix}

(10)

\begin{matrix} s & = sigmoid (w^{⊤} z + b), \end{matrix}

(11)

where

(μ, σ)

are loaded from scaler.json and

(w, b)

are loaded from model.json. A fixed decision threshold

τ

is then applied: accept if

s \geq τ

, reject otherwise. By default in the demo,

τ

is taken from the S1-selected threshold hint stored in model.json; alternative thresholds (e.g., pooled-S1 global thresholds) are analyzed offline and are not required for demonstrating deterministic scoring.

6.5. Client-Side Implementation Workflow

The proposed pipeline is intended to separate offline enrollment from lightweight client-side inference. During offline enrollment, the system constructs the fixed-text or free-text feature representation, fits the pooled S1 StandardScaler, trains the per-user one-vs-rest Logistic Regression verifier, selects the operating threshold on S1, and exports the resulting verifier artifacts. At deployment time, the client does not retrain the model or reselect thresholds. Instead, it loads the exported model.json and scaler.json files, constructs the current authentication unit, standardizes the feature vector using the exported scaler statistics, computes the linear Logistic Regression score, applies sigmoid conversion, and compares the score with the S1-selected threshold.

This implementation path requires only basic numerical operations at runtime: feature extraction, vector standardization, a dot product, a sigmoid function, and a threshold comparison. The exported artifacts are compact JSON files and can be inspected or replayed without specialized biometric infrastructure. The browser replay demo and browser timing benchmark implement this same inference contract, showing how the verifier can be executed in a standard browser environment from exported artifacts. Thus, the practical implementation burden is concentrated in offline preprocessing and enrollment, while the deployed client-side scoring path remains simple and deterministic.

6.6. Browser Timing Benchmark Protocol

To evaluate whether the exported artifacts can be scored efficiently in a browser client environment, we implemented a JavaScript timing benchmark that follows the same inference contract as the replay demo. The benchmark loads the exported model.json and scaler.json files together with replayed feature vectors, aligns the feature vectors to the exported feature order, and then measures the runtime scoring path of the exported verifier. The measured path consists of StandardScaler normalization, the Logistic Regression dot product, sigmoid conversion, and comparison against the S1-selected threshold.

Timing is measured with the browser performance.now() API. Before measurement, the benchmark executes a JavaScript benchmark warm-up pass to reduce the influence of just-in-time compilation and cache effects. To avoid timer-resolution artifacts caused by the extremely small per-sample scoring cost, each timing repeat performs multiple full passes over the replayed feature vectors, and the elapsed time is normalized by the total number of scored samples. The reported latency is, therefore, expressed per authentication unit: one repetition for fixed-text and one window for free-text. The benchmark measures the runtime scoring cost of the exported verifier, while excluding offline model training, S1 threshold selection, raw keystroke-event capture, feature-vector construction, UI rendering, background scheduling, and battery consumption.

The corresponding desktop and mobile browser measurements are reported in Table 8. The results show very small per-sample scoring latencies in all tested settings, but they also illustrate that latency varies across browser and device combinations. These differences are expected because devices and browsers may differ in CPU capability, JavaScript engine implementation, just-in-time compilation behavior, caching, timer resolution, and scheduling policy. Mobile devices may also introduce additional constraints such as thermal throttling, background-execution limits, and battery sensitivity.

The browser-side benchmark shows that the exported verifier is computationally lightweight in the tested environments, but computational feasibility alone does not imply reliable continuous-authentication operation. The more deployment-critical issue is whether a threshold chosen during enrollment or an earlier session remains appropriate in a later session. Our S1→S2 protocol intentionally keeps thresholds fixed after S1 selection, which exposes this operating-point transfer problem. Thus, the client-side pipeline is efficient to execute, but real deployments should monitor calibration drift, choose operating thresholds according to the desired FAR/FRR trade-off, and avoid interpreting high AUC as sufficient evidence of stable real-world authentication behavior.

6.7. Optional Prompting Policy (Demo-Only)

While offline experiments evaluate verification performance under fixed thresholds transferred from S1 to S2, an interactive system may use the score stream to decide when to request revalidation. To illustrate this separation, the demo optionally implements a simple score-consumption policy: a prompt is triggered when the score stays below

τ

for M consecutive units, followed by a cooldown of C units during which no additional prompts are issued. This mechanism is demo-only: it is not part of the offline evaluation protocol and does not affect the reported experimental results.

6.8. Score-Stream Visualization (Demo Output)

Figure 18 illustrates how the exported verifier is consumed in the browser replay demo. The horizontal axis represents the sequential authentication units replayed by the demo: fixed-text repetitions in the upper panel and free-text windows in the lower panel. The vertical axis shows the Logistic Regression score produced after applying the exported StandardScaler parameters and model coefficients. Higher scores indicate stronger similarity to the enrolled target user, whereas scores below the transferred S1-selected threshold are treated as non-genuine decisions under the fixed operating point.

The figure is intended to clarify the runtime behavior of the client-side verifier rather than to introduce an additional evaluation protocol. It shows that both fixed-text and free-text tracks can be represented as a continuous score stream, and that the same threshold-transfer principle used in the offline S1→S2 evaluation can be visualized in the replay setting. Prompt markers correspond to the optional demo-only policy described in Section 6.7, where repeated low scores can trigger re-authentication after a configurable consecutive-failure rule. These prompt markers are not used to compute AUC, HTER, FAR, FRR, or FRR@FAR = 5%; they only illustrate how a deployed interface might consume the verifier output.

7. Discussion

This section interprets the results in Section 5 under the strict S1→S2 threshold-transfer setting (Section 4), and discusses deployment implications, limitations, and future directions.

7.1. Separability vs. Transferred Operating Behavior

Across both tracks, we observe high cross-session separability (AUC) alongside noticeably higher operating error under threshold transfer (HTER; Table 4 and Table 6). This gap is consistent with a common verification phenomenon: scores remain reasonably rank-separable across sessions, yet calibration around the decision boundary shifts, degrading performance at a fixed operating point transferred from enrollment. In practical terms, AUC answers “can we separate genuine from impostor in principle?”, whereas HTER/FAR/FRR under transferred thresholds answer “will a fixed deployment threshold remain reliable under session shift?”. The latter is more sensitive to cross-session score drift and thus is the appropriate lens for client-side continuous operation.

7.2. Personal vs. Pooled-S1 Global Thresholds

Both fixed-text and free-text show only small average differences between personal thresholds and a single pooled-S1 global threshold (Table 4 and Table 6; Figure 4 and Figure 12). This suggests that, under the current per-user modeling assumption, the dominant factor is the user-specific score distribution learned in S1, while the choice between a personalized or shared operating point mainly acts as a mild score-calibration adjustment. Operationally, this finding supports a pragmatic option: using a single global threshold may simplify deployment and configuration while remaining close to per-user thresholding for most users in this setting. However, user-level heterogeneity remains visible in the tails (e.g., ΔHTER histograms), implying that threshold sharing may still be suboptimal for specific users or risk profiles.

7.3. Why Scaler-Only Warm-Up Yields Limited (And Sometimes Unstable) Gains

Scaler-only warm-up updates only standardization parameters

(μ, σ)

while keeping the classifier and S1-selected thresholds fixed. Empirically, warm-up effects are small in fixed-text and remain limited and unstable in free-text across warm-up lengths K (Section 5.1 and Section 5.2). A plausible interpretation is that cross-session shift is not dominated by a simple affine feature shift that can be fully corrected by re-centering/rescaling. Instead, the shift may involve changes in higher-order structure (e.g., variance patterns across primitives, conditional dependencies, or user state variations), for which updating only first/second moments is insufficient. Additionally, if the warm-up prefix is short or unrepresentative, refitting the scaler may amplify noise rather than reduce drift, which can explain occasional degradations at larger K.

From a deployment perspective, this is an important negative result: minimal “cheap” adaptation (scaler-only) may not reliably improve transferred-threshold performance, and should be considered optional rather than assumed beneficial.

The limited empirical benefit also affects the security interpretation of warm-up. Because warm-up requires a trusted genuine prefix, it introduces an additional operational assumption that is absent from the no-warm-up baseline. If this prefix is collected immediately after a strong authentication event, the assumption can be reasonable for short calibration or session-start adaptation. However, if the prefix is collected after the session is already compromised, the same mechanism may become a poisoning surface for the scaler. Therefore, the results should not be interpreted as evidence for unrestricted online adaptation. Instead, scaler-only warm-up is best viewed as an optional, guarded normalization update whose benefit is modest in this study and whose use should depend on the availability of a high-confidence authentication anchor.

7.4. Model Comparison: Strong AUC Does Not Guarantee Stable Threshold Transfer

In both tracks, compact linear models (LR and LinearSVM) exhibit competitive AUC and comparatively stable operating behavior under threshold transfer, while more complex models can increase AUC yet degrade HTER at the transferred EER operating point (e.g., RandomForest in fixed-text; Table 5). This supports a practical lesson for continuous authentication: under session shift, models with higher capacity can produce score distributions that are less stable across sessions, making a fixed enrollment-time threshold more fragile even if ranking performance improves. Therefore, for client-side CA where threshold transfer is required, model selection should prioritize robustness of score calibration under shift, not only separability.

7.5. Implications for Client-Side Continuous Authentication

Our pipeline is explicitly designed to be deployable on the client: inference consists of feature extraction, standardization, a linear score, and a threshold comparison. The browser replay demo requires only a minimal set of exported JSON artifacts for client-side scoring, namely model.json and scaler.json (Table 11). The browser-side timing results in Table 8 further show that the exported scoring path is lightweight in the tested desktop and mobile browser environments. After feature extraction, scoring cost scales linearly with feature dimension

O (d)

, while free-text window aggregation adds an

O (W)

per-window feature-construction cost. These properties align with environments where always-on heavyweight inference is undesirable (e.g., browser contexts).

The threshold study also shows a deployment trade-off. A pooled-S1 global threshold can reduce per-user configuration overhead with only minor average performance differences in this setting. Personal thresholds remain safer when user-level risk tolerance or heterogeneity matters.

The observed HTER and FRR values further show that the verifier should not be treated as a silent standalone access-control mechanism. A false rejection may create unnecessary warnings, re-authentication prompts, or temporary friction for the legitimate user. A false acceptance may delay intervention against an impostor session. The high FRR at the S1-selected FAR-constrained operating point illustrates this security–usability cost: stricter false-acceptance control can increase legitimate-user interruptions under cross-session drift. Therefore, the score stream is better used as a client-side risk signal. It should be combined with policy logic, such as multiple consecutive low scores, cooldowns, step-up authentication, and application-specific thresholds, rather than causing immediate lockout after a single window or repetition.

Finally, the limited warm-up benefit indicates that robust cross-session performance likely requires either stronger adaptation mechanisms or operational strategies that do not assume stable calibration from a single enrollment session.

7.6. Limitations and Assumptions

This study has several limitations and assumptions that constrain generalization beyond the evaluated setting:

Single-dataset scope and generalization. The empirical evaluation is conducted only on the public KeyRecs dataset using a fixed S1→S2 split. KeyRecs is suitable for this study because it provides both fixed-text and free-text keystroke data with explicit session identifiers, enabling strict cross-session threshold transfer. However, relying on a single dataset limits the generalizability of the results. The experiments do not cover different keyboard types, devices, keyboard layouts, user groups, language structures, longer-term drift, or fully live real-time usage scenarios. Therefore, the reported results should be interpreted as evidence for the proposed lightweight pipeline under the KeyRecs cross-session setting, rather than as a universal estimate of deployment performance. Broader validation on additional datasets and real-world multi-device settings remains future work.
Model scope. The model comparison is limited to lightweight classifiers that are plausible for client-side deployment, including Logistic Regression, Linear SVM, Random Forest, Decision Tree, and Gaussian Naive Bayes. Therefore, references to model behavior in this paper should be interpreted within this lightweight baseline set rather than as claims about all possible biometric classifiers. More complex temporal models, deep sequence models, or personalized calibration models may behave differently, but they may also introduce larger computational, storage, privacy, and deployment costs.
Trusted-prefix assumption for warm-up. The scaler-only warm-up analysis assumes that a short early prefix of the new session is genuine. This assumption is realistic only when the prefix is collected immediately after a high-assurance authentication event, such as MFA, WebAuthn, device unlock, or explicit re-authentication. It is not appropriate when the session may already be compromised or when the system only has low-confidence continuous-authentication evidence. A compromised prefix could poison the updated scaler statistics and alter later score distributions. Therefore, practical systems should treat warm-up updates as guarded and reversible. The number of warm-up samples should be bounded. Updates should be anchored to explicit authentication events. Classifier weights and thresholds should remain fixed unless stronger validation is available. Rollback to the original S1 scaler should also be possible. In our experiments, scaler-only warm-up produced limited and inconsistent improvements, so the deployment claim does not depend on warm-up as a necessary adaptation mechanism.
Adversarial and behavioral threat coverage. We evaluate standard genuine-versus-pooled-impostor discrimination using recorded keystroke samples, but we do not fully model adaptive adversaries. In practice, attackers may attempt targeted mimicry, shoulder surfing, replay of captured timing patterns, malware-based event injection, or gradual manipulation of adaptation mechanisms. Keystroke behavior may also change for benign reasons, including fatigue, stress, injury, keyboard replacement, device changes, posture changes, or long-term typing-habit drift. These adversarial and behavioral factors can shift score distributions and may degrade the stability of transferred operating thresholds. Future work should evaluate such threats under more realistic longitudinal and adversarial settings.
Browser timing scope. We added browser-side timing measurements for the runtime scoring path, but the benchmark remains narrower than a full production deployment study. The benchmark loads the exported artifacts and replayed feature vectors in the browser. The reported latency focuses on per-sample scoring after feature vectors are available. This includes StandardScaler normalization, Logistic Regression scoring, sigmoid conversion, and threshold comparison. It does not measure end-to-end browser integration with raw keystroke-event capture, real-time feature-vector construction, UI rendering, background scheduling, battery consumption, or long-running operation under changing device conditions. Although the benchmark covers both desktop and mobile browsers, broader evaluations across additional devices, operating systems, browser versions, and energy profiles remain future work.
Operating-point stability. Although the lightweight classifiers evaluated in this study show useful threshold-free separability, stable real-world CA operation depends on transferred operating points. A threshold selected on S1 may not preserve the same FAR/FRR balance on S2 because score distributions can shift across sessions. Future deployments should investigate calibration monitoring, adaptive but guarded threshold updates, and user-specific security–usability policies.
Privacy and deployment policy. The proposed client-side design reduces the need to transmit raw keystroke data to a server, but local behavioral biometric signals remain sensitive. Practical deployments should minimize retained raw events, protect exported artifacts and score logs, define retention policies, and provide clear user consent and fallback authentication mechanisms. These policy and privacy issues are outside the experimental scope of this study.

8. Conclusions and Future Work

This paper investigated lightweight continuous authentication using keystroke dynamics, with an emphasis on strict cross-session evaluation and client-side feasibility. Using the public KeyRecs dataset, we implemented and evaluated a per-user one-vs-rest pipeline with compact features for both fixed-text and free-text conditions.

A central finding is that threshold-free separability and transferred operating behavior must be interpreted separately. The S2 AUC results indicate that the lightweight per-user classifiers retain useful discriminative ranking across sessions: genuine samples often receive higher scores than impostor samples even under S1→S2 transfer. In the Logistic Regression baseline, fixed-text authentication achieved S2 AUC mean/median of 0.895/0.918, while free-text authentication achieved 0.884/0.899 under the default

W = 30, S = 10

setting. However, the transferred-threshold metrics show that this ranking ability does not automatically translate into stable deployed operation. For example, mean HTER remained around 0.19 for fixed-text and around 0.186–0.188 for free-text, and FRR at the S1-selected FAR = 5% operating point remained substantial. A threshold selected on S1 can induce different FAR/FRR behavior on S2 because of cross-session score-distribution shift. This is why HTER, FAR, FRR, and FRR at a constrained FAR are central to the evaluation: they describe the user-facing and security-facing consequences of applying an enrollment-session threshold in a later session.

The main contribution of this work is, therefore, not a new complex classifier, but a deployment-oriented evaluation pipeline for lightweight keystroke-based continuous authentication. The pipeline combines fixed-text and free-text evaluation on the same public dataset, strict S1 training and threshold selection followed by S2 testing, personal versus pooled-S1 global threshold comparison, scaler-only warm-up analysis, and compact JSON artifacts for deterministic browser-side scoring. Together, these components make the gap between discriminative ranking, threshold transfer, and client-side deployability more explicit.

The deployment-oriented results further show that the exported verifier is computationally lightweight and can be executed efficiently in browser environments using compact JSON artifacts. Nevertheless, the main deployment challenge is not only inference cost, but also operating-point robustness under natural behavioral variation. Scaler-only warm-up provides a conservative adaptation mechanism, but its empirical benefits are limited and inconsistent in our experiments; therefore, it should be treated as an optional guarded update rather than a necessary component of the deployment claim. Importantly, warm-up also relies on a trusted-prefix assumption: the early samples used for adaptation should be collected after a high-assurance authentication event. If the session is already compromised, updating normalization statistics from this prefix could poison the scaler and affect later score distributions. For this reason, practical systems should bound the number of warm-up samples, keep classifier weights and thresholds fixed unless stronger validation is available, and allow rollback to the original S1 scaler.

Several directions could strengthen cross-session robustness while preserving client-side feasibility: (i) lightweight score-calibration strategies, such as per-user post-hoc calibration on trusted data or threshold adjustment rules that account for score-distribution shift; (ii) adaptation beyond mean/variance, such as incremental model updates with strict anti-leakage controls or domain-adaptation methods constrained to small footprints; (iii) evaluation across more sessions, devices, keyboard types, operating systems, and keyboard layouts, including longer timescales and real-time usage conditions; and (iv) broader datasets and typing tasks to test generality across user populations, language structures, and application contexts.

Author Contributions

Conceptualization, Z.Z., M.P., G.C. and N.D.; methodology, Z.Z., M.P., G.C. and N.D.; software, Z.Z. and M.P.; validation, Z.Z., M.P., G.C. and N.D.; formal analysis, Z.Z. and M.P.; investigation, Z.Z., M.P., G.C. and N.D.; resources, Z.Z. and M.P.; data curation, Z.Z. and M.P.; writing—original draft preparation, Z.Z., M.P. and G.C.; writing—review and editing, Z.Z., M.P., G.C. and N.D.; visualization, Z.Z., M.P., G.C. and N.D.; supervision, M.P., G.C. and N.D.; project administration, M.P., G.C. and N.D.; funding acquisition, N.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The KeyRecs dataset analyzed in this study is a public dataset and is not redistributed in our repository. Reproducibility materials, including the experimental notebooks, exported client-side artifacts, browser benchmark files, selected result tables, and figures, are available at: https://github.com/Zhang-0124/client-side-keystroke-ca (accessed on 19 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bonneau, J.; Herley, C.; Van Oorschot, P.C.; Stajano, F. The quest to replace passwords: A framework for comparative evaluation of web authentication schemes. In Proceedings of the 2012 IEEE Symposium on Security and Privacy; IEEE: New York, NY, USA, 2012; pp. 553–567. [Google Scholar] [CrossRef]
Calzavara, S.; Jonker, H.; Krumnow, B.; Rabitti, A. Measuring web session security at scale. Comput. Secur. 2021, 111, 102472. [Google Scholar] [CrossRef]
Dacosta, I.; Chakradeo, S.; Ahamad, M.; Traynor, P. One-time cookies: Preventing session hijacking attacks with stateless authentication tokens. ACM Trans. Internet Technol. TOIT 2012, 12, 1–24. [Google Scholar] [CrossRef]
Fett, D.; Küsters, R.; Schmitz, G. A comprehensive formal security analysis of OAuth 2.0. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1204–1215. [Google Scholar] [CrossRef]
Fridman, L.; Weber, S.; Greenstadt, R.; Kam, M. Active authentication on mobile devices via stylometry, application usage, web browsing, and GPS location. IEEE Syst. J. 2017, 11, 513–521. [Google Scholar] [CrossRef]
Ayeswarya, S.; Singh, K.J. A comprehensive review on secure biometric-based continuous authentication and user profiling. IEEE Access 2024, 12, 82996–83021. [Google Scholar] [CrossRef]
Stylios, I.; Kokolakis, S.; Thanou, O.; Chatzis, S. Behavioral biometrics & continuous user authentication on mobile devices: A survey. Inf. Fusion 2021, 66, 76–99. [Google Scholar] [CrossRef]
Baig, A.F.; Eskeland, S. Security, privacy, and usability in continuous authentication: A survey. Sensors 2021, 21, 5967. [Google Scholar] [CrossRef]
Jain, A.K.; Ross, A.; Prabhakar, S. An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 4–20. [Google Scholar] [CrossRef]
Joyce, R.; Gupta, G. Identity authentication based on keystroke latencies. Commun. ACM 1990, 33, 168–176. [Google Scholar] [CrossRef]
Monrose, F.; Rubin, A.D. Keystroke dynamics as a biometric for authentication. Future Gener. Comput. Syst. 2000, 16, 351–359. [Google Scholar] [CrossRef]
Teh, P.S.; Teoh, A.B.J.; Yue, S. A survey of keystroke dynamics biometrics. Sci. World J. 2013, 2013, 408280. [Google Scholar] [CrossRef] [PubMed]
Banerjee, S.P.; Woodard, D.L. Biometric authentication and identification using keystroke dynamics: A survey. J. Pattern Recognit. Res. 2012, 7, 116–139. [Google Scholar] [CrossRef]
Pisani, P.H.; Lorena, A.C. A systematic review on keystroke dynamics. J. Braz. Comput. Soc. 2013, 19, 573–587. [Google Scholar] [CrossRef][Green Version]
Pisani, P.H.; Giot, R.; de Carvalho, A.C.P.L.F.; Lorena, A.C. Enhanced template update: Application to keystroke dynamics. Comput. Secur. 2016, 60, 134–153. [Google Scholar] [CrossRef]
Mhenni, A.; Cherrier, E.; Rosenberger, C.; Ben Amara, N.E. Double serial adaptation mechanism for keystroke dynamics authentication based on a single password. Comput. Secur. 2019, 83, 151–166. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Killourhy, K.S.; Maxion, R.A. Comparing anomaly-detection algorithms for keystroke dynamics. In Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems & Networks; IEEE: New York, NY, USA, 2009; pp. 125–134. [Google Scholar] [CrossRef]
Baig, A.F.; Eskeland, S.; Yang, B. Privacy-preserving continuous authentication using behavioral biometrics. Int. J. Inf. Secur. 2023, 22, 1833–1847. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Satyanarayanan, M. The emergence of edge computing. Computer 2017, 50, 30–39. [Google Scholar] [CrossRef]
Abuhamad, M.; Abusnaina, A.; Nyang, D.; Mohaisen, D. Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A contemporary survey. IEEE Internet Things J. 2021, 8, 65–84. [Google Scholar] [CrossRef]
Subash, A.; Song, I.; Lee, I.; Lee, K. Adaptability of current keystroke and mouse behavioral biometric systems: A survey. Comput. Secur. 2026, 160, 104731. [Google Scholar] [CrossRef]
Bours, P. Continuous keystroke dynamics: A different perspective towards biometric evaluation. Inf. Secur. Tech. Rep. 2012, 17, 36–43. [Google Scholar] [CrossRef]
Eberz, S.; Rasmussen, K.B.; Lenders, V.; Martinovic, I. Evaluating behavioral biometrics for continuous authentication: Challenges and metrics. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security; ACM: New York, NY, USA, 2017; pp. 386–399. [Google Scholar] [CrossRef]
Bleha, S.; Slivinsky, C.; Hussien, B. Computer-access security systems using keystroke dynamics. IEEE Trans. Pattern Anal. Mach. Intell. 1990, 12, 1217–1222. [Google Scholar] [CrossRef]
Obaidat, M.S.; Sadoun, B. Verification of computer users using keystroke dynamics. IEEE Trans. Syst. Man Cybern. Part B Cybern. 1997, 27, 261–269. [Google Scholar] [CrossRef]
Zhong, Y.; Deng, Y.; Jain, A.K. Keystroke dynamics for user authentication. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; IEEE: New York, NY, USA, 2012; pp. 117–123. [Google Scholar]
Bergadano, F.; Gunetti, D.; Picardi, C. User authentication through keystroke dynamics. ACM Trans. Inf. Syst. Secur. TISSEC 2002, 5, 367–397. [Google Scholar] [CrossRef]
Araújo, L.C.F.; Sucupira, L.H.R.; Lizarraga, M.G.; Ling, L.L.; Yabu-Uti, J.B.T. User authentication through typing biometrics features. IEEE Trans. Signal Process. 2005, 53, 851–855. [Google Scholar] [CrossRef]
Gunetti, D.; Picardi, C. Keystroke analysis of free text. ACM Trans. Inf. Syst. Secur. TISSEC 2005, 8, 312–347. [Google Scholar] [CrossRef]
Messerman, A.; Mustafić, T.; Camtepe, S.A.; Albayrak, S. Continuous and non-intrusive identity verification in real-time environments based on free-text keystroke dynamics. In Proceedings of the 2011 International Joint Conference on Biometrics (IJCB); IEEE: New York, NY, USA, 2011; pp. 1–8. [Google Scholar] [CrossRef]
Huang, J.; Hou, D.; Schuckers, S. A practical evaluation of free-text keystroke dynamics. In Proceedings of the 2017 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA); IEEE: New York, NY, USA, 2017; pp. 1–8. [Google Scholar] [CrossRef]
Çeker, H.; Upadhyaya, S. User authentication with keystroke dynamics in long-text data. In Proceedings of the 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS); IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar] [CrossRef]
Ayotte, B.; Banavar, M.; Hou, D.; Schuckers, S. Fast free-text authentication via instance-based keystroke dynamics. IEEE Trans. Biom. Behav. Identity Sci. 2020, 2, 377–387. [Google Scholar] [CrossRef]
Acien, A.; Morales, A.; Vera-Rodriguez, R.; Fierrez, J.; Monaco, J.V. TypeNet: Scaling up keystroke biometrics. In Proceedings of the 2020 IEEE International Joint Conference on Biometrics (IJCB); IEEE: New York, NY, USA, 2020; pp. 1–7. [Google Scholar] [CrossRef]
Lu, X.; Zhang, S.; Hui, P.; Lio, P. Continuous authentication by free-text keystroke based on CNN and RNN. Comput. Secur. 2020, 96, 101861. [Google Scholar] [CrossRef]
Aversano, L.; Bernardi, M.L.; Cimitile, M.; Pecori, R. Continuous authentication using deep neural networks ensemble on keystroke dynamics. PeerJ Comput. Sci. 2021, 7, e525. [Google Scholar] [CrossRef]
Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
Cutler, A.; Cutler, D.R.; Stevens, J.R. Random forests. In Ensemble Machine Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 157–175. [Google Scholar]
Kumar, D.; Pawar, P.P.; Ananthan, B.; Rajasekaran, S.; Prabhakaran, T.V. Optimized Support Vector Machine Based Fused IoT Network Security Management. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence For Internet of Things (AIIoT); IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Bengio, S.; Mariéthoz, J. A statistical significance test for person authentication. In Proceedings of the Odyssey 2004: The Speaker and Language Recognition Workshop, Toledo, Spain, 31 May–3 June 2004. [Google Scholar]
Martin, A.; Doddington, G.; Kamm, T.; Ordowski, M.; Przybocki, M. The DET curve in assessment of detection task performance. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 1997), Rhodes, Greece, 22–25 September 1997; pp. 1895–1898. [Google Scholar] [CrossRef]
Ross, A.; Rattani, A.; Tistarelli, M. Exploiting the “Doddington Zoo” effect in biometric fusion. In Proceedings of the 2009 IEEE 3rd International Conference on Biometrics: Theory, Applications, and Systems; IEEE: New York, NY, USA, 2009; pp. 1–7. [Google Scholar] [CrossRef]
Jain, A.; Nandakumar, K.; Ross, A. Score normalization in multimodal biometric systems. Pattern Recognit. 2005, 38, 2270–2285. [Google Scholar] [CrossRef]
Štruc, V.; Žganec-Gros, J.; Vesnicer, B.; Pavešić, N. Beyond parametric score normalisation in biometric verification systems. IET Biom. 2014, 3, 62–74. [Google Scholar] [CrossRef]
Pisani, P.H.; Poh, N.; de Carvalho, A.C.P.L.F.; Lorena, A.C. Score normalization applied to adaptive biometric systems. Comput. Secur. 2017, 70, 565–580. [Google Scholar] [CrossRef]
Vivaracho-Pascual, C.; Simon-Hurtado, A.; Manso-Martinez, E.; Pascual-Gaspar, J.M. Client threshold prediction in biometric signature recognition by means of Multiple Linear Regression and its use for score normalization. Pattern Recognit. 2016, 55, 1–13. [Google Scholar] [CrossRef]
Moreno-Torres, J.G.; Raeder, T.; Alaiz-Rodríguez, R.; Chawla, N.V.; Herrera, F. A unifying view on dataset shift in classification. Pattern Recognit. 2012, 45, 521–530. [Google Scholar] [CrossRef]
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244. [Google Scholar] [CrossRef]
Giot, R.; El-Abed, M.; Rosenberger, C. Web-based benchmark for keystroke dynamics biometric systems: A statistical analysis. In Proceedings of the 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing; IEEE: New York, NY, USA, 2012; pp. 11–15. [Google Scholar] [CrossRef]
Dias, T.; Vitorino, J.; Maia, E.; Sousa, O.; Praça, I. KeyRecs: A keystroke dynamics and typing pattern recognition dataset. Data Brief 2023, 50, 109509. [Google Scholar] [CrossRef] [PubMed]

Figure 1. End-to-end pipeline for client-side continuous authentication.

Figure 2. Timing primitives for consecutive keystrokes:

D U_{i}

,

D D_{i}

,

U D_{i}

, and

U U_{i}

.

U D_{i}

can be negative under key overlap.

Figure 2. Timing primitives for consecutive keystrokes:

D U_{i}

,

D D_{i}

,

U D_{i}

, and

U U_{i}

.

U D_{i}

can be negative under key overlap.

Figure 3. Fixed-text: distribution of per-user AUC on S2.

Figure 4. Fixed-text: HTER on S2 under transferred personal vs. pooled-S1 global threshold.

Figure 5. Fixed-text: distribution of

Δ HTER = {HTER}_{g l o b a l} - {HTER}_{p e r s o n a l}

on S2.

Figure 5. Fixed-text: distribution of

Δ HTER = {HTER}_{g l o b a l} - {HTER}_{p e r s o n a l}

on S2.

Figure 6. Fixed-text: per-user AUC on S2 across models.

Figure 7. Fixed-text: per-user HTER on S2 at each model’s transferred EER threshold.

Figure 8. Fixed-text: per-user

FRR @ FAR = 5 %

on S2 (threshold selected on S1).

Figure 8. Fixed-text: per-user

FRR @ FAR = 5 %

on S2 (threshold selected on S1).

Figure 9. Fixed-text: scaler-only warm-up effect on S2 (HTER,

K = 10

).

Figure 9. Fixed-text: scaler-only warm-up effect on S2 (HTER,

K = 10

).

Figure 10. Fixed-text: distribution of scaler-only warm-up impact on S2,

Δ HTER = {HTER}_{n o - w a r m} - {HTER}_{w a r m}

.

Figure 10. Fixed-text: distribution of scaler-only warm-up impact on S2,

Δ HTER = {HTER}_{n o - w a r m} - {HTER}_{w a r m}

.

Figure 11. Free-text: distribution of per-user AUC on S2 under the default windowing setting (

W = 30

,

S = 10

).

Figure 11. Free-text: distribution of per-user AUC on S2 under the default windowing setting (

W = 30

,

S = 10

).

Figure 12. Free-text thresholding comparison on S2 under the default windowing setting (

W = 30

,

S = 10

). The panels compare personal thresholds with the pooled-S1 global threshold for HTER, FAR, and FRR. All thresholds are selected on S1 and transferred unchanged to S2.

Figure 12. Free-text thresholding comparison on S2 under the default windowing setting (

W = 30

,

S = 10

). The panels compare personal thresholds with the pooled-S1 global threshold for HTER, FAR, and FRR. All thresholds are selected on S1 and transferred unchanged to S2.

Figure 13. Free-text model comparison under the default windowing setting (

W = 30

,

S = 10

). The panels show the per-user S2 distributions of AUC, HTER at the transferred S1-selected EER threshold, and FRR at the S1-selected FAR = 5% operating point. LR = Logistic Regression, GNB = GaussianNB, DT = DecisionTree, and RF = RandomForest.

Figure 13. Free-text model comparison under the default windowing setting (

W = 30

,

S = 10

). The panels show the per-user S2 distributions of AUC, HTER at the transferred S1-selected EER threshold, and FRR at the S1-selected FAR = 5% operating point. LR = Logistic Regression, GNB = GaussianNB, DT = DecisionTree, and RF = RandomForest.

Figure 14. Free-text: scaler-only warm-up effect on S2 (HTER,

K = 10

).

Figure 14. Free-text: scaler-only warm-up effect on S2 (HTER,

K = 10

).

Figure 15. Free-text: population median HTER on S2 vs. warm-up length K.

Figure 16. Free-text: per-user

Δ HTER

distributions across K on S2.

Figure 16. Free-text: per-user

Δ HTER

distributions across K on S2.

Figure 17. Offline-to-client workflow for client-side scoring and the browser replay demo.

Figure 18. Browser replay score-stream visualization for fixed-text repetitions (top) and free-text windows (bottom). Each point is scored using the exported model.json and scaler.json artifacts under the same client-side inference contract used by the demo. The horizontal threshold line denotes the fixed S1-selected operating point transferred to the replay stream. Scores above the threshold are interpreted as target-like under this operating point, while scores below the threshold are treated as non-genuine decisions. Prompt markers illustrate the optional demo-only consecutive-low-score policy and are not part of the offline S1→S2 evaluation metrics.

Table 1. Quantitative highlights with comparability and deployment flags.

Work	Dataset/Setting	Text	Decision Unit	Strict S1 → S2 Thr?	Client Evidence?	Reported Metric(s) and Representative Number(s)
Comparing anomaly detectors (2009) [18]	51 subjects, password benchmark	Fixed	password attempt	No	N/A	Top detectors: EER 9.6–10.2%.
Real-time free-text verification (2011) [32]	55 users, real-time verification	Free	m trials	Partial	N/A	$m = 1$ : eFAR 0.73%, iFAR 0.66%, FRR 9.48%; $m = 3$ : eFAR 2.61%, iFAR 2.02%, FRR 1.84%.
Practical free-text evaluation (2017) [33]	Sliding-window free-text evaluation	Free	1-min window	Partial	N/A	FAR 1%, FRR 11.5% (1-min window); also reports CV: FAR 0.98%, FRR 11.85%.
Fast free-text authentication (2020) [35]	Clarkson II/Buffalo (public)	Free	100/200 DD digraphs	No	N/A	EER: Clarkson II 9.7% (100), 7.8% (200); Buffalo 5.3% (100), 3.0% (200).
CNN+RNN free-text CA (2020) [37]	Two public datasets	Free	fixed-length sequence	No	N/A	Best FRR (2.07%, 6.61%); best FAR (3.26%, 5.31%); best EER (2.67%, 5.97%).
TypeNet (2020) [36]	Aalto, large-scale	Free	50 keystrokes/ seq	No	N/A	With 1K test users: EER 4.8%.
DNN ensemble identification (2021) [38]	Integrated dataset; identification setting	Mixed	identification	No	N/A	Accuracy up to 0.997 in identification.
CA methodology audit (2017) [25]	Methodology audit	N/A	N/A	N/A	N/A	Shows common choices can underestimate error rates by 63% and 81%.
KeyRecs dataset release (2023) [52]	Dataset release	Both	dataset specification	N/A	N/A	fixed-text.csv: 19,773 × 50; free-text.csv: 562,584 × 9; 99 participants.
This paper	KeyRecs, strict protocol	Both	Fixed: repetition; Free: 30 digraphs	Yes	Yes	Fixed: HTER 19.4%, FRR@FAR = 5% 43.8% (S2, threshold from S1); Free: HTER 18.8%, FRR@FAR = 5% 54.8% (S2, operating point from S1).

Note. “Strict S1 → S2 thr?” indicates whether the work evaluates threshold selection on one session and transfers the fixed operating point to a later session. “Partial” denotes temporal or streaming evaluation without the same strict S1 → S2 threshold-transfer protocol. “Client evidence” refers to reported client-side artifacts, demo, browser-side scoring, or similar deployment-oriented evidence. “N/A” indicates that the corresponding item is not applicable or was not reported in the cited work.

Table 2. Fixed-text repetition-level feature representation.

Component	Definition	Count
Decision unit	One complete prompted fixed-text repetition; each repetition is one verification sample.	1 sample
Raw source	KeyRecs fixed-text repetition-level table, with one row identified by participant, session, and repetition.	–
Excluded columns	`participant`, `session`, and `repetition`; these are used only for labels, S1/S2 splitting, and bookkeeping, not as model inputs.	3 columns
Retained features	All remaining numeric repetition-level timing descriptors, kept in a fixed dataset order.	47 features
Aggregation	None. Fixed-text samples are already complete repetitions, so no sliding-window aggregation is used.	–
Standardization	A pooled-S1 StandardScaler is fitted on the 47-dimensional S1 feature vectors and reused for S1/S2 scoring.	47 parameters per mean/scale vector
Deployment contract	The same ordered 47-dimensional vector is used by the LR verifier and the exported `model.json` and `scaler.json` artifacts.	47 inputs

Table 3. Free-text window-level feature set (19 dimensions).

Group	Definition	Count
Basic/robust stats	For each primitive in ${DU, DD, UD, UU}$ : mean, std, median, MAD	$4 \times 4 = 16$
Derived summaries	$c v_{D D} = σ (D D) / (μ (D D) + ϵ)$ ; long-pause ratio ( $U D > 0.8$ s); typing-rate proxy (kps)	3
Total		19

Table 4. Fixed-text cross-session baseline (S2) and threshold strategy comparison.

Metric (S1-Select/S2-Evaluate)	Mean	Median
AUC	0.894679	0.918053
HTER (personal threshold)	0.194290	0.179172
HTER (pooled-S1 global threshold)	0.191233	0.173217

Table 5. Fixed-text: lightweight model comparison under cross-session evaluation.

Model	AUC		HTER		FRR@FAR = 5%
	Mean	Med	Mean	Med	Mean	Med
DecisionTree (depth = 8)	0.7243	0.7221	0.2758	0.2779	0.5443	0.5400
GaussianNB	0.8495	0.8768	0.1957	0.1799	0.6345	0.7400
LR (baseline)	0.8947	0.9181	0.1943	0.1792	0.4382	0.4100
LinearSVM	0.8933	0.9213	0.1924	0.1767	0.4351	0.4100
RandomForest (200, depth = 10)	0.9343	0.9680	0.3764	0.3964	0.6125	0.7400

Table 6. Free-text: cross-session baseline under default windowing (

W = 30, S = 10

).

Table 6. Free-text: cross-session baseline under default windowing (

W = 30, S = 10

).

Metric (S1-Select/S2-Evaluate)	Mean	Median
AUC (S2)	0.884	0.899
HTER (personal EER threshold)	0.188	0.167
HTER (pooled-S1 global threshold)	0.186	0.164
$FRR @ FAR = 5 %$ (S1 operating point)	0.548	0.604

Table 7. Free-text: lightweight model comparison under cross-session evaluation (

W = 30, S = 10

).

Table 7. Free-text: lightweight model comparison under cross-session evaluation (

W = 30, S = 10

).

Model	AUC (Mean/Median)	HTER@Transferred EER Threshold (Mean/Median)	FRR@FAR = 5% (Mean/Median)
LR	0.884/0.899	0.188/0.167	0.548/0.604
LinearSVM	0.885/0.905	0.183/0.165	0.523/0.542
GaussianNB	0.870/0.897	0.187/0.166	0.529/0.486
DecisionTree (depth = 8)	0.835/0.859	0.186/0.159	0.556/0.480
RandomForest (200, depth = 10)	0.926/0.950	0.233/0.212	0.381/0.355

Table 8. Browser-side JavaScript scoring latency for the exported client-side verifier. The measured path includes StandardScaler normalization, Logistic Regression scoring, sigmoid conversion, and threshold comparison. Timing starts after replayed feature vectors are available, and excludes offline training, S1 threshold selection, raw-event capture, UI rendering, and battery measurements.

Device	Browser	Track	d	Median μs/Sample	p95 μs/Sample
Windows laptop	Chrome 147	Fixed-text	47	0.083	0.087
Windows laptop	Chrome 147	Free-text	19	0.057	0.063
Windows laptop	Edge 148	Fixed-text	47	0.101	0.114
Windows laptop	Edge 148	Free-text	19	0.085	0.114
Android phone	Chrome 148	Fixed-text	47	0.146	0.154
Android phone	Chrome 148	Free-text	19	0.088	0.098
Android phone	Firefox 150	Fixed-text	47	0.286	0.298
Android phone	Firefox 150	Free-text	19	0.170	0.178

Table 9. Paired user-level uncertainty analysis for personal versus pooled-S1 global thresholding. Here

Δ = {HTER}_{global} - {HTER}_{personal}

on S2; values close to zero indicate that the two thresholding strategies behave similarly under S1→S2 transfer. Negative values indicate lower HTER under the pooled-S1 global threshold. The confidence interval is a non-parametric bootstrap 95% interval, and p is from a two-sided Wilcoxon signed-rank test.

Table 9. Paired user-level uncertainty analysis for personal versus pooled-S1 global thresholding. Here

Δ = {HTER}_{global} - {HTER}_{personal}

on S2; values close to zero indicate that the two thresholding strategies behave similarly under S1→S2 transfer. Negative values indicate lower HTER under the pooled-S1 global threshold. The confidence interval is a non-parametric bootstrap 95% interval, and p is from a two-sided Wilcoxon signed-rank test.

Track	Comparison	n	Metric	Mean Δ	Median Δ	95% CI/p
Fixed-text	Global–personal threshold	99	HTER	$- 0.0031$	$- 0.0007$	$[- 0.0060, - 0.0003]$ /0.0378
Free-text	Global–personal threshold	98	HTER	$- 0.0020$	$- 0.0003$	$[- 0.0044, 0.0004]$ /0.2276

Table 10. Paired user-level uncertainty analysis for lightweight model comparisons against Logistic Regression. Here

Δ = {HTER}_{model} - {HTER}_{LR}

at the transferred S1-selected EER threshold on S2; positive values indicate higher transferred-threshold error than LR. The confidence interval is a non-parametric bootstrap 95% interval, and p is from a two-sided Wilcoxon signed-rank test. DecisionTree and RandomForest correspond to the lightweight tree baselines described in the experimental setup.

Table 10. Paired user-level uncertainty analysis for lightweight model comparisons against Logistic Regression. Here

Δ = {HTER}_{model} - {HTER}_{LR}

at the transferred S1-selected EER threshold on S2; positive values indicate higher transferred-threshold error than LR. The confidence interval is a non-parametric bootstrap 95% interval, and p is from a two-sided Wilcoxon signed-rank test. DecisionTree and RandomForest correspond to the lightweight tree baselines described in the experimental setup.

Track	Model vs. LR	n	Metric	Mean Δ	Median Δ	95% CI/p
Fixed-text	LinearSVM–LR	99	HTER	$- 0.0019$	$- 0.0014$	$[- 0.0043, 0.0005]$ /0.0863
Fixed-text	GaussianNB–LR	99	HTER	$0.0014$	$0.0116$	$[- 0.0190, 0.0223]$ /0.7376
Fixed-text	DecisionTree–LR	99	HTER	$0.0814$	$0.0858$	$[0.0620, 0.1010]$ / $< 0.001$
Fixed-text	RandomForest–LR	99	HTER	$0.1818$	$0.1930$	$[0.1622, 0.2013]$ / $< 0.001$
Free-text	LinearSVM–LR	98	HTER	$- 0.0047$	$- 0.0016$	$[- 0.0074, - 0.0019]$ /0.0020
Free-text	GaussianNB–LR	98	HTER	$- 0.0015$	$0.0013$	$[- 0.0140, 0.0104]$ /0.8220
Free-text	DecisionTree–LR	98	HTER	$- 0.0018$	$0.0013$	$[- 0.0134, 0.0099]$ /0.7458
Free-text	RandomForest–LR	98	HTER	$0.0449$	$0.0429$	$[0.0344, 0.0557]$ / $< 0.001$

Table 11. Minimal artifacts used by the browser replay demo. Only model.json and scaler.json are required for client-side scoring.

File	Inference	Format	Purpose
`model.json`	Yes	JSON	Feature order; LR coefficients/intercept; optional S1-selected threshold hint.
`scaler.json`	Yes	JSON	StandardScaler parameters (`mean_`, `scale_`) for runtime standardization.
`demo_metrics_target_user.json`	No	JSON	Demo-only reference metrics (sanity checks; integration metadata).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Z.; Papaioannou, M.; Choudhary, G.; Dragoni, N. Client-Side Continuous Authentication Using Keystroke Dynamics: A Lightweight Pipeline and Cross-Session Evaluation. Electronics 2026, 15, 2325. https://doi.org/10.3390/electronics15112325

AMA Style

Zhang Z, Papaioannou M, Choudhary G, Dragoni N. Client-Side Continuous Authentication Using Keystroke Dynamics: A Lightweight Pipeline and Cross-Session Evaluation. Electronics. 2026; 15(11):2325. https://doi.org/10.3390/electronics15112325

Chicago/Turabian Style

Zhang, Zhanhe, Maria Papaioannou, Gaurav Choudhary, and Nicola Dragoni. 2026. "Client-Side Continuous Authentication Using Keystroke Dynamics: A Lightweight Pipeline and Cross-Session Evaluation" Electronics 15, no. 11: 2325. https://doi.org/10.3390/electronics15112325

APA Style

Zhang, Z., Papaioannou, M., Choudhary, G., & Dragoni, N. (2026). Client-Side Continuous Authentication Using Keystroke Dynamics: A Lightweight Pipeline and Cross-Session Evaluation. Electronics, 15(11), 2325. https://doi.org/10.3390/electronics15112325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Client-Side Continuous Authentication Using Keystroke Dynamics: A Lightweight Pipeline and Cross-Session Evaluation

Abstract

1. Introduction

Contributions

2. Literature Review

2.1. Continuous Authentication and Behavioral Biometrics

2.2. Keystroke Dynamics in Fixed-Text and Free-Text Settings

2.3. Representation and Lightweight Modeling

2.4. Cross-Session Evaluation, Thresholding, and Adaptation

2.5. Dataset Landscape and Rationale for KeyRecs

2.6. Positioning and Comparison Strategy

2.7. Critical Interpretation of Table 1

3. Proposed Method

3.1. Overview

3.2. Feature Extraction and Representation

3.2.1. Timing Primitives

3.2.2. Fixed-Text Repetition-Level Features (47D)

3.2.3. Free-Text Representation via Sliding-Window Aggregation

3.2.4. Free-Text Window-Level Features (19-Dimensional)

3.3. Pooling Terminology

3.4. Per-User Verification Model and Scoring

3.4.1. Standardization

3.4.2. Lr Score

3.5. Thresholding Strategies

3.6. Scaler-Only Warm-Up (Minimal Adaptation)

4. Experimental Setup

4.1. Data Preparation and Inclusion Criteria

4.2. Strict S1→S2 Protocol (Instantiated)

4.3. Genuine/Impostor Construction

4.4. Threshold Selection Procedures

4.4.1. Per-User EER Threshold (Personal)

4.4.2. Per-User FAR-Constrained Threshold (Personal, FAR ≤ 5 % )

4.4.3. Pooled-S1 Global EER Threshold

4.5. Scaler-Only Warm-Up Evaluation

Practical Threat Model for the Trusted Prefix

4.6. Metrics and Aggregation

4.6.1. AUC (Threshold-Free)

4.6.2. Operating-Point Metrics Under Transferred Thresholds

4.6.3. User-Level Reporting

4.6.4. Statistical Uncertainty and Paired Testing

4.7. Implementation Notes

5. Results

5.1. Fixed-Text Track

5.1.1. Cross-Session Baseline and Threshold Strategy

5.1.2. Lightweight Classifier Comparison

5.1.3. Scaler-Only Warm-Up (Fixed-Text)

5.2. Free-Text Track

5.2.1. Cross-Session Baseline and Threshold Strategy (Default W = 30, S = 10)

5.2.2. Lightweight Classifier Comparison (Free-Text)

5.2.3. Scaler-Only Warm-Up (Free-Text) and Sensitivity to K

5.3. Operating Metrics Under Threshold Transfer

5.4. Client-Side Browser Timing and Feasibility

5.5. Statistical Uncertainty of Main Comparisons

5.6. Summary of Findings

6. Client-Side Deployment and Replay Demonstration

6.1. Deployment Goal and Scope

6.2. Exported Artifacts

6.3. Replay Feature CSV (Demo Data Source)

6.4. Browser-Side Inference Contract

6.5. Client-Side Implementation Workflow

6.6. Browser Timing Benchmark Protocol

6.7. Optional Prompting Policy (Demo-Only)

6.8. Score-Stream Visualization (Demo Output)

7. Discussion

7.1. Separability vs. Transferred Operating Behavior

7.2. Personal vs. Pooled-S1 Global Thresholds

7.3. Why Scaler-Only Warm-Up Yields Limited (And Sometimes Unstable) Gains

7.4. Model Comparison: Strong AUC Does Not Guarantee Stable Threshold Transfer

7.5. Implications for Client-Side Continuous Authentication

7.6. Limitations and Assumptions

8. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

4.4.2. Per-User FAR-Constrained Threshold (Personal, $FAR \leq 5 %$ )