LightLiveAuth: A Lightweight Continuous Authentication Model for Virtual Reality

Li, Pengyu; Chen, Feifei; Pan, Lei; Hoang, Thuong; Zhu, Ye; Yang, Leon

doi:10.3390/iot6030050

Open AccessArticle

LightLiveAuth: A Lightweight Continuous Authentication Model for Virtual Reality

by

Pengyu Li

^†

,

Feifei Chen

,

Lei Pan

,

Thuong Hoang

,

Ye Zhu

and

Leon Yang

^*,†

School of Information Technology, Deakin University, Burwood, VIC 3125, Australia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

IoT 2025, 6(3), 50; https://doi.org/10.3390/iot6030050

Submission received: 29 May 2025 / Revised: 28 August 2025 / Accepted: 28 August 2025 / Published: 2 September 2025

Download

Browse Figures

Versions Notes

Abstract

As network infrastructure and Internet of Things (IoT) technologies continue to evolve, immersive systems such as virtual reality (VR) are becoming increasingly integrated into interconnected environments. These advancements allow real-time processing of multi-modal data, improving user experiences with rich visual and three-dimensional interactions. However, ensuring continuous user authentication in VR environments remains a significant challenge. To address this issue, an effective user monitoring system is required to track VR users in real time and trigger re-authentication when necessary. Based on this premise, we propose a multi-modal authentication framework that uses eye-tracking data for authentication, named MobileNetV3pro. The framework applies a transfer learning approach by adapting the MobileNetV3Large architecture (pretrained on ImageNet) as a feature extractor. Its pre-trained convolutional layers are used to obtain high-level image representations, while a custom fully connected classification is added to perform binary classification. Authentication performance is evaluated using Equal Error Rate (EER), accuracy, F1-score, model size, and inference time. Experimental results show that eye-based authentication with MobileNetV3pro achieves a lower EER (3.00%) than baseline models, demonstrating its effectiveness in VR environments.

Keywords:

continuous authentication; virtual reality; biometric authentication; mobilenetv3; machine learning

1. Introduction

Virtual reality (VR) authentication is the process of verifying a user’s identity in a virtual environment to prevent unauthorized access and protect user privacy and data security [1]. As VR technologies become increasingly integrated into IoT ecosystems, spanning sectors such as gaming, education, healthcare, and remote industrial training, the need for secure and seamless authentication mechanisms has become more pressing. A robust authentication system not only safeguards sensitive user data but also protects against identity spoofing and enhances user experience across distributed, networked devices [2].

Traditional authentication methods rely on passwords, security tokens, or biometric identification [3]. However, in fully immersive VR environments, these approaches are impractical. Typing passwords, entering codes, or using physical tokens disrupts immersion and may introduce usability challenges. To address these limitations, researchers have explored biometric authentication methods such as voice recognition, facial recognition, gesture tracking, and even eye movement analysis [4,5,6,7]. Among these, eye-tracking-based authentication has emerged as a promising solution due to its unobtrusiveness and compatibility with modern VR headsets [8,9].

Despite the advantages of biometric authentication, continuous authentication (CA) further enhances security by monitoring the user’s identity throughout a VR session rather than relying on a single authentication step. This prevents unauthorized access after login and mitigates risks such as session hijacking and unauthorized takeovers [10]. However, experimental results from prior studies indicate that achieving real-time user authentication in VR environments demands both high accuracy and low latency. Motion prediction methods often suffer from reduced accuracy, as VR headsets may struggle to reliably capture fine-grained user movements. Additionally, authentication schemes based on 3D models tend to offer weaker security and require users to spend considerable time on interaction and accurate input, which may degrade user experience. In contrast, biometric authentication has emerged as a more secure and efficient alternative. However, most existing eye-tracking-based methods rely on one-time verification and often report relatively high EERs, increasing the likelihood of authentication errors and user frustration. In this study, the robustness of the proposed method was further validated against adversarial threats by conducting Projected Gradient Descent (PGD) attack experiments, confirming its resilience.

This paper explores continuous authentication in VR, focusing on eye-region features and behavioral patterns for real-time identity verification. We review existing VR authentication methods, propose a multi-modal authentication framework, and evaluates its effectiveness using MobileNetV3 for eye-based authentication. The remainder of the paper is organized as follows: Section 2 reviews related work on VR authentication, Section 3 introduces the proposed continuous authentication methods, Section 4 describes the experimental design and evaluation, Section 5 discusses challenges and future directions, and Section 6 concludes the study.

2. Related Work

VR authentication refers to the process of verifying a user’s identity before granting access to a VR system or application. As VR technology continues to evolve and gain popularity across various sectors such as gaming, education, healthcare, and professional training, the need for secure and effective authentication methods has become increasingly vital. VR authentication helps protect users’ privacy, secure sensitive data, and ensure that only authorized individuals can access VR environments [1]. It plays a critical role in safeguarding the VR space, enhancing user experience, and building trust in these emerging technologies [2,3].

Early explorations of VR authentication mechanisms include Yu’s [11] investigation of the feasibility of implementing various authentication systems in VR. Building on these insights, George [12] adapted PINs and Android unlock patterns for use in VR, evaluating them in terms of usability and security. Their work focused on the distinct nature of VR’s private visual channel—where users are immersed in the virtual world—to make it more difficult for unauthorized observers to view the authentication process.

Researchers have also taken a more novel approach to VR authentication, proposing unique methods that exploit VR’s immersive capabilities [13,14]. Mann [15] introduced RubikAuth, an innovative VR-based authentication scheme that uses 3D models for user verification. Traditional password-based logins in VR remain vulnerable to theft, prompting a shift toward biometric authentication methods. Miller [16] applied Siamese neural networks to learn a distance function characterizing systematic differences between data provided across pairs of dissimilar VR systems. Their method achieved an average EER value of 1.38–3.86% when benchmarks were used for a certification containing a dataset of 41 users performing pitching tasks across three VR systems—Oculus Quest, HTC Vive, and HTC. A lively universe was used for comparison with previous VR biometric methods. They also calculated the average accuracy of the recognition task using universal distance matching and fully convolutional networks separately on the registration dataset. Liebers [17] proposed a biometric system based on head orientation and gaze behavior while tracking moving stimuli. Their hybrid post hoc analysis achieved recognition accuracies of up to 75% with learning algorithms and up to 100% using deep learning. Li [18] found that unique head-movement patterns in response to external audio stimuli could authenticate users with a true acceptance rate of 95.57%, a false acceptance rate of 4.43%, and a processing latency of around 1.9 s on Google Glass. Similarly, Mustafa [19] showed that head, hand, and body movement patterns provide user-specific information that can be exploited for user authentication.

Voice-based authentication represents another prominent area of study. By leveraging unique voice characteristics, this method enables hands-free, seamless verification ideal for immersive VR experiences. Key advantages include increased security and user convenience, as users can authenticate without removing VR headsets [20,21]. Shang [22] explored internal body voice, which captures vibrations transmitted through the user’s body, making it difficult for attackers to replicate. This method provides a higher level of security by ensuring that the voice input is genuinely from the user wearing the headset. Their system successfully defended against various attacks with an accuracy of at least 98%, significantly enhancing the security of voice-based authentication in AR environments. Duezguen [23] examined various input systems for VR authentication, including voice recognition, highlighting the challenges and solutions in developing authentication methods that are both secure and user-friendly in the context of VR and AR head-mounted displays (HMDs). Their research focused on usability and security, evaluating different interaction methods like voice control, head movement, and touch controls for entering responses to authentication challenges. Bekkanti [24] assessed the reliability of voice biometrics and concluded that voice authentication is most effective when combined with other methods such as facial recognition or passwords for enhanced security.

Although most VR headsets lack eye-tracking sensors, recent models increasingly incorporate them to improve performance [25,26,27,28,29]. Sluganovic [30] noted that gaze-based authentication systems either suffer from high error rates or require long authentication times. Using a gaze-tracking device, they developed a prototype system and performed a series of systematic user experiments with 30 participants from the general public. They investigated performance and security under several different attack scenarios and showed that their system surpasses existing gaze-based authentication methods, both in achieving equal error rates (6.3%) and significantly lower authentication times (5 s). Rigas [25] further enhanced gaze-based biometrics by integrating dynamic saccadic features. They tested on a large database of 322 subjects, and the biometric accuracy demonstrated a relative improvement in the range of 31.6–33.5% for the verification scenario, and in the range of 22.3–53.1% for the identification scenario. More importantly, this improvement was demonstrated across different types of visual stimuli (random dot, text, video), indicating the enhanced robustness offered by the incorporation of saccadic vigor and acceleration cues. Khamis [8] confirmed that pursuits are robust against varying virtual 3D target sizes, with performance improving as trajectory sizes (e.g., radius) increase, particularly during walking interactions.

Other works have implemented real-time eye-movement-based systems. Lohr [9] designed a movement-driven authentication system for VR devices using the FOVE head-mounted display. Olade [13] demonstrated that users’ data can act as high-confidence biometric discriminators using machine learning classifiers such as k-Nearest Neighbors (kNN), thus adding a layer of security in identification, or dynamically adjusting the VR environment to user preferences. They also performed white-box penetration tests on 12 attackers, some of whom were physically similar to the participants. After the preliminary study, they obtained an average recognition confidence of 0.98 from the test data of actual participants, and a classification accuracy of 98.6%. Penetration tests showed that the confidence of all attackers to be less than 50%, although physically similar attackers have higher confidence levels. These findings are helpful for the design and development of secure VR systems. Liebers [6] aims to explore and analyze the potential of using gaze-based authentication as a secure method for user identification in VR settings.

Miller [31] presents a real-time system that continually authenticates users in VR by monitoring their motion trajectories during interactions. The system captures data from VR controllers and headsets, analyzing movement patterns to ensure user identity remains consistent throughout the session. This continuous authentication approach enhances security by promptly detecting any unauthorized user attempting to take over the VR session. Miller [32] analyzed the effectiveness of behavior-based biometric authentication across different VR systems. It focuses on how user behavior, such as hand movements and interaction patterns, can be used to authenticate users within a single VR system and across multiple systems. Their paper evaluated various machine learning algorithms to classify and authenticate users based on their unique interaction behaviors. Li [33] developed a motion-forecasting system leveraging users’ motion trajectories (e.g., movements of the VR headset and controllers) to predict future actions and use these predictions for authentication. This approach improves security by relying on unique behavioral patterns that are difficult to replicate. Their approach employed deep learning techniques, such as convolutional neural networks (CNNs) and Siamese networks, to enhance the accuracy and reliability of the authentication process. Cheng [34] incorporated federated learning for motion prediction, avoiding centralized data collection to enhance privacy protection.

Most existing research focuses on traditional password systems, body movement analysis, and eye-tracking technology as standalone modalities for user identity verification. However, a more secure and efficient solution could emerge from combining these approaches. Kim [35] proposed a decentralized identifier for the metaverse, ensuring identity verification while protecting sensitive information. Wang [36] introduced a multi-attribute authentication framework to counter Man-in-the-Room (MITR) attacks in VR environments. MITR attacks refer to scenarios where an attacker physically intrudes on the VR environment, attempting to access a user’s sensitive data by observing or manipulating their authentication process. Noah [37] systematically evaluated knowledge-based AR/VR authentication schemes, comparing security and usability of PINs, gestures, and other methods.

Table 1 shows the summary of different VR authentication methods. In the field of VR authentication, many researchers have already proposed various methods, including 3D login systems, eye tracking, and gaze tracking. While these methods offer various advantages, significant challenges remain in ensuring fast, secure, and seamless authentication. Existing methods typically suffer from relatively low authentication accuracy, with error rates still posing challenges for real-world deployment. In addition, authentication latency is often high, leading to a degraded user experience in immersive environments where seamless interaction is crucial. Furthermore, most prior approaches, including VRBiom, lack mechanisms for continuous or periodic re-authentication, thereby failing to ensure session integrity over time—an aspect particularly critical in shared or public VR settings.

To address these gaps, this study proposes an improved authentication framework that builds upon the foundation of VRBiom [38], introducing a lightweight and real-time model for eye-region-based user verification. Our approach achieves 3.46% higher accuracy and reduced latency through optimized convolutional architectures and streamlined preprocessing. Moreover, continuous authentication capabilities are incorporated, enabling dynamic user verification throughout the VR session without interrupting the immersive experience.

3. Continuous Authentication Methods in VR

This study conducts a simulation of VR user authentication, using the VRBiom dataset to predict user identity. It enables real-time verification based on biometric data, ensuring secure and seamless authentication without requiring re-authentication upon user change detection. Although the approach has yet to be deployed on real VR hardware, the simulation enables evaluation of feasibility and performance.

Figure 1 illustrates the overall workflow of the proposed experiment for VR user authentication. The process begins with the collection of the VRBiom dataset, including user eye-region images under various conditions. The collected images are then preprocessed through cropping, normalization, and augmentation to ensure consistency and enhance model training. Subsequently, several deep learning architectures, including MobileNetV3-Large, are employed for feature extraction and prediction. The trained models are applied to predict user identity and monitor eye-region information in real time. Finally, model parameters are optimized to improve accuracy, efficiency, and robustness against noise or adversarial conditions. This workflow provides a systematic approach for real-time VR user authentication while enabling comprehensive performance evaluation and model refinement. Following these preprocessing and evaluation steps, the workflow proceeds to the design and training of the final model, as detailed in the next subsection.

3.1. Model Design and Training

In this study, MobileNetV3-Large is adopted as the backbone model due to its efficiency and strong feature extraction capabilities. MobileNetV3 is a lightweight CNN designed for mobile and edge applications, characterized by its small model size and low parameter count. In the implementation, model pruning was further applied within the CNN to reduce computational complexity, making it more suitable for resource-constrained VR devices. To enhance performance in Presentation Attack Detection (PAD), the model is modified in several ways:

Additional CNN Layers for feature refinement: To capture more fine-grained spatial details, an extra convolutional block is integrated before the classification head. This block consists of two 3 × 3 convolutional layers, each followed by Batch Normalization and ReLU activation. This addition helps in refining discriminative features crucial for distinguishing bona fide samples from attacks.
Attention mechanism: To further enhance feature representation, Channel Attention (SE module) is incorporated in the additional CNN layers. This mechanism selectively emphasizes the most relevant feature channels, improving the model’s robustness to variations in illumination and occlusion (e.g., glasses, masks).
Modified classification head: Instead of using the default fully connected layer, it is replaced with a global average pooling (GAP) layer, followed by a dropout layer (p = 0.3) to mitigate overfitting. The final fully connected layer outputs a single logit, which is passed through a sigmoid activation function for binary classification.

The proposed model is based on the MobileNetV3-Large architecture tailored for VR eye-region authentication. The input is a 224 × 224 Red Green Blue (RGB) image fed into an initial 3 × 3 convolutional layer with 16 filters and stride 2, followed by a series of nine MobileNetV3 bottleneck blocks that incorporate expansion layers, depthwise convolutions, squeeze-and-excitation (SE) modules, and non-linear activations (Rectified Linear Unit (ReLU) or h-swish). An additional convolutional block with a 3 × 3 depthwise convolution of 128 channels, combined with an SE module and h-swish activation, is introduced to enhance fine-grained spatial feature extraction specific to eye movements. The final feature maps undergo global average pooling and pass through fully connected layers with dropout to generate a compact embedding, followed by a classification layer with softmax activation. The network contains approximately 2.6 million parameters and is optimized for low computational cost, making it suitable for embedded deployment in VR devices. This architectural design balances accuracy and efficiency, with all SE modules following the standard squeeze-and-excitation mechanism to improve channel-wise feature recalibration.

3.1.1. Training Procedure

The model is trained using binary cross-entropy loss with logits, which is well-suited for binary classification tasks. The Adam optimizer is employed with an initial learning rate of 0.001, which is reduced using a cosine annealing scheduler to fine-tune performance.

3.1.2. Data Augmentation

To improve generalization, the following data augmentation techniques are applied:

Random horizontal flipping (p = 0.5)
Random rotation ( $\pm 15$ degrees)
Gaussian blur ( $σ = 0.1 - 2.0$ ) for occlusion simulation
Color jittering (brightness, contrast, saturation adjustments)

Training Strategy

The model is trained for 30 epochs with a batch size of 32, using early stopping to prevent overfitting.
Each batch consists of a balanced mix of bona fide and attack samples to ensure fair learning.
The training set is used for backpropagation, while the validation set monitors performance improvements.

3.2. Optimization

Through the analysis of training and testing data, a noticeable decline in the model’s predictive performance was observed when dealing with blurred images and highly illuminated images. The model struggles to extract discriminative features under these challenging conditions, leading to increased misclassification rates. This performance degradation suggests that variations in image clarity and lighting conditions significantly impact the model’s ability to generalize effectively.

Figure 2 shows the highlight image affecting prediction accuracy. During testing, we observed that the MobileNetV3 model’s prediction accuracy degrades when exposed to certain visual distortions in the input images. Specifically, images affected by motion blur or defocus blur tend to reduce the model’s ability to accurately extract discriminative features from the eye region. Furthermore, the presence of strong light sources or specular highlights (e.g., light spots or glare) in the image often leads to inaccurate feature extraction or incorrect classification, likely due to overexposed regions masking critical visual cues. These factors introduce challenges in real-world environments where lighting and motion cannot be consistently controlled, particularly in mobile or wearable VR systems. To improve the accuracy of our MobileNetV3-based classification model, several optimization strategies were implemented, including network architecture refinement, training adjustments, and data handling techniques.

1.

Optimizing the Fully Connected Layers The original model used a Flatten layer followed by fully connected layers with dropout. To enhance generalization and reduce overfitting, flatten was replaced with GlobalAveragePooling2D, which reduces the number of trainable parameters while preserving spatial information. Additionally, we introduced BatchNormalization after each dense layer to accelerate convergence and stabilize training. The dropout rate was also adjusted to 0.3 to prevent excessive feature loss.

2.

Fine-Tuning More Layers

Initially, we only trained the last 10 layers of MobileNetV3. To allow the model to learn more task-specific features, the training scope was expanded to the last 30 layers. This strategy enables the network to adapt deeper feature representations while still leveraging the pretrained weights from ImageNet. Fine-tuning modifies the trainable parameter set:

θ = {θ_{f r o z e n}, θ_{t r a i n a b l e}}

(1)

where:

Let $θ$ denote the set of all parameters in the network, comprising both frozen and trainable components during fine-tuning.
$θ_{f r o z e n}$ represents parameters in layers frozen to retain pretrained ImageNet features.
$θ_{t r a i n a b l e}$ corresponds to parameters in fine-tuned layers.

Expanding

θ_{t r a i n a b l e}

from

k = 10

to

k = 30

allows deeper feature adaptation while leveraging pretrained representations. The parameter update follows:

θ_{t r a i n a b l e}^{(t + 1)} = θ_{t r a i n a b l e}^{(t)} - η \frac{\partial L}{\partial θ_{t r a i n a b l e}}

(2)

where:

$θ_{trainable}^{(t)}$ denotes the trainable parameters at iteration t;
$η$ is the learning rate that controls the update step size;
L represents the loss function;
$\frac{\partial L}{\partial θ_{trainable}}$ is the gradient of the loss with respect to the trainable parameters.

3.

Learning Rate Scheduling A fixed learning rate may cause suboptimal convergence. We adopt the ReduceLROnPlateau learning rate scheduler to dynamically adjust the learning rate during training. This scheduler monitors the validation loss and reduces the learning rate by a factor of 0.5 if no improvement is observed for 5 consecutive epochs (patience = 5). The learning rate will not be reduced below a minimum threshold of

1 \times 10^{- 6}

to prevent excessively small updates. This configuration balances convergence speed and stability, and the parameters were chosen based on preliminary tuning experiments. Including these hyperparameters improves the reproducibility and clarity of our training process. A fixed learning rate may cause suboptimal convergence. To dynamically adjust the learning rate, ReduceLROnPlateau was integrated, which reduces the learning rate by a factor of

γ

when validation loss stagnates:

η_{t + 1} = \{\begin{matrix} η_{t}, & if L_{v a l} (t) < L_{v a l} (t - p) \\ η_{t} \cdot γ, & if L_{v a l} (t) \geq L_{v a l} (t - p) \end{matrix}

(3)

where:

$η_{t}$ is the learning rate at epoch t.
$L_{v a l} (t)$ is the validation loss at epoch t.
p is the patience parameter, controlling the number of epochs before reducing the learning rate.
$γ$ is the reduction factor, typically set to $0.5$ .

This approach helps the model escape local minima and improves final accuracy.

MobileNetV3pro utilizes depthwise separable convolutions and hard swish activation, making it sensitive to gradient magnitudes. Weighted loss affects optimization by modifying gradients:

\frac{\partial L}{\partial θ} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} w_{c} y_{i, c} \frac{1}{{\hat{y}}_{i, c}} \frac{\partial {\hat{y}}_{i, c}}{\partial θ}

(4)

where:

L is the loss function;
$θ$ denotes the model parameters;
N is the number of training samples in a batch;
C is the total number of classes;
$w_{c}$ is the class-specific weight for class c to handle class imbalance;
$y_{i, c}$ is the ground truth label for sample i and class c, where $y_{i, c} = 1$ if sample i belongs to class c, otherwise 0;
${\hat{y}}_{i, c}$ is the predicted probability of sample i belonging to class c;
$\frac{\partial {\hat{y}}_{i, c}}{\partial θ}$ is the derivative of the predicted probability with respect to the model parameters.

This adjustment ensures: Higher gradient contribution from minority classes. Balanced parameter updates, reducing bias towards majority classes. Improved recall for underrepresented classes.

4.

Data Augmentation for Robustness: To address issues caused by variations in lighting and reflections from eyeglasses, we applied extensive data augmentation. Using ImageDataGenerator, we introduced random brightness adjustments, rotation, width and height shifts, zooming, and horizontal flips. These transformations improve the model’s ability to generalize across diverse real-world conditions.

5.

Accuracy loss: Instead of using standard binary cross-entropy or focal loss, accuracy loss was implemented to directly optimize the model’s classification accuracy. This loss function penalizes incorrect predictions explicitly and provides intuitive feedback for training. It is particularly effective under challenging conditions such as extreme lighting, where misclassifications are more likely.

The accuracy loss is defined as:

L_{A c c} = 1 - \frac{1}{N} \sum_{i = 1}^{N} I ({\hat{y}}_{i} = y_{i})

(5)

where:

$L_{Acc}$ is the accuracy-based loss value;
N is the total number of samples;
${\hat{y}}_{i}$ is the predicted class label for the i-th sample;
$y_{i}$ is the ground truth label for the i-th sample;
$I ({\hat{y}}_{i} = y_{i})$ is the indicator function, which returns 1 if ${\hat{y}}_{i} = y_{i}$ and 0 otherwise.

Since the indicator function is non-differentiable and cannot be used directly in gradient-based optimization, anchor classes and

γ

reduce the impact of easy examples. By incorporating these improvements, the model’s robustness and accuracy in classifying eye region images under challenging conditions were significantly enhanced. These strategies collectively addressed overfitting, improved feature extraction, and increased resilience to real-world variations.

3.3. Adversarial Training with PGD Attacks

To improve model robustness against adversarial perturbations and privacy-preserving noise, we incorporated adversarial training based on the PGD method. For each mini-batch, adversarial examples

x^{a d v}

were generated by iteratively applying small perturbations to the input samples x, guided by the loss gradient. The PGD update rule is defined as:

x_{t + 1}^{a d v} = {Proj}_{B_{ϵ} (x)} (x_{t}^{a d v} + α \cdot sign (\nabla_{x} L (f (x_{t}^{a d v}), y)))

(6)

where:

x is the original clean input sample;
$x_{t}^{a d v}$ is the adversarial example at iteration t;
$x_{t + 1}^{a d v}$ is the updated adversarial example at iteration $t + 1$ ;
$α$ is the step size for each iteration;
$\nabla_{x} L (f (x_{t}^{a d v}), y)$ is the gradient of the loss function $L$ with respect to the input, evaluated at $x_{t}^{a d v}$ ;
$f (\cdot)$ is the model (e.g., a neural network) that outputs predictions;
y is the ground-truth label of the input sample x;
$sign (\cdot)$ denotes the element-wise sign function;
$B_{ϵ} (x)$ is the $ℓ_{\infty}$ -norm ball of radius $ϵ$ centered at x;
${Proj}_{B_{ϵ} (x)} (\cdot)$ denotes the projection operation that ensures $x_{t + 1}^{a d v}$ stays within the allowed perturbation bound around x.

During training, each batch consisted of both clean and adversarial samples, with a fixed ratio (e.g., 70% clean and 30% adversarial). Specifically, the perturbation bound was set to

ϵ

= 0.01, with a step size of

α

= 0.003, and each adversarial example was generated through 10 iterative updates. During the evaluation phase, only adversarial samples were used without mixing clean samples, resulting in a clean-to-adversarial ratio of 0:1. These hyperparameter settings follow standard practices in adversarial learning literature and ensure both the effectiveness of the attack and the reproducibility of the experiment. This strategy enabled the model to learn more generalized and robust feature representations, improving its performance under both Laplace Differential Privacy (LDP) noise and adversarial attacks.

4. Experiments

4.1. Experimental Set up

To evaluate the performance of our proposed model, MobileNetV3pro, we compare it with four widely used models: ResNet-50, ResNet-101, MobileNet, and MobileNetV3. The evaluation focuses on two key metrics: EER and accuracy, assessing each model’s effectiveness in VR user authentication. Additionally, the Receiver Operating Characteristic (ROC) curve is plotted for each model to visualize performance in distinguishing between genuine users and impostors. The ROC curve helps analyze the trade-off between the true positive rate (TPR) and false positive rate (FPR) across different threshold values.

All experiments were implemented in Python 3.10 and executed within the Jupyter Lab environment. To accelerate model training and inference, we utilized a system equipped with an NVIDIA RTX 3080 GPU (NVIDIA Corporation, Santa Clara, CA, USA) an Intel Core i7 CPU, and 32 GB of RAM. The deep learning models were developed using Python 3.10, TensorFlow 2.11 and run on Windows 10. This hardware and software configuration provided sufficient computational resources to efficiently conduct the VR user authentication experiments.

4.2. VRBiom Data

To evaluate our proposed continuous authentication system in VR, we utilize the VRBiom dataset [38], which, to the best of our knowledge, is the first periocular PAD dataset collected using HMD devices such as the Meta Quest Pro. This dataset provides high-quality Near-Infrared (NIR) periocular images, making it well-suited for biometric authentication and security studies in VR environments. These datasets have been anonymized and collected following the original authors’ institutional ethical approvals and comply with relevant privacy regulations. Our research does not involve collecting new biometric data or processing identifiable personal information beyond the scope of the original datasets. Figure 3 shows a sample of the eye region images.

Since the HMD closely fits around the user’s head, the dataset was collected under controlled conditions. Each identity, whether bona fide (genuine users) or presentation attacks (PAIs), was recorded within a single session. Each recorded sample includes two sub-samples, corresponding to left and right eye images captured by the NIR cameras of the Meta Quest Pro. During bona fide data collection, participants were first informed about the study and signed a consent form. Each subject was recorded under two conditions: with and without glasses, as shown in Figure 3. For each condition, the subject maintained three gaze states: steady gaze, moving gaze, and partially closed eyes. Each video recording lasted approximately 10 s at 72 FPS, resulting in sequences of about 650 frames after discarding overexposed initial frames. The spatial resolution of each frame is 224 × 224 pixels.

To simulate presentation attacks, PAIs were used to create a diverse attack dataset. The attacks included rigid masks with real eyes, rigid masks with fake 3D eyeballs, generic flexible masks with printed synthetic eyes, and custom flexible masks with fake 3D eyeballs. Additionally, print-based attacks using bona fide images, as well as auxiliary elements such as fake eyeballs, eyelashes, and glasses, were incorporated to enhance realism and variability. Table 2 summarizes the different types of PAIs used, while Figure 3 presents sample images from both bona fide and attack sessions. This dataset enables us to evaluate the robustness of our continuous authentication model against both genuine and spoofed identities, particularly in VR-based biometric authentication scenarios. All data used in this study were collected using Meta Quest Pro devices. As a result, the findings may be limited by hardware-specific characteristics of this headset, such as its eye-tracking precision and sensor configuration. Cross-device generalization, particularly to other VR platforms such as HTC Vive or Oculus Rift, was not evaluated in this study and remains an important direction for future work to ensure wider applicability in real-world deployments.

4.3. Data Pre-Processing and Feature Extraction

The dataset comprises video recordings of 25 bona fide subjects, captured under controlled conditions. Each subject participated in 36 recording sessions, which include variations across three gaze scenarios, two conditions (with and without glasses), and three repetitions from both the left and right cameras. In total, 900 bona fide videos were collected.

Attack samples were generated by selecting a near-frontal frame from both the with-glasses and without-glasses recordings of each subject. This frame was printed using a high-resolution laser printer (visible in the near-infrared spectrum) to create print attacks. For each eye (left and right), three attack attempts were recorded, both with and without glasses, resulting in 300 attack videos.

PAIs were obtained using different instruments, with each attack type recorded three times under both glasses and no-glasses conditions. Specifically:

Seven mannequin identities were used, leading to (7 × 3 × 2 =) 42 videos. Two types of rigid masks (one with real eyes, one with fake 3D eyeballs) contributed 120 and 168 videos, respectively. Flexible masks with either printed eyes or 3D eyeballs resulted in 240 and 192 videos, respectively. An experimental protocol was established to partition the dataset into training, validation, and test sets, ensuring a balanced dataset for PAD assessment. These partitions are identity-disjoint, ensuring that subjects in one set do not appear in another. Each partition contains approximately one-third of the total dataset.

To ensure a fair evaluation and prevent any data leakage, we adopted an identity-disjoint partitioning strategy, where all 36 sessions recorded from a single participant were grouped together and assigned exclusively to one of the train, validation, or test sets. This prevents temporal or behavioral correlation between the partitions. In total, 50% of subjects were used for training, 25% for validation, and 25% for testing. No subject had data in more than one partition. The dataset was uniformly sampled by extracting every 10th frame. This process ensures that each frame is treated independently, without considering temporal correlations between consecutive frames. Video-based PAD detection, which utilizes temporal patterns, is beyond the scope of this study. At the frame level, the dataset is divided as follows:

Training partition: 68,394 frames from 1002 videos. Validation partition: 40,080 frames from 448 videos (with unique identities distinct from training). Test partition: 34,478 frames from 432 videos (including both bona fide and attack samples). A manual inspection was conducted to remove erroneous samples caused by recording artifacts or technical glitches, while uniform sampling every 10th frame was adopted to reduce data redundancy and computational load, we acknowledge that this approach may not fully capture fine-grained temporal transitions important for continuous authentication. Future work will investigate sliding window and denser frame sampling strategies to enhance the utilization of temporal behavioral dynamics.

4.4. Evaluation Metrics

To compare the performance of our improved MobileNetV3 model against ResNet50, ResNet101, MobileNet, and standard MobileNetV3. In this study, the positive class corresponds to bona fide (genuine) users, and the negative class corresponds to attackers. The outcomes of the classifier are defined as follows:

True Positive (TP): correctly accepting a bona fide user;
True Negative (TN): correctly rejecting an attacker;
False Positive (FP): incorrectly accepting an attacker;
False Negative (FN): incorrectly rejecting a bona fide user.

Let

\hat{y}

denote the predicted label obtained by applying a decision threshold

τ

to the model probability

s \in [0, 1]

:

\hat{y} = \{\begin{matrix} 1 & if s \geq τ \\ 0 & if s < τ \end{matrix}

Different values of the threshold

τ

affect the counts of TP, TN, FP, and FN, thus impacting Accuracy, F1-score, and EER.

Accuracy: Measures the overall correctness of predictions:

$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$

(7)
EER: Commonly used in biometric authentication, it is the point where the false acceptance rate (FAR) and false rejection rate (FRR) are equal:

$E E R = F A R = F R R$

(8)

The EER is achieved at the threshold $τ_{EER}$ where the false acceptance rate (FAR) equals the false rejection rate (FRR), balancing security and usability. Lower EER values indicate better biometric authentication performance.
Area Under the Curve (AUC): Evaluates the trade-off between true positive rate and false positive rate:

$A U C = \int_{0}^{1} T P R (F P R) d F P R$

(9)

where:
- $T P R$ is the True Positive Rate, also known as sensitivity or recall, defined as:
  
  $T P R = \frac{T P}{T P + F N}$
- $F P R$ is the False Positive Rate, defined as:
  
  $F P R = \frac{F P}{F P + T N}$
- $T P R (F P R)$ means that the true positive rate is a function of the false positive rate, as both vary depending on the classification threshold;
- The integral computes the area under the ROC curve, which plots $T P R$ versus $F P R$ as the threshold varies.
F1-score: The harmonic mean of precision and recall, providing a balanced measure of a model’s performance:

$F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(10)

where:

$P r e c i s i o n = \frac{T P}{T P + F P}$

(11)

Higher F1-score indicate a better balance between precision and recall.
Model size: Refers to the total storage required for the trained model, measured in megabytes (MB). Smaller models are more efficient for deployment on edge devices like VR headsets.
Inference time: Measures the average time required for a model to make a single prediction:

$I n f e r e n c e T i m e = \frac{T o t a l P r o c e s s i n g T i m e}{N u m b e r o f S a m p l e s}$

(12)

Inference time was measured using Python’s time.time () function, recording the total time for 100 randomly generated input samples and computing the average per sample. All measurements were performed on an otherwise idle system to avoid interference from background processes.

These metrics comprehensively assess our model’s effectiveness in classification tasks, ensuring a fair comparison with other architectures. In addition to standard biometric evaluation metrics such as EER, AUC, and F1-score, it is important to consider metrics that are more aligned with the requirements of security-critical applications. Future evaluations will incorporate additional indicators such as attack detection rate, false alarm cost, and security-usability trade-off analysis. These metrics can offer deeper insights into the practical effectiveness and robustness of the authentication system under adversarial conditions and during real-time VR usage. By extending our metric set, we aim to better capture both the security guarantees and the usability demands critical to deployment in real-world VR environments.

4.5. Adversarial Attack

In this study, the PGD attack was employed to evaluate the robustness of our biometric verification model against adversarial examples. PGD is one of the most powerful and widely used first-order adversarial attack methods. It generates perturbations by iteratively applying small gradient-based updates within a defined perturbation bound, thereby crafting adversarial inputs that remain imperceptible to human vision but can deceive neural networks.

The primary motivation for using PGD is its ability to simulate strong and realistic adversarial threats. Unlike simpler attacks such as Fast Gradient Sign Method (FGSM), PGD performs multi-step optimization, making it a more rigorous and comprehensive benchmark for model robustness. This allows us to assess how well the model can maintain its verification performance in the presence of adversarial noise, which is crucial for real-world applications where system security and reliability are essential.

Moreover, integrating PGD attacks during training (i.e., adversarial training) can help improve model generalization and resilience by exposing the model to challenging examples, effectively reducing its vulnerability to unseen attacks. Therefore, PGD serves both as a robustness evaluation tool and a regularization mechanism for enhancing model security.

The PGD attack generates an adversarial example

x^{adv}

as follows:

\begin{matrix} x_{0}^{adv} & = x + δ, δ \sim U (- ϵ, ϵ) \\ x_{t + 1}^{adv} & = Π_{B_{ϵ} (x)} (x_{t}^{adv} + α \cdot sign (\nabla_{x} L (f (x_{t}^{adv}), y))) \end{matrix}

where:

$Π_{B_{ϵ} (x)} (\cdot)$ denotes the projection operator onto the $ℓ_{\infty}$ -ball of radius $ϵ$ centered at x,
$sign (\cdot)$ is the element-wise sign function,
$\nabla_{x} L (f (x), y)$ is the gradient of the loss with respect to the input.

This formulation illustrates the iterative nature of the PGD attack, where the adversarial example is gradually refined over multiple steps. The process begins with a small random initialization within the allowed perturbation region to avoid gradient masking. In each step, the adversarial example is updated in the direction that maximally increases the model’s loss using the sign of the gradient. The projection operator ensures that the updated input remains within the

ℓ_{\infty} -

ball of radius

ϵ

, maintaining perceptual similarity to the original input. By repeating this process T times, PGD creates strong adversarial examples that are more effective than one-step attacks like FGSM. This makes PGD a powerful tool for evaluating model robustness and training models to resist adversarial perturbations through adversarial training.

4.6. Results and Evaluation

4.6.1. Quantitative Results

To evaluate the performance of our proposed model, we conducted experiments on VRBiom using five different architectures: ResNet-50, ResNet-101, MobileNet, MobileNetV3, and our proposed model. The evaluation was performed using EER and AUC as key performance metrics, which are widely used for biometric authentication systems.

Table 3 presents a comprehensive performance comparison of various deep learning models in terms of EER, AUC, F1-score, model size, and inference time. The results clearly highlight the superior performance of our proposed model, MobileNetV3pro, which achieves the best overall trade-off between verification accuracy and computational efficiency. Among all models, MobileNetV3pro achieves the lowest EER (3.00%), highest AUC (95.17%), and highest F1-score (94.79%), demonstrating its strong discriminative power and robustness. In contrast, deeper models such as ResNet-50 and ResNet-101 suffer from higher EERs (16.08% and 22.45%, respectively) and lower AUCs (83.89% and 80.30%), indicating their limitations in handling fine-grained biometric data, particularly in data-constrained scenarios.

Lightweight baselines like MobileNet and MobileNetV3 show competitive performance, with EERs of 3.68% and 15.15%, respectively. However, MobileNetV3pro outperforms both across all evaluation metrics. Compared to MobileNetV3, it reduces EER by 12.15 percentage points and increases AUC by 7.34 points. Although MobileNet achieves a comparable AUC (95.90%), its slightly higher EER (3.68%) and larger model size (16.75 MB) make it less efficient overall. In terms of model complexity, MobileNetV3pro has the smallest size (13.82 MB), significantly smaller than ResNet-50 (91.98 MB) and ResNet-101 (164.32 MB), and even lighter than MobileNetV3 (15.35 MB). It also achieves the fastest inference time (159.43 ms), making it especially well-suited for real-time biometric verification in resource-constrained environments such as VR headsets or mobile devices.

Figure 4 illustrates the ROC curves of the six evaluated models, highlighting their TPR against the FPR at various decision thresholds. These visualizations offer an intuitive perspective on each model’s discriminative power and corroborate the quantitative findings presented in Table 3.

As shown in the figure, the ROC curve for the proposed model consistently lies above those of the baseline models, reflecting its superior verification performance. The area under the ROC curve for the proposed model is the highest among all, indicating its strong capability in distinguishing between genuine and impostor samples across a range of thresholds.

In comparison, ResNet-50 and ResNet-101 exhibit shallower curves, with noticeably lower AUCs and higher EERs, suggesting limited robustness and a tendency to misclassify samples, particularly under varying biometric inputs. MobileNet and MobileNetV3 perform better than the ResNet architectures, with more favorable ROC curves. However, the MobileNetV3 curve shows a broader variance, which aligns with its wider EER range reported in the table.

Overall, the ROC curve analysis reinforces the quantitative evaluation results and confirms that the proposed model not only maintains a low false positive rate but also achieves high recall, making it highly effective and reliable for biometric verification tasks in resource-constrained environments. ResNet-101 is a deeper and more complex network than MobileNetV3, requiring a larger dataset to generalize well. If the dataset is relatively small, e.g., VRBiom, ResNet models might overfit, leading to higher EER. MobileNetV3, being a lightweight network designed for efficiency, can generalize better with limited data due to built-in architectural optimizations. ResNet-101 has significantly more parameters than MobileNetV3, and with smaller datasets, the model might learn noise instead of meaningful biometric features, leading to suboptimal performance. MobileNetV3, with depthwise separable convolutions and a reduced number of parameters, is more efficient in learning essential patterns while reducing overfitting risks. ResNet models extract high-level abstract features effectively in tasks like image classification but may not be as efficient for biometric verification, which requires fine-grained, subtle features from eye-region data. MobileNetV3 uses a combination of squeeze-and-excitation (SE) blocks, inverted residual connections, and lightweight convolutions, making it better suited for capturing discriminative details in small regions like the eye area. ResNet-101 requires more computations per forward pass, making it harder to train with limited resources, whereas MobileNetV3 is optimized for efficiency with techniques like non-linear activation (Swish) and automated architecture search. MobileNetV3 benefits from better training stability due to its optimized structure, improving generalization performance. MobileNetV3’s architectural efficiency, better feature selection mechanisms, and lightweight design likely contributed to its superior performance over ResNet models in our biometric verification task. ResNet-101, while powerful, may not be well-suited to our dataset due to overfitting, computational inefficiency, and suboptimal feature extraction for fine-grained eye-region analysis.

4.6.2. Ablation Study

To analyze the contributions of different components in our proposed model, an ablation study was conducted by systematically adding or removing key modifications, including weighted loss, fine-tuning more layers, and learning rate scheduling. Table 4 summarizes the results of these experiments in terms of EER and AUC.

Table 4 presents an ablation study illustrating the effect of various optimization techniques applied to the baseline MobileNetV3 model. The evaluation considers EER, AUC, F1-score, model size, and inference time. The results clearly show that the performance of the baseline model can be significantly improved through targeted optimizations in both architecture and loss function design.

The unmodified MobileNetV3 model achieves an EER of 15.15%, an AUC of 87.83%, and an F1-score of 80.31%. These baseline values reflect moderate performance, leaving substantial room for improvement, particularly in classification robustness and discriminative power. Adding a weighted loss helps address class imbalance by penalizing minority class misclassifications more heavily. This adjustment slightly improves the F1-score to 84.42%, though the EER remains relatively unchanged at 15.27%, and AUC even drops slightly to 85.40%. Nevertheless, inference time is reduced to 802.30 ms, and model size becomes more compact (13.82 MB), suggesting efficiency gains.

When fine-tuning was applied to the last 30 layers, the model exhibited mixed effects. While inference time improved to 621.43 ms, demonstrating speed benefits from deeper tuning, the performance dropped slightly with an EER of 17.54% and AUC of 84.22%, possibly due to overfitting or suboptimal learning without complementary strategies like loss function modification. In contrast, the use of focal loss yielded a substantial leap in verification accuracy. The EER dropped sharply to 8.32%, AUC increased to 91.48%, and F1-score increased to 90.64%. This suggests focal loss is particularly effective in emphasizing hard-to-classify examples, thereby enhancing the model’s robustness and reducing false positives.

Finally, by integrating all three techniques—weighted loss, fine-tuning, and focal loss—the model achieved an EER of just 3.00%, an AUC of 95.17%, and an F1-score of 94.79%. Moreover, this configuration also delivered the fastest inference time (545.22 ms) and smallest model size (13.82 MB), demonstrating the efficiency and effectiveness of a well-rounded optimization strategy. These results underline the complementary nature of architectural and loss-based enhancements, while focal loss individually contributes most to improving discriminative ability. Its combination with weighted loss and deeper fine-tuning unlocked the full potential of MobileNetV3.

4.6.3. PGD Attack Results

Table 5 summarizes the model performance under clean and adversarial conditions. Without any defense mechanisms, the model exhibited a substantial performance degradation under PGD attack. Specifically, the EER increased markedly from 3.00% to 40.78%, and the AUC decreased from 95.17% to 58.37%. Furthermore, the F1-score declined significantly, highlighting the model’s vulnerability to adversarial perturbations.

Figure 5 shows the prediction accuracy of the evaluated models under PGD attack. The robustness of MobileNetV3pro has already demonstrated excellent performance. The EER of other models (ResNet-101, and MobileNet) approaches nearly 100%, with the best-performing ResNet-50 still reaching 84.54%, while their accuracy is nearly 0.

A key limitation of our study is the reliance on a single dataset (VRBiom), which includes data from only 25 subjects. Although this dataset offers valuable insights into eye-based VR user authentication, the relatively small sample size and lack of demographic diversity limit the generalizability of our findings. Future work will involve extending the evaluation to include additional publicly available or newly collected VR authentication datasets, as well as performing cross-dataset validation to assess model robustness and adaptability across varied scenarios. This step is essential before the proposed system can be considered for practical deployment in diverse real-world settings.

5. Discussion

Continuous authentication for VR headsets has been actively studied in recent years, with many papers reporting improvements in key factors like EER and attack rate. Building on this prior work, our study implemented a model with a lower EER and higher robustness for VR authentication. However, several limitations remain. First, the evaluation was based on simulation, with a limited user dataset (25 participants). Although the model was trained sufficiently, its robustness remains an issue. Second, the model is based on MobileNetV3 large, a pretrained feature extractor. Despite its effectiveness, the PGD attack success rate remains quite high, and the user can possibly be attacked. It also introduces potential challenges impacting user experience, such as interruptions caused by frequent authentication checks and false alarms that may frustrate users. These issues can disrupt VR immersion, diminishing the overall usability and acceptance of the system.

The current evaluation includes seven types of presentation attacks with a limited number of samples in some categories (e.g., mannequins with seven videos). This limited diversity may not adequately represent the full range of real-world spoofing attempts. While these attacks were chosen based on feasibility and relevance, we acknowledge that the dataset’s diversity and scale can be further improved. Future work will aim to expand the attack dataset with more varied and realistic spoofing strategies (e.g., 3D mask attacks, AI-generated synthetic faces) and larger sample sizes to enable a more comprehensive evaluation of system robustness. The measured inference time of 545 ms for MobileNetV3pro exceeds typical VR real-time requirements (8–14 ms per frame). This result was obtained on a general CPU without hardware acceleration or model optimization. Factors such as input resolution, model size, and security-related computations contribute to latency. However, techniques like pruning, quantization, and GPU acceleration can substantially reduce inference time. Our current study does not include direct measurements of power usage or thermal impact, the computational complexity of the proposed MobileNetV3pro model suggests a need for optimization when deploying on mobile VR platforms.

Future work will address these limitations through practical implementation on real VR headsets. This includes conducting experiments involving actual user participation to rigorously assess the effectiveness, security, and stability of the authentication model in real-world usage conditions. Conducting user-involved studies will allow us to capture natural biometric and behavioral variations, offering a more realistic validation of the system’s robustness. Moreover, we aim to explore additional biometric modalities, such as iris dynamics, facial micro-expressions, or voice features, as complementary inputs to enhance the accuracy and security of user verification. These multi-modal approaches could significantly improve resistance against spoofing or adversarial attempts. The blurred images may influence data predictions, we will analyze the effect of image quality on prediction accuracy. We plan to perform detailed ablation experiments in future work to empirically justify the choice of expanding trainable layers from 10 to 30 and will include theoretical reasoning to support this strategy. We plan to conduct a comprehensive theoretical analysis and empirical evaluation of various optimization strategies. This will include a systematic comparison of different optimization algorithms and fine-tuning configurations to identify the most effective approach for training our VR authentication model. For PGD attack, we will set different ratios of clean and adversarial data to achieve the best performance of PGD attack results.

We also intend to evaluate the system across multiple commercial VR platforms—such as HTC Vive, Oculus Quest 3, and Apple Vision Pro—to verify its hardware-agnostic compatibility and adaptability. By conducting comparative experiments across these devices, we will be able to analyze metrics such as EER and the success rate of PGD adversarial attacks under diverse hardware constraints and sensor configurations. Ultimately, our goal is to advance toward a deployable, user-friendly, and secure authentication solution for VR environments that balances performance with privacy and usability.

6. Conclusions

The goal of this research was to develop an efficient and reliable continuous authentication system for VR environments, addressing the critical need for continuous user monitoring in shared or multi-user VR scenarios. This study contributes to the growing field of VR security by offering a real-time, dynamic solution to user authentication. The system enhances data protection in sensitive domains such as healthcare, finance, and education, while maintaining immersion and usability.

However, the current approach has notable limitations. The model exhibits vulnerability to adversarial attacks, which could undermine its real-world security. Furthermore, the evaluation was conducted on a single dataset, limiting generalizability across diverse VR environments and user populations. Hardware dependencies and performance variations under different lighting conditions also present practical challenges.

In conclusion, this study demonstrates the feasibility of continuous VR authentication and establishes a solid foundation for developing more secure and user-friendly VR systems. Future work should focus on enhancing adversarial robustness, conducting evaluations across diverse datasets and hardware setups, and addressing real-world deployment constraints to improve practical readiness.

Author Contributions

Conceptualization, L.Y.; methodology, P.L.; software, P.L.; validation, P.L. and L.Y.; formal analysis, P.L.; investigation, P.L.; resources, P.L.; data curation, P.L.; writing—original draft preparation, P.L.; writing—review and editing, P.L., F.C., L.P., T.H. and Y.Z.; visualization, P.L.; supervision, L.Y.; project administration, L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is the publicly available VRBiom dataset, which can be accessed at https://www.idiap.ch/en/scientific-research/data/vrbiom (accessed on 21 January 2025). No new data were created during this study.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

VR	Virtual Reality
AR	Augmented Reality
TP	True Positive
TN	True Negative
FP	False Positive
FN	False Negative
EER	Equal Error Rate
AUC	Area Under the Curve
PGD	Projected Gradient Descent
FGSM	Fast Gradient Sign Method
ROC	Receiver Operating Characteristic
SE	Squeeze and Excitation
ReLU	Rectified Linear Unit
IoT	Internet of Things
PIN	Personal Identification Number
HMDs	Head Mounted Displays
MITR	Man In The Room
GAP	Global Average Pooling
CA	Continuous Authentication
CNN	Convolutional Neural Networks
LSTM	Long Short-Term Memory
KNN	k-Nearest Neighbors
RGB	Red Green Blue
PAD	Presentation Attack Detection
NIR	Near-Infrared

References

Jones, J.M.; Duezguen, R.; Mayer, P.; Volkamer, M.; Das, S. A literature review on virtual reality authentication. In Proceedings of the Human Aspects of Information Security and Assurance: 15th IFIP WG 11.12 International Symposium, HAISA 2021, Virtual, 7–9 July 2021; Proceedings. Springer: Berlin/Heidelberg, Germany, 2021; Volume 15, pp. 189–198. [Google Scholar]
Kürtünlüoğlu, P.; Akdik, B.; Karaarslan, E. Security of virtual reality authentication methods in metaverse: An overview. arXiv 2022, arXiv:2209.06447. [Google Scholar] [CrossRef]
Abdelrahman, Y.; Mathis, F.; Knierim, P.; Kettler, A.; Alt, F.; Khamis, M. CueVR: Studying the usability of cue-based authentication for virtual reality. In Proceedings of the 2022 International Conference on Advanced Visual Interfaces, Rome, Italy, 6–10 June 2022; pp. 1–9. [Google Scholar]
Mathis, F.; Fawaz, H.I.; Khamis, M. Knowledge-driven biometric authentication in virtual reality. In Proceedings of the Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–10. [Google Scholar]
Kupin, A.; Moeller, B.; Jiang, Y.; Banerjee, N.K.; Banerjee, S. Task-driven biometric authentication of users in virtual reality environments. In Proceedings of the International Conference on Multimedia Modeling, Thessaloniki, Greece, 8–11 January 2019; Volume 25, pp. 55–67. [Google Scholar]
Liebers, J.; Schneegass, S. Gaze-based authentication in virtual reality. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications, Stuttgart, Germany, 2–5 June 2020; pp. 1–2. [Google Scholar]
Pfeuffer, K.; Geiger, M.J.; Prange, S.; Mecke, L.; Buschek, D.; Alt, F. Behavioral biometrics in VR: Identifying people from body motion and relations in virtual reality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow Scotland, UK, 4–9 May 2019; pp. 1–12. [Google Scholar]
Khamis, M.; Oechsner, C.; Alt, F.; Bulling, A. VRpursuits: Interaction in virtual reality using smooth pursuit eye movements. In Proceedings of the 2018 international conference on advanced visual interfaces, Castiglione della Pescaia Grosseto, Italy, 29 May–1 June 2018; pp. 1–8. [Google Scholar]
Lohr, D.; Berndt, S.-H.; Komogortsev, O. An implementation of eye movement-driven biometrics in virtual reality. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, Warsaw, Poland, 14–17 June 2018; pp. 1–3. [Google Scholar]
Andam, A.; Bentahar, J.; Hedabou, M. Multi-modal deep reinforcement learning for visual security of virtual reality applications. IEEE Internet Things J. 2024, 11, 39890–39900. [Google Scholar] [CrossRef]
Yu, Z.; Liang, H.N.; Fleming, C.; Man, K.L. An exploration of usable authentication mechanisms for virtual reality systems. In Proceedings of the IEEE Asia Pacific Conf. on Circuits and Systems (APCCAS), Jeju, Republic of Korea, 25–28 October 2016; pp. 458–460. [Google Scholar]
George, C.; Khamis, M.; von Zezschwitz, E.; Burger, M.; Schmidt, H.; Alt, F.; Hussmann, H. Seamless and secure VR: Adapting and evaluating established authentication systems for virtual reality. In Proceedings of the NDSS, San Diego, CA, USA, 26 February–1 March 2017. [Google Scholar]
Olade, I.; Liang, H.N.; Fleming, C.; Champion, C. Exploring the vulnerabilities and advantages of swipe or pattern authentication in VR. In Proceedings of the 2020 4th International Conference on Virtual and Augmented Reality Simulations, Sydney, NSW, Australia, 14–16 February 2020; pp. 45–52. [Google Scholar]
Funk, M.; Marky, K.; Mizutani, I.; Kritzler, M.; Mayer, S.; Michahelles, F. Lookunlock: Using spatial-targets for user-authentication on HMDs. In Proceedings of the Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–6. [Google Scholar]
Mann, A. Optimization problems in fog and edge computing. In Fog and Edge Computing: Principles and Paradigms; John Wiley & Sons: Hoboken, NJ, USA, 2019; pp. 103–121. [Google Scholar]
Miller, R.; Banerjee, N.K.; Banerjee, S. Using Siamese neural networks to perform cross-system behavioral authentication in virtual reality. In Proceedings of the IEEE Virtual Reality and 3D User Interfaces (VR), Lisboa, Portugal, 27 March–1 April 2021; pp. 140–149. [Google Scholar]
Liebers, J.; Horn, P.; Burschik, C.; Gruenefeld, U.; Schneegass, S. Using gaze behavior and head orientation for implicit identification in virtual reality. In Proceedings of the 27th ACM Symposium on Virtual Reality Software and Technology, Osaka, Japan, 8–10 December 2021; pp. 1–9. [Google Scholar]
Li, S.; Ashok, A.; Zhang, Y.; Xu, C.; Lindqvist, J.; Gruteser, M. Whose move is it anyway? Authenticating smart wearable devices using unique head movement patterns. In Proceedings of the 2016 IEEE International Conference on Pervasive Computing and Communications, Sydney, NSW, Australia, 14–19 March 2016; pp. 1–9. [Google Scholar]
Mustafa, T.; Matovu, R.; Serwadda, A.; Muirhead, N. Unsure how to authenticate on your VR headset? Come on, use your head! In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, Tempe, AZ, USA, 21 March 2018; pp. 23–30. [Google Scholar]
Sivasamy, M.; Sastry, V.N.; Gopalan, N.P. VRCAuth: Continuous authentication of users in VR environment using head-movement. In Proceedings of the 2020 5th International Conference on Communication and Electronics Systems, Coimbatore, India, 10–12 June 2020; pp. 518–523. [Google Scholar]
Li, M.; Banerjee, N.K.; Banerjee, S. Using motion forecasting for behavior-based VR authentication. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality, Los Angeles, CA, USA, 17–19 January 2024; pp. 31–40. [Google Scholar]
Shang, J.; Wu, J. Enabling secure voice input on AR headsets using internal body voice. In Proceedings of the 2019 16th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Boston, MA, USA, 10–13 June 2019; pp. 1–9. [Google Scholar]
Duezguen, R.; Mayer, P.; Das, S.; Volkamer, M. Towards secure and usable authentication for AR and VR head-mounted displays. arXiv 2020, arXiv:2007.11663. [Google Scholar]
Bekkanti, N.; Busch, L.; Amman, S. Evaluation of Voice Biometrics for Identification and Authentication; SAE International: Warrendale, PA, USA, 2021. [Google Scholar]
Rigas, I.; Komogortsev, O.; Shadmehr, R. Biometric recognition via eye movements: Saccadic vigor and acceleration cues. ACM Trans. Appl. Percept. 2016, 13, 6. [Google Scholar] [CrossRef]
Olade, I.; Fleming, C.; Liang, H.N. BioMove: Biometric user identification from kinesiological movements for VR systems. Sensors 2020, 20, 2944. [Google Scholar] [CrossRef] [PubMed]
Qian, K.; Arichi, T.; Price, A.; Dall’Orso, S.; Eden, J.; Noh, Y.; Rhode, K.; Burdet, E.; Neil, M.; Edwards, A.D.; et al. An eye tracking-based VR system for use inside MRI systems. Sci. Rep. 2021, 11, 16301. [Google Scholar] [CrossRef] [PubMed]
Asish, S.M.; Kulshreshth, A.K.; Borst, C.W. User identification utilizing minimal eye-gaze features in VR applications. Virtual Worlds 2022, 1, 42–61. [Google Scholar] [CrossRef]
Peng, S.; Al Madi, N. An eye opener on the use of machine learning in eye movement based authentication. In Proceedings of the 2022 Symposium on Eye Tracking Research and Applications, Seattle, WA, USA, 8–11 June 2022; pp. 1–2. [Google Scholar]
Sluganovic, I.; Roeschlin, M.; Rasmussen, K.B.; Martinovic, I. Using reflexive eye movements for fast challenge-response authentication. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1056–1067. [Google Scholar]
Miller, R.; Ajit, A.; Banerjee, N.K.; Banerjee, S. Realtime behavior-based continual authentication in VR environments. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence and Virtual Reality, San Diego, CA, USA, 9–11 December 2019; pp. 253–2531. [Google Scholar]
Miller, R.; Banerjee, N.K.; Banerjee, S. Within- and cross-system behavior-based biometric authentication in VR. In Proceedings of the IEEE VR Workshops, Atlanta, GA, USA, 22–26 March 2020; pp. 311–316. [Google Scholar]
Li, L.; Chen, C.; Pan, L.; Zhang, L.Y.; Zhang, J.; Xiang, Y. SIGA: RPPG-based authentication for VR head-mounted display. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, Hong Kong, China, 16–18 October 2023; pp. 686–699. [Google Scholar]
Cheng, R.; Wu, Y.; Kundu, A.; Latapie, H.; Lee, M.; Chen, S.; Han, B. MetaFL: Privacy-preserving user authentication in VR with federated learning. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems, Hangzhou, China, 4–7 November 2024; pp. 54–67. [Google Scholar]
Kim, M.; Oh, J.; Son, S.; Park, Y.; Kim, J.; Park, Y. Secure and privacy-preserving authentication using decentralized identifier in metaverse. Electronics 2023, 12, 4073. [Google Scholar] [CrossRef]
Wang, J.; Gao, B. Multi-attribute user authentication against man-in-the-room attack in VR. In Proceedings of the International Conference on Human–Computer Interaction, Bucharest, Romania, 16–17 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 455–461. [Google Scholar]
Noah, N.; Das, S. From PINs to gestures: Analyzing knowledge-based authentication schemes for AR and VR. IEEE Trans. Vis. Comput. Graph. 2025, 31, 3172–3182. [Google Scholar] [CrossRef] [PubMed]
Kotwal, K.; Özbulak, G.; Marcel, S. Assessing the reliability of biometric authentication on VR devices. In Proceedings of the 2024 IEEE International Joint Conference on Biometrics, Buffalo, NY, USA, 15–18 September 2024; pp. 1–10. [Google Scholar]

Figure 1. Data processing and model prediction pipeline.

Figure 2. Example eye region images with significant overexposure or highlight artifacts due to headset light reflection and user glasses. These visual distortions pose challenges for accurate biometric feature extraction, particularly affecting iris and eyelid boundary visibility.

Figure 3. Example of sample images. (o) original true users and with glasses. (a) rigid masks with own eyes, (b) rigid masks with fake eyeballs, (c) flex masks with print attacks, (d) flex masks with print attacks, (e) flex masks with fake eyeballs, and (f) auxiliary instruments (fake eyeballs, prints with synthetic eyes, eyelashes, glasses).

Figure 4. Different model prediction results: (a) ResNet 50 model prediction ROC curve. (b) ResNet 101 model prediction ROC curve. (c) MobileNet V1 model prediction ROC curve. (d) MobileNetV3 model prediction ROC curve. (e) MobileNetV3 improve model prediction ROC curve. (f) MobileNetV3pro model prediction ROC curve.

Figure 5. Model prediction accuracy under PGD attack.

Table 1. Comparative Analysis of VR Authentication Methods.

Category	Paper	EER%	Attack Success Rate%
Knowledge & Cue-Based Authentication	[3,4,5,11,13,37]	N/A	17–70
Gaze Eye-Tracking-Based Biometrics	[8,9,25,27,28,29,30]	1.67–3.2	5–40
Head-Movement-Based Authentication	[18,19,20]	4–6	10–50
Voice-Based Authentication	[22,24,34]	2.1–5.4	20–60
Multimodal	[14,15,17,33,36]	0.29–5.8	1–15

Table 2. Summary of attack types and bona fide samples [38].

Type	Subtype	Identities	Videos	Attack Types
Bona fide	Steady gaze, moving gaze, glass, no glass	25	900	–
Attacks	Mannequins	2	7	Own eyes
	Custom rigid mask	3	10	Own eyes
	Custom rigid mask	4	14	Fake 3D eyeballs
	Generic flexible masks	5	20	Print attacks
	Custom silicone masks	6	16	Fake 3D eyeballs
	Print attacks	7	25	Print attacks

Table 3. Performance Comparison of Different Models.

Model	EER (%)	AUC (%)	F1-Score (%)	Model Size (MB)	Inference Time (ms)
ResNet-50	16.08	83.89	81.82	91.98	847.25
ResNet-101	22.45	80.30	81.45	164.32	1079.26
MobileNet	3.68	95.90	95.03	16.75	784.56
MobileNetV3	15.15	87.83	80.31	15.35	984.34
MobileNetV3pro	3.00	95.17	94.79	13.82	545.22

Table 4. Performance Comparison between Different Optimization Methods.

Model	EER (%)	AUC (%)	F1-Score (%)	Model Size (MB)	Inference Time (ms)
MobileNetV3	15.15	87.83	80.31	15.35	984.34
+ Weighted Loss	15.27	85.40	84.42	13.82	802.30
+ Fine-tuning 30 layers	17.54	84.22	83.02	13.82	621.43
+ Focal Loss	8.32	91.48	90.64	13.82	650.53
+ All together	3.00	95.17	94.79	13.82	545.22

Table 5. Model performance under clean and PGD adversarial settings.

Condition	EER (%)	AUC (%)	F1-Score (%)
Clean Test Data	3.00	95.17	94.79
Under PGD Attack	40.78	58.37	68.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, P.; Chen, F.; Pan, L.; Hoang, T.; Zhu, Y.; Yang, L. LightLiveAuth: A Lightweight Continuous Authentication Model for Virtual Reality. IoT 2025, 6, 50. https://doi.org/10.3390/iot6030050

AMA Style

Li P, Chen F, Pan L, Hoang T, Zhu Y, Yang L. LightLiveAuth: A Lightweight Continuous Authentication Model for Virtual Reality. IoT. 2025; 6(3):50. https://doi.org/10.3390/iot6030050

Chicago/Turabian Style

Li, Pengyu, Feifei Chen, Lei Pan, Thuong Hoang, Ye Zhu, and Leon Yang. 2025. "LightLiveAuth: A Lightweight Continuous Authentication Model for Virtual Reality" IoT 6, no. 3: 50. https://doi.org/10.3390/iot6030050

APA Style

Li, P., Chen, F., Pan, L., Hoang, T., Zhu, Y., & Yang, L. (2025). LightLiveAuth: A Lightweight Continuous Authentication Model for Virtual Reality. IoT, 6(3), 50. https://doi.org/10.3390/iot6030050

Article Menu

LightLiveAuth: A Lightweight Continuous Authentication Model for Virtual Reality

Abstract

1. Introduction

2. Related Work

3. Continuous Authentication Methods in VR

3.1. Model Design and Training

3.1.1. Training Procedure

3.1.2. Data Augmentation

Training Strategy

3.2. Optimization

3.3. Adversarial Training with PGD Attacks

4. Experiments

4.1. Experimental Set up

4.2. VRBiom Data

4.3. Data Pre-Processing and Feature Extraction

4.4. Evaluation Metrics

4.5. Adversarial Attack

4.6. Results and Evaluation

4.6.1. Quantitative Results

4.6.2. Ablation Study

4.6.3. PGD Attack Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI