Next Article in Journal
Lag-Specific Transfer Entropy for Root Cause Diagnosis and Delay Estimation in Industrial Sensor Networks
Previous Article in Journal
Binary Classification of Pneumonia in Chest X-Ray Images Using Modified Contrast-Limited Adaptive Histogram Equalization Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Pupil Detection Algorithm Based on ViM

School of Optoelectronic Engineering, Xi’an Technological University, Xi’an 710000, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(13), 3978; https://doi.org/10.3390/s25133978
Submission received: 27 April 2025 / Revised: 22 June 2025 / Accepted: 24 June 2025 / Published: 26 June 2025
(This article belongs to the Section Intelligent Sensors)

Abstract

Pupil detection is a key technology in fields such as human–computer interaction, fatigue driving detection, and medical diagnosis. Existing pupil detection algorithms still face challenges in maintaining robustness under variable lighting conditions and occlusion scenarios. In this paper, we propose a novel pupil detection algorithm, ViMSA, based on the ViM model. This algorithm introduces weighted feature fusion, aiming to enable the model to adaptively learn the contribution of different feature patches to the pupil detection results; combines ViM with the MSA (multi-head self-attention) mechanism), aiming to integrate global features and improve the accuracy and robustness of pupil detection; and uses FFT (Fast Fourier Transform) to convert the time-domain vector outer product in MSA into a frequency–domain dot product, in order to reduce the computational complexity of the model and improve the detection efficiency of the model. ViMSA was trained and tested on nearly 135,000 pupil images from 30 different datasets, demonstrating exceptional generalization capability. The experimental results demonstrate that the proposed ViMSA achieves 99.6% detection accuracy at five pixels with an RMSE of 1.67 pixels and a processing speed exceeding 100 FPS, meeting real-time monitoring requirements for various applications including operation under variable and uneven lighting conditions, assistive technology (enabling communication with neuro-motor disorder patients through pupil recognition), computer gaming, and automotive industry applications (enhancing traffic safety by monitoring drivers’ cognitive states).

1. Introduction

The human eye can reflect a wealth of individual information, including gaze direction and fixation points, fatigue levels, drowsiness states, health conditions, and attention concentration levels [1]. Based on this characteristic, eye movement data have been widely applied in various assistive technologies and physical–mental health monitoring systems [2]. Precisely because of this, pupil localization technology has become a key interdisciplinary research subject.
Pupil localization technology holds extensive application value across multiple critical domains: VFOA (Visual Focus of Attention) tracking [3], driver fatigue monitoring [4], psychological and neuroscientific research [2], consumer behavior analysis [5], virtual reality systems [6], iris recognition technology [7], assistive devices for individuals with disabilities [8], and learning engagement assessment [1]. In medical diagnostics, VFOA tracking technology provides crucial methodology for the early identification of ASD (autism spectrum disorder). As a neurodevelopmental condition typically manifesting during infancy, ASD is characterized by core symptoms including impaired social interaction and persistent attention deficits. Through eye movement trajectory analysis and VFOA monitoring, characteristic behavioral patterns of ASD can be effectively identified. Furthermore, this technology demonstrates significant diagnostic value for neurological disorders, such as aiding Parkinson’s disease diagnosis [9].
The advancement of imaging technology and computational methods has significantly improved pupil detection technology. Traditional manual measurement and basic image processing algorithms are gradually being replaced by methods utilizing machine learning and computer vision. These new technologies provide higher accuracy and robustness, making real-time pupil tracking possible in various environments.
Since 2024, neural networks with Mamba architecture [10] have been widely used in the field of NLP. Compared to Transformer architecture [11], Mamba has higher training and inference efficiency. The Mamba model is based on the SSM (State Space Equation) [12] and can not only be transformed into Transformer form, but also run in a structure similar to RNNs [13], appearing as a linear Transformer rather than traditional QK matrix multiplication. This linear complexity brings attraction to Mamba and allows for parallel training and high throughput.
In recent years, a large number of studies have introduced the Mamba model into the field of computer vision and achieved remarkable results, such as the ViM model [14]. Therefore, we wanted to know if applying Mamba to pupil localization will produce better results.
Through an analysis of outstanding pupil detection models in recent years, it can be observed that the primary challenge in pupil detection tasks remains the high sensitivity to varying lighting conditions and partial occlusions (such as glasses, eyelashes, etc.) in input images, leading to unstable detection accuracy across different datasets. Existing models typically address this issue from the following three perspectives: The first approach involves incorporating channel or spatial attention mechanisms [15,16], enabling the model to adaptively focus on the pupil region. However, the additional attention mechanisms increase the computational complexity of the network itself. The second solution employs Transformer-based architectures or deep convolutional neural networks (e.g., YOLOv8) as backbone networks, leveraging the MSA mechanism’s ability to capture global image information and the expanded receptive fields in intermediate layers of deep convolutional networks [17,18,19,20] to precisely locate the pupil center. Nonetheless, this method significantly raises the computational complexity of the model. The third strategy designs the detection process as a two-stage pipeline [21,22]: the first stage coarsely localizes candidate pupil regions, and the second stage refines the regression of these candidate areas based on the initial coarse results to obtain accurate pupil center coordinates. The drawback of this approach is that if the initial coarse localization deviates slightly, the final regression results will be suboptimal.
To address the aforementioned issues and solutions, this paper adopts a learnable weighted feature fusion method, introducing learnable weights corresponding to the number of patches. Without significantly increasing model complexity, this approach enables the network to adaptively focus on potential pupil regions, maintaining high-precision pupil detection under varying lighting conditions and partial occlusions (essentially, the introduction of learnable weights achieves a similar effect as attention mechanisms, since both channel and spatial attention modules output weights assigned to intermediate features—however, the computational cost of attention mechanisms far exceeds that of simply concatenating learnable weights to patch inputs). Furthermore, this paper is the first to introduce the ViM model into the field of pupil detection. First, leveraging ViM’s long-sequence modeling capability, it effectively overcomes the performance bottlenecks of existing methods under varying lighting and partial occlusion conditions. Second, compared to ViT [23], the SSM-based ViM architecture significantly reduces computational complexity, enabling real-time detection while maintaining high accuracy. Finally, if computational efficiency and resource consumption are disregarded, the pairwise interaction property of ViT’s self-attention mechanism grants it superior global modeling capability compared to ViM (precisely because this pairwise matrix multiplication in MSA leads to ViT’s high computational complexity). To address this, this paper combines ViM with ViT’s self-attention mechanism, which excels at fine-grained global context modeling, and integrates MSA with ViM while employing FFT to reduce the computational overhead of MSA’s matrix operations. This fusion ultimately forms the proposed ViMSA model, offering a novel solution for pupil detection and advancing the application and development of the Mamba architecture in computer vision.
The main contributions of this paper are as follows:
  • Enhanced and extended ViM, proposing the ViMSA model.
  • Weighted Feature Fusion, a learnable weighting mechanism, is introduced to enhance the model’s adaptive learning capability for effective information extraction, thereby reducing its sensitivity to varying illumination conditions and partial occlusions.
  • Integration of FFT-accelerated MSA enables ViM to maintain low computational complexity while acquiring ViT’s more refined global modeling capability.
  • Extensive experiments on multiple pupil datasets demonstrate the superior performance of our proposed method.
The remainder of this paper is organized as follows: Section 2 presents the background and related work on pupil center localization and the ViM model. Section 3 elaborates on the construction of the proposed ViMSA model. Section 4 provides detailed experimental results of the proposed method. Section 5 discusses this work, highlighting the advantages and applications of the ViMSA model while discussing potential limitations. Finally, Section 6 concludes this research.

2. Related Works

2.1. Pupil Detection

We reviewed major research advances in pupil detection methodologies. Existing pupil localization methods can be primarily categorized into learning-based and non-learning-based approaches [24].
Non-learning-based pupil detection methods refer to traditional algorithms applied in early-stage research, which primarily include edge detection, threshold segmentation, feature extraction, and template matching. Edge detection-based algorithms identify potential pupil boundaries by analyzing brightness gradients in pupil images and then filter the final boundaries based on predefined conditions. The Wolfgang Fuhl team has proposed two edge filtering algorithms based on ellipse estimation, ExCuSe [25] and ElSe [26], both of which combine edge detection, ellipse fitting, and geometric constraints to locate the pupil. However, they rely on manually designed features and have poor robustness. Thiago Santini et al. [27] proposed PuReST based on ExCuSe and ElSe, combined with Kalman filtering to improve the stability of continuous frames. However, it is susceptible to different lighting conditions and has poor robustness. Threshold segmentation-based methods binarize images using thresholds to enable pupil detection. Professor Changyuan Wang’s team [28] first binarized pupil images and then applied Hough transform to detect circles and ellipses for pupil center localization, while this method features relatively low complexity, its accuracy heavily depends on threshold selection, with improper thresholds leading to performance degradation. Feature extraction-based approaches detect pupils by analyzing specific characteristics like size, gradient, and grayscale. Timm F. et al. [29] determined pupil centers by computing gradient vector fields, demonstrating certain robustness to illumination variations, though computational complexity increases significantly with higher image resolutions. Template matching-based techniques utilize predefined pupil models to match pupil regions in images. Cerrolaza et al. [30] constructed pupil models using point distribution methods and searched for actual pupil positions through template matching, this method exhibits high computational complexity and requires substantial prior knowledge of pupil characteristics. In summary, the challenges that traditional algorithms generally need to address are still high sensitivity to lighting and environment, complex computation, and the need for a large amount of prior knowledge of pupil models, resulting in a high false detection rate.
Unlike non-learning-based methods, learning-based approaches leverage popular machine learning models in recent years to extract effective discriminative features from pupil images for precise pupil localization. These models excel at extracting deep semantic features and their interrelationships from images. Krafka et al. [31] proposed the iTracker model for mobile pupil detection, which employs a lightweight CNN architecture to extract pupil features and perform recognition. However, this model inherits CNN’s tendency to over-focus on local features, limiting its detection accuracy. Larumbe-Bergera et al. [32] introduced a ResNet50-based pupil detection network that incorporates global pooling layers to compensate for CNN’s neglect of global features, but the excessive use of fully connected layers increases the network’s computational complexity. Yiu et al. [33] proposed a fully convolutional network model capable of segmenting the entire pupil region and locating the pupil. However, their model is based on the prior assumption that the pupil is circular, which imposes certain limitations. Vera-Olmos et al. [34] utilized deep convolutional networks for pupil tracking, integrating dilated convolutions to expand the receptive fields between convolutional layers and adopting a feature pyramid structure to combine shallow and deep image features for pupil localization. Nevertheless, this method suffers from substantial model parameters and high complexity. Wolfgang Fuhl et al. [35] proposed a two-stage detection model, PupilNet v2.0, where the first stage performs coarse localization and the second stage refines the regression. However, the final detection accuracy is affected by the initial coarse localization results.
Gorkem Can Ates et al. [15] employed ResNet integrated with SE (Squeeze-and-Excitation) attention modules, achieving superior detection performance on low-resolution images while experiencing decreased efficiency and accuracy in high-resolution scenarios. Kejuan Xue et al. [16] based their work on YOLOv8n (the least complex model in the YOLOv8 series), incorporating a dual attention mechanism to enhance the model’s ability to capture pupil region information under varying lighting conditions and introducing acyclic convolutions in the encoding part to reduce model complexity. However, the matrix multiplication operations in the spatial and channel attention mechanisms still increase the overall computational load. Chugh, S. et al. [17] enhanced the UNet architecture by incorporating CSA (Contrastive Self-Attention Mechanism) to capture global features in pupil images, though the matrix operations involved in this mechanism, similar to standard self-attention, increase computational complexity. H.M. Zhang et al. [18] integrated FPN (Feature Pyramid Network) with ViT to mitigate invalid signal interference caused by occlusions, but the introduction of ViT significantly elevates the model’s computational demands. Zhuohao Guo et al. [19] selected YOLOv8 as the network backbone and introduced deformable convolutions to adapt the model’s learning direction to the pupil region, improving detection accuracy but increasing model complexity. Wang Li et al. [20] introduced a hybrid architecture combining CNN-based ResNeSt modules [36] with Transformers, merging CNN’s local feature extraction capability with Transformer’s global feature capture ability to improve pupil detection accuracy, though the model complexity remains high. Jian Xun Mi et al. [21] employed a two-stage voting mechanism with a fully convolutional network to regress the pupil center point offset in a top-down manner for pupil center localization. B. Zhu et al. [22] proposed a hybrid approach integrating deep learning with traditional algorithms, where the first stage employs the YOLOv5s deep neural network for coarse pupil localization, followed by the second stage utilizing the Canny edge detector for pupil contour extraction, ultimately achieving accurate pupil positioning. However, as a two-stage model similar to PupilNet v2.0, its prediction accuracy is susceptible to the initial coarse localization results. Gabriel Bonteanu et al. [37] used two structurally identical fully convolutional networks to predict the pupil’s x and y coordinates separately, resulting in lower computational costs. However, explicitly separating the x and y coordinates of the pupil leads to temporal inconsistency in the results, as there is an implicit relationship between the x and y coordinates at any given moment. Genjian Yang et al. [38] adopted a ResNet-based architecture with dilated convolutions to address potential inconsistencies in pupil size within images, but the model remains vulnerable to environmental lighting conditions.

2.2. ViM

Deep learning models have demonstrated remarkable performance across various artificial intelligence tasks. To date, diverse deep learning architectures have emerged, with model structures fundamentally determining their functionalities. Typically, the initially proposed MLP (Multilayer Perceptron) [39], also known as fully connected networks, can learn deep representations of data. Subsequently, developed CNN primarily serve computer vision tasks by extracting local image features. GANs (Generative Adversarial Networks) [40], as generative architectures, find applications in both sequential and image domains. Comprising a generator and discriminator that compete adversarially, they achieve mutual improvement through this minimax game during training. GNNs (Graph Neural Networks) [41] specialize in processing graph-structured data. RNNs (Recurrent Neural Networks) are primarily used for natural language processing tasks due to their sequential data processing capabilities. However, they suffer from vanishing/exploding gradient problems with long sequences, failing to capture relationships between distant tokens [42]. Although subsequent improvements like GRUs (Gated Recurrent Units) [43] and LSTM (Long Short-Term Memory) networks [44] mitigate these gradient issues, they cannot provide fundamental solutions. In 2017, the Transformer architecture was introduced, which utilizes self-attention mechanisms to process data, effectively addressing the vanishing/exploding gradient problems inherent in RNNs when handling long sequences. The Transformer has subsequently been adapted to computer vision tasks, as exemplified by ViT [18]. Compared to CNN, ViT demonstrates superior capability in capturing global image information, albeit at the cost of significantly increased computational overhead. The incorporation of self-attention mechanisms results in quadratic computational complexity growth relative to image resolution in Transformer-based models. SSM also serve as alternative approaches for sequence processing tasks, though they similarly suffer from high computational demands. The S4 model [45] introduced convolutional operations to enable parallel training, but its fixed parameterization imposes limitations on generalization across diverse input data types. Building upon S4, the Mamba model was developed, which employs hardware-aware parallel scan algorithms to substantially accelerate training efficiency while implementing input dependent parameter selection mechanism. Given Mamba’s success in sequential tasks, researchers have extended it to computer vision, yielding ViM models analogous to ViT.
The ViM model divides input images into uniformly sized patches, which are then processed as sequential inputs during training. However, for pupil detection tasks, the contribution of each segmented pupil patch to the final detection result varies significantly. Treating all patches equally limits further improvements in detection accuracy. To address this, we introduce a weighted feature fusion method that assigns learnable weights to patch sequences, enabling ViM to adaptively adjust each patch’s contribution to the detection outcome. Moreover, while mutual information between different regions of pupil images is crucial for accurate detection, the 1D convolutional modules in ViM primarily capture local feature relationships, neglecting long-range dependencies among patches. We replace these 1D convolutions with the MSA module from Transformers to enhance ViM’s ability to model long-range interactions. Additionally, we employ FFT to reduce the computational complexity of MSA, thereby improving the model’s operational efficiency.

3. Methodology

3.1. ViM Model

The standard Mamba model is designed for one-dimensional sequential data. To process 2D image data i h × w × c (where h and w denote height/width and c represents channel depth), the image is first partitioned into flattened 2D patches i p J × p 2 × c (with J being the patch count and p the patch side length). As shown in Figure 1, each white bounding box in the image corresponds to one patch.
These patches are then linearly projected via a fully-connected layer into D-dimensional vectors i p d J × D . Finally, positional embeddings E p o s J × D are added to i p d , forming the input tensor for the Vision Mamba module. The ViM architecture is illustrated in Figure 2, where the symbol represents the element wise addition of the vector.
The output dimensionality for each processing step is indicated at corresponding positions on the right side of Figure 1, where B represents the batch size. As the core component of ViM, the Vision Mamba Encoder’s internal architecture is detailed in Figure 3.
In the Vision Mamba Encoder, tokens first undergo normalization (Norm) to prevent gradient vanishing or explosion during training. The normalized tokens then pass through two distinct linear layers (represented by the blue trapezoidal modules on the left in Figure 3) for dimension expansion. Here, H denotes the internal hidden dimension of the linear layer. The expanded x vectors are processed separately through forward and backward pathways, yielding output feature vectors yf and yb, respectively. Simultaneously, the expanded z vectors are activated and then undergo do product with both yf and yb. The results are summed and projected back to the original input dimension through another linear layer (the blue trapezoidal module on the right in Figure 3). The final output is obtained via skip connection. The forward and backward pathways share an identical architecture, with the forward path processing the x sequence in its original order and the backward path operating on the reversed x sequence. The activation module employs the SiLU (Sigmoid Linear Unit) activation function [46], defined by the following equation:
f x = x σ x σ x = 1 / 1 + e x
Additionally, Figure 3 displays the State Space Model (SSM) module adopted in the ViM architecture. The computational process of this module is formally expressed by the following equation:
h t = A d h t + B d x t y t = C d h t
In the equation, h t represents the hidden state of the SSM system at time step, while h t denotes the hidden state at the next time step. The term x t corresponds to the system input, and y t to the system output. The matrix A d (discrete state transition matrix) characterizes the inter-state transition relationships, B d (discrete input matrix) governs how input signals affect the state, and C d (discrete output matrix) defines the mapping from state vectors to output vectors. For practical computation, A d , B d , and C d are dynamically determined by the input vectors, serving as trainable parameters in ViM. The SSM module employs matrix operations to significantly accelerate both training and inference efficiency.

3.2. Weighted Feature Fusion

Pupil detection is different from other computer vision tasks because the distribution of effective information in pupil images is uneven, especially under variable lighting and partial occlusion conditions, and the contribution values of each part of the image to pupil detection are different (from the pupil image, it can be roughly judged that the effective information decreases layer by layer in an elliptical shape from the center of the pupil outward [21]). If the patch sequence obtained after uniform segmentation is treated equally, it will limit the improvement of the model’s detection accuracy under different lighting and occlusion conditions.
To solve the above problems, this paper adopts a weighted feature fusion method. After cutting the pupil image into patch sequences, a learnable patch weight is concatenated on this basis. The dimension of the concatenated patch sequence is B×J×(D + 1). This weight can weight each patch when mapping features to pupil coordinates, making the ViM model more focused on learning features related to pupil regions. Figure 4 is a schematic diagram of weight concatenation.
In Figure 4, the blue square represents the feature patch vector with dimensions of B×J×D, and the yellow square represents the concatenated weight vector.
Considering the issue of computational efficiency, the number of neurons in the fully connected layer should not be too large. In this paper, a global average pooling module [47] is added before the output layer of the model. After the Vision Mamba Encoder layer, weight removal operations are used to extract weight columns separately, and the extracted residual feature vectors are subjected to Global Average Pooling operations. Finally, the pooled feature vectors are weighted and sent to MLP for pupil center calculation.

3.3. ViMSA

In pupil images, various regions contribute to pupil detection and localization to some extent, such as the iris, sclera, and other ocular structural areas, as well as external features like Purtscher’s Spot generated under infrared camera illumination. Therefore, the relationships between different regions in pupil images are crucial for the task of pupil detection and localization. These inter-regional relationships are commonly referred to as global feature representations.
Compared to ViT, ViM employs SSM to capture global dependencies through implicit state transitions and input-dependent recursive mechanisms, demonstrating superior computational efficiency and training effectiveness over ViT. However, since ViM does not explicitly compute pairwise interactions between all patches, its global dependency modeling capability remains slightly inferior to ViT. This raises a critical research question: How can we endow ViM with ViT-level global relational modeling capacity while preserving its efficient detection characteristics?
Through observation, it can be seen that the 1D convolutional module used in the ViM model tends to extract short-range dependencies in sequential data, focusing on local patch relationships, while its ability to capture global dependencies is relatively limited. This becomes a constraint that hinders the ViM model from achieving better performance in pupil detection tasks. In comparison, the MSA mechanism in Transformer can effectively capture global information by modeling relationships across all patches. Its structural diagram is shown in Figure 5.
Here, Q , K , and V are all similar to the feature vector x in ViM, with dimensions of B×J×H. After being projected through linear layers, the Q , K , and V vectors are fed into the Scaled Dot-Product Attention module, whose computation is defined by the following formula
A t t e n t i o n Q , K , V = S o f t m a x Q · K T d k · V
In the formula, d k denotes the number of heads in multi-head self-attention, and · represents matrix multiplication. This matrix operation is fundamentally why the MSA module can effectively capture global relationships. Building on this principle, we replace the 1D convolution in ViM with MSA, thereby enhancing the model’s ability to incorporate global feature representations in its regression output.
While MSA effectively addresses the global dependency problem, it simultaneously introduces a new challenge: high computational complexity. The primary factor contributing to MSA’s computational overhead lies in its high-dimensional matrix multiplication operations.
To address the high computational complexity introduced by MSA, this paper first decomposes the matrix multiplication into a sum of multiple outer products. Given matrix A with dimensions m × n and matrix B with dimensions n × p , the resulting matrix C from their multiplication can be expressed by Equation (4).
C = A × B = k = 1 n a k b k
Here, a k represents the k-th column vector of matrix A , and b k denotes the k-th row vector of matrix B , where indicates the vector outer product operation. The outer product is mathematically defined by Equation (5).
a b i j = a i · b j
The obtained a k and b k satisfy Equation (6).
ψ a k b k = ψ a k ψ b k
Here, ψ denotes the FFT and represents dot product in the frequency domain. Given a sequence X of length N (denoted as x n ), its (DFT) Discrete Fourier Transform) is defined by Equation (7).
X k = n = 0 N 1 x n e i 2 π N k n , k = 0 , 1 , , N 1
The FFT improves computational efficiency by leveraging the recursive nature of the DFT. The FFT computation can be decomposed as follows.
X k = X e v e n k + e i 2 π N k X o d d k X k + N 2 = X e v e n k e i 2 π N k X o d d k
Here, X e v e n k represents the DFT of even-indexed elements in the input sequence, while X o d d k denotes the DFT of odd-indexed elements.
Qualitatively, for an N × N matrix, standard matrix multiplication requires N 3 multiplication operations and N 3 N 2 addition operations. By decomposing the matrix multiplication into vector outer products and converting them to frequency–domain dot product via FFT, the computational complexity reduces to only N 2 + N log 2 N multiplications and N log 2 N additions. This approach effectively achieves the goal of reducing the module’s computational complexity.
For convenience, we refer to this model as ViMSA. The overall architecture of ViMSA as illustrated in Figure 6.
According to Equation (4), the Q K matrix in MSA is decomposed into an outer product form of row and column vectors. Following Equation (6), FFT is applied separately to the decomposed row and column vectors. The resulting vectors are then multiplied, and the obtained matrices are converted back to the time domain through IFFT operations and summed to yield the original matrix multiplication result (denoted as Temp Matrix). The operation between Temp Matrix and V matrix follows a similar process as Q K , as illustrated in the Fourier MSA module in Figure 6, where the different colored curves connected to FFT in the Fourier MSA module represent the use of FFT for the row and column vectors of each decomposed matrix.

4. Experiments

4.1. Data Acquisition

The experiment employed 30 distinct datasets comprising approximately 135,000 pupil images to evaluate the performance of the proposed ViMSA model. Among these, 29 datasets were provided by Wolfgang Fuhl’s research team and were previously cited in ExCuSe [25], ELSe [26], and PupilNet 2.0 [27]. Following the naming convention established in prior researches, these 29 datasets are designated with Roman numerals as DI, DII, DIII, DIV, DV, DVI, DVII, DVIII, DIX, DX, DXI, DXII, DXIII, DXIV, DXV, DXVI, DXVII, DXVIII, DXIX, DXX, DXXI, DXXII, DXXIII, DXXIV, newDI, newDII, newDIII, newDIV, and newDV [48].
The additional dataset used is the publicly available iris dataset CASIA-Iris-Thousand, which contains 20,000 iris images collected from 1000 participants. For convenience in subsequent experiments, this dataset will be referred to as CIT.
Figure 7 presents representative pupil images sampled from these datasets.
In the experiments, all pupil images were single-channel grayscale. The CIT dataset contained images sized 640 × 480, while all other datasets featured 384 × 288 images. For standardization, all images were resized to 320 × 240. Specifically: (1) CIT images were proportionally scaled down, and (2) other datasets underwent center cropping to preserve the 320 × 240 central region. The data were split into 85% for training and 15% for testing.

4.2. Experimental Environment

The hardware configuration and runtime parameters used in our experiments are specified in Table 1.

4.3. Evaluation Metrics

This paper adopts two classic regression-based evaluation metrics: RMSE (Root Mean Squared Error) and DR (Detection Rate) within 5-pixel error range, which serve as primary performance measures for pupil detection tasks. The RMSE formula is presented below.
R M S E = 1 N i = 1 N x i x i 2 + y i y i 2
Here, x i , y i denote the predicted pupil center coordinates from the model, while x i , y i represent the ground truth pupil center coordinates, with N being the total number of pupil images in the test set. A smaller RMSE value indicates higher pupil detection accuracy. The loss function adopted in our model is also RMSE.
The formula for calculating the DR is as follows:
D R = C N × 100 %
Here, N represents the total number of samples, and C denotes the count of correct predictions within the specified pixel tolerance.

4.4. Ablation Experiment

For the experimental configuration regarding batch size, learning rate, and optimizer selection, we first fixed the learning rate at 1 × 10−3 and chose Adam as the optimizer. Figure 8 demonstrates the convergence behavior of the model under different batch sizes across various datasets, while Table 2 presents the corresponding DR5 (detection rate within 5-pixel error) in % (with fixed model parameters: 8 MSA heads, 10 × 10 patch size, and Gaussian initialization for the weights in weighted fusion). In Table 2, the first column lists the dataset names, the second column shows the corresponding data volumes, and the last column provides brief descriptions of each dataset.
Figure 8 reveals that when the batch size is relatively small (bs = 1 and bs = 2), although the model can converge to favorable values, the training process exhibits excessive fluctuations due to the gradient noise introduced by smaller batches, which helps the model escape poor local optima. Conversely, with larger batch sizes (bs = 16 and bs = 32), while the training process remains stable, the model fails to converge to better solutions, indicating it becomes trapped in inferior local optima—essentially overfitting to the training data—as the reduced inter-batch information variation diminishes necessary noise, preventing the model from escaping suboptimal solutions. Overall, when bs = 4, the model demonstrates relatively stable training while achieving superior convergence. Notably, across all batch sizes, the model converges significantly faster on the CIT dataset compared to others, likely because CIT, unlike the remaining 29 datasets, does not intentionally incorporate extreme environmental conditions (e.g., strong reflections, contact lenses, or mascara occlusion), making it inherently less challenging for detection. As shown in Table 2, a batch size of 4 yields superior detection accuracy across most datasets. Therefore, bs = 4 is selected for subsequent experiments.
With the batch size fixed at 4 and Adam selected as the optimizer, Figure 8 illustrates the model’s convergence behavior under different learning rate strategies, while Table 3 presents the corresponding DR5. The evaluated learning rate configurations include: (1) fixed rates (1 × 10−1, 1 × 10−2, 1 × 10−3, 1 × 10−4, 1 × 10−5); (2) Step Decay (1 × 10−3 for first 40 epochs, 1 × 10−4 for next 40 epochs, 1 × 10−5 thereafter); (3) Exponential Decay (γ = 0.95, initial rate = 1 × 10−3); and (4) Cosine Decay (cycle = 100, initial rate = 1 × 10−3, minimum rate = 1 × 10−5).
Figure 8 demonstrates that with a fixed learning rate of 1 × 10−5, the model’s loss barely decreases due to insufficient parameter updates from excessively small gradients. Fixed rates of 1 × 10−1 and 1 × 10−2 initially converge rapidly but ultimately fail to train properly, as large learning rates cause overshooting and oscillate or even diverge near the loss minimum. While fixed rates of 1 × 10−3 and 1 × 10−4 enable stable training, they converge to suboptimal solutions, with 1 × 10−3 showing faster early-stage convergence but 1 × 10−4 reaching marginally better local optima later. Unlike fixed rates, adaptive learning rate strategies are more prevalent in deep learning. The Step Decay schedule in Figure 8 exhibits distinct plateaus and abrupt drops (particularly around epoch 40), whereas Exponential Decay and Cosine Decay display smoother trajectories. Notably, Cosine Decay achieves superior convergence by maintaining: (1) smoother, continuously differentiable rate transitions than Step Decay, and (2) more gradual decay than Exponential Decay, thereby prolonging effective optimization before premature stagnation. As quantified in Table 3, Cosine Decay performs better in most datasets, justifying its selection for subsequent experiments.
The fixed batch size is 4, and the learning rate adjustment strategy is Cosine Decay. Figure 8 shows the convergence of the model under different optimizers, and Table 4 shows the corresponding DR5.
Figure 8 demonstrates that SGD exhibits the slowest convergence due to its lack of adaptive learning rate adjustment. While AdaGrad converges relatively quickly, its training progress nearly stagnates after epoch 40 because it uses the accumulated sum of historical gradient squares as the denominator, causing the learning rate to decrease monotonically until it becomes infinitesimally small in later stages, halting parameter updates. Adam combines momentum with adaptive learning rates for faster convergence, whereas AdamW further improves upon Adam by decoupling weight decay to resolve conflicts between adaptive learning rates and L2 regularization, resulting in more stable training. As shown in Figure 8, AdamW slightly outperforms Adam in both convergence speed and stability. Table 4 confirms that AdamW achieves superior results on most datasets. Therefore, AdamW is selected as the optimizer for subsequent experiments.
After completing the ablation experiment of training hyperparameters, it is necessary to experimentally verify and select the hyperparameters of the model.
This paper compares two adjustable parameters in the experiment: patch size and the number of MSA modules, as well as the weight initialization scheme used in the weighted feature fusion method.
The weighted feature fusion method’s performance is significantly affected by weight initialization, as it determines the learning direction of weight vectors. This paper systematically evaluates three distinct initialization approaches: fixed initialization where weights are set to a constant value of 1, random initialization with values uniformly distributed between 1 × 10−5 and 1, and Gaussian initialization where weights are randomly generated following a normal distribution centered on a 30 × 30-pixel region of the image with a standard deviation of 1. All experiments were performed using the default parameters of ViT and ViM models, maintaining a consistent patch size of 16 × 16 pixels and employing 8 MSA modules, with DR5 serving as the evaluation metric. The comprehensive comparison results are presented in Table 5. Figure 9 shows the convergence of the model under different weight initialization schemes.
As evidenced in Figure 9, the fixed initialization scheme yields the slowest convergence and even fails to converge normally on datasets like DI (with reflections), DII (poor illumination), and DV (contact lenses). This likely occurs because: (1) complex data distributions require differentiated feature learning, but fixed initialization forces identical neuron updates during backpropagation, reducing the network to single-neuron functionality; (2) while random initialization breaks symmetry and enables diversified feature learning, improperly configured random values may cause gradient instability and neuron saturation. In contrast, Gaussian initialization achieves the fastest convergence since: (a) its zero-centered symmetric weight distribution aligns with pupil image characteristics; (b) appropriate numerical ranges prevent activation saturation (e.g., ReLU outputs) during forward propagation, avoiding vanishing gradients; (c) uniform gradient distribution during backpropagation; and (d) random gaussian noise ensures differential initial weights for accelerated specialized learning. From Table 5, it can be seen that the Gaussian initialization scheme achieved better results on all datasets. Therefore, in the subsequent experiments, Gaussian initialization was chosen as the initialization scheme for the weights in the weighted feature fusion module.
Regarding patch size and the number of MSA modules, the original ViM model uses a 16 × 16 patch size, while the original ViT model employs 8 MSA heads. Set the patch size to 8 × 8, Figure 9 shows the loss curves of the model on different datasets under different MSA heads, and Table 6 shows the corresponding DR5.
Figure 9 demonstrates that when the number of MSA heads is too small (2 or 4 heads), convergence is slower and settles at suboptimal solutions due to insufficient capacity to capture the diversity and complexity in input data, limiting feature learning. Conversely, excessive heads (16 or 24) accelerate convergence but fail to escape poor local optima, as redundant heads may overfit to noise rather than meaningful patterns. Comparative analysis reveals that 12 heads strike an optimal balance, delivering moderate convergence speed and superior performance across most datasets. Thus, 12 MSA heads are selected for subsequent experiments.
Keeping the number of MSA heads fixed at 12, Figure 9 shows the loss curves of the model on different datasets for different patch sizes, and Table 7 shows the corresponding DR5.
Figure 9 demonstrates that when the patch size is too small (e.g., 4 × 4 or 8 × 8), the model converges faster but with relatively unstable training dynamics, as smaller patches preserve finer local details but are also more susceptible to local noise, resulting in larger gradient fluctuations. Conversely, excessively large patch sizes (e.g., 20 × 20 or 40 × 40) lead to more stable training but tend to converge to poorer local optima due to the loss of discriminative fine-grained features. The analysis reveals that a 16 × 16 patch size achieves an optimal balance, delivering faster convergence, moderate stability, and superior final performance. As evidenced in Table 7, the 16 × 16 configuration outperforms other settings across most datasets. Therefore, a patch size of 16 × 16 is selected for subsequent experiments.

4.5. Horizontal Experiment

This paper conducts a comparative analysis of both classical and state-of-the-art models in pupil detection. The evaluated models include: classical approaches—ExCuSe [25] (abbreviated as Ex), ELSe [26] (EL), PuReST [27] (PR), PupilNet v2.0 [35] (Pu, which contains three sub-models: direct approach SK8P8 (SK), direct coarse positioning and fine positioning implemented with FCKxPy (FC) and FSKxPy (FS)); and recent models from literature [15] (GC), [16] (KX), [17] (CS), [18] (ZHM), [19] (ZG), [20] (WL), [21] (JXM), [22] (ZB), [37] (GB), [38] (GY), along with ViT, ViM, ViM+MSA (VM), and ViT+FMSA (the proposed Fourier-accelerated MSA, denoted as VF). Table 8 and Table 9 summarize the RMSE values of classical and emerging models across all datasets, respectively, while Table 10 and Table 11 present the DR5 metrics for classical and emerging models on all datasets correspondingly.
The accuracy of models Ex, EL, Pu, and GB can be reused across some shared datasets. As shown in Table 8 and Table 9 (RMSE) versus Table 10 and Table 11 (DR5), classical pupil detection algorithms (Ex, EL, Pu) fail to deliver satisfactory results under varying illumination conditions. In contrast, the proposed ViMSA model maintains robust detection rates even under extreme conditions including poor lighting, reflections, contact lens, and mascara application, outperforming both classical image-processing-based methods and recently proposed detection algorithms in DR5 and RMSE metrics across most datasets. Notably, the SK sub-model of Pu achieves 99% accuracy and 1.7px RMSE on dataset DXVII, surpassing ViMSA’s performance—likely because this dataset contains only 268 images, whereas ViMSA (like other deep learning models) requires larger training sets to achieve optimal generalization, while the simpler SK architecture may perform better on smaller datasets. This suggests that expanding the pupil image volume could further improve ViMSA’s performance on such datasets. Table 9 and Table 11 demonstrates identical results between VM and ViMSA, as well as between VF and ViT, since their core computational workflows remain fundamentally unchanged, with FFT primarily serving to accelerate matrix multiplication.
Figure 10 shows the loss function curves of the latest pupil detection model on different datasets.
From Figure 10, it can be seen that the convergence stability and efficiency of ViMSA are slightly better than other latest models.
Table 12 shows the GFLOPs and running time for each model.
As evidenced in Table 12, the proposed ViMSA model demonstrates superior computational efficiency compared to most state-of-the-art approaches, with a notably low complexity of 0.102 GFLOPs—surpassed only by the GB model and original ViM architecture. While GB achieves faster execution through its simplified dual-convolutional design (processing x/y coordinates separately), this decoupled approach fails to capture spatial correlations between coordinates, resulting in lower accuracy than ViMSA. The baseline ViM, though efficient, exhibits weaker global modeling capability versus ViT. ViMSA addresses this by integrating ViT’s MSA mechanism for enhanced global representation, achieving accuracy improvements with merely a 0.008 GFLOPs overhead. Experimental results confirm that replacing standard MSA with FFT-accelerated MSA in ViT (yielding VF) improves both speed and complexity. Ultimately, ViMSA meets real-time pupil detection requirements (>100 FPS) while maintaining robust performance across challenging conditions (poor illumination, occlusions etc.), as it combines: (1) ViM’s efficient SSM backbone, (2) MSA’s global attention, and (3) FFT optimization—collectively solving critical issues in extreme scenarios.

5. Discussion

The excellent detection performance of the proposed ViMSA model on multiple datasets indicates that the model can be independently used for all real-time eye tracking systems and gaze tracking systems that support head mounted pupil acquisition devices. Specifically, it can be applied to scenarios where the ambient light intensity changes rapidly, such as real-time gaze tracking between pilots, as well as pupil localization in strong light (frequent blinking) and pathological eyelid droop scenarios.
The conventional non-learning-based image processing algorithms (ELSe, ExCuSe, and PuReST) generally deliver satisfactory detection results, but exhibit severe performance degradation under specific illumination conditions as demonstrated in Table 9. Notably, the proposed ViMSA model achieves superior detection accuracy across all datasets, demonstrating significantly enhanced robustness to varying lighting conditions and partial occlusions compared to edge-detection and ellipse-fitting-based approaches like ELSe.
Compared with the CS, ZHM, WL, and ZG models, ZHM, CS integrates UNet with CSA, enhancing UNet’s capability to capture global features. However, the matrix operations involved in the CSA are computationally analogous to MSA, which inevitably increases the model’s computational complexity. WL employs a ResNeSt and ViT cascade as the backbone network, while ZHM adopts FPN combined with ViT in series as the backbone. Both architectures exhibit relatively high computational complexity, and their serialized network designs fail to fully leverage ViT’s global mutual information modeling capabilities. The ZG model uses the YOLOv8 object detection network as the backbone network, and increases the inter layer receptive field through the depth of the convolutional network to capture global information, however, the model has a high computational complexity. ViMSA has a simple structure and combines the global modeling capability of MSA with the efficiency of ViM, enabling efficient real-time pupil detection.
Compared with ZB, JXM, and GB models, JXM adopts a two-stage mode for detection. Although the model complexity is lower, it limits the further improvement of model detection accuracy because the two steps used for detection are not decoupled. ZB similarly adopts a two-stage paradigm, where the computational complexity remains non-negligible due to the employment of YOLOv5s for initial coarse localization in the first stage, while inheriting the inherent step-dependency limitations characteristic of two-stage architectures. GB decomposes pupil localization into x-direction localization and y-direction localization, ignoring the potential spatial correlation of pupil coordinates, which limits the improvement of model accuracy. ViMSA directly learns the relevant information of the pupil from the pupil image and simultaneously regresses the x and y coordinates without relying on other operations.
Compared with the GY model, the GY model uses ResNet as the backbone network and combines dilated convolution to adapt to images with different pupil sizes, but cannot escape the poor global modeling ability of CNN. ViM combines the global modeling capability of ViT with the efficiency of ViM, enabling real-time and efficient completion of pupil detection tasks.
Compared with the GC and KX model, The GC model incorporates the channel attention SE module to perform weighted calibration on the channel dimension of feature maps, while the KX model introduces a dual attention mechanism combining both spatial and channel attention, enabling adaptive learning of pupil features. However, its essence remains similar to the weighted feature fusion in ViMSA. In terms of model computational complexity, the weighted feature fusion method is significantly better than the attention mechanism.
Table 13 summarizes the aforementioned pupil detection approaches, listing their techniques, ability to capture global dependencies in pupil images, adaptive learning capacity, computational complexity, and inherent limitations.
In Table 13, the first column lists model names, the second column indicates the technical approaches of the models, and columns three to five present the three major advantages of a model in pupil detection tasks (√ denote the model possesses the corresponding advantage while blank spaces indicate its absence). The sixth column describes inherent limitations caused by certain models' training patterns or architectural designs.
One potential issue with the proposed model is that all datasets involved in the experiment have pupil image coordinate label pairs, which would cause the model to default to all input images containing pupils. For input images without pupils, a pseudo pupil coordinate will also be given during testing based on the internal information distribution of the image. A possible solution is to add a large number of images without pupils to the training data, such as completely closed eye images, environment images without human eyes, etc., so that the model learns patterns without pupil information and outputs empty coordinates that meet expectations.
In future research, this ViM-based pupil detection model can be used as an auxiliary technology for various real-time applications. In the medical industry, by tracking the movement trajectory of the subject’s pupils, it is possible to identify whether the subject has vestibular dysfunction. In the automotive industry, detecting driver distraction and identifying driver drowsiness based on eye tracking technology can improve driving safety. In addition, some educational platforms choose online education mode, allowing students to freely arrange their learning location and time, which can lead to lecturers being unable to track students’ attention in a timely manner. In this case, it is necessary to establish a VFOA tracking system to evaluate the effectiveness of the online learning environment.

6. Conclusions

This paper improves and extends ViM, proposing a new pupil center detection model ViMSA that can achieve high-precision and real-time pupil center localization under variable lighting conditions and partial occlusion.
In order to enable the model to correctly detect the center of the pupil in extreme situations, the proposed ViMSA model adopts two serial schemes. One is to introduce learnable weight parameters with the same number of patches, so that the model can learn the possible regions of pupils in the image in a targeted manner, similar to what attention mechanisms do, but with much lower computational complexity than the attention module. The second approach is to use ViM as the network backbone and integrate MSA, combined with ViT’s ability to capture global features in more detail, enabling the network to accurately detect pupil centers under different lighting and occlusion conditions. Based on this, FFT is used to further reduce the computational complexity of the model, forming the ViMSA model. Compared to other networks, the proposed ViMSA can achieve higher detection accuracy with lower computational complexity.
The proposed ViMSA model was evaluated on approximately 135,000 pupil images from 30 distinct datasets, comprising diverse samples captured under variable and non-uniform lighting conditions. These images have different reflections, covering eyelashes and eyebrows, using eye black or contact lenses.
Experimental results demonstrate that the proposed ViMSA achieves 99.6% DR5, with an RMSE of 1.67 pixels, while maintaining a processing speed exceeding 100 FPS, fulfilling real-time monitoring requirements for diverse applications.
Like other deep learning-based models, the performance of ViMSA depends on the quantity and diversity of the pupil image dataset used to train the network.
Based on these experimental results, the proposed ViMSA model is suitable for processing pupil images under various lighting and occlusion conditions in real-time systems due to its high accuracy and real-time performance. As expected, due to its good immunity to variable lighting conditions or the presence of some obstacles, its detection accuracy is significantly better than non-learning-based classical detection algorithms and existing learning-based network models.

Author Contributions

Methodology, Y.Z.; Software, Y.Z. and P.X.; Formal analysis, P.W.; Writing—original draft, Y.Z.; Writing—review & editing, C.W.; Supervision, C.W., P.W. and P.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in the article can be found on the website https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=/datasets-head-mounted&mode=list, listed in reference [48].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Susac, A.; Bubic, A.; Planinic, M.; Movre, M.; Palmovic, M. Role of diagrams in problem solving: An evaluation of eye-tracking parameters as a measure of visual attention. Phys. Rev. Phys. Educ. Res. 2019, 15, 013101. [Google Scholar] [CrossRef]
  2. Cutumisu, M.; Turgeon, K.-L.; Saiyera, T.; Chuong, S.; Esparza, L.M.G.; MacDonald, R.; Kokhan, V. Eye tracking the feedback assigned to undergraduate students in a digital assessment game. Front. Psychol. 2019, 10, 1931. Available online: https://www.frontiersin.org/article/10.3389/fpsyg.2019.01931/full (accessed on 18 September 2019). [CrossRef] [PubMed]
  3. Masse, B.; Ba, S.; Horaud, R. Tracking gaze and visual focus of attention of people involved in social interaction. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2711–2724. Available online: https://ieeexplore.ieee.org/document/8194910/ (accessed on 28 December 2018). [CrossRef]
  4. Zhuang, Q.; Kehua, Z.; Wang, J.; Chen, Q. Driver fatigue detection method based on eye states with pupil and iris segmentation. IEEE Access 2020, 8, 173440–173449. [Google Scholar] [CrossRef]
  5. Yu, H.; Xie, Y.; Chen, X. Eye-tracking studies in visual marketing: Review and prospects. Foreign Econ. Manag. 2018, 40, 98–108. Available online: http://qks.sufe.edu.cn/J/WJGL/Article/Details/A030fbb749-95b9-4bee-b8c1-681a5d8d6976 (accessed on 1 December 2018).
  6. Clay, V.; König, P.; König, S.U. Eye tracking in virtual reality. J. Eye Mov. Res. 2019, 12, 3. Available online: https://bop.unibe.ch/JEMR/article/view/4332-Clay-final-sub (accessed on 27 May 2019). [CrossRef]
  7. Bowyer, K.W.; Hollingsworth, K.; Flynn, P.J. Image understanding for iris biometrics: A survey. Comput. Vis. Image Underst. 2008, 110, 281–307. [Google Scholar] [CrossRef]
  8. Skodras, E.; Kanas, V.G.; Fakotakis, N. On visual gaze tracking based on a single low cost camera. Signal Process. Image Commun. 2015, 36, 29–42. Available online: https://linkinghub.elsevier.com/retrieve/pii/S0923596515000909 (accessed on 30 September 2015). [CrossRef]
  9. Lu, C.; Chakravarthula, P.; Liu, K.; Liu, X.; Li, S.; Fuchs, H. Neural 3D gaze: 3D pupil localization and gaze tracking based on anatomical eye model and neural refraction correction. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Singapore, 17–21 October 2022; pp. 375–383. Available online: https://ieeexplore.ieee.org/document/9995692/ (accessed on 26 April 2023).
  10. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  11. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017. [Google Scholar] [CrossRef]
  12. Gu, A.; Goel, K.; Ré, C. Efficiently Modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
  13. Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
  14. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
  15. Ates, G.C.; Coskunpinar, C.; Tse, D.; Pelaez, D.; Celik, E. Robust residual convolutional neural network based pupil tracking for low-computational power applications. Eng. Appl. Artif. Intell. 2024, 133 Pt A, 133. [Google Scholar] [CrossRef]
  16. Xue, K.; Wang, J.; Wang, H. Research on Pupil Center Localization Detection Algorithm with Improved YOLOv8. Appl. Sci. 2024, 14, 2076–3417. [Google Scholar] [CrossRef]
  17. Chugh, S.; Ye, J.; Fu, Y.; Eizenman, M. CSA-CNN: A Contrastive Self-Attention Neural Network for Pupil Segmentation in Eye Gaze Tracking. In Proceedings of the 2024 Symposium on Eye Tracking Research and Applications, Glasgow, UK, 4–7 June 2024; pp. 1–7. [Google Scholar]
  18. Zhang, H.; Wang, C. Integrated neural network-based pupil tracking technology for wearable gaze tracking devices in flight training. IEEE Access. IEEE Access 2024, 12, 133234–133244. [Google Scholar] [CrossRef]
  19. Guo, Z.; Su, M.; Li, Y.; Liu, T.; Guan, Y.; Zhu, H. Fast and Accurate Pupil Localization in Natural Scenes. J. Bionic Eng. 2024, 21, 2646–2657. [Google Scholar] [CrossRef]
  20. Li, W.; Wang, C. Vision Transformer-Based Pupil Localization Method. J. Xi’an Technol. Univ. 2023, 43, 561–567. [Google Scholar]
  21. Mi, J.X.; Gao, Y.; Yuan, S.; Li, W. Accurate and Robust Eye Center Localization by Deep Voting. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4070–4082. [Google Scholar] [CrossRef]
  22. Zhu, B.; Wang, K.; Tong, W.; Chai, M.; Zhang, X. Pupil center localization in NIR based on machine learning. In Proceedings of the 4th International Conference on Internet of Things and Smart City (IoTSC 2024), Hangzhou, China, 22–24 March 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13224, pp. 428–433. [Google Scholar]
  23. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
  24. Rathnayake, R.; Madhushan, N.; Jeeva, A.; Darshani, D.; Subasinghe, A.; Silva, B.N.; Wijesinghe, L.P.; Wijenayake, U. Current trends in human pupil localization: A review. IEEE Access 2023, 11, 115836–115853. [Google Scholar] [CrossRef]
  25. Fuhl, W.; Kbler, T.; Sippel, K.; Rosenstiel, W.; Kasneci, E. ExCuSe: Robust Pupil Detection in Real-World Scenarios. In Computer Analysis of Images and Patterns, Proceedings of the16th International Conference, Valletta, Malta, 2–4 September 2015; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
  26. Fuhl, W.; Santini, T.C.; Kuebler, T.; Kasneci, E. Else: Ellipse selection for robust pupil detection in real-world environments. arXiv 2016, arXiv:1511.06575. [Google Scholar]
  27. Santini, T.; Fuhl, W.; Kasneci, E. PuReST: Robust pupil tracking for real-time pervasive eye tracking. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, Warsaw, Poland, 14–17 June 2018; p. 61. [Google Scholar]
  28. Wang, C.; Zhang, J.; Li, J.J. Research on Fast Localization Method of Pupil Center. Comput. Eng. Appl. 2011, 47, 196–198+201. [Google Scholar]
  29. Timm, F.; Barth, E. Accurate eye centre localisation by means of gradients. In Proceedings of the Sixth International Conference on Computer Vision Theory and Applications, Vilamoura, Algarve, Portugal, 5–7 March 2011. [Google Scholar]
  30. Cerrolaza, J.J.; Villanueva, A.; Cabeza, R. Taxonomic study of polynomial regressions applied to the calibration of video-oculographic systems. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications, Savannah, GA, USA, 26–28 March 2008; pp. 259–266. [Google Scholar]
  31. Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; Torralba, A. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 27–30 June 2016; pp. 2176–2184. [Google Scholar]
  32. Larumbe-Bergera, A.; Garde, G.; Porta, S.; Cabeza, R.; Villanueva, A. Accurate pupil center detection in off-the-shelf eye tracking systems using convolutional neural networks. Sensors 2021, 21, 6847. [Google Scholar] [CrossRef] [PubMed]
  33. Yiu, Y.H.; Aboulatta, M.; Raiser, T.; Ophey, L.; Flanagin, V.L.; Zu Eulenburg, P.; Ahmadi, S.A. DeepVOG: open-source pupil segmentation and gaze estimation in neuroscience using deep learning. J. Neurosci. Methods 2019, 324, 108307. [Google Scholar] [CrossRef]
  34. Vera-Olmos, F.J.; Pardo, E.; Melero, H.; Malpica, N. DeepEye: Deep convolutional network for pupil detection in real environments. Integr. Comput. Aided Eng. 2019, 26, 85–95. [Google Scholar] [CrossRef]
  35. Fuhl, W.; Santini, T.; Kasneci, G.; Rosenstiel, W.; Kasneci, E. PupilNet v2.0: Convolutional Neural Networks for CPU based real time Robust Pupil Detection. arXiv 2017. [Google Scholar] [CrossRef]
  36. Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  37. Bonteanu, G.; Bonteanu, P.; Cracan, A.; Bozomitu, R.G. Implementation of a High-Accuracy Neural Network-Based Pupil Detection System for Real-Time and Real-World Applications. Sensors 2024, 24, 22. [Google Scholar] [CrossRef]
  38. Yang, G.; Chen, W.; Wu, P.; Gou, J.; Meng, X. An Irregular Pupil Localization Network Driven by ResNet Architecture. Mathematics 2024, 12, 2703. [Google Scholar] [CrossRef]
  39. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef]
  40. Chakraborty, T.; KS, U.R.; Naik, S.M.; Panja, M.; Manvitha, B. Ten years of generative adversarial nets (GANs): A survey of the state-of-the-art. Mach. Learn. Sci. Technol. 2024, 5, 011001. [Google Scholar] [CrossRef]
  41. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61. [Google Scholar] [CrossRef] [PubMed]
  42. Lipton, Z.C.; Berkowitz, J.; Elkan, C. A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv 2015. [Google Scholar] [CrossRef]
  43. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
  44. Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to Forget: Continual Prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
  45. Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. Hippo: Recurrent memory with optimal polynomial projections. Adv. Neural Inf. Process. Syst. 2020, 33, 1474–1487. [Google Scholar]
  46. Shazeer, N. GLU Variants Improve Transformer. arXiv 2020. [Google Scholar] [CrossRef]
  47. Lin, M.; Chen, Q.; Yan, S. Network In Network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
  48. Available online: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=/datasets-head-mounted&mode=list (accessed on 26 April 2023).
Figure 1. Example of dividing an image into patches.
Figure 1. Example of dividing an image into patches.
Sensors 25 03978 g001
Figure 2. ViM model.
Figure 2. ViM model.
Sensors 25 03978 g002
Figure 3. Vision Mamba Encoder structure.
Figure 3. Vision Mamba Encoder structure.
Sensors 25 03978 g003
Figure 4. Diagram of Weight Concatenation.
Figure 4. Diagram of Weight Concatenation.
Sensors 25 03978 g004
Figure 5. MSA structural diagram.
Figure 5. MSA structural diagram.
Sensors 25 03978 g005
Figure 6. The architecture of ViMSA.
Figure 6. The architecture of ViMSA.
Sensors 25 03978 g006
Figure 7. Some image examples from different datasets used for experiments: (a) DI—Reflection; (b) DV—Contact lenses; (c) DII—Bad illumination; (d) DVI—Mascara; (e) CIT.
Figure 7. Some image examples from different datasets used for experiments: (a) DI—Reflection; (b) DV—Contact lenses; (c) DII—Bad illumination; (d) DVI—Mascara; (e) CIT.
Sensors 25 03978 g007
Figure 8. Demonstration of loss curves for ViMSA on partial datasets under different training hyperparameters.
Figure 8. Demonstration of loss curves for ViMSA on partial datasets under different training hyperparameters.
Sensors 25 03978 g008aSensors 25 03978 g008b
Figure 9. Demonstration of loss curves for ViMSA on partial datasets under different model hyperparameters.
Figure 9. Demonstration of loss curves for ViMSA on partial datasets under different model hyperparameters.
Sensors 25 03978 g009
Figure 10. Comparison of loss curves of the latest pupil detection model on different datasets.
Figure 10. Comparison of loss curves of the latest pupil detection model on different datasets.
Sensors 25 03978 g010
Table 1. Experimental hardware and operating environment parameters.
Table 1. Experimental hardware and operating environment parameters.
HardwareEnvironmental Parameters
CPUIntel(R) Core(TM) i7-14700HX CPU@2.10 GHz
GPU8 GB NVIDIA GeForce GTX 4060
RAM32 G
SystemWindows 11
LanguagePython3.11
DL LibraryPytorch2.6.0
Table 2. DR5 of ViMSA on different datasets under different batch sizes.
Table 2. DR5 of ViMSA on different datasets under different batch sizes.
DatasetImage Number12481632Description of the Dataset
DI655475.0475.9780.5881.8979.3981.78Reflection
DII50573.8478.0582.7980.1180.779.8Bad illumination
DIII979983.9976.9384.5379.8179.5481.33Reflection
DIV265573.4979.2784.0882.879.6979.74Contact lenses
DV213583.1779.0180.6782.9179.8381.59Shifted contact lenses
DVI440083.0176.1483.4980.4575.2780.18Mascara
DVII489073.5883.4881.8479.377.5780.48Mascara, eyeshadow
DVIII63077.080.4281.1381.9180.9279.43Eyelashes
DIX283176.3280.4780.9180.2181.0581.01Additional black dot
DX84083.9682.2684.9782.2477.3881.21Bad illumination
DXI65572.9481.382.3280.5380.581.25Additional black dot
DXII52477.2881.0482.6182.8377.0481.54Bad illumination
DXIII49182.8275.6583.5880.8376.7780.73Bad illumination
DXIV46971.5879.4882.5180.0578.9981.08Bad illumination
DXV36374.1579.5782.1981.1576.380.73Shifted contact lenses
DXVI39283.7581.5682.8880.8179.2979.44Mascara, eyeshadow
DXVII26880.0176.7680.8382.4275.3780.96Mascara
DXVIII10,79471.4583.7381.6179.7882.079.24Reflection, Mascara
DXIX13,47473.1975.1384.6780.7277.6579.87Reflection
DXX10,34480.478.2882.3281.6976.480.82Reflection
DXXI913372.1876.2984.5980.8577.380.74Reflection, Bad illumination
DXXII10,37076.0783.0180.1680.2576.7481.24Reflection, Mascara
DXXIII63679.4277.9480.481.1675.9479.99Bad illumination
DXXIV96176.675.2180.8780.6677.6879.13Reflection
newDI12,17079.7481.2382.2880.8878.5679.83Reflection
newDII703283.5179.5183.1482.8778.7679.14Reflection
newDIII878071.175.2381.9580.5378.180.57Reflection, Bad illumination
newDIV869172.583.384.4179.5576.0181.22Reflection
newDV454474.978.1483.3580.9676.2180.24Bad illumination
CIT20,00071.0880.9784.9182.6776.4880.25CASIA Iris Thousand
Table 3. DR5 of ViMSA on different datasets under different learning rates.
Table 3. DR5 of ViMSA on different datasets under different learning rates.
Dataset1 × 10−11 × 10−21 × 10−31 × 10−41 × 10−5Step DecayExponential DecayCosine Decay
DI56.9469.780.5883.6740.3788.4889.3891.55
DII28.7370.4582.7986.1229.8487.089.3193.17
DIII39.3650.2584.5385.1740.5787.8791.1290.55
DIV28.8843.2484.0886.6457.8189.6388.7590.47
DV39.2152.5580.6784.2556.5486.1791.8892.81
DVI19.8570.6283.4984.1340.9289.2889.1492.46
DVII35.0936.7281.8485.6550.4889.7891.6591.11
DVIII33.0663.9181.1382.1651.2988.4190.9993.18
DIX20.5141.2380.9183.9247.6287.7391.8493.96
DX32.8150.6684.9783.8153.487.488.9193.4
DXI34.651.882.3284.5854.486.5590.3890.38
DXII32.0172.9782.6186.7338.9387.3688.392.59
DXIII33.4556.2183.5882.0727.9188.9788.9593.29
DXIV51.7643.6282.5184.6450.7789.4891.0892.26
DXV21.866.0882.1982.5941.1988.5490.7590.46
DXVI44.0437.1682.8883.054.889.2990.3392.47
DXVII42.2658.7680.8382.4445.0787.6790.9992.86
DXVIII34.3273.3481.6184.0831.1189.3888.9993.95
DXIX57.5753.7284.6784.8147.5187.7991.5891.61
DXX26.1562.782.3286.1653.1686.3588.490.75
DXXI45.3148.6284.5986.1630.0687.8390.6493.1
DXXII25.9761.0480.1684.6629.988.3790.3290.57
DXXIII35.2765.7380.484.7733.6789.9988.3492.38
DXXIV55.5760.7980.8785.9846.289.6989.3593.06
newDI49.2960.2982.2885.3758.1886.688.4392.3
newDII41.4251.1183.1484.2642.6289.6188.0293.26
newDIII49.6449.9781.9582.1857.5587.9688.7292.52
newDIV45.2742.984.4183.5747.1886.7989.493.34
newDV56.5853.7283.3586.3655.4989.7688.6391.88
CIT56.7459.5584.9185.9223.4288.2190.9590.47
Table 4. DR5 of ViMSA on different datasets under different optimizers.
Table 4. DR5 of ViMSA on different datasets under different optimizers.
DatasetSGDSGD with MomentumAdaGradRMSPropAdamAdamW
DI86.1789.6182.0391.9591.5593.66
DII87.7889.8785.7589.3593.1792.6
DIII86.1488.8578.2888.1190.5592.26
DIV87.187.5779.6289.2490.4791.69
DV87.3487.1185.8788.3692.8194.59
DVI84.8489.7286.7491.092.4694.03
DVII84.7987.5286.3491.7791.1192.93
DVIII85.3387.5285.5591.8893.1893.79
DIX85.3189.6283.5991.6193.9691.62
DX84.7588.0687.1190.8993.493.0
DXI84.1388.8181.6390.2590.3895.0
DXII87.0587.0285.2690.0992.5993.93
DXIII87.8787.4584.391.7993.2994.54
DXIV85.0187.1181.6990.3292.2694.92
DXV84.8987.1578.9290.7190.4692.57
DXVI85.6888.2385.7391.5792.4793.45
DXVII86.4489.2382.2490.2392.8693.64
DXVIII86.4787.2679.3789.993.9593.34
DXIX84.5388.5580.9189.3191.6193.28
DXX84.4887.582.1991.8690.7593.0
DXXI85.3389.1782.7990.9893.194.87
DXXII87.4187.3379.6588.7690.5792.57
DXXIII86.3987.4880.6190.192.3893.4
DXXIV87.6587.5582.4391.7693.0692.11
newDI86.6189.981.8888.5792.394.72
newDII87.288.586.690.7293.2694.13
newDIII84.3588.985.9588.0392.5293.73
newDIV84.1788.7283.1791.5193.3491.18
newDV84.0287.186.088.1291.8892.86
CIT85.7689.9280.8188.390.4793.66
Table 5. Weight initialization scheme and corresponding DR5.
Table 5. Weight initialization scheme and corresponding DR5.
DatasetFixedRandomGaussian
DI54.3284.9693.66
DII75.5585.6992.6
DIII51.990.592.26
DIV67.6284.1891.69
DV59.7690.0594.59
DVI54.4184.194.03
DVII61.888.292.93
DVIII73.3188.1793.79
DIX69.3986.791.62
DX58.0886.9593.0
DXI74.3588.3395.0
DXII57.4887.9493.93
DXIII64.9489.3894.54
DXIV72.9289.6394.92
DXV52.3490.8592.57
DXVI42.0586.693.45
DXVII54.2990.2693.64
DXVIII45.5590.5593.34
DXIX73.7290.9893.28
DXX61.4689.5293.0
DXXI50.5887.6294.87
DXXII72.5288.0192.57
DXXIII58.1886.8593.4
DXXIV51.8487.0592.11
newDI56.7388.5894.72
newDII46.2590.394.13
newDIII46.8585.4293.73
newDIV70.2789.1191.18
newDV50.3787.3692.86
CIT82.4686.0293.66
Table 6. Different MSA heads and corresponding DR5.
Table 6. Different MSA heads and corresponding DR5.
Dataset248121624
DI84.1985.5687.1591.5689.0988.55
DII85.3486.7987.1890.6785.3682.92
DIII85.6786.2187.1491.0389.5888.92
DIV85.3284.4386.5586.4686.7187.77
DV85.686.9188.4686.9689.7581.41
DVI84.0284.5188.8389.6386.4686.31
DVII85.2385.9389.3390.7583.784.09
DVIII84.7184.188.9590.9585.1582.83
DIX85.9985.4188.1590.8289.9689.86
DX85.7785.9189.2689.5789.8585.21
DXI84.384.2486.4489.4486.9583.97
DXII84.6185.5587.1388.9587.7384.09
DXIII85.1285.2590.786.1288.4289.73
DXIV84.5585.7489.1290.0587.987.59
DXV84.3785.989.6491.6489.6581.36
DXVI85.0886.2187.4486.7585.3185.9
DXVII85.4286.2488.7891.0686.581.36
DXVIII85.986.5990.0489.5489.5983.61
DXIX85.484.8989.8291.5290.9183.35
DXX85.0584.7688.5488.7489.5684.56
DXXI85.5285.487.6187.7786.0889.76
DXXII85.8886.6989.3390.2388.5184.59
DXXIII84.0184.0889.0786.5983.4687.16
DXXIV85.4985.8790.7391.4890.3185.59
newDI85.6984.5489.9791.6984.6988.02
newDII85.5284.7887.9592.788.2585.87
newDIII85.9484.1188.9392.0188.5382.34
newDIV85.6486.1686.1691.4388.6586.57
newDV84.6785.9690.8786.2689.0484.27
CIT85.3885.990.6992.9390.3787.12
Table 7. Different patch sizes and corresponding DR5.
Table 7. Different patch sizes and corresponding DR5.
Dataset4 × 48 × 810 × 1016 × 1620 × 2040 × 40
DI83.3291.5693.1896.6989.5886.14
DII85.4290.6790.1695.9689.4287.48
DIII83.0691.0396.9193.8887.7887.96
DIV85.8586.4691.6398.0788.4886.62
DV84.4786.9694.7798.6888.5686.87
DVI83.1789.6390.8398.589.1587.33
DVII85.1590.7595.6394.987.185.69
DVIII88.990.9596.7698.4689.7385.6
DIX86.290.8296.7398.0389.3985.48
DX89.0989.5790.1296.3887.0486.05
DXI88.2189.4494.7793.5988.8486.63
DXII91.3788.9595.5897.3388.6986.46
DXIII85.2986.1290.8395.6687.4387.17
DXIV84.8990.0594.4597.4888.0686.82
DXV83.4691.6491.4698.4288.2585.52
DXVI86.4886.7595.6596.8689.9286.52
DXVII90.9691.0690.8198.1487.9886.21
DXVIII85.2489.5491.3795.9889.8886.22
DXIX90.1591.5294.397.5787.2185.94
DXX91.5888.7490.4796.788.0187.25
DXXI83.9687.7796.6797.8688.9286.02
DXXII84.3990.2396.4693.488.1285.13
DXXIII89.9386.5994.5295.6387.0285.21
DXXIV85.2391.4892.2893.3787.0286.4
newDI89.5991.6991.1998.9487.485.95
newDII83.4492.795.0995.9487.9686.18
newDIII83.2592.0190.1293.0688.3586.65
newDIV89.7691.4394.0995.789.4885.72
newDV87.8686.2694.5896.5987.3686.92
CIT92.0292.9397.2199.690.3689.2
Table 8. RMSE of various classic models on different datasets.
Table 8. RMSE of various classic models on different datasets.
DBExELPRPuOurs
SKFCFS
DI6.844.75.85.64.92.14
DII1387.35.35.45.42.28
DIII1387.78.697.92.67
DIV5.353.83.43.431.88
DV643.63.23.631.77
DVI966.26.65.65.41.8
DVII1198.36.65.36.62.47
DVIII9.976.64.54.75.11.81
DIX644.14.14.14.11.89
DX5.455.35.35.65.12.2
DXI9.464.54.36.43.22.72
DXII5.354.944.34.32.02
DXIII7.365.15.45.14.72.33
DXIV7.5443.22.62.51.99
DXV9.796.85.16.95.11.82
DXVI1497.35.36.85.32.11
DXVII5.432.81.74.02.11.87
DXVIII15109.29.9128.62.27
DXIX1614131416131.98
DXX9.456.45.46.65.42.14
DXXI101185.17.74.71.92
DXXII15101011109.42.76
DXXIII2.8334.14.03.42.34
DXXIV12109.9119.99.92.76
newDI1688.37.39.77.31.72
newDII1715141214122.28
newDIII1413131212112.82
newDIV11107.74.75.852.33
newDV9.265.45.6652.16
CIT4.743.632.62.61.67
Table 9. RMSE of emerging models on different datasets.
Table 9. RMSE of emerging models on different datasets.
DBCSZHMZBGCWLJXMGBZGKXGYViTViMVMVFOurs
DI2.964.224.94.094.62.42.42.53.03.32.83.12.142.82.14
DII2.863.324.573.694.83.332.63.13.62.92.72.282.92.28
DIII2.263.773.153.863.23.31.93.12.33.33.12.62.673.12.67
DIV2.53.065.144.943.72.622.43.43.53.22.61.883.21.88
DV2.663.743.893.563.12.2222.42.92.62.81.772.61.77
DVI2.444.263.933.063.12.72.12.72.42.62.62.81.82.61.8
DVII2.134.332.792.914.52.62.82.72.43.82.42.72.472.42.47
DVIII2.344.193.573.593.53.12.63.12.833.531.813.51.81
DIX2.74.43.713.432.62.12.32.92.83.42.81.893.41.89
DX2.633.93.844.494.13.32.332.54.52.72.52.22.72.2
DXI3.02.675.214.24.22.51.81.833.13.22.82.723.22.72
DXII2.674.524.944.984.62.32.72.42.63.13.23.12.023.22.02
DXIII2.982.993.134.924.32.93.13.12.62.93.33.32.333.32.33
DXIV2.483.93.365.063.83.12.52.932.53.52.71.993.51.99
DXV2.393.734.853.9832.52.21.93.13.23.33.41.823.31.82
DXVI3.094.372.813.564.52.23.42.53.13.83.232.113.22.11
DXVII2.723.874.123.862.92.42.13.234.03.42.71.873.41.87
DXVIII3.382.84.044.224.133.22.934.53.12.82.273.12.27
DXIX3.144.343.764.733.43.32.42.22.43.832.81.9831.98
DXX2.174.593.615.034.42.92.52.23.12.82.83.22.142.82.14
DXXI2.782.723.213.873.22.82.12.12.54.133.41.9231.92
DXXII2.733.614.154.34.12.13.22.233.72.42.72.762.42.76
DXXIII2.843.153.213.073.13.12.62.63.53.632.52.3432.34
DXXIV2.083.944.043.623.43.22.42.53.63.22.432.762.42.76
newDI2.564.673.653.813.43.12.43.23.54.12.42.51.722.41.72
newDII2.943.984.174.69432.52.92.93.82.32.72.282.32.28
newDIII3.073.645.084.993.332.12.03.343.63.22.823.62.82
newDIV2.714.152.823.24.62.42.33.12.73.52.72.62.332.72.33
newDV2.583.714.674.44.12.82.533.533.42.52.163.42.16
CIT1.892.512.822.992.71.821.92.23.93.22.81.673.21.67
Table 10. DR5 of various classic models on different datasets.
Table 10. DR5 of various classic models on different datasets.
DBExELPRPuOurs
SKFCFS
DI72868377788296.69
DII40656980797995.96
DIII38646762606693.88
DIV80838890909298.07
DV76858991899298.68
DVI60787573787998.5
DVII49606473807394.9
DVIII55687384838198.46
DIX76878686868698.03
DX79798080788196.38
DXI58758485749193.59
DXII80798287858597.33
DXIII69748179818395.66
DXIV68848791949597.48
DXV56577281718198.42
DXVI35606980728096.86
DXVII79909399879798.14
DXVIII24575955446295.98
DXIX23333634203797.57
DXX58787479737996.7
DXXI52476581678397.86
DXXII26535450525893.4
DXXIII93949286879095.63
DXXIV46535546555593.37
newDI22626469566998.94
newDII16263344354595.94
newDIII34393845444993.06
newDIV48546783778295.7
newDV59757978768196.59
CIT83858992949499.6
Table 11. DR5 of emerging models on different datasets.
Table 11. DR5 of emerging models on different datasets.
DBCSZHMZBGCWLJXMGBZGKXGYViTViMVMVFOurs
DI92.3185.5581.9286.2583.7995.3395.5194.5192.2190.7593.0991.5796.6993.0996.69
DII92.8390.3683.7288.3982.6190.792.2794.4491.4688.7392.4293.6495.9692.4295.96
DIII96.0687.9691.2987.4791.0390.497.9791.3895.990.2991.6294.3393.8891.6293.88
DIV94.7491.7680.6581.7188.2594.3597.3695.2189.7589.2790.894.098.0790.898.07
DV93.9188.1687.3489.1291.7596.5797.5697.495.2392.5394.1692.9998.6894.1698.68
DVI95.0685.3487.1391.7891.4893.9296.8893.4695.3794.3994.4393.1298.594.4398.5
DVII96.7484.9793.2392.5684.0994.3793.2993.8795.2987.7495.3693.694.995.3694.9
DVIII95.685.7289.0688.9389.2691.4294.4491.8293.0692.3389.291.9298.4689.298.46
DIX93.6984.6388.3189.9492.094.1597.0395.8492.4893.4289.992.9598.0389.998.03
DX94.0887.387.6284.186.0690.495.8791.9194.584.293.8494.8796.3893.8496.38
DXI92.0993.8780.2785.6585.594.7398.4798.4692.1291.5491.2592.9793.5991.2593.59
DXII93.8683.9781.7281.583.2895.7893.8595.2294.1791.4190.8791.3797.3390.8797.33
DXIII92.2192.1591.3881.8185.1192.5591.6491.4993.9992.4890.2490.3495.6690.2495.66
DXIV94.8587.2890.1481.0788.0391.6994.6692.5392.294.7989.2993.4497.4889.2997.48
DXV95.3588.1782.1886.8792.1694.6596.4197.8891.6491.0990.2890.1398.4290.2898.42
DXVI91.5984.7493.189.184.0296.4390.0594.6991.787.9391.1992.0696.8691.1996.86
DXVII93.5887.4386.0987.4992.5795.1596.6490.7992.0986.6889.8593.7498.1489.8598.14
DXVIII90.0693.1586.5285.5886.2891.9791.2592.3992.2784.3291.792.9895.9891.795.98
DXIX91.3384.9488.0382.8489.7390.3295.2196.2795.2887.892.1993.297.5792.1997.57
DXX96.5383.5988.8481.2184.8592.4394.9396.2691.6993.2993.1691.1996.793.1696.7
DXXI93.2993.6190.9887.4291.1793.2296.9797.1494.8886.0892.1290.1697.8692.1297.86
DXXII93.5488.8585.9385.1386.1697.1690.8996.1291.988.4595.1793.693.495.1793.4
DXXIII92.9791.2890.9891.791.5391.5594.1194.1589.2488.7792.2294.8595.6392.2295.63
DXXIV97.0187.0886.5188.7790.0690.9295.3294.6688.9591.0395.3192.1593.3795.3193.37
newDI94.4683.1688.6187.7589.8191.595.2791.0289.2185.9995.3994.7998.9495.3998.94
newDII92.4186.8485.8483.0686.6291.9594.6592.6592.7587.9995.9493.8495.9495.9495.94
newDIII91.788.6780.9681.4490.5291.8696.7297.5590.486.5489.191.2593.0689.193.06
newDIV93.6285.9293.0691.083.6295.396.0291.8193.7589.6293.894.3295.793.895.7
newDV94.3188.2783.1884.6386.3892.9594.5891.8389.6591.9890.1894.6996.5990.1896.59
CIT98.0494.7193.0492.1593.4598.4497.6898.2496.587.5191.2293.1699.691.2299.6
Table 12. The GFLOPs and running time for each model.
Table 12. The GFLOPs and running time for each model.
ModelGFLOPs(G)Running Time (ms)
ExCuSe-6
ELSe-7
PuReST-4
Pu (SK)-7
Pu (FC)-1200
Pu (FS)-850
CS14.6143
ZHM15.3555
WL16.8157
JXM0.14310
GB0.0938
GC4.417.6
ZG11.231
ZB6.424
KX4.218.37
GY4.117
ViT17.652
ViM0.09418
VM0.3249
VF15.3921
ViMSA0.1029
Table 13. Summary of various models.
Table 13. Summary of various models.
TechnologiesGlobal
Dependencies
Adaptive
Learning
Low
Complexity
Limitations
CSImprove self attention mechanism
ZHMIntegrating FPN and ViT
WLIntegrating ResNeSt and ViT
ZGIntroduce deformable convolution, deepen the model depth
ZBYOLOv5s coarse localization, Canny edge detection and fine localization Step dependency
JXMCNN two-step voting mechanism Step dependency
GBDecoupling x, y coordinates Ignore the intrinsic connection between x and y
GCIntroducing SE module
KXIntroducing a dual attention mechanism
GYIntroducing dilated convolution
OursImprove ViM, introducing weighted feature fusion, accelerate MSA
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wang, C.; Wang, P.; Xue, P. Pupil Detection Algorithm Based on ViM. Sensors 2025, 25, 3978. https://doi.org/10.3390/s25133978

AMA Style

Zhang Y, Wang C, Wang P, Xue P. Pupil Detection Algorithm Based on ViM. Sensors. 2025; 25(13):3978. https://doi.org/10.3390/s25133978

Chicago/Turabian Style

Zhang, Yu, Changyuan Wang, Pengbo Wang, and Pengxiang Xue. 2025. "Pupil Detection Algorithm Based on ViM" Sensors 25, no. 13: 3978. https://doi.org/10.3390/s25133978

APA Style

Zhang, Y., Wang, C., Wang, P., & Xue, P. (2025). Pupil Detection Algorithm Based on ViM. Sensors, 25(13), 3978. https://doi.org/10.3390/s25133978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop