1. Introduction
Post-COVID-19 condition (PCC) describes symptoms that persist after SARS-CoV-2 infection, typically emerging about three months after the acute illness and lasting for at least two months, without being explained by another medical condition [
1]. Symptoms may represent a continuation of the acute episode or arise only after apparent recovery [
1]. Recent studies suggest that PCC affects roughly 43% of individuals following infection [
2], with higher rates observed in patients who experienced severe acute disease requiring hospitalization [
2]. Many of these patients experience a broad range of complaints affecting multiple organ systems. Reported manifestations include fatigue, headache, loss or distortion of smell and taste, and dyspnea [
3].
Subjective cognitive complaints are also common, with patients describing “brain fog”, a non-specific term referring to reduced mental clarity and slowed thinking, concentration difficulties, and episodes of confusion. Neuropsychological testing has revealed cognitive impairments (i.e., objective reductions in cognitive performance relative to normative expectations) in several domains, such as executive functioning, working memory, and processing speed, following PCC [
4,
5,
6,
7]. These complaints are strongly associated with fatigue and depressive symptoms, and show a weak relationship with objective cognitive performance measured with neuropsychological testing. This suggests that cognitive impairment may be present even in patients without cognitive complaints, regardless of hospitalization status [
4]. Cognitive impairment is often mild and heterogeneous, varying across individuals in the domains affected, severity, and the co-occurring symptoms contributing to the impairment. Altogether, distinguishing and predicting cognitive impairment remains challenging, particularly at the individual level.
Neuropsychological testing is the established reference for the assessment of cognitive impairment across neurological and psychiatric conditions. In PCC, as in other contexts, neuropsychological test batteries are used to quantify cognitive performance and identify domain-specific impairments. However, their application in large-scale or repeated assessments outside specialized neuropsychological settings is limited by practical factors, including long administration times, the need for trained professionals, and susceptibility to fatigue, education, and mood among others. Brief cognitive screening tools, such as the Montreal Cognitive Assessment (MoCA), were developed to address some of these limitations, offering a faster and more accessible alternative. While the MoCA is commonly adopted due to its sensitivity to mild cognitive impairment [
8,
9], its performance is influenced by demographic factors such as education level and age, which motivated the introduction of score adjustments [
10]. Subsequent studies and meta-analyses have shown that optimal MoCA cut-off scores vary across populations and clinical settings, and that cross-cultural adaptations exhibit variability in sensitivity and specificity [
10,
11,
12]. This MoCA variability is particularly relevant when the cognitive impairment assessed is subtle and heterogeneous, as observed in PCC. From a practical perspective, the MoCA must be administered by trained and certified professionals to ensure standardized delivery and scoring [
8]. Taken together, these limitations motivate interest in complementary, objective screening approaches.
One promising approach involves the use of objective, measurable motor and physiological responses as indirect predictors of cognitive performance. Eye-Tracking (ET) has been increasingly used as an objective, non-diagnostic, non-invasive, and time-efficient method for characterizing oculomotor behavior associated with cognitive impairment [
13,
14]. Demonstrating that ET metrics can reliably predict neuropsychological performance at the individual level would support their use as a complementary tool to instruments like the MoCA. Numerous studies have shown that specific ET metrics relate to distinct cognitive domains [
15]. This association between ET data and cognition is rooted in the fact that eye movements are controlled by a complex oculomotor system involving cortical and subcortical brain regions [
16,
17,
18].
Measures obtained during smooth-pursuit eye movements have been found to identify individuals with neurological or neurodegenerative conditions [
19,
20]. Performing smooth pursuit tasks activates multiple brain areas, including the fronto-insular cortex, anterior cingulate cortex, superior colliculi, supplementary motor area, and thalamus [
15]. The engagement of these regions suggests that smooth pursuit performance depends on neural systems overlapping with executive functions such as attention and response control [
15]. Consistent with this distributed network involvement, smooth pursuit deficits are observed across multiple disorders: Alzheimer’s and Parkinson’s diseases are characterized by reduced velocity gain and frequent saccadic intrusions that worsen with disease progression [
21,
22]; multiple sclerosis patients exhibit low-gain pursuit associated with lesions affecting frontal eye field, cerebellar, and vestibular pathways [
23]; amyotrophic lateral sclerosis patients were found to have impaired pursuit eye movement which was suggested as a sign of extrapyramidal or supratentorial pyramidal involvement [
15]. Smooth pursuit impairment is described as “trait characteristic” in schizophrenia, appearing to have a genetic component [
24]. Finally, patients with PCC showed significantly higher amplitude of saccades during smooth pursuit compared to healthy controls [
25].
Fixation duration and stability reflect the brain’s ability to sustain attention and maintain visual engagement [
15,
26,
27]. Rather than being a passive absence of movement, fixation is an active process that requires continuous attention and suppression of reflexive saccades [
28]. Neuroimaging studies indicate that frontal and cingulate regions (bilateral dorsolateral prefrontal cortex, anterior cingulate cortex, and frontal eye fields) contribute to control mechanisms that stabilize gaze [
15,
28]. When these mechanisms are compromised, characteristic patterns emerge: mild cognitive impairment patients tend to show shorter average fixation durations, likely reflecting impaired sustained attention [
26], while Alzheimer’s disease is associated with large “intrusive saccades” during fixation, indicative of a failure in cortical suppression [
21,
28]. Similarly, increased fixation instability has been reported in PCC patients relative to healthy controls [
25].
Pupillary behavior is often assessed using metrics such as latency to pupil constriction, peak pupil constriction, and baseline pupil diameter [
17]. Beyond reflecting light reflexes, task-evoked pupillary responses are sensitive to changes in arousal, attentional control, and mental effort [
17]. Whereas pupillary responses are less commonly assessed than fixations, they offer accurate information about executive function and are considered potential biomarkers for evaluating conditions like Alzheimer’s disease and Parkinson’s disease [
15,
16,
26]. Additionally, recent research found that PCC patients showed significantly reduced pupil responses to light stimulation (both constriction and dilation) compared to a control group [
25]. Together, these findings suggest that ET provides rich, multivariate behavioral data that may be sensitive to cognitive impairment not readily captured by conventional screening testing.
However, because eye movements and pupil response metrics reflect the output of multiple control systems, the cognitive information captured by ET is distributed across numerous interrelated parameters. No single metric is sufficient to characterize cognitive impairment; rather, it is the combination and interaction of multiple measures that best captures cognitive performance. This complexity makes ET data well-suited for machine learning approaches aimed at predicting cognitive performance, which can integrate multiple parameters simultaneously and capture nonlinear relationships. Machine learning methods integrated with ET metrics have been used before to differentiate individuals with PCC from healthy controls [
29]. More broadly, computational approaches to ET analysis have been applied across neurological and psychiatric populations supporting the feasibility of using machine learning to extract clinically meaningful information from oculomotor data [
16,
30,
31].
In previous work, we reported that individuals with PCC exhibit specific ET parameters (from fixation, smooth pursuit and pupil dynamics) associated with neuropsychological outcomes [
32]. Data-driven analyses, including principal component analysis and clustering, further indicated that ET features can differentiate patient subgroups with distinct cognitive profiles. While these findings provided evidence that ET can capture clinically relevant information at the group level, they did not address whether ET could support prediction of cognitive performance at the individual level.
Given the evidence linking eye movements and pupil responses with cognitive performance, this study aims to examine whether ET data combined with machine learning can predict performance on neuropsychological tests. The study has two objectives: first, to evaluate whether ET parameters can predict neuropsychological test scores; and second, to assess whether ET features can distinguish between altered and non-altered cognitive performance as determined by neuropsychological assessment.
2. Materials and Methods
2.1. Participants
This study included two samples: a primary cohort that completed both ET and neuropsychological assessments, and an independent normative reference sample used to establish baseline cognitive performance. Individuals with PCC and healthy participants who were enrolled in the Nautilus clinical study (ClinicalTrials.gov Identifier: NCT05307575) and in the Rehab COVID project (ClinicalTrials.gov Identifier: NCT05846126) were invited to take part in this study. Participants were recruited through the Consorci Sanitari de Terrassa (Terrassa, Barcelona, Spain) from November 2021 to October 2024. All procedures were conducted in accordance with the Declaration of Helsinki and received approval from the Drug Research Ethics Committee (CEIm) of the Consorci Sanitari de Terrassa (reference: 02-20-107-070).
Participants with PCC were eligible for inclusion if they met the following criteria: (1) A confirmed diagnosis of COVID-19 based on WHO guidelines, with signs and symptoms present during the acute phase; (2) a minimum of 12 weeks since the acute infection; (3) persistence of post-infection symptoms consistent with PCC; (4) age between 18 and 65 years.
All participants in the ET study were excluded if they presented any of the following: (1) A prior diagnosis of psychiatric, neurological, neurodevelopmental, or systemic conditions known to affect cognition; (2) motor or sensory impairments that could interfere with neuropsychological evaluation; (3) type 1 or type 2 diabetes; (4) history of intraocular or refractive surgery; (5) glaucoma or any retinal pathology; (6) diagnosed or suspected strabismus; (7) stereopsis poorer than 100 arcseconds; (8) binocular near visual acuity worse than 0.2 logMAR or binocular visual acuity worse than 0.15 logMAR (both corrected); (9) participants with high myopia, defined as a refractive error greater than −6 diopters. A total of 191 participants completed the ET protocol. After applying exclusion criteria, 172 participants were retained for the primary ET-neuropsychological analyses (mean age = 49.69 ± 7.58 years; 79% female).
A normative reference sample was recruited to establish baseline cognitive performance for the neuropsychological assessments. It comprised 133 participants, who completed neuropsychological testing only. For this normative sample, exclusion criteria related to prior psychiatric, neurological, neurodevelopmental, or systemic conditions affecting cognition, as well as motor or sensory impairments (exclusion criteria 1 and 2 above), were applied. The normative reference sample was used to derive demographic-adjusted z-scores for the neuropsychological test outcomes.
2.2. Neuropsychological Tests
All neuropsychological assessments were administered individually by trained clinical neuropsychologists in a fixed order designed to minimize fatigue and interference between tasks. More cognitively demanding tasks, such as those involving executive function and working memory, were scheduled earlier in the session when mental load was lower. Short breaks were permitted when needed. Since no single task can fully capture a cognitive domain, multiple complementary tests were used together as indicators of each domain.
The MoCA is a brief, cognitive screening tool developed to improve the detection of Mild Cognitive Impairment (MCI). It is designed to be administered in approximately 10 min and covers a broad set of cognitive domains. It consists of short-term memory assessment through two learning trials of five words followed by delayed recall; visuospatial abilities assessed using clock drawing and three-dimensional cube copying; executive functioning evaluated through a Trail Making-type alternation task, phonemic verbal fluency, and verbal abstraction tasks; attention and working memory measured by repeating digits forward and backward, target-detection tapping, and serial subtraction tasks; language assessment through confrontation naming of low-familiarity animals and repetition of complex sentences; and orientation testing for time and place. Its validation results indicated higher sensitivity for detecting MCI than other common screening tests and retaining good specificity [
8]. In the context of post-COVID condition, previous work from our group supports the usefulness of including the MoCA as part of brief cognitive screening protocols. In that study, a model including MoCA performance, Digit Symbol, and phonetic verbal fluency contributed to distinguishing individuals with post-COVID condition from healthy controls, showing good sensitivity and acceptable overall discrimination, although specificity was low [
4]. Nevertheless, the MoCA remains a screening tool. Its brief multidomain format limits domain-specific interpretation, and it should not be considered a standalone diagnostic instrument or a substitute for comprehensive neuropsychological assessment.
The Digit Symbol test from the Wechsler Adult Intelligence Scale-Third Edition (WAIS-III) was used to assess processing speed and sustained attention. Participants completed a timed task requiring rapid matching of digits to their corresponding symbols. The score reflects the total matches completed [
33].
The Digit Span was used to assess attention and working memory. The Digit Span Forward task evaluates attention, requiring participants to repeat sequences of digits in the same order as presented, with span length increasing until performance failed. In contrast, the Digit Span Backward assesses working memory and executive functioning; in this task, participants repeated sequences of digits in reverse order. For both tasks, the score corresponds to the total number of correctly reproduced sequences [
33].
The Trail Making Test (TMT) includes two parts. TMT-A evaluates processing speed and attention through a numbered sequencing task, while TMT-B adds an executive component by requiring alternating number-letter sequences. Performance on both parts is scored as the time required to complete the sequence, with longer times indicating poorer performance [
34,
35,
36].
The Stroop Color-Word Test (SWCT) comprises three conditions. In the Word and Color conditions, participants read color words or name color patches, providing measures of processing speed. In the Color-Word condition, they must name the ink color of incongruent color words, which places additional demands on executive-function inhibition. Responses are scored as the number of correct items produced within a fixed time interval [
37].
Verbal fluency was assessed through both phonological and semantic tasks. Phonemic fluency was measured using the Controlled Oral Word Association Test (COWAT) with the letters P, M, and R, while semantic fluency was evaluated through an animal-naming task. In both tasks, performance was scored as the number of valid, non-repeated responses produced within one minute [
38,
39].
2.3. Eye Movement Recording
ET recording was conducted in a separate visit from the neuropsychological testing. Participants were screened to confirm that their near visual acuity and binocularity met the study’s inclusion criteria. Participants performed the visual tasks under binocular viewing using a desktop-mounted EyeLink 1000 Plus eye tracker (SR-Research Ltd., Ottawa, ON, Canada) sampling at 1000 Hz. The standard 9-point calibration was performed prior to data acquisition, achieving a mean accuracy of 0.88° ± 0.59°. Eye movements were recorded binocularly. All recordings were conducted under dim illumination to minimize visual distractions.
Participants were seated and their head was supported by a chin rest at 60 cm from an LCD computer screen (1280 × 1024 pixels, 60 Hz, 17″). When needed, participants wore their habitual glasses or contact lenses to correct refractive error. Stimulus presentation followed the configuration used in our previous study [
25] and included tasks designed to elicit smooth pursuit and steady fixation. Pupil myosis and mydriasis in reaction to light was also recorded. Smooth pursuit was assessed by asking participants to visually track a small moving dot (0.3°) along horizontal, vertical, and sinusoidal trajectories. During the fixation task, participants fixated a 0.5° central cross. After 10 s, brief peripheral stimuli (distractors) appeared for 30 s, while participants were instructed to maintain fixation on the central target. Pupil responses to light were also recorded using an LED light stimulus. During this procedure, participants fixated a central white cross on a dark background; after a brief interval, the LED light source was illuminated for 5 s and then switched off. Pupil diameter was measured both before and after light onset to quantify pupillary constriction and dilation. All visual stimuli and task sequences were generated and presented using MATLAB v8.5.0.197613, R2015a (MathWorks, Natick, MA, USA) with the Psychophysics Toolbox extension [
40,
41].
2.4. Data Analysis
ET recordings were processed using a combination of the manufacturer’s EyeLink Host Software v5.04 and custom Python v3.8.10 scripts. For the analysis, data from the right eye was used across participants. The native software automatically marked blink periods, and signal segments spanning 200 ms before and after each blink were discarded. It also applied two built-in heuristic filters to attenuate measurement noise. Additional processing steps were performed offline with custom python scripts. Velocity profiles were smoothed using a 21-ms Savitzky–Golay filter (second order), and saccades were detected through an adaptive velocity-based threshold that adjusted to the noise of each recording [
42]. Fixation periods were derived after removing all velocity-defined saccades.
A preliminary screening analysis was conducted to identify the ET parameters most strongly associated with neuropsychological performance. Partial Spearman correlations were computed between each ET metric and the neuropsychological test scores, controlling for age and years of education. The complete set of extracted eye movements and pupil-response metrics tested for correlations is available in
Supplementary Table S1. Only parameters showing statistically significant associations were retained for modeling. This procedure yielded four features: fixation Root Mean Square error (RMS) with distractors, smooth-pursuit RMS in sinusoidal and vertical trajectories, and pupil area reduction. This analysis showed correlation coefficients in the range of 0.210–0.290 (
p < 0.05) [
32].
For both fixation and smooth pursuit tasks, spatial error at each time sample
i was defined as the Euclidean distance between gaze position (
,
) and target position (
,
). RMS was computed across the full duration of the trial as in Equation (1):
where
denotes the total number of gaze samples in the trial.
In the fixation task, the target was static at the central fixation position (0,0), such that and for all i. In the smooth-pursuit task, target coordinates corresponded to the stimulus position at the gaze timestamp. Lower RMS values indicate reduced spatial error, reflecting greater fixation stability or tracking precision. For each participant, RMS was computed across each trial, yielding a single scalar value per condition (fixation with distractors, sinusoidal pursuit, and vertical pursuit).
Finally, pupil area reduction was computed relative to baseline. Baseline pupil area (
) was defined as the mean pupil area during the 1 s interval preceding LED onset (
):
where
denotes pupil area at time
t and
the number of gaze samples in the time interval.
The constriction level (
) was defined as the mean pupil area during the 1 s interval beginning at the first detected local minimum
following LED onset. The detected local minimum must have represented at least a 10% decrease from baseline.
correspond to the number of samples in the time interval.
Pupil reduction was expressed as the relative percentage decrease from baseline to minimum pupil area, and this value was used in the modeling analyses as presented in Equation (4):
The proportion of missing data varied across features, with fixation RMS (distractors) showing 2.9% missing values, smooth-pursuit RMS in the sinusoidal condition 4.65%, smooth-pursuit RMS in the vertical condition 4.65%, and pupil are reduction 14.53%.
2.5. Machine Learning Models
The first step of the machine learning pipeline involved preparing the dataset, particularly addressing missing values. In our dataset, missing data arose mainly from technical limitations of the ET (e.g., loss of pupil signal due to eyelid closure or poor tracking quality). These cases were handled using multiple imputation with scikit-learn’s IterativeImputer [
43], which preserves relationships among variables. Following preprocessing, a range of supervised machine-learning models was then evaluated to predict neuropsychological performance from ET features and age as inputs. Two families of models were considered: regression algorithms, used to estimate each neuropsychological raw test score, and classification algorithms, used to determine whether neuropsychological test performance was altered or not. All models shared a common input feature set comprising four scalar ET measures (fixation RMS error during the distractor condition, smooth-pursuit RMS error along the sinusoidal trajectory, smooth-pursuit RMS error along the vertical trajectory, and pupil area reduction) each computed as a single value per participant and visual task. Participant age was additionally included as a predictor in all models. The dataset was structured in table format, where each row represented one participant and each column corresponded to one predictor variable. Outcome variables, either continuous neuropsychological test scores (for regression) or binary impairment labels (for classification), were handled separately depending on the modeling task. The following subsections describe the specific models and their implementation.
2.5.1. Regression Approach
Five regression approaches were evaluated: (1) Ordinary Least Squares (OLS), (2) Stochastic Gradient Descent (SGD), (3) Extreme Gradient Boosting (XGBoost), (4) Random Forest, and (5) feedforward neural networks. Implementations used scikit-learn, XGBoost for gradient boosting, and TensorFlow/Keras v2.13.0 [
44] for neural networks.
Hyperparameter optimization was performed for all machine learning models using Bayesian optimization implemented in the Optuna framework [
45] by minimizing the average error across all neuropsychological outcomes. The search spaces evaluated 80 configurations per model. For SGD, the search covered the loss function, penalty type, and regularization strength. For the ensemble tree methods (XGBoost and Random Forest), it included tree depth, number of estimators, subsampling fractions, and regularization parameters. For the feedforward neural network, the search spanned network depth, hidden layer sizes, activation function (ReLU or ELU), dropout rate, L2 regularization strength, learning rate, batch size, and early stopping patience. The full set of optimized hyperparameter values are listed in
Supplementary Table S2.
Model performance was evaluated using 5-fold cross-validation (KFold, shuffle = True). In each iteration, models were trained on four folds and evaluated on the held-out fold, ensuring that no participant data used in the training phase participated in the performance evaluation. Out-of-fold predictions (i.e., predictions obtained for each observation when it appears in the test fold) were aggregated across all folds. Predictive performance was assessed using the coefficient of determination (R2) and the Root Mean Squared Percentage Error (RMSPE), computed on the pooled out-of-fold predictions.
2.5.2. Classification Approach
The neuropsychological tests battery defined the cognitive outcomes predicted by the models. For each test, outcomes were defined as a binary classification (“altered” vs. “not altered”). Machine learning models were trained using the ET features to predict these classifications. The MoCA was evaluated separately as a screening test and, in some models, was included as an additional predictor.
Because neuropsychological performance varies as a function of demographic factors, classification labels were derived from demographic-adjusted z-scores computed using the normative reference sample (
n = 133). This sample was used to establish baseline performance by estimating expected scores using an OLS regression model with age, years of education, and gender as predictors. Standardized z-scores were then calculated as deviations from expected performance, as defined in Equation (5), and classified as altered when |z| > 1 (with direction reversed for time-based tests such as TMT) [
46].
Thus, classification was based on demographically adjusted performance rather than raw test scores or direct group comparisons. For the MoCA, the altered label was defined using the conventional clinical cutoff (score < 26 points).
Although individual tests involve overlapping cognitive processes, neuropsychological performance is commonly interpreted according to the predominant cognitive function required by each task. Accordingly, tests were grouped a priori into cognitive domains reflecting their principal cognitive demand to enable domain interpretation and reduce test variability. Separate composites were created to distinguish between executive function tests that vary on processing speed, and between processing speed tests that vary in their attentional demands. The following composites were examined: Attention (Digit Span Forward and TMT-A), Executive Function 1 (Digit Span Backward, TMT-B-A ratio, Stroop Color-Word, and phonological fluency), Executive Function 2 (Digit Span Backward, TMT-B, Stroop Color-Word, and phonological fluency), Processing Speed 1 (Digit Symbol, TMT-A, Stroop Word, and Stroop Color), and Processing Speed 2 (Digit Symbol, Stroop Word, and Stroop Color). Additionally, a derived global composite was also created. For each composite, the altered state was defined by the presence of two or more altered tests within the composite, while the global composite was defined by considering all neuropsychological tests together (excluding MoCA). The global composite should be interpreted as an aggregate indicator of overall cognitive performance within the study, rather than as a clinical diagnosis. It reflects the general principle of multi-domain impairment described in Petersen’s MCI framework [
47] without implying a clinical MCI classification.
Two families of classifiers were evaluated: Linear Discriminant Analysis (LDA) and Random Forest classification (RF). Models were implemented using scikit-learn. Independent hyperparameter optimization was conducted for each neuropsychological outcome to ensure the maximum performance. This process utilized the Optuna framework [
45] for Bayesian optimization, targeting the maximization of the Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Final optimized hyperparameters are presented in
Supplementary Table S3. The ROC-AUC was used as the primary metric for model optimization and comparison because it provides a threshold-independent evaluation of classifier discrimination. A 5-fold cross-validation procedure was applied following the same strategy ensuring that no participant data used in the training phase participated in the performance evaluation. Out-of-fold predictions were aggregated across folds, and performance metrics were computed on the pooled predictions. Final comparisons between LDA and Random Forest models were based on these out-of-fold predictions, with McNemar’s test used to assess statistical differences in classification performance.
MoCA was evaluated as a standalone classifier to serve as a clinical screening reference for comparison with ET-based models. The conventional cut-off score of <26 was applied. The MoCA was selected because it is a brief and widely used screening instrument that provides a global measure of cognitive functioning across multiple domains, with minimal administration time. Because lower MoCA scores indicate greater cognitive impairment, scores were inverted during ROC computation.
Additionally, motivated by the potential of complementary classification combining ET and MoCA, a model was trained using the MoCA scores, besides the ET features and age, as inputs to predict the abovementioned binary classification. The weight of the MoCA feature was systematically varied by duplicating it within the feature pool, thereby increasing its probability of selection during tree construction [
48]. Multiple models with different MoCA weights were trained and evaluated using the same stratified 5-fold cross-validation procedure described above. Classification models’ performance was assessed as described in the previous section.
Given the relevance of the global composite as an indicator of multi-test cognitive alteration aligned with screening applications, a final comparison focused on the global composite outcome. Therefore, ROC curves were computed for the global composite to compare the classification performance of the evaluated models. For the RF-based models, confusion matrices were derived using thresholds selected to ensure a minimum sensitivity of 0.85, reflecting a screening-oriented criterion. Differences in model performance were assessed for statistical significance using McNemar’s test.
To evaluate the relative contribution of each predictor to the model’s decisions, a permutation feature importance analysis was performed on the ET and MoCA combined model. This technique quantifies importance by measuring the mean decrease in the Area Under the Receiver Operating Characteristic Curve (ROC-AUC) when the values of a specific feature are randomly permuted. A total of 50 permutations were performed for each feature.
4. Discussion
This study evaluated the ability of ET metrics to predict neuropsychological performance and to distinguish between altered and non-altered cognitive status in PCC, both independently and in combination with the MoCA. The assessment aimed to characterize cognitive performance across key domains rather than to establish a diagnostic determination. The main findings indicate that while ET-based models poorly predicted individual test scores, they provided complementary information that improved classification when combined with the MoCA, especially at the global level. These results suggest that ET metrics may provide additional, objective signals of cognitive impairment that are not fully captured by standard screening tools.
The regression analyses revealed that ET features could not accurately predict individual neuropsychological test raw scores. Across all tests and algorithms, the models performed poorly: R
2 values fluctuated around 0, meaning they failed to exceed a baseline that predicts the average score for all participants. The RMSPE were correspondingly high, in some cases exceeding 100%, indicating large differences between predicted and actual scores. Non-linear models (XGBoost and RF) showed slightly better performance relative to linear models but remained close to the baseline. Neural networks produced the most negative R
2 scores. Across models of varying complexity, hyperparameter optimization favored high regularized or shallow architectures, suggesting a weak signal between ET predictors and neuropsychological scores. Altogether, these findings indicate that the selected ET metrics and current task design may not be sufficient for estimating precise scores on individual cognitive tests, as the relationship between these specific predictors and neuropsychological performance appears to be weak and noisy. It is possible that more complex oculomotor parameters or tasks with higher cognitive loads could provide the additional signal required for robust prediction. Beyond the methodological frame, this limited predictive power also likely reflects the clinical complexity of PCC. Cognitive impairment in this population is characteristically subtle and highly heterogeneous. This diffuse presentation, combined with high inter-individual variability in symptoms makes precise score estimation challenging at the individual level [
4,
5,
6,
7].
To our knowledge, the use of ET data to predict neuropsychological test scores via regression has not been previously reported. Regression requires accurate estimation of scores and is highly sensitive to prediction error, particularly when cognitive impairments have low signal-to-noise ratio as is the case in PCC. In contrast, classification models reduce test-specific noise and may better capture subtle deviations from normal efficiency.
Based on this rationale, cognitive performance was evaluated using binary classification (altered or non-altered). Using only ET parameters and age as predictors, two classification algorithms were evaluated: LDA and RF. ET-only models showed comparable performance according to the McNemar’s test and AUC values around ~0.53-0.70, indicating modest but consistent discriminative ability. Given the comparable performance between models, RF was selected for subsequent analyses due to its greater capacity to capture non-linear relationships. However, RF has reduced interpretability compared with linear models like LDA because predictions are generated from the aggregation of many decision trees rather than a single equation.
To contextualize the performance of ET classifiers, the MoCA was evaluated as a standalone classifier across the same individual tests and composite cognitive domains. Overall, MoCA outperformed ET-based models, achieving higher AUC values in most outcomes. However, its performance was not uniform across domains, showing limited or below-chance discrimination in tasks such as TMT-A and TMT-B, as well as comparable performance in semantic fluency. These results highlight both the strength of MoCA as a global screening tool and some domain-specific limitations in our population sample.
Given that MoCA and ET metrics capture distinct aspects of cognitive function, multimodal integration was evaluated to combine complementary sources of information. The integrated RF-ET + MoCA model achieved AUC values that exceeded those of the RF-ET classifier alone and were often higher than those of MoCA. Notably, the highest classification performance for the integrated model was observed in the global composite (AUC = 0.80), the Executive Function 2 composite (AUC = 0.81), and the Stroop Color-Word test (AUC = 0.80). These outcomes share a dominance on executive control and task complexity, involving the integration of working memory, inhibition, and cognitive flexibility [
33,
34,
35,
36,
37,
38,
39]. While MoCA alone already demonstrated good discriminative performance in these domains, the addition of ET features further improved classification. This pattern suggests that combination of ET and MoCA capture information that enhances the global assessment offered by standard screening measures. Permutation feature importance analysis (
Figure 3) showed that MoCA was the strongest individual contributor to the multimodal classifier. This finding is expected, given that MoCA shares conceptual and clinical overlap with the neuropsychological outcomes used for classification. Nevertheless, ET metrics, particularly vertical smooth-pursuit RMS and fixation RMS, also contributed positively to classification performance. Although the contribution of individual ET variables was modest, their combined contribution supported the complementary value of ET biomarkers. Furthermore, the weighted RF architecture was specifically designed to balance feature sampling between MoCA and ET variables during tree construction. Therefore, the predominance of MoCA in the feature importance analysis likely reflects both its strong predictive signal and its weight within this multimodal framework.
ET metrics reflect underlying cognitive control processes. Fixation stability reflects the ability to sustain attention and suppress reflexive saccades, while increased saccadic intrusions may indicate reduced inhibitory control. Similarly, smooth pursuit performance depends on predictive tracking and continuous error correction, processes that rely on the integration of attention and executive control [
18,
49]. Together, these measures provide indices of cognitive efficiency at the process level, particularly in terms of attention, inhibition, and executive regulation [
50]. In contrast, neuropsychological test scores reflect what level of performance was ultimately achieved rather than the underlying processes involved. While these two levels of measurement are related, they are not equivalent. This distinction becomes particularly relevant in PCC, where cognitive impairment is often subtle and may not manifest as clear deficits in neuropsychological test scores. Large cohort studies showed that cognitive impairment in PCC is typically modest in magnitude, heterogeneous across individuals, and diffuse across cognitive domains [
4,
7]. Further, individual neuropsychological test scores often remain within normative ranges even when individuals report significant subjective cognitive difficulties or demonstrate subtle inefficiencies across multiple domains. This pattern reflects a global reduction in cognitive efficiency rather than focal, severe impairment in a single specific function [
4,
7].
In this context, detecting a deviation from normative functioning, rather than to estimate exact performance levels, becomes more appropriate for PCC. This aligns with the nature of ET metrics, which capture overall cognitive processing rather than domain-specific performance [
50]. Consistent with this perspective, the recent literature has framed cognitive assessment through ET as a classification problem. Prior work emphasized that ET metrics primarily reflect cognitive demands rather than precise performance outcomes [
51], while Kim et al. [
52] demonstrated this approach by developing a deep learning model that distinguished between normal and impaired executive function during visuospatial memory encoding with high accuracy. Rizzo et al. [
53] showed that ET features reliably distinguish cognitive interference states using machine learning classifiers. These studies support the use of ET for screening rather than fine-grained score prediction.
The global composite provides a summary measure of overall cognitive alteration by integrating performance across multiple cognitive domains. It is therefore more aligned with real world screening objectives. This is relevant in PCC, where cognitive impairment tends to be subtle and distributed across domains. The ability of the ET classification models to identify global cognitive status positions itself as a promising complementary screening-level tool. Notably, the ROC profile of the RF-ET model indicates that high sensitivity levels (e.g., ~0.80) can be achieved at moderate specificity (~0.50), an acceptable trade-off in screening contexts where minimizing false negatives is prioritized. This result motivated direct comparison to the MoCA, which has become the most widely used screening tool in both clinical practice and research.
The MoCA was developed and validated for detecting MCI [
10] and has since become the most widely used cognitive screening instrument in clinical practice and research. Although PCC-related cognitive impairment differs conceptually from MCI, comparing MoCA performance across these conditions provides a useful framework for understanding its behavior in PCC populations. Consistent with this perspective, large PCC cohort studies have used the MoCA as a global descriptive index, highlighting its role in this population [
4,
7]. However, the MoCA should still be interpreted as a brief screening instrument rather than a comprehensive diagnostic tool. Its multidomain structure provides an overall estimate of cognitive status but offers limited domain-specific characterization and does not replace a full neuropsychological assessment. The classification performance of the MoCA observed in the present study regarding the global composite differs from that reported in the literature on MCI. A recent systematic review and meta-analysis by Islam et al. [
11], largely based on case–control studies applying Petersen criteria for MCI, reported high sensitivity and moderate specificity using the recommended cut-off score of 26. In contrast, we observed lower sensitivity and higher specificity when applying the MoCA within our heterogeneous PCC cohort with the same cut-off value. This discrepancy is likely attributable to differences in study design and population characteristics. Most existing MoCA validation studies compared individuals with cognitive impairment (typically recruited from memory clinics) against healthy controls. This case–control design samples individuals from opposite ends of the cognitive spectrum, potentially inflating sensitivity through spectrum bias. In contrast, subtle cognitive impairments often fail to produce score reductions, leaving many cognitively altered individuals at or near the screening threshold rather than clearly below it. This increases false negatives and reduces sensitivity. Conversely, the MoCA demonstrates high specificity in our PCC population, functioning conservatively by showing strong agreement with the non-altered label. In addition, optimal MoCA cut-off scores vary as a function of age, education, language of administration, and race or ethnicity, with no universally accepted thresholds, and are associated with variability in sensitivity and specificity with wide confidence intervals across populations and clinical settings [
8,
9,
10,
11,
12].
The MoCA combines multiple cognitive domains into a single composite score whereas ET measures reflect processes that may be sensitive to subtle inefficiencies in cognitive control. This pattern provides a clinical explanation for the complementary performance observed between ET and MoCA in the global composite analysis. While the improvement of the RF-ET + MoCA model over the standalone MoCA is modest in some domains, the confusion matrices showed that for the global composite, the integrated model enhances sensitivity relative to MoCA alone while improving specificity relative to the RF-ET model, integrating the strengths of both methods. Compared with the MoCA literature summarized by Islam et al. [
11], the integrated RF-ET + MoCA model achieved sensitivity within the upper range reported for MCI populations, while maintaining moderate specificity, despite being applied to a more heterogeneous PCC cohort characterized by subtle cognitive impairment. This sensitivity in PCC supports the added value of multimodal integration for screening-level cognitive assessment. From a practical perspective, this performance improvement comes with minimal additional burden. ET tasks are brief, objective, and non-invasive, while the MoCA remains a widely accepted screening instrument with modest administration time. The combined approach offers a balance between diagnostic accuracy and clinical efficiency. Concretely, these results suggest that such a tool could potentially serve as a first-line screening filter in clinical settings. This approach may help identify patients who require more comprehensive neuropsychological evaluation but might otherwise be missed by the MoCA alone.
Several limitations should be considered. First, this study cross-sectional design offers only a snapshot of the relationships between cognitive performance and eye movements. As a result, these findings cannot show how cognitive symptoms evolve or recover over time, nor can they establish the prognostic value of ET measures for future cognitive outcomes Second, the MoCA performance is known to be influenced by demographic factors such as age and education introducing variability that may affect classification outcomes. Third, although ET provides distinct advantages, it exhibited lower specificity compared to multimodal approaches. Fourth, overall classification performance should be interpreted as moderate rather than definitive. Although the integrated ET and MoCA model improved discrimination relative to either modality alone, classification accuracy remains constrained by the subtle and heterogeneous nature of cognitive alterations in PCC. This heterogeneity may also be related to the frequent coexistence of fatigue, anxiety, and depression in PCC, conditions known to alter gaze behavior [
54]. Consequently, some ET-derived alterations identified by the models may capture broader affective-attentional processes rather than purely cognitive impairment.
Finally, several methodological factors may limit generalizability and may introduce bias. The relatively small sample size and class imbalance may affect the performance estimates, particularly for outcomes with low prevalence of cognitive alteration. Class imbalance can influence model performance by biasing the classifier toward the majority class. To partially mitigate this, class imbalance was addressed during model training by including class weighting in the Random Forest hyperparameter search (see
Table S3). When applied, class weighting increased the relative contribution of the underrepresented class, helping preserve sensitivity to minority cases. In addition, the transformation of continuous neuropsychological measures into binary labels may reduce information and introduce bias, especially near the selected threshold where small variations may result in different class assignments despite similar cognitive performance. To mitigate these issues, labels were derived from demographically adjusted z-scores and the composite outcomes across multiple domains.
Future longitudinal data will be required to determine whether ET-based markers are sensitive to change over time and whether they can accurately capture individual recovery trajectories and progression in the PCC population. Future studies with larger and more balanced cohorts will be necessary to further validate the robustness of ET-based classification and to refine decision thresholds across cognitive domains. Lastly, testing more cognitively demanding ET tasks may further increase the discriminative power of these models.