1. Introduction
Phishing is a form of cyberattack in which attackers create fraudulent websites or messages to obtain a victim’s sensitive information. Hackers commonly use fake emails and fake websites for this purpose. In email phishing, the user receives a message that appears to come from a trusted source, such as a bank or a government agency, and is prompted to click a link that is intended to steal personal or sensitive information.
Notwithstanding the widespread dissemination of information about phishing and the large number of studies devoted to this phenomenon, phishing attacks continue to be highly effective and remain one of the most successful forms of cybercrime. According to the Anti-Phishing Working Group [
1], phishing remained a high-volume global threat in 2025, with approximately 3.8 million phishing attacks observed over the course of the year, highlighting its persistent and widespread impact.
The success of phishing suggests that many deceptive messages contain cues that are either overlooked, given too little weight, or evaluated too late during the inspection process; therefore, understanding how users visually scan malicious emails is essential for identifying which elements attract attention, which are neglected, and how these patterns of inspection may contribute to correct or incorrect judgments.
Eye tracking has become a key method for studying how users visually inspect phishing emails because gaze metrics provide process-level evidence about ongoing response allocation during evaluation tasks. Authors typically interpret these data under the assumption that fixations are a proxy for attention and processing, and that gaze measures align closely in time with evaluation processes [
2,
3]. In phishing-email research, eye tracking is used to complement outcome measures such as accuracy, perceived trustworthiness, and response intentions by showing how users visually inspect an email while they emit selection responses (e.g., “phishing” vs. “legitimate”). This focus is also supported by the view that email phishing should be studied separately from web phishing, since the tactics and detectable cues differ across media, and eye-tracking research on phishing emails is still more limited than work on phishing websites [
2,
4].
1.1. AOI-Based Approaches and Their Limitations
Methodologically, the most common approach is to define Areas of Interest (AOIs) that correspond to meaningful parts of the email, such as the header or sender, the body, URLs or links, and attachments, and then measure inspection through fixation-based and visit-based metrics. A representative example is provided by Pietrantonio et al. [
5], who divide emails into header, body, and URL AOIs and compute measures such as total visit duration, total fixation duration, fixation count, visit count, coverage, and scanpath order to describe inspection strategies during phishing-identification tasks. Consistent with this broader trend, a recent review of gaze-based and EEG-based phishing research highlights the frequent use of fixation metrics and AOI time measures across experimental paradigms [
3,
5,
6]. However, the literature also notes that experimental choices can introduce interpretive constraints: phishing materials sourced online may conflate multiple techniques (e.g., threat and urgency), and screenshots can limit interaction compared to real email clients, potentially shaping fixation and dwell patterns. Relatedly, AOI classification itself can be a limitation: AOIs may sometimes support “correct” decisions for the wrong reasons, and very small AOIs can be sensitive to eye-tracker error margins, requiring careful AOI design and (where needed) adjustment procedures.
Empirical findings from email-focused studies suggest that users often spend most of their viewing time on the email body, followed by the header, while the footer and signature receive relatively little attention during discrimination tasks. In the study by Pfeffel et al. [
7], participants often began by inspecting the body or the header and moved from top to bottom, but after this initial pass, they tended to jump between regions and revisit them. This pattern suggests a shift from sequential scanning to more targeted, cue-focused checking. These observations are often considered important for phishing detection because the header can contain useful cues for identifying phishing, even though performance may still remain imperfect when users are explicitly told to distinguish phishing from legitimate emails [
7].
Within this framework, individual differences, especially expertise and knowledge about cybersecurity, are repeatedly associated with differences in gaze strategy and, in some datasets, with better performance. Pietrantonio et al. [
5] report that experts classified emails more accurately than non-experts and present preliminary evidence that experts spent more fixation time on the header, which they interpret as greater awareness of phishing indicators in that area. Similarly, Pfeffel et al. [
7] find that their expert group paid relatively more attention to the header and attachments and needed less overall time to process the emails. They argue that both knowledge and processing time were important for successful recognition [
5,
7].
At the same time, several studies caution that greater attention to suspicious regions does not necessarily lead to correct detection. This is why gaze data should be analyzed together with behavioral outcomes rather than treated as a direct proxy for understanding. In line with this caution, the literature notes that dwell time on an AOI does not necessarily reflect the level of understanding of a security cue, and short glances do not necessarily indicate that an element was missed, meaning that the same gaze pattern can sometimes support different interpretations. For example, Ribeiro et al. [
3] found that phishing emails produced longer fixation times and more fixations in the sender area than legitimate emails, but they did not observe significant correlations between eye-tracking measures across AOIs and identification accuracy. McAlaney and Hills [
4] similarly found that participants looked at phishing-indicator regions more often than expected by chance and revisited them more frequently, yet they spent less total dwell time on those indicators than expected by chance. They also report that scanning time did not show a simple or direct relationship with trustworthiness judgments. Their findings further suggest that some indicator types, such as misspellings and threatening language, are linked to lower perceived trustworthiness, and that urgency and threat cues tended to be viewed before misspellings. This indicates that different cue types may differentially control responding even when the relation between gaze and outcomes is not straightforward [
3,
4].
Recent research has also expanded the range of cues under study by examining visually and psychologically manipulative design strategies in phishing emails, described as dark patterns that may bypass user scrutiny. In particular, Oh et al. [
8] report that brand imitation and urgent language attracted the least visual attention, with significantly lower fixation duration, fixation count, and revisit count than other cues. This pattern supports the idea that persuasive but visually familiar elements may be systematically overlooked [
8].
Studies that combine gaze traces with participants’ reported reasons for their judgements likewise suggest that protective behaviors may be applied only superficially, meaning that users may check cues without reliably extracting the information that is actually diagnostic. Zhuo et al. [
9] report that looking at the sender was associated with lower phishing susceptibility. At the same time, they found that participants rarely hovered over links to inspect the actual URLs. When participants did check the URLs, they often did so through the browser, but there was no evidence that this strategy reliably reduced susceptibility in that specific context. More broadly, the literature highlights that both the contextual relevance of materials and the presence of salient design elements can influence how attention is allocated. This implies that gaze strategies observed in controlled tasks may change when messages align closely with users’ everyday goals and routines. They further argue that email-processing patterns and prior learning histories are important because early attention to salient cues and perceived relevance may shape both perceived trustworthiness and vulnerability during phishing encounters [
9]. Taken together, these results emphasize a cue > attention > interpretation > action chain: users may perform “protective” checks yet remain vulnerable if they cannot interpret what they see as diagnostic of phishing.
Because spontaneous cue checking may not be sufficient, another line of research examines interface interventions designed to guide attention and better calibrate trust during email evaluation. Baltuttis and Teubner [
2] study visual risk indicators in an eye-tracking experiment and report that these indicators affect both trust and response behavior, treating visual attention as central to the link between risk signaling, trust, and action. In this context, risk indicators refer to interface cues that explicitly signal a message’s suspected risk level (e.g., banners, labels, external-sender warnings, or risk scores), whereas trust indicators refer to cues that implicitly increase perceived legitimacy (e.g., brand logos, familiar layouts, or authoritative wording). At a more adaptive level, ADVERT (ADaptive Visual aids for Efficient Real-time security-assistive Technology) proposes an adaptive attention enhancement mechanism using attention signals/gaze-based features. This approach explicitly frames attention guidance as a phishing-prevention mechanism and connects attention measures to phishing-recognition performance [
2,
10,
11].
Finally, the literature highlights several methodological threats, especially priming and limited ecological validity, that can affect both gaze patterns and performance in controlled phishing experiments. Pfeffel et al. [
7] note as a limitation that participants were told in advance that the set included both phishing and legitimate emails, and they suggest more ecologically valid designs, such as embedding phishing messages within broader email-management tasks over time. Baltuttis and Teubner [
2] similarly discuss sample bias and priming effects when participants are explicitly informed that the study concerns phishing detection, and they describe design choices intended to reduce these problems through broader recruitment and more general task framing. In line with this logic, Zhuo et al. [
9] report that they did not disclose the true purpose of the study and did not tell participants that phishing emails would be present, in order to avoid biasing behavior in their scenario-based design [
2,
7,
9].
1.2. Beyond AOIs: Global Organization of Visual Exploration
A specific contribution of the present study is to extend the analysis of phishing-email inspection beyond conventional AOI-based eye-tracking measures by introducing a complementary scene-based indicator of global visual exploration. AOI-based metrics capture where and for how long people look, though longer inspection does not necessarily reflect greater suspiciousness of the emails. Instead, scene-based metrics capture the overall structure of the exploration strategy (how systematically fixations are distributed across the stimulus space). Consistent with this interpretation, Wang et al. [
12] argue that it is not the total amount of time spent on an email that improves phishing detection, but how effectively that time is used, including which deception indicators the user attends to. A non-suspicious email could elicit longer viewing time because participants are trying to interpret it within their own contextual frame, whereas a suspicious email could be labeled quickly as suspicious after only a minimal amount of time. In such cases, fixation duration could reflect interpretative effort rather than the intrinsic suspiciousness of the stimulus. Prior research similarly suggests that attention to phishing-related cues is not linearly associated with judgment of suspicion [
4]. A complementary scene-based measure is therefore useful because it captures the overall spatial organization of fixations across the entire stimulus, rather than focusing only on areas of interest defined by the researcher.
A well-established candidate for this purpose is the Nearest Neighbor Index (NNI), a spatial statistics measure originally proposed by Clark and Evans to quantify spatial relationships in point patterns [
13] and was later introduced into eye-movement research as an indicator of scanning strategy and has since been used across a range of Human Factors and HCI contexts to assess changes in global visual exploration associated with mental workload, including aviation, maritime scenarios, automotive driving, gaming tasks, and website exploration (see [
14,
15] for the most recent findings). Conceptually, the NNI compares the average observed distance between the nearest neighbors among fixations with the distance expected under complete spatial randomness. Values around 1 are consistent with spatial randomness; values less than 1 indicate clustered fixations, while values greater than 1 indicate a more regular and dispersed pattern. Calculated across all fixations within a defined reference area, the NNI provides a compact, AOI-independent description indicating whether visual sampling is concentrated in a few locations or distributed across the entire screen. This is particularly relevant for phishing detection, where potentially diagnostic clues may be distributed across both the header and body of the email, and where analyses based solely on the AOI might underestimate significant differences in how users navigate the stimulus while forming a judgment. The explicit theoretical baseline (NNI = 1) also makes the index intrinsically normalized and easily comparable across conditions and stimuli, allowing for an immediate directional interpretation of deviations. Consequently, especially in phishing detection, often studied using AOI-centric approaches where certain cues may be systematically overlooked, a global indicator such as the NNI can capture differences in scanning strategies that might not entirely emerge from local measures.
1.3. The Present Study
Accordingly, the present study focuses on the NNI as an index of visual exploration strategy during email classification, and explores the informational role of the NNI compared to other ocular and behavioral metrics (fixation duration and decision time). For the NNI, we test whether it differs between suspicious and non-suspicious emails, indicating that message type elicits different global exploration patterns, and between correct and incorrect trials, indicating that classification success is associated with a distinct spatial organization of fixations. In addition, we test whether the relationship between NNI and performance varies as a function of cybersecurity knowledge, consistent with the role of individual differences in phishing-related decision-making.
3. Data Analysis
Preliminary data processing: Response accuracy was computed by comparing the participant’s response to the true label of each email. Each trial was re-coded into SDT outcome categories (Hit, Miss, False Alarm, Correct Rejection) [
18]. Trials with a very low number of fixations (below the 5th percentile of the fixation-count distribution) were excluded from subsequent analyses to reduce the influence of unreliable eye-tracking segments (e.g., signal loss or insufficient visual exploration) and to ensure that oculometric indices were computed from a minimum amount of gaze data. Fixation data were processed from the exported file in which each row corresponded to a fixation associated with a specific stimulus. The reference area required for the computation of the expected nearest-neighbor distance in the NNI was defined as the convex hull area of the cleaned fixation coordinates. The convex hull polygon was obtained via the monotone chain algorithm [
21], and its area was computed using Gauss’s shoelace method. Trials with fewer than three distinct fixation points after cleaning were considered insufficient for a reliable convex-hull estimation. Accordingly, the convex hull area and NNI for that stimulus were considered as missing. Spatial clustering was summarized via the NNI as the ratio between the observed mean nearest-neighbor distance and the expected distance estimated from the reference area and the number of fixations [
14]. Because the reference area used to compute the expected nearest-neighbor distance is derived from the same fixation set used to compute observed distances, NNI values should be interpreted as reflecting both the spatial arrangement of fixations and the extent of the explored area (as captured by the convex hull). This choice was adopted to let the reference area adapt to trial-by-trial differences in exploration, consistently with prior NNI applications in which the convex hull is used as an envelope for the fixation pattern. Fixation coordinates (X, Y) were used to assign each fixation to an AOI by comparing coordinates with predefined rectangular boundaries (Sender, Subject, Body), and to provide average fixation count for descriptive purposes. Data from one participant were excluded due to technical issues during recordings, and data from another participant were also excluded because eye fixations were not recorded for a large portion of the task. After excluding the two participants with substantial recording problems, 157 trials out of 3180 were removed based on the fixation-count threshold, corresponding to a trial-level data loss of approximately 4.9%. No additional participant- or trial-level quality thresholds were applied beyond the exclusion of participants with largely incomplete gaze recordings and the 5th-percentile fixation-count criterion.
Descriptive analysis: Descriptive statistics were computed for categorical variables (i.e., prior technical assistance due to viruses/security vulnerabilities, use of two-factor authentication) and for device-use variables. Each device category (desktop, laptop, tablet, smartphone, smartwatch) was coded as a binary indicator (0 = not used regularly; 1 = used regularly). A composite index of the number of devices used regularly was obtained by summing device indicators. Distributions and missing values were inspected prior to inferential analyses.
Main analysis: All analyses were conducted with Jamovi software (version 2.6). The normality assumptions were verified by observing asymmetry and kurtosis of the distributions of every single variable. Mean fixation duration variables showed departures from normality and were therefore log-transformed prior to inferential analyses. All ANOVA/ANCOVA models involving fixation duration were conducted on the log-transformed values, whereas descriptive statistics and figures are reported in the original scale to facilitate interpretation. (
Tables S1 and S2). On data from the 30 analyzed participants, repeated-measures analyses of variance were conducted on the within-subject factors of interest. For the NNI, mean fixation duration, and decision time, 2 × 2 ANOVAs were performed with signal presence (suspicious versus non-suspicious emails) and outcome (right and wrong answers) as within-subject factors. These models were planned to test whether these metrics are capable of explaining participants’ exploration strategies across the four SDT outcomes (hits, correct rejections, misses, and false alarms). To study the role of knowledge in the cybersecurity domain, the same analyses were replicated in the form of repeated-measures ANCOVAs by introducing the CAIN score as a covariate. The continuous covariate was mean-centered prior to the repeated-measures ANCOVA to scale the intercept to the sample mean and to ensure an accurate and interpretable estimation of the within-subjects marginal means [
22]. For all analyses, a significance level of alpha = 0.05 was adopted, and effect size indices (η
2p) were reported. In case of statistical significance, comparisons between conditions were examined further through post hoc comparisons on estimated marginal means, with Bonferroni correction to control Type I error in multiple comparisons. Because all repeated-measures factors included only two levels, the sphericity assumption was automatically satisfied; therefore, Greenhouse-Geisser corrections were not required. To support the interpretation of NNI in relation to complete spatial randomness (NNI = 1), we additionally conducted exploratory post hoc two-tailed one-sample
t-tests comparing the participant-level mean NNI in each condition against the theoretical value of 1 (H
a: μ ≠ 1).
4. Results
Sample characteristics: In the analyzed sample (N = 30), participants reported using, on average, approximately 2.5 devices regularly (M = 2.50; SD = 0.73; 1–4). Overall, this suggests a predominantly multi-device usage profile. Specifically, device use was strongly dominated by smartphones and laptops. Overall, 96.7% of participants (n = 29) reported regular smartphone use, and 86.7% of them (n = 26) reported regular laptop use. By contrast, the use of desktop computers, tablets, and smartwatches was markedly less frequent. Regular desktop use was reported by 23.3% (n = 7), tablet use by 26.7% (n = 8), and smartwatch use by 16.7% (n = 5). With regard to prior technical assistance for problems related to viruses or security vulnerabilities, 66.7% of participants (n = 20) reported no prior request for assistance, whereas 33.3% (n = 10) reported that they had requested such assistance in the past. This indicates that roughly one-third of the sample had experienced security-related issues requiring external support. Concerning email access and two-factor authentication (2FA), the most frequent response was non-use of 2FA (53.3%, n = 16). A further 33.3% (n = 10) reported using 2FA, while 13.3% (n = 4) stated that they did not know what 2FA is. Taken together, these descriptive findings point to a relatively low uptake of 2FA and suggest the presence of a subgroup with limited awareness of this security measure. Concerning cybersecurity awareness, the average CAIN score was 26.47 (SD = 2.74).
Task performance: On average, the number of valid trials was 100.7 (SD = 7.75, 73–106). Descriptive statistics for the answers by outcome, across the 30 participants, showed the following distributions: Right Answers,
M = 76.1 (
SD = 10.55; 51–91), and Wrong Answers,
M = 24.6 (
SD = 7.52; 11–42). Descriptive statistics for the answers by outcome, using SDT levels [
18], are reported in
Table 1. Descriptive statistics for decision time showed the following distribution:
M = 12.3 s (
SD = 3.84; 5.77–20.5).
For trials excluded because fixation count was below the 5th percentile, the mean number of excluded trials per participant (N = 30) was: hits 1.80 (SD = 2.76; 0–13), misses 0.03 (SD = 0.18; 0–1), correct rejections 1.17 (SD = 1.98; 0–5), and false alarms 1.93 (SD = 4.45; 0–22).
D’ was calculated as the difference between the z-transformed hit rate and the z-transformed false alarm rate, and response bias (β) was calculated as the ratio between the height of the standard normal distribution at the z-transformed hit rate and at the z-transformed false alarm rate. The distributions of
d’ and
β in the sample are reported in
Figure 2 and
Figure 3, respectively.
To examine the relationship between CAIN scores and phishing-attempt detection, Pearson’s correlations were computed between mean-centered CAIN scores and signal-detection indices. Higher CAIN scores were significantly associated with a lower number of Misses, r(28) = −0.56, p = 0.001, and with a higher hit rate, r(28) = 0.47, p = 0.008. In contrast, correlations with d’, false-alarm rate, Hits, False Alarms, and Correct Rejections were not statistically significant (all ps > 0.05).
Descriptive statistics of gaze allocation and duration: Descriptive statistics for the number of fixations per AOI, across the 30 participants, valid trials were as follows: Number of fixations in Sender (
M = 6.75;
SD = 2.70; 2.83–13.5), Subject (
M = 6.01;
SD = 3.08; 1.35–14.1), and Body (
M = 41.11;
SD = 12.03; 16.38–64.6). Overall fixation distributions (
Figure 4) were comparable across email types, with a mean of 51.6 fixations for suspicious emails (
SD = 14.0; 30.9–84.5) and 54.9 fixations for non-suspicious emails (
SD = 14.0; 34.0–85.4). Across response accuracy, participants produced fewer fixations when providing right answers (
M = 52.6;
SD = 13.0; 36.4–82.8), compared to wrong ones (
M = 58.5;
SD = 16.2; 30.4–92.6). Across Signal Detection Theory outcomes, mean fixation counts differed across response categories, with lower values for Hits (
M = 49.3,
SD = 12.9; 30.9–79.1) and progressively higher values for Correct Rejections (
M = 55.0,
SD = 14.2; 34.6–84.3) and False Alarms (
M = 57.1,
SD = 16.9; 30.6–94.0), peaking for Misses (
M = 62.3,
SD = 19.5; 29.9–103.7).
Main Results: After filtering, no missing values were detected for any of the variables. An examination of the descriptive statistics revealed, for the Decision Time variables, generally low values of skewness and a slight deviation from normality in the kurtosis of Decision Time in Misses (k = −1.057) alone; however, this value was retained as it was of modest magnitude. The variables related to mean fixation duration, on the other hand, exhibited higher indices of skewness and kurtosis; for this reason, they were subjected to logarithmic transformation in order to improve their adherence to normality. After logarithmic transformation, the mean fixation duration variables exhibited distributions that were generally more appropriate for the analyses, with low skewness and generally acceptable kurtosis; only mean fixation duration in correct rejection showed a slight increase in kurtosis (k = 1.193) (
Table S2).
The repeated-measures ANOVA on NNI revealed a significant main effect of Outcome,
F(1, 29) = 31.01,
p < 0.001, η
2p = 0.517, and a significant main effect of Signal,
F(1, 29) = 9.70,
p = 0.004, η
2p = 0.250. Namely, NNI was higher for right answers and suspicious emails than for wrong answers and non-suspicious emails, respectively. Estimated marginal means showed that NNI was highest for Hits (
M = 1.032, 95% CI [1.008–1.057]) and lowest for False Alarms (
M = 0.951, 95% CI [0.924–0.978]) (
Figure 5). The Outcome × Signal interaction was not significant,
F(1, 29) = 2.10,
p = 0.158, η
2p = 0.067. When the CAIN score was entered as a covariate in the model, the main effects of Outcome,
p < 0.001, and Signal,
p = 0.005, remained significant, and their interaction,
p = 0.159 remained not significant. The effect of the CAIN score on NNI was not significant,
F(1, 28) = 1.92,
p = 0.177, η
2p = 0.064. Likewise, all the interactions involving the CAIN score were not significant (all
ps > 0.05). In order to interpret NNI values relative to complete spatial randomness (NNI = 1), two-tailed one-sample
t-tests were conducted within each email type and outcome. Given the four comparisons performed, statistical significance was evaluated using a Bonferroni-corrected alpha threshold of α = 0.0125 (0.05/4). In the one-sample
t-test (H
a: μ ≠ 1), it was observed that NNI in Hits significantly differed from 1,
t(29) = 2.72,
p = 0.011, Cohen’s
d = 0.496, 95% CI [0.113–0.872], indicating a more dispersed/regular fixation pattern (NNI > 1), whereas NNI in Correct Rejection did not differ from 1,
t(29) = −0.82,
p = 0.421,
Cohen’s d = −0.149, 95% CI [−0.508–0.212], suggesting a pattern consistent with spatial randomness. NNI in False Alarms was significantly lower than 1,
t(29) = −3.76,
p < 0.001, Cohen’s
d = −0.687, 95% CI [−1.081–−0.283], indicating clustering of fixations. By contrast, although NNI in Misses was lower than 1, this comparison did not survive Bonferroni correction,
t(29) = −2.53,
p = 0.017, Cohen’s
d = −0.461, 95% CI [−0.835–−0.081].
The repeated-measures ANOVA on mean fixation duration showed a significant main effect of Outcome,
F(1, 29) = 6.97,
p < 0.013, η
2p = 0.194. Specifically, right answers showed a lower mean fixation duration than wrong answers. Estimated marginal means showed lower fixation durations for Hits (
M = −0.625, 95% CI [−0.648–−0.603]) and Correct Rejections (
M = −0.630, 95% CI [−0.651–−0.609]) than for Misses (M = −0.622, 95% CI [−0.646–−0.597]) and False Alarms (
M = −0.624, 95% CI [−0.645–−0.603]) (
Figure 6). The main effect of Signal showed a trend toward significance,
F(1, 29) = 3.03,
p = 0.093, η
2p = 0.094; mean fixation duration was higher when the signal was present. However, the Signal × Outcome interaction was not statistically significant,
F(1, 29) = 0.260,
p = 0.614, η
2p = 0.009. When the CAIN score was entered as a covariate, the main effect of Outcome remained significant,
p = 0.011, and the main effect of Signal remained not significant,
p = 0.093, as did their interaction,
p = 0.616. The effect of the CAIN score on mean fixation duration was not significant,
F(1, 28) = 0.13,
p = 0.72, η
2p = 0.005. Likewise, all the interactions involving the CAIN score were not significant (all
ps > 0.05).
The repeated-measures ANOVA on decision time showed a significant main effect of Outcome,
F(1, 29) = 44.43,
p < 0.001, η
2p = 0.605. The main effect of Signal was not significant,
F(1, 29) = 0.01,
p = 0.924, η
2p = 0.000. However, the Signal × Outcome interaction was significant,
F(1, 29) = 12.49,
p < 0.001, η
2p = 0.301. Bonferroni-corrected post hoc comparisons on the estimated marginal means indicated that decision time was significantly shorter for correct responses when the phishing signal was present (Hit;
M = 11.8 s, 95% CI [10.5–13.2]) compared with incorrect responses, both when the signal was present (Miss;
M = 15.0 s, 95% CI [13.2–16.8]),
p < 0.001, and when it was absent (False Alarm;
M = 13.7 s, 95% CI [12.1–15.4]),
p = 0.001 (
Figure 7). A trend toward faster responses was also observed for Hits relative to correct responses when the signal was absent (Correct Rejection;
M = 13.0 s, 95% CI [11.6–14.4]), though this difference did not reach significance,
p = 0.071. In addition, decision time was significantly shorter for correct rejections than misses,
p = 0.006. When the CAIN score was entered as a covariate in the model, the main effect of Outcome remained significant,
p < 0.001, whereas the main effect of Signal,
p = 0.922, was not significant and the Outcome × Signal interaction,
p = 0.002, remained significant. The effect of the CAIN score on decision time was not significant,
F(1, 28) = 2.67,
p = 0.113, η
2p = 0.087. Likewise, all the interactions involving the CAIN score were not significant (all
ps > 0.05).
5. Discussion
This study extends eye-tracking research on phishing detection by introducing a scene-based measure of global visual exploration, the Nearest Neighbor Index (NNI). Unlike classical measures, NNI captures whether fixations are clustered or more evenly distributed across the stimulus, without depending on pre-defined regions of interest.
At the aggregate level, the main findings show that NNI was higher for suspicious than for non-suspicious emails, suggesting a shift in exploration strategy when the stimulus imposes greater task demands, and higher for correct than for incorrect classifications, suggesting that errors could be associated with visual anchoring on a limited subset of elements, possibly salient but not diagnostic. Indeed, NNI values in false alarms were significantly lower than 1, whereas misses showed a similar clustering tendency that did not survive Bonferroni correction. Crucially, hits but not correct rejections showed NNI values significantly higher than 1, indicating a more dispersed fixation pattern for correct responses only when the phishing signal is present. Accordingly, NNI may provide an indirect index of differences in visual exploration strategy. When a phishing signal is present, dispersed fixations may reflect a broader and more systematic cue-checking strategy, increasing the probability that diagnostic elements are sampled and correctly interpreted. Conversely, the more clustered pattern observed descriptively in misses may reflect a narrower exploration focus, reducing the likelihood that phishing-relevant cues are included within the inspected region. When no phishing signal is present, fixation patterns consistent with spatial randomness appear to be associated with correct rejections, whereas clustered fixations are associated with false alarms. In this case, a narrow focus may increase the likelihood of overinterpreting salient but non-diagnostic elements as suspicious. Thus, classification errors appear more likely when the stimulus is explored less broadly, although only correct detection in the presence of a phishing signal is associated with a clearly broader visual exploration strategy.
A possible interpretation of these findings lies in the fact that, in the literature, the NNI has been linked to task demands. Indeed, the NNI is interpreted in relation to workload: as workload increases, the NNI tends to indicate more distributed gaze patterns (less concentrated on a few points) [
14]. This higher dispersion is interpreted as an adaptation strategy: rather than keeping attention tightly focused, the observer broadens visual sampling to maximize readiness and quickly intercept potentially relevant incoming information. Concurrent bottom-up processes allow for salient elements in the email to quickly capture attention and then trigger a broader exploration of the entire scene, which overall increases the dispersion captured by the NNI [
14]. This interpretation could explain the high NNI for right answers when a phishing cue is present (Hits), even though the lowest decision time was observed in this condition.
Accordingly, even though only marginally significant, suspicious emails were associated with higher mean fixation durations [
23,
24].
Importantly, the interpretability of these gaze dynamics is strengthened by the fact that NNI is anchored to an explicit theoretical benchmark (NNI = 1), corresponding to spatial randomness. This reference point allows changes in exploration to be described not only as relative differences between conditions, but as meaningful departures from (or adherence to) a common baseline: values below 1 reflect increasingly clustered, spatially anchored sampling, whereas values above 1 reflect an expansion toward more distributed coverage of the email. In the present data, this anchor makes it possible to frame error patterns as a contraction of exploration, particularly for false alarms, with misses showing a descriptively similar tendency, while correct performance, particularly when a phishing cue is present, reflects a broadening of sampling (hits above 1), and correct rejections remain closer to a randomness-consistent pattern.
Although decision time did not differ significantly between suspicious and non-suspicious emails overall, participants responded faster when they were correct than when they were incorrect, and hits were faster than both misses and false alarms. This pattern is consistent with the idea that longer response times in incorrect trials could reflect uncertainty and prolonged search that remains poorly organized, as suggested by the clustered scanning observed in misses and false alarms, rather than more effective processing that resolves ambiguity successfully. Accordingly, mean fixation duration was lower for correct than incorrect answers, reflecting prolonged or inefficient stimulus processing [
23,
24].
Crucially, conventional intensity-based indices (e.g., decision time, number of fixations) do not necessarily map onto the informativeness of visual inspection. Fewer fixations or faster decisions can reflect superficial processing, but they can also reflect a more efficient strategy in which a limited number of fixations is strategically deployed to sample a wider portion of the email and intercept diagnostic cues. Because the NNI indexes the spatial distribution of fixations rather than their quantity, it can reveal “broad-but-efficient” exploration that would otherwise be misclassified as reduced or incomplete inspection. In this sense, NNI provides a complementary, and in some situations more sensitive, indicator of information acquisition, showing whether the available fixations were concentrated on a narrow subset of elements or distributed to maximize coverage of potentially relevant regions.
The distribution of signal detection indices revealed a dissociation between sensitivity and decision criterion. Specifically, d′ values were relatively compact and approximately unimodal, clustering around moderate to high levels of discrimination, suggesting that participants shared a broadly similar ability to distinguish between signal and noise. In contrast, β values showed a markedly skewed and potentially multimodal distribution, with a concentration around neutral to moderately liberal criteria and a long tail toward highly conservative responding. This pattern could indicate substantial variability in decision strategies despite comparable perceptual sensitivity. Moreover, this variability in β could reflect systematic individual differences in how participants regulate their responses under uncertainty. Future research, ideally with larger samples, should further investigate this aspect by identifying and characterizing subgroups of participants exhibiting distinct decision strategies.
Associations between CAIN scores and SDT outcomes suggest that higher CAIN scores were associated with a reduced tendency to miss phishing attempts and with a higher hit rate, although they were not significantly related to overall discrimination performance indexed by d’.
At the same time, CAIN did not emerge as a consistent predictor across models, suggesting that cybersecurity knowledge may not directly shape the moment-by-moment spatial organization of gaze during visual search. Instead, CAIN may be more closely associated with later-stage processes, such as the interpretation of suspicious cues once they have been encountered, rather than with how visual exploration is spatially organized during inspection—a dimension that NNI appears to better capture.
Taken together, these findings are consistent with the view that cybersecurity knowledge represents a potentially relevant individual-difference factor, though they do not support strong causal claims regarding its direct influence on online gaze behavior. From an applied standpoint, these results tentatively suggest that training interventions might benefit from integrating knowledge-based instruction with approaches aimed at supporting more systematic visual inspection of email content. These observations should, however, be interpreted with caution: the modest sample size (N = 30) limits statistical power to detect subtle covariate effects and higher-order interactions, and the generalizability of the findings to more ecologically valid phishing-detection contexts remains to be established. Further research with larger and more diverse samples is needed to clarify the role of individual differences in shaping gaze-based indicators of phishing detection.