3.1. Feature Selection and Theoretical Foundation
Through the preceding systematic review [
2,
5,
19,
22,
49,
50], we identified six intrinsic features as determinants of elderly visual attention, including colour brightness (C–B, brightness of areas of interest of colour images), centre bias (CB, favoured fixations of central screen regions), foreground–background differentiation (F–B D, attention distribution between salient objects and context), depth detection (DD, processing of spatial depth cues), early attentional prior (EAP, temporal delays in fixation initiation), and sustained-attention spatial prior (SASP, duration and spatial clustering of fixations). Three criteria were applied for feature selection.
Features need to exhibit empirically documented age-related alterations with replicated evidence across independent studies [
51];
Features must be measurable through computational extraction from visual stimuli and eye-tracking data without manual annotation or subjective judgement;
Features must remain relevant across diverse visual contexts (natural scenes, interfaces, multimedia) rather than being task-specific or domain-limited.
These criteria ensure features capture fundamental attentional mechanisms while also enabling automated, generalisable modelling.
We focus on intrinsic characteristics, which are properties inherent to visual stimuli (colour, depth, spatial composition) or attentional behaviour (centre bias, temporal dynamics, fixation patterns), that remain consistent across viewing conditions. We excluded extrinsic parameters such as viewing distance, display resolution, and ambient lighting. Although these factors influence viewing conditions, they introduce experimental variability unrelated to fundamental attentional mechanisms and would limit model generalisation to diverse real-world applications requiring different viewing setups.
Furthermore, these six features are not assumed independent. For example, colour brightness may interact with depth detection via shading cues, whilst centre bias could correlate with the sustained-attention spatial prior if central fixations are systematically prolonged. Disentangling these interactions is fundamental to characterising elderly visual attention. It remains to be determined whether compensatory high-level strategies (such as centre bias and sustained-attention spatial prior) dominate attentional allocation, or whether sensory-level features (including colour brightness and depth) preserve their predictive utility despite age-related sensory decline. Our methodology addresses two questions. First, which features most strongly predict elderly attention allocation? Second, how do features interact to produce observed attention patterns?
To answer these questions, we developed computational algorithms that extract quantitative measurements for each feature from visual stimuli and corresponding eye-tracking data. These feature representations are integrated into a unified gradient boosting framework that enables systematic quantification of individual feature contributions and interactive effects on elderly visual attention prediction.
3.2. Computational Feature Extraction
We extract quantitative measurements of six intrinsic features from visual stimuli and eye-tracking data. Computational algorithms transform RGB images and fixation coordinates into numerical representations encoding elderly-specific attentional characteristics. Features progress from sensory processing (colour brightness) through spatial organisation (centre bias, foreground–background differentiation, depth detection) to temporal dynamics (early attentional prior, sustained-attention spatial prior).
- (1)
Colour brightness
Colour constitutes a fundamental visual feature that influences both information extraction and the perception of other visual attributes [
52]. Age-related physiological changes reduce colour brightness sensitivity in older adults [
4]. We simulate this degradation by converting RGB images to greyscale using luminance-weighted transformation:
Coefficients reflect human photopic sensitivity [
4,
19] with peak response in the green spectrum. This isolates brightness from chromatic content, preserving spatial luminance patterns critical for attention allocation.
Standard photometric coefficients are derived from young-adult photopic sensitivity and do not explicitly account for age-related ocular changes, such as lens yellowing, which attenuates short-wavelength light and alters perceived colour distributions [
19]. While one could theoretically introduce manual physiological filters, such rigid parameterisation often fails to accommodate the significant inter-individual variability inherent in visual ageing. Consequently, rather than imposing a pre-defined physiological transformation, we utilise Equation (1) to extract scene luminance as a stable physical baseline. Under our data-driven framework, age-specific perceptual differences are not treated as hand-crafted corrections but are implicitly learnt through supervised training on age-stratified gaze data. This approach ensures that the ‘age-specificity’ of the model is not a post hoc adjustment, but an emergent property of the learnt conditional mappings between physical stimuli and elderly fixation behaviour, resulting in a more robust and statistically grounded estimation of attentional deployment.
The greyscale image (
Igrey) eliminates chromatic information while preserving edge structures necessary for attention modelling (
Figure 2). We apply this transformation to all images, which creates standardised luminance representations for subsequent extraction. In accordance with recent studies such as [
53,
54], a standard optical examination image was selected to demonstrate the computational results.
- (2)
Centre bias
Centre bias refers to systematic attention towards central regions, which intensifies with age [
5]. Older adults exhibit stronger central preferences than younger populations, reflecting reduced peripheral acuity, narrowed useful field of view, and decreased exploratory scanning [
49].
We compute centre bias through adaptive localisation integrating geometric layout and semantic content. Traditional methods assume uniform spatial weighting around the geometric centre (
xc,
yc). Older adult observers balance central fixation strategies with salient object locations. We identify the most salient object using pre-trained detection [
51], compute its centroid (
xs,
ys), and calculate the effective centre as their midpoint:
where
is the Gaussian map value at image pixel
,
is the centre point of centre bias, and
is the standard deviation of the 2D Gaussian function.
This accommodates diverse compositions. With centrally positioned salient objects, (xeff, yeff) converges towards the geometric centre, reinforcing bias. With off-centre objects, (xeff, yeff) shifts moderately, capturing elderly observers’ compromise between central fixation and salient content.
We generate a 2D Gaussian centred at (
xeff,
yeff) with
= 0.3 ×
, creating smooth attention weighting decreasing with distance from centre (
Figure 3). This spatial prior quantifies centre bias strength at each pixel.
- (3)
Foreground–background differentiation
Younger adults prioritise foreground objects through efficient figure-ground segregation; older adults allocate increased attention to background regions [
2]. This reflects reduced contrast sensitivity impairing foreground extraction and compensatory reliance on contextual information.
We employ hierarchical matting [
55] to segment foreground and background, generating a continuous alpha matte (
) where
= 1 indicates foreground,
= 0 indicates background, and intermediate values capture transitions. The algorithm fuses semantic segmentation with detail-preserving matting through adaptive weighting:
where
represents unified foreground probability from semantic encoding;
represents refined matting output. The weighting term
balances contributions. When
≈ 0.5 (uncertain boundaries), the model emphasises
for detail preservation; when
≈ 0 or 1 (confident regions), the model prioritises
for semantic consistency.
Our elderly attention model inverts this representation, weighting background regions more heavily to reflect older adults’ disproportionate allocation to contextual content (
Figure 4). This design choice does not imply an active cognitive priority for the background, but rather serves as a behavioural representation of the compensatory scanning strategies resulting from age-related declines in figure-ground segregation [
2].
- (4)
Depth detection
Depth detection enables prioritisation of proximal objects and navigation [
56]. Age-related degradation arises from reduced binocular disparity processing, diminished motion parallax sensitivity, and impaired monocular cue extraction [
22]. These deficits alter attention allocation across depth planes, increasing reliance on foreground objects and reducing background exploration [
21].
We employ transformer-based hierarchical depth estimation [
57] comprising coarse estimation through multi-scale wavelet-decomposed feature extraction, followed by adaptive refinement via the AdaBins architecture (v1.0–weights relsease). AdaBins dynamically learns optimal depth bin boundaries rather than imposing fixed intervals, improving accuracy in scenes with non-uniform depth distributions.
Final depth values are computed through probability-weighted summation over
N adaptive bins:
where
is the predicted probability that the pixel belongs to depth bin k, and
is the centre value of bin k. The bin centres are determined by adaptive bin widths
, computed as:
,
,
. Here,
is the raw bin width output from a mini-Vision Transformer (ViT) encoder followed by an MLP head, normalised via the softmax-like operation to ensure
. The bin centre
is calculated as the midpoint of bin
within the depth range
, with cumulative width
positioning the bin along the depth axis. The small constant ε prevents numerical instability during normalisation.
As illustrated in
Figure 5, the resulting depth map encodes relative spatial distance for each pixel, with darker values indicating proximity and lighter values indicating distance. This depth representation serves as a feature input to our elderly visual attention model, where it interacts with other attention-guiding factors such as centre bias and foreground–background differentiation. Empirical analysis demonstrates that incorporating depth information significantly improves fixation prediction accuracy for elderly observers, particularly in scenes with pronounced depth structure where age-related depth processing deficits most strongly influence attention allocation.
- (5)
Early Attentional Prior
The previous four features rely exclusively on bottom-up image properties. However, image features alone are insufficient to model elderly attention, as age-related sensory decline forces older adults to rely heavily on top-down, compensatory oculomotor strategies [
58]. To address this shift, we introduce the early attentional prior. This feature captures the inherent spatial exploration habits that older adults apply during the initial phase of scene observation.
Early attentional prior refers to temporal latency between stimulus onset and initial attention deployment, which systematically increases with age, reflects slowed visual pathway transmission, reduced parietal attention network efficiency, and diminished saccade programming control [
51]. The temporal delay is accompanied by spatial consequences. Older adults exhibit more dispersed, less targeted initial fixations than younger adults’ rapid convergence on salient regions [
59].
We model this through a spatial prior encoding the statistical distribution of early-stage fixations from elderly observers, as elaborated in the following steps.
Training data aggregation. We isolate fixations occurring within the first two seconds post-stimulus onset from the training data. The initial ambient phase (typically the first 1.5 to 2 s) is characterised by rapid spatial orienting driven by global layout and top-down spatial priors [
60,
61]. Following this, visual behaviour shifts to focal processing for detailed semantic extraction. Because cognitive ageing slows visual processing and saccadic programming, this initial orienting phase is slightly extended in older adults [
62]. Therefore, the two-second threshold provides a robust window to capture early attentional orienting while excluding sustained, semantic-driven attention phases.
Spatial discretisation. We partition each image into N = 16 uniform regions () in a 4 × 4 grid. This resolution captures coarse spatial preferences (upper-centre bias, peripheral avoidance) without overfitting to pixel-level noise, ensuring the prior generalises across diverse compositions.
Prior generation. For each region
, we accumulate total fixation duration
(milliseconds) from all initial fixations (0–2 s) across the training corpus. Duration-based weighting captures both fixation frequency and dwell time, which reflects attentional engagement strength. Regional weights are normalised to form a probability distribution:
where
quantifies the proportion of early attentional resources elderly observers allocate to spatial zone
. This spatial prior functions as a temporal feature, encoding where elderly attention deploys during delayed initial orienting (
Figure 6).
- (6)
Sustained-Attention Spatial Prior
Sustained-attention spatial prior characterises spatial distribution of sustained attention beyond initial orienting, reflecting stable viewing patterns during extended exploration. Elderly observers exhibit reduced saccadic amplitude (constraining exploration), prolonged fixation durations (indicating slower information extraction), and heightened central clustering due to peripheral vision degradation [
25,
63]. These characteristics define an elderly-specific ‘attentional footprint’ differing from younger adults’ exploratory patterns.
We construct a global fixation prior that encodes the statistical spatial preferences exhibited by elderly observers during extended viewing. This sustained-attention spatial prior aggregates fixations across the entire viewing duration, reflecting stable, long-term exploration strategies. This temporal distinction is critical. Early orienting is dominated by bottom-up stimulus capture; sustained viewing integrates top-down factors such as scene comprehension goals and compensatory strategies [
8].
This spatial prior is generated as follows.
Global data aggregation. We extract fixation data spanning complete viewing duration from all training images with no temporal truncation. This window captures cumulative attentional deployment, including exploratory scanning and revisitation patterns (repeated returns to central regions compensating for working memory limitations).
Regional weighting. Using the same 4 × 4 spatial discretisation (N = 16 regions ), we compute total dwell time by summing fixation durations across training samples. Duration-based weighting ensures the prior reflects attentional engagement intensity: prolonged cumulative dwell indicates sustained processing resource allocation, consistent with characteristically longer elderly fixation durations.
Prior normalisation. Regional dwell times are normalised to yield spatial density of sustained elderly attention:
quantifies the proportion of total attentional resources elderly observers allocate to spatial zone
during sustained viewing. Higher
values identify regions attracting prolonged attention across diverse content, revealing population-level spatial biases independent of scene semantics (
Figure 7). The resulting map reveals sustained attention distribution, which exhibits strong central concentration and reduced peripheral weighting. This pattern reflects combined influence of reduced saccadic exploration and peripheral vision decline. The fixation prior functions as a spatial template modulating predictions. Regions with high
receive elevated attention weights, which reflects that elderly observers systematically allocate disproportionate sustained attention to specific zones.