Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study

Zhang, Jing; Zhu, Xiaoyu; Tang, Wenzhe; Ge, Weijia; Zhang, Yong; Li, Jing

doi:10.3390/ijgi15020069

Open AccessArticle

Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study

by

Jing Zhang

^*

,

Xiaoyu Zhu

,

Wenzhe Tang

,

Weijia Ge

,

Yong Zhang

and

Jing Li

College of Furnishings and Industrial Design, Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(2), 69; https://doi.org/10.3390/ijgi15020069

Submission received: 25 November 2025 / Revised: 31 January 2026 / Accepted: 2 February 2026 / Published: 6 February 2026

Download

Browse Figures

Versions Notes

Abstract

Modern geo-information platforms commonly adopt multi-window map interfaces that integrate heterogeneous data, such as dynamic maps and live camera feeds. These interfaces impose high cognitive load and slow spatial event detection. Operators must rapidly locate the source of visual alarms, a task often leading to delays under high visual workload. To address this challenge, this study investigated whether spatialized auditory cues can improve alarm localization in such complex monitoring interfaces. A controlled experiment with 24 participants used a within-subjects design to test factors of auditory spatial cueing (none, binaural, monaural), display dynamics (dynamic, static), and interface complexity (4, 8, 12 panes). Behavioral and eye-tracking data measured detection accuracy, efficiency, and gaze patterns. Results showed that dynamic displays and high interface complexity impaired performance, indicating increased cognitive load. In contrast, monaural lateralized auditory alarms substantially improved detection efficiency and mitigated visual overload. Interaction analyses revealed that binaural cues reduced the performance costs of dynamic displays, whereas monaural cues compensated for high-density layouts. These findings demonstrate that spatialized auditory alarms effectively support spatiotemporal situational awareness and improve operator performance in high-load geo-surveillance systems. The study offers empirical and practical implications for designing cognitively ergonomic, multimodal interfaces that move beyond purely visual alarm designs.

Keywords:

geo-information; auditory spatial cueing; visual search; display dynamics; interface complexity; eye-tracking

1. Introduction

Modern geo-information systems—such as smart city control platforms, emergency response centers, intelligent transportation systems, and environmental monitoring dashboards—integrate diverse streams of spatiotemporal data to support real-time decision-making. The interface of the classic geo-information systems is shown in Figure 1, these systems commonly employ multi-view geovisualization interfaces that combine 2D maps, 3D navigation views, video-based sensor feeds, and statistical panels to provide spatial context and situational awareness. While such dashboards enhance the richness of spatial information available to operators, they also introduce substantial cognitive demands: high interface complexity, dynamic visual content, and heterogeneous spatial representations often compete for users’ limited attentional resources, thereby reducing search efficiency [1,2]. Cross-modal attentional cueing has been recognized as an effective approach to mitigating this issue. Specifically, auditory spatial cueing, functioning as exogenous signals, can rapidly direct attention to task-relevant locations and reduce visual search costs. Previous studies have shown that spatialized sounds can bias both overt and covert attentional orienting [3,4], and subsequent research has further revealed that auditory cues enhance target detection efficiency by constraining gaze distribution and shortening fixation sequences [5,6]. Research on multisensory integration has also indicated that the effects of cross-modal signals are more pronounced under high task load conditions [7].

Despite these advances, notable limitations persist in the existing literature. Most studies have focused on simplified visual search paradigms, such as tasks involving single displays or low-complexity stimuli, while paying limited attention to the efficacy of auditory spatial cueing in complex multi-window map interfaces. In particular, there is a lack of systematic investigation into how display dynamics and interface complexity interact to influence cue effectiveness. Dynamic displays introduce continuous perceptual changes that may exacerbate competition for attentional resources [8,9,10], while highly complex interfaces are known to substantially increase cognitive load and impair the efficiency of attentional allocation [11,12]. Under such conditions, whether auditory spatial cueing can still effectively guide attention is insufficiently supported by empirical evidence. Furthermore, although behavioral studies have repeatedly demonstrated the benefits of auditory cues, research examining how they alter eye movement dynamics—such as fixation patterns and saccadic paths—in dynamic and highly complex human–computer interaction interfaces remains relatively scarce. While eye-tracking metrics can directly reflect attentional guidance mechanisms, only a limited number of empirical studies have explicitly explored the relationship between auditory spatial cueing and gaze behavior.

The present study addresses these gaps by examining the effects of auditory spatial cueing on visual search performance in complex geo-information interfaces. We systematically manipulated display dynamics (dynamic versus static) and interface complexity (number of panes: 4, 8, 12) to assess how these factors modulate cueing effectiveness. Importantly, we combined behavioral measures (accuracy, response time) with eye-tracking metrics (number of fixations, total fixation duration, scan paths, heatmaps) to provide converging evidence on attentional allocation. By examining auditory spatial cueing within high-load, dynamic, multi-window map interfaces, this study makes two key contributions. First, it extends the applicability of cross-modal attention and multisensory integration theories to more ecologically valid monitoring contexts that approximate key properties of real-world systems. Second, it introduces a methodological innovation by integrating behavioral and eye-tracking measures, thereby uncovering both performance-level benefits and the underlying attentional mechanisms. Collectively, these contributions underscore the potential of auditory spatial cueing to enhance user performance and mitigate visual overload in safety-critical monitoring contexts.

2. Related Work

2.1. Auditory Spatial Cueing and Cross-Modal Attention

Numerous studies have demonstrated that auditory spatial cueing can effectively facilitate visual attention allocation through both exogenous and endogenous orienting mechanisms [13,14]. Research on multisensory integration further indicates that cross-modal signals are particularly effective under high attentional load, compensating for limited visual resources [15]. While several studies have found that unilateral cues with explicit directional information typically yield the most substantial facilitation effects [16], other evidence suggests that bilateral cues may enhance overall alertness and, in some cases, produce comparable or even superior performance [17]. Furthermore, although audiovisual integration generally improves visual search efficiency [18], this advantage does not manifest consistently across all scenarios, as additional auditory input may increase processing costs and reduce potential benefits [19,20,21]. Collectively, these findings indicate that the effects of auditory spatial cueing are context-dependent rather than universally applicable. Notably, most existing studies have employed simplified visual search tasks, raising questions about the generalizability of their findings to complex multi-window map interfaces.

2.2. Display Dynamics

Dynamic visualization plays an essential role in modern GIS and spatial monitoring systems, which frequently present real-time data streams such as traffic flows, sensor feeds, environmental indicators, and moving objects. Substantial evidence indicates that the dynamic nature of visual content influences attention allocation by modulating perceptual load levels. Dynamic content, characterized by continuous visual changes, often exacerbates attentional competition and impairs search performance [22,23], whereas static content facilitates more systematic and efficient visual exploration strategies. However, some studies have demonstrated that dynamic changes may enhance stimulus salience and improve detection performance under certain conditions [24,25]. This discrepancy suggests that the effects of dynamic information on search efficiency remain controversial. Notably, existing research has predominantly utilized natural scenes or video materials, rather than investigating these effects within complex geo-information interfaces that contain both dynamic and static elements.

2.3. Interface Complexity

Multi-view dashboards are increasingly used in smart city management, disaster response, transportation monitoring, and other domains requiring simultaneous analysis of heterogeneous spatial data. Prior research indicates that as the number of views increases, users face greater cognitive and perceptual load, leading to degraded spatial decision-making performance [1,2,26]. However, this relationship is not strictly linear. Moderate complexity has been found to maintain vigilance and arousal levels, thereby enhancing monitoring accuracy under specific conditions [27]. The interaction effects between auditory spatial cueing and interface complexity remain contentious: some studies suggest auditory cues are most effective under high load conditions [28], while others indicate this effect diminishes under extremely high cognitive demands [29]. These inconsistent findings highlight the need to systematically investigate the role of auditory cues across multiple levels of interface complexity.

2.4. Eye-Tracking Evidence in Multimodal Attention

Eye-tracking techniques have become an important tool in geovisualization, cartography, and spatial cognition research, providing insight into gaze allocation, spatial strategy, and cognitive load during map-based tasks. To the best of our knowledge, few empirical studies have systematically integrated behavioral and eye-tracking measures to examine how auditory spatial cueing interacts with display dynamics and interface complexity in multi-window map interfaces. Existing eye-tracking research has only provided fragmented evidence across these dimensions. Studies have shown that auditory cues can directly influence eye movement behavior—unilateral cues, due to their explicit directional information, constrain fixation distribution more effectively than bilateral cues [30,31]. However, this effect tends to diminish under conditions of high visual competition [32]. In terms of display dynamics, dynamic displays generally lead to increased fixation counts and longer saccade paths, whereas static displays promote more focused and efficient fixation patterns [33,34]. Nonetheless, some studies suggest that dynamic transients may enhance detection efficiency by facilitating attentional shifts [35]. Regarding interface complexity, increased display load has been associated with greater fixation dispersion and elevated cognitive load [2,36], though moderate complexity may help sustain vigilance and stabilize gaze. In summary, while auditory spatial cueing, display dynamics, and interface complexity each independently influence gaze behavior, their interactive effects remain systematically unexamined. This study aims to investigate how auditory spatial cueing modulates visual search and gaze dynamics across varying levels of display dynamics and interface complexity, combining behavioral performance measures with eye-tracking metrics to reveal multimodal patterns of attentional allocation in complex monitoring interfaces.

3. Materials and Methods

This study employed a three-factor mixed experimental design to systematically investigate the independent and joint effects of auditory spatial cueing, display dynamics, and interface complexity on visual search efficiency within geo-information monitoring dashboards. By manipulating three auditory cue conditions (no sound, binaural sound, monaural sound), two levels of display dynamics (dynamic, static), and three levels of interface complexity (low, medium, high), the experiment comprehensively assessed the impact of these factors on behavioral metrics (ACC, RT) and eye-tracking metrics (NOF, TFD). This multifactorial experimental design adheres to rigorous human factors engineering methodology, enabling the effective isolation and quantification of the specific influence pathways and regulatory mechanisms through which different interface display elements affect visual search performance during complex information monitoring tasks. Figure 2 presents the technical roadmap of this study.

3.1. Participants

Twenty-seven college students participated in the study, comprising 12 males and 15 females aged 20–25 years (M = 22.9, SD = 1.59). All participants reported normal or corrected-to-normal vision, had no color vision deficiencies or hearing impairments, and had no prior experience with similar experiments. Written informed consent was obtained from each participant prior to the experiment. Each participant received ¥50 in compensation for their participation. After data quality screening, 24 participants were retained for the final analysis.

3.2. Apparatus

The experiment was conducted in an eye-tracking laboratory using a standardized eye-tracking setup, as shown in Figure 3. Participants were seated in front of a 21.5-inch LCD monitor (HP Inc., Palo Alto, CA, USA; 1920 × 1080 pixels) positioned at eye level. The viewing distance was fixed at 60 cm.

Eye movements were recorded using a Tobii Pro Fusion 250 desktop eye tracker (Tobii AB, Stockholm, Sweden) at a sampling rate of 250 Hz. A standard nine-point calibration procedure was performed before the experiment and repeated at the beginning of each block to ensure data accuracy and stability.

Auditory stimuli were delivered through a pair of stereo earphones with a 3.5 mm plug (Apple EarPods, Model A1279, Apple Inc., Cupertino, CA, USA) connected to the experimental computer. The left and right audio channels were controlled separately. The headphone output was adjusted for each participant to ensure balanced intensity between channels.

Both behavioral and eye-tracking data were recorded using ErgoLAB 3.3.5 (Kingfar Technologies, Beijing, China).

3.3. Experimental Design

The experiment used a within-subjects design with three factors: auditory spatial cueing, display dynamics, and interface complexity. All experimental conditions were presented in a randomized order to each participant.

As shown in Figure 4, to ensure ecological validity and prevent the findings from being confined to a single interface form, the dashboard interface incorporated four representative functional pane types: (a) 2D map pane, (b) 3D navigation pane, (c) surveillance pane (road monitoring video), and (d) bar chart pane. These pane types were identified through a survey of 50 typical monitoring interfaces and validated by five expert designers, drawing on prior studies of multimodal displays, including 2D maps [37,38], 3D navigation interfaces [39,40], road surveillance systems [41,42], and data visualization panels [43].

Sample stimuli were sourced from established platforms: AntV’s WebGL-based geospatial visualization engine (2D maps), Baidu Smart Transportation (3D navigation), Shanghai public highway camera feeds (surveillance), and the China National Bureau of Statistics (bar charts). Each experimental display contained only one pane type, and all four categories were presented in equal numbers to maintain balance. Pane type functioned primarily as a contextual factor to improve ecological validity. However, because differences between pane types might influence task performance, it was included as a fixed factor in the statistical models, and its main effect was analyzed together with the manipulated variables.

3.3.1. Auditory Spatial Cueing

The auditory spatial cue conditions were designed based on the framework of auditory cueing functions proposed by Broeckelmann [44] and supported by empirical findings on lateralized and bilateral auditory stimuli [17,45]. Figure 5 shows three auditory spatial cue conditions were employed: no sound, binaural sound, and monaural sound.

In no-sound condition, no auditory spatial cue was provided. In binaural sound condition, a short pure tone (790 Hz, 100 ms duration, approximately 65 dB(A)) was presented simultaneously to both headphone channels, with the target alarm symbol appearing randomly on either side of the interface. In the monaural condition, the same tone was presented to either the left or right channel, corresponding to the spatial location of the target alarm symbol.

3.3.2. Display Dynamics

Display dynamics was manipulated at two levels: dynamic and static. As shown in Figure 6, under dynamic conditions, the panes presented dynamic content corresponding to the four representative functional pane types (2D map, 3D navigation, surveillance, and bar chart). All dynamic content was presented as pre-recorded video clips to ensure consistency across trials. The central map element, featuring a dynamic or static city-light visualization, was implemented using Ant Group’s geospatial data visualization library AntV L7 (https://l7.antv.antgroup.com/ (Accessed on 15 May 2025)). It served as a consistent, non-target background that remained identical across all experimental interfaces.

In static condition, a representative frame was extracted from the corresponding dynamic sequence. Each trial presented one stimulus for 14 s. If a participant failed to identify the target within this interval, the trial advanced automatically to the next. This 14-s interval was determined through pilot testing to balance sufficient detection time and minimize fatigue. This manipulation was informed by prior studies on dynamic versus static maps [46,47] and on motion effects in graphical user interfaces [48].

3.3.3. Interface Complexity

As shown in Figure 7, To ensure precise experimental control over interface complexity, the experimental materials were adapted from representative multi-window map interfaces [49,50]. Interface complexity was manipulated at three levels: low, medium, and high, based on the multidimensional framework proposed by Zhang et al. [51,52,53], which defines interface complexity as a function of information density, structural organization, and spatial arrangement of core elements. A symmetric center-placed layout was adopted, featuring a central 3D map pane surrounded by multiple functional panes. Interface complexity was manipulated at three levels based on the number of panes: 4 (low), 8 (medium), and 12 (high). All panes were standardized in size, proportion, and spacing according to grid-based conventions derived from professional multimodal interfaces [54]. Post-processing was conducted in Adobe After Effects to control for semantic variation, color, and size, thereby ensuring that the manipulation targeted complexity exclusively without introducing additional confounds.

3.4. Procedure

Participants were seated comfortably in front of the display in a dimly lit, acoustically isolated room. Task instructions were presented on the screen, followed by eight practice trials (excluded from analysis) to familiarize participants with the task. After practice, a 30-s rest period was provided before the formal session began. The experiment adopted a blocked design with two fully counterbalanced blocks (Block 1 and Block 2). Each block comprised 72 trials, representing the factorial combination of three auditory spatial cue conditions, two levels of display dynamics (each containing four representative functional pane types), and three levels of interface complexity. The order of conditions was randomized within and across participants to minimize learning and order effects.

Eye movements were continuously recorded during the visual search task, and participants were instructed to maintain full attention throughout. As shown in Figure 8, Each trial began with a central fixation point, followed by a stimulus presentation (dynamic video or static image) for a maximum duration of 14 s. Participants were instructed to locate the target alarm symbol as quickly and accurately as possible. The trial terminated immediately upon response; if no response was made within 14 s, it ended automatically and proceeded to the next trial. A 5-min rest break was provided between blocks to reduce fatigue. The entire session lasted approximately 50 min.

4. Results

4.1. Data Collection and Analysis

The experimental data included both behavioral and eye-tracking measures. Behavioral data consisted of accuracy (ACC) and response time (RT), whereas eye-tracking data comprised the number of fixations (NOF), total fixation duration (TFD), scan paths, and heatmaps.

Behavioral and eye-tracking data were processed and analyzed in RStudio (2025.05.0+496). To account for the repeated-measures structure of the experimental design, mixed-effects modeling was employed. Accuracy (ACC), a binary outcome, was analyzed using logistic mixed-effects models (appropriate for categorical data) with participant and item treated as random intercepts. Response time (RT) and eye-tracking measures (NOF, TFD), as continuous variables, were analyzed using linear mixed-effects models, with auditory spatial cueing, display dynamics, interface complexity, and their interactions included as fixed effects, and participants included as random intercepts.

Model comparisons were performed via likelihood-ratio tests (χ²), which statistically evaluate whether adding a predictor significantly improves model fit. Post hoc pairwise comparisons were conducted using Tukey adjustment, a conservative method that controls the family-wise error rate across multiple comparisons.

In the model comparisons reported below, ΔAIC refers to the difference in Akaike Information Criterion between a candidate model and a reference (typically simpler) model. A negative ΔAIC indicates that the candidate model with the additional term provides a better fit to the data, whereas a positive ΔAIC suggests that the added complexity does not improve fit sufficiently to justify the extra parameters. Thus, in our reporting, a negative ΔAIC supports the inclusion of the corresponding factor or interaction.

For the RT analysis, only correct trials were considered. Trials with RTs exceeding three median absolute deviations from each participant’s median RT were excluded. Data from three participants who failed to complete the experiment effectively were discarded, resulting in a final sample of 24 participants. Initial screening focused on the influence of pane type (2D map, 3D navigation, road monitoring, and dynamic data). The results indicated that adding pane type as a fixed factor did not improve model fit for ACC (ΔAIC = 3.1, χ²(3) = 2.9, p = 0.407), whereas it significantly improved model fit for RT (ΔAIC = −5, χ²(3) = 10.37, p = 0.016). Given that the primary focus of this study was on the effects of three core independent variables—auditory spatial cueing, display dynamics, and interface complexity—and pane type exhibited a constrained effect pattern, it was omitted from subsequent interaction analyses to improve model interpretability and maintain focus on the core variables.

NOF represents the total count of fixations, reflecting the difficulty of the task, whereas TFD indicates the sum of all fixation durations on a specific area or element, reflecting visual search efficiency. Scan paths visualize the temporal sequence and spatial distribution of fixations during task performance, revealing participants’ search strategies. Heatmaps provide a complementary visualization of fixation density, where red regions indicate the most frequently attended areas, followed by yellow and green regions, which indicate decreasing levels of attention.

4.2. Behavioral Data

4.2.1. Auditory Spatial Cueing

The effects of the three auditory spatial cue conditions (no sound, binaural sound, and monaural sound) on visual search task performance are presented in Figure 9. Statistical analysis revealed that including auditory spatial cueing as a fixed factor did not improve model fit for ACC (ΔAIC = −0.23, χ²(2) = 4.23, p = 0.121), whereas it significantly improved model fit for RT (ΔAIC = −1233, χ²(2) = 1236, p < 0.001). Post hoc Tukey-adjusted pairwise comparisons based on model-estimated marginal means (EMMs) revealed a clear gradient in RT performance: the monaural sound condition yielded the shortest RT, followed by the binaural sound condition, and then the no-sound condition. All pairwise differences between the three auditory spatial cue conditions were statistically significant (Tukey-adjusted p < 0.001).

To further validate the overall facilitative effect of auditory spatial cueing, all sound-present conditions (binaural and monaural) were pooled and compared with the no-sound condition. The results confirmed that the presence of auditory spatial cues showed a trend toward higher accuracy, although this effect did not reach statistical significance (ΔAIC = −1.27, χ²(1) = 3.27, p = 0.071). In contrast, response time was significantly shortened (ΔAIC = −984, χ²(1) = 986.31, p < 0.001). These results are illustrated in Figure 10.

Further comparison between auditory conditions revealed no significant difference in ACC between binaural and monaural sound conditions (ΔAIC = 1.04, χ²(1) = 0.96, p = 0.330). However, RT under the monaural cues was significantly shorter than that under the binaural cues (ΔAIC = −271.3, χ²(1) = 273.26, p < 0.001), indicating that monaural cues provide superior search efficiency.

4.2.2. Display Dynamics

As shown in Figure 11, task performance varied across different display dynamics. Statistical comparisons revealed that adding display dynamics as a fixed factor did not significantly improve model fit for ACC (ΔAIC = 0.76, χ²(1) = 1.25, p = 0.264). For RT, although adding display dynamics slightly improved AIC (ΔAIC = −1), the likelihood-ratio test was not statistically significant (χ²(1) = 3.06, p = 0.080). Nonetheless, participants were visually faster under the static display condition, a pattern suggesting that motion in the display may increase visual search demands.

4.2.3. Interface Complexity

Participants’ performance across different interface complexity levels is presented in Figure 12. Adding interface complexity as a fixed factor significantly improved model fit for both ACC (ΔAIC = −4.91, χ²(2) = 8.91, p = 0.012) and RT (ΔAIC = −178, χ²(2) = 181.87, p < 0.001). Post hoc Tukey-adjusted pairwise comparisons based on model-estimated marginal means (EMMs) further revealed that ACC was significantly higher for the 4-pane interfaces than for the 12-pane interfaces, with a similar but non-significant trend observed for the 8-pane interfaces. For RT, mean values under the 12-pane interfaces were significantly longer than those under the 8-pane interfaces, which in turn were significantly longer than those under the 4-pane interfaces. These findings indicate that visual search efficiency declines significantly as interface complexity increases.

4.2.4. Interaction Effects

Adding the three-way interaction among auditory spatial cueing, display dynamics, and interface complexity did not significantly improve model fit for ACC (ΔAIC = 1.48, χ²(4) = 6.51, p = 0.164). Conversely, adding this three-way interaction significantly improved model fit for RT (ΔAIC = −27, χ²(4) = 35.48, p < 0.001). To further explore this three-way interaction for RT, additional analyses of the two-way interactions were conducted.

(1): Auditory Spatial Cueing × Display Dynamics

As shown in Figure 13, no significant interaction was found between auditory spatial cueing and display dynamics for ACC (ΔAIC = 3.34, χ²(2) = 0.66, p = 0.717), though it reached significance for RT (ΔAIC = −4, χ²(2) = 8, p = 0.018). Specifically, while static displays yielded superior efficiency in no-sound and monaural conditions, the binaural condition reversed this trend, with dynamic displays producing shorter RTs. This pattern suggests that auditory spatial cueing—particularly binaural cues—modulated the impact of display dynamics on visual search efficiency.

(2): Auditory Spatial Cueing × Interface Complexity

Figure 14 presents task performance across the three auditory spatial cue conditions under different interface complexity levels. Statistical analysis revealed that although the interaction between auditory spatial cueing and interface complexity was not significant for ACC (ΔAIC = 5.1, χ²(4) = 2.9, p = 0.575), a significant interaction effect was observed for RT (ΔAIC = −45, χ²(4) = 53.6, p < 0.001).

Further analyses collapsed monaural and binaural sound conditions into a single ‘sound present’ category and compared it with the no-sound condition. No significant interaction between sound presence and interface complexity was observed for ACC (ΔAIC = 3.66, χ²(2) = 0.34, p = 0.844), whereas a significant interaction was found for RT (ΔAIC = −7, χ²(2) = 11.2, p = 0.004). These results demonstrate that the effects of auditory spatial cueing and interface complexity on visual search efficiency are primarily manifested in the RT metric.

(3): Display Dynamics × Interface Complexity

As shown in Figure 15, participants’ performance under dynamic and static conditions across three interface complexity levels demonstrates no significant interaction effects for either ACC (ΔAIC = 2.43, χ²(2) = 1.57, p = 0.456) or RT (ΔAIC = 1, χ²(2) = 2.83, p = 0.243).

These results indicate that display dynamics did not substantially alter the impact of interface complexity on visual search performance.

4.3. Eye-Tracking Data

4.3.1. Auditory Spatial Cueing

Figure 16 displays participants’ eye movement performance in the visual search task across the three auditory spatial cue conditions. Adding auditory spatial cueing as a fixed factor significantly improved model fit for both NOF (ΔAIC = −518, χ²(2) = 521.78, p < 0.001) and TFD (ΔAIC = −999, χ²(2) = 1003.28, p < 0.001). Consistent with the behavioral RT results, post hoc Tukey-adjusted pairwise comparisons based on model-estimated marginal means (EMMs) revealed a clear gradient of performance: the monaural sound condition yielded the highest search efficiency, significantly outperforming the binaural sound condition, which in turn performed better than the no-sound condition.

As shown in Figure 17, when comparing sound-present and no-sound conditions, significant differences were found for both NOF (ΔAIC = −382, χ²(1) = 384.33, p < 0.001) and TFD (ΔAIC = −804, χ²(1) = 805.44, p < 0.001). Specifically, a sound-present condition elicited fewer fixations and shorter total fixation durations than the no-sound condition, providing strong evidence that auditory spatial cueing streamlines visual processing.

Further comparisons between monaural and binaural sound conditions also revealed significant differences in both NOF (ΔAIC = −52, χ²(1) = 53.35, p < 0.001) and TFD (ΔAIC = −144.8, χ²(1) = 146.83, p < 0.001), with monaural cues producing fewer fixations and shorter fixation durations than binaural cues. This indicates that monaural guidance facilitates more efficient visual search behavior.

As shown in Figure 18, using the 3D navigation task under the static display and high complexity condition as an example, heatmaps revealed that, compared to the no-sound condition, the binaural-sound condition induced a more localized fixation distribution with reduced hotspot intensity (fewer red regions). This pattern reflects reduced repeated fixations and improved information-acquisition efficiency. Correspondingly, the scan paths under binaural cues appeared more streamlined and organized, reflecting optimized visual search strategies and reduced cognitive load, as shown in Table 1.

As shown in Figure 19, using the 2D map task under the static display and high complexity condition as an example, the comparison of eye movement data between binaural and monaural cues revealed distinct patterns. The binaural cues elicited a broader fixation spread and more chaotic scanning trajectories, suggesting attentional dispersion and strategy uncertainty when multiple auditory spatial cues were presented. Conversely, the monaural cues showed a more concentrated fixation distribution and simpler scan paths, indicating that single, unambiguous auditory spatial cues enable users to locate targets more rapidly and execute more efficient search strategies. These results suggest that monaural cues guide visual attention more effectively than binaural cues, thereby improving visual search efficiency.

4.3.2. Display Dynamics

As shown in Figure 20, display dynamics demonstrated significant effects on all eye movement metrics. Adding display dynamics as a fixed factor significantly improved model fit for both NOF (ΔAIC = −68, χ²(1) = 69.44, p < 0.001) and TFD (ΔAIC = −34, χ²(1) = 36.42, p < 0.001). Specifically, the static condition yielded significantly higher NOF values than the dynamic condition, whereas the dynamic condition resulted in significantly longer TFD. These results indicate that display dynamics systematically influences users’ eye movement behavior: dynamic displays tend to induce fewer but longer fixations, reflecting sustained attention and deeper processing, while static displays elicit more frequent but shorter fixations, suggesting more efficient visual scanning.

This finding aligns with Roehrbein et al. [55], who reported that static information, due to its lower information load, promotes a high-frequency and short-duration fixation pattern indicative of parallel visual processing. Therefore, the static condition in this study likely reflects a lower cognitive load and a more efficient search process than the dynamic condition.

As shown in Figure 21, using the 3D navigation task under the monaural sound and high complexity condition as an example, eye movement patterns under dynamic and static conditions exhibit distinct spatial distributions. The heatmap for the dynamic condition shows a denser, broader fixation distribution (larger red areas), indicating longer fixation durations and greater cognitive effort. Correspondingly, the scan paths show higher trajectory complexity, with frequent regressions and repeated scanning loops. These patterns suggest that dynamic displays impose greater memory and attentional demands, thereby increasing cognitive load and reducing overall visual search efficiency.

4.3.3. Interface Complexity

Figure 22 presents participants’ eye-movement performance in the visual search task across different interface-complexity conditions. Adding interface complexity as a fixed factor significantly improved model fit for both NOF (ΔAIC = −353, χ²(2) = 357.93, p < 0.001) and TFD (ΔAIC = −100, χ²(2) = 103.39, p < 0.001). As interface complexity increased, both NOF and TFD showed substantial upward trends. These findings indicate that participants required more visual attention resources when processing 12-pane interfaces, while exhibiting higher visual search efficiency and faster target localization under 4-pane interfaces.

As shown in Figure 23, using the 3D navigation task under the binaural sound and static condition as an example, heatmaps revealed higher hotspot values in non-core regions for the 12-pane interfaces, manifested as more extensive red regions. This indicates that individual pane elements were fixated upon more frequently, reflecting lower search efficiency. Meanwhile, scan path visualization demonstrated more densely clustered fixations and more circuitous, random scanning paths under 12-pane interfaces, as shown in Table 2. These visual characteristics corroborate the findings derived from NOF and TFD metrics. Overall, increased interface complexity significantly elevates visual search difficulty, suggesting greater cognitive demands and increased allocation of gaze resources.

4.3.4. Interaction Effects

A significant three-way interaction was observed among auditory spatial cueing, display dynamics, and interface complexity for both NOF (ΔAIC = −25, χ²(4) = 32.7, p < 0.001) and TFD (ΔAIC = −14, χ²(4) = 21.97, p < 0.001). To further examine this effect, subsequent analyses of all two-way interactions were conducted.

(1): Auditory Spatial Cueing × Display Dynamics

Figure 24 presents eye movement metrics across the three auditory cue conditions under different display dynamics. Adding the interaction between auditory spatial cueing and display dynamics significantly improved model fit for NOF (ΔAIC = −17, χ²(2) = 21.04, p < 0.001), and for TFD (ΔAIC = −6, χ²(2) = 9.99, p = 0.007).

Figure 25 displays eye movement metrics for dynamic and static conditions, with and without auditory spatial cues. Results revealed that although the interaction between auditory spatial cueing and display dynamics was not significant for TFD (ΔAIC = 0, χ²(1) = 1.73, p = 0.189), it proved significant for NOF (ΔAIC = −15, χ²(1) = 17.88, p < 0.001). Simple effects analysis demonstrated that the difference in NOF between dynamic and static conditions was significantly smaller under the sound-present condition than under the no-sound condition, indicating that auditory spatial cueing reduces the impact of display dynamics on fixation behavior. These findings suggest that auditory spatial cues can help stabilize visual attention and reduce unnecessary gaze shifts during visual search.

(2): Auditory Spatial Cueing × Interface Complexity

Figure 26 presents eye-movement metrics across the three auditory spatial cue conditions at different interface complexity levels. Statistical results revealed significant interactions for both NOF (ΔAIC = −20, χ²(4) = 27.71, p < 0.001) and TFD (ΔAIC = −36, χ²(4) = 44.08, p < 0.001). The monaural auditory cue was found to significantly modulate the responses of both NOF and TFD to increasing complexity, demonstrating its role in mitigating the adverse effects on visual search and attention.

As shown in Figure 27, further model comparisons indicated that, compared with the model without the interaction term, adding the interaction between sound presence (no sound vs. sound) and interface complexity improved model fit for TFD (ΔAIC = −4, χ²(2) = 8.55, p = 0.014), but not for NOF (ΔAIC = −1, χ²(2) = 4.59, p = 0.101). These findings suggest that auditory spatial cueing interacts with interface complexity to jointly regulate eye movement patterns, highlighting the synergistic role of sound guidance in supporting visual search efficiency under complex interface conditions.

(3): Display Dynamics × Interface Complexity

Figure 28 illustrates the interaction effects between display dynamics and interface complexity on eye movement metrics. Statistical results indicated no significant interaction effects for either NOF (ΔAIC = 0, χ²(2) = 4.03, p = 0.130) or TFD (ΔAIC = 3, χ²(2) = 0.77, p = 0.681). These findings suggest that the effects of display dynamics and interface complexity on visual attention operate independently, without a combined influence on eye movement performance.

5. Discussion

5.1. Effects of Auditory Spatial Cueing

Auditory spatial cueing markedly improved users’ ability to detect target events in simulated geo-information interfaces, as evidenced by the significant gains observed in both RT and eye-tracking metrics. Monaural cues produced the strongest facilitation, followed by binaural cues, whereas no sound condition yielded the slowest and least focused performance. These results support the premise that spatialized audio can serve as an effective attentional guide in geo-information environments, where operators are required to monitor multiple spatial layers simultaneously [56,57,58,59].

The pronounced advantage of monaural cues can be explained through the lens of spatial disambiguation in multi-window map interfaces. In such layouts, visual information is spatially fragmented across multiple panes that often lack strong perceptual grouping. By providing unambiguous lateralized auditory signals that directly correspond to pane-level spatial locations, monaural cues effectively reduce spatial uncertainty and facilitate rapid geospatial orienting. In contrast, binaural cues, while enhancing general alertness via bilateral stimulation, distribute attention more diffusely due to their lack of directional specificity. This distinction aligns with cross-modal orienting models suggesting that unambiguous directional signals elicit faster exogenous shifts in attention than symmetric cues. Furthermore, this mechanism aligns with theories of visual–auditory spatial congruency in geospatial cognition, where aligned cross-modal cues enhance spatial updating and reduce cognitive load in complex map-based tasks [56]. This study extends prior laboratory findings by confirming that auditory spatial cueing remains effective even in complex, dynamic, multi-window map interfaces that simulate real-world monitoring systems. From a design perspective, incorporating spatialized auditory signals can compensate for visual overload and enhance user responsiveness in complex geo-information systems. Adaptive cueing algorithms could further optimize performance by modulating cue type and intensity based on detected workload or interface complexity.

5.2. Effects of Display Dynamics

Display dynamics significantly influenced search performance, with static displays supporting faster responses than dynamic displays. Dynamic displays, characterized by continuous visual changes, introduced greater perceptual uncertainty and was associated with longer fixation durations, suggesting increased attentional demands during information processing. This finding supports prior work indicating that motion and transients can consume attentional demand by competing with goal-directed search processes [60,61].

However, the present results contrast with studies reporting that motion enhances detection performance by increasing stimulus salience [62,63]. The discrepancy reflects challenges inherent in spatiotemporal visualization, where continuously changing map states, live sensor feeds, or animated spatial indicators frequently conflict with user-driven search strategies. Although animation can, under certain conditions, improve salience and facilitate detection of temporal patterns, the findings of this study suggest that in dense multi-pane geovisual interfaces, motion often behaves as a distractor rather than a facilitator.

Accordingly, implications for spatial decision support systems emphasize the need to use dynamic spatial content judiciously, particularly when users must perform rapid anomaly detection, and to consider adaptive display strategies that attenuate motion or animation under high cognitive load. Moreover, integrating auditory cues may help offset the attentional burden introduced by unpredictable dynamic displays. Collectively, these insights reinforce the broader need to balance temporal fidelity and cognitive ergonomics in spatiotemporal geovisualization.

5.3. Effects of Interface Complexity

The present findings confirm that increasing interface complexity substantially impairs visual search performance. As the number of panes increased, participants exhibited lower ACC and longer RT, accompanied by higher fixation counts and longer fixation durations. These results provide behavioral and eye-tracking evidence that elevated interface complexity increases perceptual and cognitive load, leading to less efficient attentional allocation. Such findings align with cognitive ergonomics research showing that display density and structural complexity increase attentional competition and reduce situational awareness [64,65,66,67].

The observed degradation likely stems from the limitations of visual working memory and the attentional bottleneck that arises when multiple information sources compete for processing resources. High visual load leads to more dispersed gaze patterns and promotes sequential rather than parallel search strategies. This interpretation aligns with perceptual load theory, which posits that under high perceptual demands, attentional resources are fully allocated to task-relevant processing, thereby reducing the capacity to handle competing information. From an applied perspective, these results emphasize the importance of interface modularity, visual hierarchy, and spatial grouping in the design of multi-view geo-information dashboards. Designers should consider hierarchical grouping and visual salience management to minimize unnecessary visual competition.

5.4. Interaction Effects Among Variables

The observed interaction effects reveal that the benefits of auditory spatial cueing depend on both display dynamics and interface complexity. Specifically, binaural cues partially offset the detrimental influence of dynamic displays, while monaural cues most effectively mitigated the negative impact of 12-pane interfaces. The interaction between monaural cues and 12-pane interfaces suggests that auditory spatial guidance is particularly beneficial in layouts where visual crowding impairs geographic feature discrimination. This supports the notion that auditory cues can serve as a compensatory spatial channel in dense geovisual displays, aiding operators in maintaining situational awareness when visual attention is fragmented across multiple data layers. These findings indicate that cueing strategies interact adaptively with task context: binaural cues enhance general alertness when visual uncertainty is temporal (dynamic displays), whereas monaural cues provide spatial precision under complex, crowded visual conditions.

This interaction pattern supports multisensory integration models proposing that the brain dynamically re-weights cross-modal information based on environmental load and cue reliability [68,69]. It also highlights the need to tailor auditory assistance strategies to specific visual and cognitive conditions rather than applying uniform cueing schemes. For example, systems might employ binaural “alert” tones when the visual field changes rapidly but switch to monaural directional cues when multiple static targets compete for attention.

5.5. Limitations and Future Research Directions

Although the study provides converging behavioral and eye-tracking evidence for the efficacy of auditory spatial cueing, the participant sample consisted mainly of young adults without professional monitoring experience. This limits the direct generalization of the findings to expert operators, who may exhibit different auditory–visual integration patterns and task strategies due to training and experience. Future research should recruit domain experts such as GIS analysts, monitoring operators, or emergency responders to examine population-specific cueing effects and to validate the ecological applicability of auditory spatial guidance in real-world monitoring systems. Additionally, while the experimental interface was designed to reflect the properties of diverse content displays, the search task was intentionally discrete and time-limited for experimental control rather than representing fully continuous monitoring conditions. Extending this paradigm to dynamic, multitasking scenarios would clarify how auditory spatial cueing interacts with sustained attention, task switching, and workload management. Finally, future research could incorporate neurophysiological measures such as EEG or fNIRS to complement behavioral and eye-tracking evidence, as the neural correlates of cross-modal attention and cognitive load in complex monitoring contexts remain insufficiently understood. Such integration would provide a more comprehensive understanding of the cognitive mechanisms underlying the observed behavioral and oculomotor improvements, thereby guiding future multimodal interface designs toward greater cognitive efficiency and adaptability.

6. Conclusions

This study aimed to investigate how auditory spatial cueing, display dynamics, and interface complexity jointly influence visual search performance in multi-view geo-information interfaces. To achieve this goal, a controlled laboratory experiment was conducted using a three-factor within-subjects design, combining behavioral measures (accuracy, response time) and eye-tracking metrics (number of fixations, fixation duration, scan paths, and heatmaps). By systematically manipulating auditory spatial cue conditions (none, binaural, monaural), display dynamics (dynamic, static), and interface complexity (4, 8, and 12 panes), the study advanced GIScience understanding of how multimodal interaction influences spatial cognition in complex geovisual environments.

The main findings are as follows: (1) Increasing interface complexity significantly reduced search accuracy and efficiency, confirming that dense visual layouts impose heavy cognitive and perceptual loads. (2) Dynamic displays further intensified attentional demands, leading to longer response times and sustained fixation behavior, whereas static displays facilitated more efficient search strategies. (3) Auditory spatial cueing markedly improved visual search performance, with monaural cues demonstrating the most substantial facilitative effect by providing explicit directional guidance. (4) Interaction analyses revealed that binaural cues mitigated the performance costs of dynamic displays, while monaural cues effectively compensated for high interface complexity. These results jointly suggest that auditory spatial guidance can serve as a compensatory mechanism for visual overload, enhancing attention allocation and reducing cognitive effort in multimodal environments.

Taken together, this study provides actionable insights for the design of cognitively ergonomic geo-information systems by validating the robustness of auditory spatial cueing effects under realistic interface conditions. Practically, it offers design insights for the development of intelligent auditory–visual interfaces in safety-critical domains such as air traffic control, smart city monitoring, and intelligent transportation systems. Future research should extend this approach to expert populations, continuous spatiotemporal monitoring scenarios, and adaptive multimodal systems that leverage real-time user-state estimation. Such efforts will accelerate the development of intelligent, resilient, and human-centered geo-information interfaces that support reliable decision-making in increasingly data-rich environments.

Author Contributions

Conceptualization, Jing Zhang; Methodology, Jing Zhang and Xiaoyu Zhu; Software, Xiaoyu Zhu; Validation, Jing Zhang, Xiaoyu Zhu and Wenzhe Tang; Formal analysis, Jing Zhang; Investigation, Jing Zhang; Resources, Wenzhe Tang and Yong Zhang; Data curation, Jing Zhang, Wenzhe Tang and Weijia Ge; Writing—original draft preparation, Jing Zhang and Xiaoyu Zhu; Writing—review and editing, Jing Zhang, Xiaoyu Zhu and Wenzhe Tang; Visualization, Xiaoyu Zhu and Weijia Ge; Supervision, Yong Zhang, Weijia Ge and Jing Li; Project administration, Yong Zhang, Weijia Ge and Jing Li; Funding acquisition, Yong Zhang and Jing Li All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 72201128), and the Jiangsu Provincial Social Science Fund Project (No. 22SHC012).

Institutional Review Board Statement

After consideration by the Institutional Review Board of this institution, the experimental design and protocol of the study were found to be scientifically sound, fair and impartial, and did not cause harm or risk to the subjects. Recruitment was conducted in accordance with the principles of voluntary participation and informed consent, ensuring the protection of participants’ rights, interests, and privacy. The study was determined to be free from conflicts of interest, ethical or moral violations, and legal noncompliance. It also complied with the ethical standards outlined in the Declaration of Helsinki. The Institutional Review Board confirmed that the project was proceeding as planned (protocol code No.2025022, approval date: 21 January 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to thank all participants for taking the time to patiently complete the experimental content, and the College of Furnishings and Industrial Design for providing the experimental site.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wickens, C.D. Multiple resources and mental workload. Hum. Factors 2008, 50, 449–455. [Google Scholar] [CrossRef]
Dehais, F.; Karwowski, W.; Ayaz, H. Brain at Work and in Everyday Life as the Next Frontier: Grand Field Challenges for Neuroergonomics. Front. Neuroergonomics 2020, 1, 583733. [Google Scholar] [CrossRef] [PubMed]
Mandal, A.; Liesefeld, A.M.; Liesefeld, H.R. Tracking the Misallocation and Reallocation of Spatial Attention toward Auditory Stimuli. J. Neurosci. 2024, 44, e2196232024. [Google Scholar] [CrossRef] [PubMed]
Cai, M.; Bao, Y. Spatial attention modulates auditory dominance in audiovisual order judgment. Psych. J. 2023, 12, 537–539. [Google Scholar] [CrossRef] [PubMed]
Cinetto, S.; Blini, E.; Zangrossi, A.; Corbetta, M.; Zorzi, M. Spatial regularities in a closed-loop audiovisual search task bias subsequent free-viewing behavior. Psychon. Bull. Rev. 2025, 32, 2977–2989. [Google Scholar] [CrossRef]
Kern, L.; Niedeggen, M. Are auditory cues special? Evidence from cross-modal distractor-induced blindness. Atten. Percept. Psychophys. 2023, 85, 889–904. [Google Scholar] [CrossRef]
He, Y.; Guo, Z.; Wang, X.; Sun, K.; Lin, X.; Wang, X.; Li, F.; Guo, Y.; Feng, T.; Zhang, J.; et al. Effects of Audiovisual Interactions on Working Memory Task Performance—Interference or Facilitation. Brain Sci. 2022, 12, 886. [Google Scholar] [CrossRef]
Yuan, P.; Hu, R.; Zhang, X.; Wang, Y.; Jiang, Y. Cortical entrainment to hierarchical contextual rhythms recomposes dynamic attending in visual perception. eLife 2021, 10, e65118. [Google Scholar] [CrossRef]
Aguado-López, B.; Palenciano, A.F.; Peñalver, J.M.G.; Díaz-Gutiérrez, P.; López-García, D.; Avancini, C.; Ciria, L.F.; Ruz, M. Proactive selective attention across competition contexts. Cortex 2024, 176, 113–128. [Google Scholar] [CrossRef]
Xu, S.; Xu, M.; Kang, Q.; Yuan, X. Mobile Reading Attention of College Students in Different Reading Environments: An Eye-Tracking Study. Behav. Sci. 2025, 15, 953. [Google Scholar] [CrossRef]
Das, A.; Wu, Z.; Skrjanec, I.; Feit, A.M. Shifting focus with hceye: Exploring the dynamics of visual highlighting and cognitive load on user attention and saliency prediction. Proc. ACM Hum.-Comput. Interact. 2024, 8, 236. [Google Scholar] [CrossRef]
Zhang, C.; Wei, D.P.; Ji, Y.; Chen, D.; Li, X.Y.; Gong, X.D. The influence of interface attributes and interaction elements on user performance and cognitive load in task interruption scenarios. Int. J. Ind. Ergon. 2025, 108, 103761. [Google Scholar] [CrossRef]
Zimmer, U.; Wendt, M.; Pacharra, M. Enhancing allocation of visual attention with emotional cues presented in two sensory modalities. Behav. Brain Funct. 2022, 18, 10. [Google Scholar] [CrossRef]
Hu, J.; Badde, S.; Vetter, P. Auditory guidance of eye movements toward threat-related images in the absence of visual awareness. Front. Hum. Neurosci. 2024, 18, 1441915. [Google Scholar] [CrossRef] [PubMed]
Saccani, M.S.; Contemori, G.; Del Popolo Cristaldi, F.; Bonato, M. Attentional load impacts multisensory integration, without leading to spatial processing asymmetries. Sci. Rep. 2025, 15, 16240. [Google Scholar] [CrossRef] [PubMed]
Gao, Z.; Hui-Wang, Q.; Feng, G.; Lv, H. Exploring Sonification Mapping Strategies for Spatial Auditory Guidance in Immersive Virtual Environments. ACM Trans. Appl. Percept. 2022, 19, 9. [Google Scholar] [CrossRef]
Fu, J.; Guo, X.; Tang, X.; Wang, A.; Zhang, M.; Gao, Y.; Seno, T. The Effects of Bilateral and Ipsilateral Auditory Stimuli on the Subcomponents of Visual Attention. i-Perception 2021, 12, 20416695211058222. [Google Scholar] [CrossRef]
Tang, X.; Gu, J.; Lu, S.; Sun, J.; Du, Y. From Sound to Sight: The Cross-Modal Spread of Location-Based Inhibition of Return. Psychophysiology 2025, 62, e70123. [Google Scholar] [CrossRef]
O’Dowd, A.; Hirst, R.J.; Seveso, M.A.; McKenna, E.M.; Newell, F.N. Generalisation to novel exemplars of learned shape categories based on visual and auditory spatial cues does not benefit from multisensory information. Psychon. Bull. Rev. 2025, 32, 417–429. [Google Scholar] [CrossRef]
Jing, B.; Wu, C.; Pi, Z.; Zhou, Y.; Ma, H. Examining the Voice-Image Matching for Pedagogical Agents Presented in Instructional Videos. J. Exp. Educ. 2025, 1–21. [Google Scholar] [CrossRef]
Groznik, V.; De Gobbis, A.; Georgiev, D.; Semeja, A.; Sadikov, A. Machine Learning-Based Detection of Cognitive Impairment from Eye-Tracking in Smooth Pursuit Tasks. Appl. Sci. 2025, 15, 7785. [Google Scholar] [CrossRef]
Gaspelin, N.; Luck, S.J. The Role of Inhibition in Avoiding Distraction by Salient Stimuli. Trends Cogn. Sci. 2018, 22, 79–92. [Google Scholar] [CrossRef]
Forschack, N.; Gundlach, C.; Hillyard, S.; Müller, M.M. Dynamics of attentional allocation to targets and distractors during visual search. NeuroImage 2022, 264, 119759. [Google Scholar] [CrossRef]
Stolte, M.; Ansorge, U. Automatic capture of attention by flicker. Atten. Percept. Psychophys. 2021, 83, 1407–1415. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Feit, A.M. Understanding and Predicting Temporal Visual Attention Influenced by Dynamic Highlights in Monitoring Task. IEEE Trans. Hum. -Mach. Syst. 2025, 55, 1053–1063. [Google Scholar] [CrossRef]
Kwon, J.; Schmidt, A.; Luo, C.; Jun, E.; Martinez, K. Visualizing Spatial Cognition for Wayfinding Design: Examining Gaze Behaviors Using Mobile Eye Tracking in Counseling Service Settings. ISPRS Int. J. Geo-Inf. 2025, 14, 406. [Google Scholar] [CrossRef]
Hemmerich, K.; Luna, F.G.; Martín-Arévalo, E.; Lupiáñez, J. Understanding vigilance and its decrement: Theoretical, contextual, and neural insights. Front. Cogn. 2025, 4, 1617561. [Google Scholar] [CrossRef]
Ko, S.; Kutchek, K.; Zhang, Y.; Jeon, M. Effects of Non-Speech Auditory Cues on Control Transition Behaviors in Semi-Automated Vehicles: Empirical Study, Modeling, and Validation. Int. J. Hum.-Comput. Interact. 2021, 38, 185–200. [Google Scholar] [CrossRef]
El Iskandarani, M.; Sara, L.R.; Bolton, M. Examining dual-task interference effects of visual and auditory perceptual load in virtual reality. Int. J. Hum.-Comput. Stud. 2025, 205, 103619. [Google Scholar] [CrossRef]
Dunifon, C.M.; Rivera, S.; Robinson, C.W. Auditory stimuli automatically grab attention: Evidence from eye tracking and attentional manipulations. J. Exp. Psychol. Hum. Percept. Perform. 2016, 42, 1947–1958. [Google Scholar] [CrossRef]
Rummukainen, O.; Mendonça, C. Task-Relevant Spatialized Auditory Cues Enhance Attention Orientation and Peripheral Target Detection in Natural Scenes. J. Eye Mov. Res. 2016, 9, 1–10. [Google Scholar] [CrossRef]
Du, P.; Li, D.; Liu, T.; Zhang, L.; Yang, X.; Li, Y. Crisis Map Design Considering Map Cognition. ISPRS Int. J. Geo-Inf. 2021, 10, 692. [Google Scholar] [CrossRef]
Yu, J.; Zhou, M.; Wang, X.; Pu, G.; Cheng, C.; Chen, B. A Dynamic and Static Context-Aware Attention Network for Trajectory Prediction. ISPRS Int. J. Geo-Inf. 2021, 10, 336. [Google Scholar] [CrossRef]
Wu, C.-F.; Gao, C.; Lin, K.-C.; Chang, Y.-H. Evaluating Impacts of Bus Route Map Design and Dynamic Real-Time Information Presentation on Bus Route Map Search Efficiency and Cognitive Load. ISPRS Int. J. Geo-Inf. 2022, 11, 338. [Google Scholar] [CrossRef]
Ehinger, K.A.; Wolfe, J.M. When is it time to move to the next map? Optimal foraging in guided visual search. Atten. Percept. Psychophys. 2016, 78, 2135–2151. [Google Scholar] [CrossRef]
Rymarkiewicz, W.; Cybulski, P.; Horbiński, T. Measuring Efficiency and Accuracy in Locating Symbols on Mobile Maps Using Eye Tracking. ISPRS Int. J. Geo-Inf. 2024, 13, 42. [Google Scholar] [CrossRef]
Siepmann, N.; Edler, D.; Keil, J.; Kuchinke, L.; Dickmann, F. The position of sound in audiovisual maps: An experimental study of performance in spatial memory. Cartogr. Int. J. Geogr. Inf. Geovisualization 2020, 55, 136–150. [Google Scholar] [CrossRef]
Medyńska-Gulij, B.; Gulij, J.; Cybulski, P.; Zagata, K.; Zawadzki, J.; Horbiński, T. Map design and usability of a simplified topographic 2D map on the smartphone in landscape and portrait orientations. ISPRS Int. J. Geo-Inf. 2022, 11, 577. [Google Scholar] [CrossRef]
Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Audio Visual Language Maps for Robot Navigation. In Experimental Robotics; Ang, M.H., Jr., Khatib, O., Eds.; Springer Proceedings in Advanced Robotics; Springer: Cham, Switzerland, 2024; Volume 30, p. 10. [Google Scholar]
Besançon, L.; Ynnerman, A.; Keefe, D.F.; Yu, L.; Isenberg, T. The state of the art of spatial interfaces for 3D visualization. Comput. Graph. Forum 2021, 40, 293–326. [Google Scholar] [CrossRef]
Kashevnik, A.; Lashkov, I.; Axyonov, A.; Ivanko, D.; Ryumin, D.; Kolchin, A.; Karpov, A. Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin. IEEE Access 2021, 9, 34986–35003. [Google Scholar] [CrossRef]
Pramanik, A.; Sarkar, S.; Maiti, J. A real-time video surveillance system for traffic pre-events detection. Accid. Anal. Prev. 2021, 154, 106019. [Google Scholar] [CrossRef]
Coscia, A.; Suh, A.; Chang, R.; Endert, A. Preliminary Guidelines for Combining Data Integration and Visual Data Analysis. IEEE Trans. Vis. Comput. Graph. 2024, 30, 6678–6690. [Google Scholar] [CrossRef]
Broeckelmann, E.M.; Martin, T.; Glazebrook, C.M. Auditory Cues and Feedback in the Serial Reaction Time Task: Evidence for Sequence Acquisition and Sensory Transfer. J. Mot. Behav. 2025, 57, 182–197. [Google Scholar] [CrossRef]
Hirway, A.; Qiao, Y.; Murray, N. Spatial Audio in 360° Videos: Does it influence Visual Attention? In Proceedings of the 13th ACM Multimedia Systems Conference; Association for Computing Machinery: New York, NY, USA, 2022; pp. 39–51. [Google Scholar]
Midtbø, T.; Larsen, E. Map Animations Versus Static Maps—When Is One of Them Better? In Proceedings of the Joint ICA Commissions Seminar on Internet-based Cartographic Teaching and Learning, Madrid, Spain, 6–8 July 2005. [Google Scholar]
Medyńska-Gulij, B.; Wielebski, Ł.; Halik, Ł.; Smaczyński, M. Complexity Level of People Gathering Presentation on an Animated Map—Objective Effectiveness Versus Expert Opinion. ISPRS Int. J. Geo-Inf. 2020, 9, 117. [Google Scholar] [CrossRef]
Shao, J.; Wu, J.; Tang, W.; Xue, C. How dynamic information layout in GIS interface affects users’ search performance: Integrating visual motion cognition into map information design. Behav. Inf. Technol. 2023, 42, 1686–1703. [Google Scholar] [CrossRef]
Barvir, R.; Vozenilek, V. Developing Versatile Graphic Map Load Metrics. ISPRS Int. J. Geo-Inf. 2020, 9, 705. [Google Scholar] [CrossRef]
Lin, X.; Pan, P. The Impact of Information Layout and Auxiliary Instruction Display Mode on the Usability of Virtual Fitting Interaction Interfaces. Information 2025, 16, 862. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, N.; Zhang, Y.; Xu, C. Research on Evaluation Method for Multimodal Information Interface Complexity. In Human-Computer Interaction—HCII 2025; Kurosu, M., Hashizume, A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15768, p. 31. [Google Scholar]
Wang, Z.; Liu, F.; Lu, Z.; Jia, F.; Wang, J. EHMI: A Complexity Assessment Method for Automotive Intelligent Cockpit Human-Computer Interaction Interfaces: An Example from the Instrument Cluster. In Design, User Experience, and Usability—HCII 2025; Schrepp, M., Ed.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15797, p. 21. [Google Scholar]
Hsieh, M.C.; Chiu, M.C.; Hwang, S.L. An optimal range of information quantity on computer-based procedure interface design in the advanced main control room. J. Nucl. Sci. Technol. 2015, 52, 687–694. [Google Scholar] [CrossRef]
Zhang, N.; Zhang, J.; Jiang, S.; Ge, W. The Effects of Layout Order on Interface Complexity: An Eye-Tracking Study for Dashboard Design. Sensors 2024, 24, 5966. [Google Scholar] [CrossRef] [PubMed]
Roehrbein, F.; Coen-Cagli, R.; Schwartz, O. Dynamic scenes vs. static images: Differences in basic gazing behaviors for natural stimulus sets. J. Vis. 2011, 11, 486. [Google Scholar] [CrossRef]
Afonso-Jaco, A.; Katz, B.F.G. Spatial knowledge via auditory information for blind individuals: Spatial cognition studies and the use of audio-VR. Sensors 2022, 22, 4794. [Google Scholar] [CrossRef]
Boyer, E.O.; Portron, A.; Bevilacqua, F.; Lorenceau, J. Continuous Auditory Feedback of Eye Movements: An Exploratory Study toward Improving Oculomotor Control. Front. Neurosci. 2017, 11, 197. [Google Scholar] [CrossRef]
Fleming, J.T.; Noyce, A.L.; Shinn-Cunningham, B.G. Audio-visual spatial alignment improves integration in the presence of a competing audio-visual stimulus. Neuropsychologia 2020, 146, 107530. [Google Scholar] [CrossRef]
Eimontaite, I.; Gwilt, I.; Cameron, D.; Aitken, J.M.; Rolph, J.; Mokaram, S.; Law, J. Dynamic Graphical Signage Improves Response Time and Decreases Negative Attitudes Towards Robots in Human-Robot Co-working. In Human Friendly Robotics; Ficuciello, F., Ruggiero, F., Finzi, A., Eds.; Springer Proceedings in Advanced Robotics; Springer: Cham, Switzerland, 2019; Volume 7, p. 11. [Google Scholar]
Yu, R.F.; Chan, A.H.S. Display movement velocity and dynamic visual search performance. Hum. Factors Ergon. Manuf. Serv. Ind. 2015, 25, 269–278. [Google Scholar] [CrossRef]
Mahadevan, V.; Vasconcelos, N. Spatiotemporal Saliency in Dynamic Scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 171–177. [Google Scholar] [CrossRef]
Parise, C.V.; Ernst, M.O. Multisensory integration operates on correlated input from unimodal transient channels. eLife 2025, 12, RP90841. [Google Scholar] [CrossRef] [PubMed]
Frischkorn, G.T.; Wilhelm, O.; Oberauer, K. Process-oriented intelligence research: A review from the cognitive perspective. Intelligence 2022, 94, 101681. [Google Scholar] [CrossRef]
Li, W.C.; Yu, C.S.; Greaves, M.; Braithwaite, G. How cockpit design impacts pilots’ attention distribution and perceived workload during aiming a stationary target. Procedia Manuf. 2015, 3, 5663–5669. [Google Scholar] [CrossRef]
Yiu, C.Y.; Ng, K.K.H.; Li, Q.; Yuan, X. Gaze behaviours, situation awareness and cognitive workload of air traffic controllers in radar screen monitoring tasks with varying task complexity. Int. J. Occup. Saf. Ergon. 2025, 31, 504–515. [Google Scholar] [CrossRef]
Xue, H.; Wang, T.; Zhang, X. Visual search in vibration environments: Effects of spatial ability, stimulus size and stimulus density. Int. J. Ind. Ergon. 2020, 79, 102988. [Google Scholar] [CrossRef]
Yamamoto, S.; Miyazaki, M.; Iwano, T.; Kitazawa, S. Bayesian calibration of simultaneity in audiovisual temporal order judgments. PLoS ONE 2012, 7, e40379. [Google Scholar] [CrossRef] [PubMed]
Rohe, T.; Ehlis, A.C.; Noppeney, U. The neural dynamics of hierarchical Bayesian causal inference in multisensory perception. Nat. Commun. 2019, 10, 1907. [Google Scholar] [CrossRef] [PubMed]
Ferrari, A.; Noppeney, U. Attention controls multisensory perception via two distinct mechanisms at different levels of the cortical hierarchy. PLoS Biol. 2021, 19, e3001465. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The interface of the classic geo-information systems.

Figure 2. The technical flowchart of this study.

Figure 3. Schematic illustration of the experimental apparatus.

Figure 4. Representative examples of four pane types.

Figure 5. Auditory spatial cue conditions.

Figure 6. Illustrations of the four representative functional pane types.

Figure 7. Illustrations of the three interface complexity levels.

Figure 8. Experimental procedure.

Figure 9. Comparison of ACC and RT across three auditory spatial cue conditions (no sound, binaural sound, and monaural sound).

Figure 10. Comparison of ACC and RT between sound-present condition and no-sound condition.

Figure 11. ACC and RT for dynamic versus static conditions.

Figure 12. Comparison of ACC and RT across three interface complexity levels.

Figure 13. Comparison of interaction effects between auditory spatial cueing and display dynamics on ACC and RT.

Figure 14. Comparison of interaction effects between auditory spatial cueing and interface complexity on ACC and RT.

Figure 15. Comparison of interaction effects between display dynamics and interface complexity on ACC and RT.

Figure 16. Comparison of three auditory spatial cue conditions (no sound, binaural sound, monaural sound) on NOF and TFD.

Figure 17. Comparison between the sound-present condition and the no-sound condition for NOF and TFD.

Figure 18. Comparison of heatmaps between no sound and binaural sound conditions under 3D navigation.

Figure 19. Comparison of heatmaps between binaural and monaural sound conditions under 2D Map navigation.

Figure 20. Comparison between dynamic and static conditions on NOF and TFD.

Figure 21. Example heatmaps and scan paths for dynamic and static conditions under 3D navigation.

Figure 22. Comparison of three interface complexity conditions on NOF and TFD.

Figure 23. Example heatmaps for three interface-complexity conditions.

Figure 24. Comparison of interaction effects between auditory spatial cueing and display dynamics on NOF and TFD.

Figure 25. Comparison of interaction effects between display dynamics and sound presence on NOF and TFD.

Figure 26. Comparison of interaction effects between auditory spatial cueing and interface complexity on NOF and TFD.

Figure 27. Comparison of interaction effects between interface complexity and sound presence on NOF and TFD.

Figure 28. Comparison of interaction effects between display dynamics and interface complexity on NOF and TFD.

Table 1. Comparison of scan paths between no sound and binaural sound conditions.

	No Sound	Binaural Sound
Typical participant scanning path
The overlapped saccade tracks

Table 2. Comparison of scan paths across three interface complexity conditions.

	Low	Medium	High
Typical participant scanning path
The overlapped saccade tracks

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Zhang, J.; Zhu, X.; Tang, W.; Ge, W.; Zhang, Y.; Li, J. Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study. ISPRS Int. J. Geo-Inf. 2026, 15, 69. https://doi.org/10.3390/ijgi15020069

AMA Style

Zhang J, Zhu X, Tang W, Ge W, Zhang Y, Li J. Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study. ISPRS International Journal of Geo-Information. 2026; 15(2):69. https://doi.org/10.3390/ijgi15020069

Chicago/Turabian Style

Zhang, Jing, Xiaoyu Zhu, Wenzhe Tang, Weijia Ge, Yong Zhang, and Jing Li. 2026. "Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study" ISPRS International Journal of Geo-Information 15, no. 2: 69. https://doi.org/10.3390/ijgi15020069

APA Style

Zhang, J., Zhu, X., Tang, W., Ge, W., Zhang, Y., & Li, J. (2026). Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study. ISPRS International Journal of Geo-Information, 15(2), 69. https://doi.org/10.3390/ijgi15020069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Alarm Localization in Multi-Window Map Interfaces with Spatialized Auditory Cues: An Eye-Tracking Study

Abstract

1. Introduction

2. Related Work

2.1. Auditory Spatial Cueing and Cross-Modal Attention

2.2. Display Dynamics

2.3. Interface Complexity

2.4. Eye-Tracking Evidence in Multimodal Attention

3. Materials and Methods

3.1. Participants

3.2. Apparatus

3.3. Experimental Design

3.3.1. Auditory Spatial Cueing

3.3.2. Display Dynamics

3.3.3. Interface Complexity

3.4. Procedure

4. Results

4.1. Data Collection and Analysis

4.2. Behavioral Data

4.2.1. Auditory Spatial Cueing

4.2.2. Display Dynamics

4.2.3. Interface Complexity

4.2.4. Interaction Effects

4.3. Eye-Tracking Data

4.3.1. Auditory Spatial Cueing

4.3.2. Display Dynamics

4.3.3. Interface Complexity

4.3.4. Interaction Effects

5. Discussion

5.1. Effects of Auditory Spatial Cueing

5.2. Effects of Display Dynamics

5.3. Effects of Interface Complexity

5.4. Interaction Effects Among Variables

5.5. Limitations and Future Research Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI