1. Introduction
Cultural heritage represents humanity’s valuable cultural legacy, encompassing diverse forms such as architecture, archeological sites, and cultural landscapes. It serves as a testament to history and artistry while functioning as a key vehicle for community identity and continuity [
1]. As sustainable development gains global consensus, the preservation and revitalization of cultural heritage have transcended cultural boundaries to become an important pathway for advancing multidimensional economic, social, and environmental progress, thereby promoting inclusive and sustainable regional growth [
2]. As a representative region for Sino-Western cultural exchange, Macau’s Historic Centre was inscribed on the World Heritage List in 2005. Encompassing 22 historic buildings, including A-Ma Temple and the Ruins of St. Paul’s, as well as 8 squares, it holds unique cultural and historical value. Reinterpreting Macau’s cultural heritage through modern design language, while harmonizing traditional cultural identity with contemporary aesthetic and functional demands, can inject new vitality into its preservation and dissemination [
3].
Currently, education-oriented serious games serve as effective tools for promoting cultural heritage. Through their “learning by doing” design mechanisms, they integrate explicit learning objectives into gaming experiences, thereby enhancing user engagement and knowledge retention [
4,
5,
6,
7]. When combined with technologies such as virtual reality (VR), augmented reality (AR), and 3D digitalization, serious games can create immersive learning environments [
8,
9,
10,
11]. For instance, integrating AR technology into Macau’s cultural festivals allows visitors to scan QR codes and participate in virtual games that blend local architecture with Portuguese symbols, providing novel opportunities for cultural exploration [
12]. Through 3D spatial reconstruction, temporal narratives, and role-playing simulations, serious games deepen users’ understanding of historical contexts and archeological methodologies [
13,
14]. However, serious games focusing on cultural heritage continue to face challenges, including limited user retention, lack of long-term feedback, and insufficient iterative optimization. Game design must balance functionality with playability while ensuring the accuracy, diversity, and cognitive appropriateness of the cultural knowledge presented [
15].
Eye-tracking technology provides an objective and precise research method for revealing cognitive processing by recording users’ gaze and saccade behaviors within visual scenes [
16,
17]. Based on principles of infrared imaging, this technology extracts multidimensional metrics, including fixation duration, fixation count, saccade paths, and pupil diameter, thereby quantifying users’ attention allocation, cognitive load, and information processing [
18]. In recent years, eye-tracking technology has been widely applied in psychology, human–computer interaction, landscape visual assessment, and educational technology [
19,
20]. In serious games research, eye-tracking data are frequently used to evaluate the effectiveness of visual interface elements, thereby optimizing designs to enhance learning outcomes and user experience [
21]. Studies have shown that dynamic media and image panels sustain user attention longer than static labels and that video game graphic design effectively captures target users’ attention [
22]. In special education, eye-tracking studies involving children with autism spectrum disorder have deepened understanding of their cognitive processes, informing the design of joint attention interventions [
23]. In resource management simulation games, metrics such as eye movement frequency, saccade rate, and pupil diameter effectively predict task difficulty and user performance [
24]. Furthermore, dynamic information indicators such as transition entropy and stationary entropy have been applied to analyze fixation sequence switching patterns, assess attention distribution across areas of interest (AOIs), and reveal users’ visual cognitive strategies and their relationships with emotional arousal [
20,
25].
From a cognitive psychology perspective, how attention translates into emotional experiences and learning outcomes has been systematically examined across multiple theoretical frameworks. Cognitive Load Theory (CLT) and the Cognitive Theory of Multimedia Learning (CTML) indicate that learners have limited working-memory resources, and that different forms of visual and textual information can affect learning outcomes and psychological experience by shaping intrinsic load, extraneous load, and germane load [
26,
27]. Flow theory and immersion research further suggest that people are more likely to enter a highly engaged and enjoyable state when task goals are clear, feedback is timely, and interaction is smooth [
28]. The embodied cognition perspective also proposes that first-person spatial exploration and embodied interaction with the environment can strengthen contextual understanding and emotional immersion [
29,
30]. Building on these theories, this study examines how visual elements with different levels of interaction dominance shape users’ emotional experiences and learning outcomes in cultural heritage serious games through attention allocation, cognitive load, and the continuity of immersion.
Although previous studies have made preliminary explorations into the role of visual elements in serious games, several limitations remain. First, most research has not systematically distinguished between active and passive interactive visual elements from the perspective of interaction dominance, making it difficult to reveal their differential effects on users’ cognitive processes and emotional responses [
31]. Second, research methods predominantly rely on single data sources, lacking integrated multimodal analytical frameworks that combine eye-tracking, behavioral, and subjective data. This limitation hinders the comprehensive capture and interpretation of visual attention dynamics [
32]. Furthermore, empirical studies focusing on serious games for cultural heritage remain limited, particularly those grounded in specific cultural contexts. Consequently, providing precise references for gamified learning design within particular cultural settings remains a challenge [
10,
15].
Therefore, this study uses the “Macau Historic Centre Science Popularization System” as an experimental platform to construct and empirically test an interaction-dominance-based analytical framework linking “visual elements–interactive behaviors–cognitive and emotional experiences–learning outcomes,” systematically examining the distinct influence mechanisms of active and passive visual elements. The innovation of this study is reflected across three dimensions: (1) Theoretical dimension: From an “interaction dominance” perspective, an integrated analytical framework was developed to systematically incorporate both active-interaction and passive-presentation visual elements, thereby elucidating the intrinsic mechanism through which visual elements influence learning outcomes via differentiated behavioral pathways and affective experiences. (2) Methodologically, it integrates multimodal data including eye-tracking metrics, information dynamism indicators, behavioral logs, the GEQ game experience scale, pre- and post-test learning assessments, and semi-structured interviews. This approach comprehensively captures the process from visual presentation to behavioral response, subjective experience, and learning outcomes, surpassing existing studies reliant on single questionnaires or static eye-tracking metrics. (3) At the practical level, using the World Heritage site of Macau Historic Centre as an empirical setting, this study proposes critical design trade-offs between “high emotional arousal, cognitive load, and interaction complexity.” It provides directly applicable empirical evidence for designing digital guided tours, narratives, and spatial experiences for similar cultural heritage sites.
This study proposes the following research questions and hypotheses (
Figure 1):
RQ1: Can serious games about cultural heritage significantly improve users’ learning performance in terms of cultural knowledge retention and comprehension?
H1. Users participating in serious games about cultural heritage will demonstrate significantly improved cultural knowledge retention and comprehension compared to pre-test levels, as reflected in higher scores on knowledge performance assessment questionnaires.
RQ2: Are there significant differences in eye-tracking metrics (TFD, AFD, FC, TFF, FFD, VC) across different categories of visual elements? Which elements receive the highest user attention?
H2. Significant differences exist in eye-tracking metrics such as fixation duration (TFD) and fixation count (FC) across categories of visual elements; users show the highest visual engagement with props, text boxes, architectural light and shadow shows, and historic buildings.
RQ3: In games, do player-driven, active-interaction visual elements (NPCs, menus, characters, props, etc.) elicit greater emotional arousal potential than environment-driven, passive-interaction visual elements (historic buildings, architectural light and shadow shows, text boxes) in terms of information dynamism?
H3. The dynamic nature of information in active-interaction visual elements, measured by transition entropy and stationary entropy, significantly enhances emotional arousal more effectively than passive-interaction visual elements.
RQ4: Do different visual elements exert varying effects on active and passive interaction behaviors? What are the specific correlations and degrees of influence?
H4. NPCs and characters significantly affect active interaction behaviors, whereas architectural light and shadow shows and text boxes have a more pronounced influence on passive interaction behaviors.
3. Results
First, consistency tests were conducted on the GEQ scale. Cronbach’s
α for all dimensions exceeded 0.70 (
Table 3). The results indicate that under the experimental conditions, each questionnaire factor demonstrated high internal consistency [
61], satisfying the requirements for experimental analysis.
3.1. Effect on Academic Performance Improvement (RQ1)
To evaluate the effectiveness of cultural heritage serious games in enhancing user learning outcomes, this study analyzed the pre- and post-intervention knowledge test scores of 30 participants. The results indicate that participants’ total knowledge test scores, as well as scores for knowledge retention and comprehension, showed significant improvement following the game intervention compared to pre-intervention levels.
Descriptive statistics (
Table 4) show that the post-intervention total knowledge test score (
M = 6.73,
SD = 2.03) was significantly higher than the pre-intervention score (
M = 3.93,
SD = 2.08). For knowledge retention, post-intervention scores (
M = 3.40,
SD = 1.22) showed a significant improvement compared to pre-intervention scores (
M = 1.53,
SD = 0.97). For knowledge comprehension, post-intervention scores (
M = 3.33,
SD = 1.03) also showed a significant increase compared to pre-intervention scores (
M = 2.40,
SD = 1.38).
To further examine the significance of these differences, this study employed the Shapiro–Wilk test to assess the normality of the pre- and post-test score differences. The results indicated that the total knowledge test score differences followed a normal distribution (
W = 0.964,
p = 0.380). Therefore, a paired samples
t-test was conducted. The
t-test results revealed a significant difference in total knowledge test scores before and after the intervention (
t(29) = −10.958,
p < 0.001), with an extremely large effect size (Cohen’s
d = 2.001). The differences in knowledge retention (
W = 0.898,
p = 0.008) and knowledge comprehension (
W = 0.899,
p = 0.008) did not follow a normal distribution; hence, they were analyzed using the Wilcoxon signed-rank test. The results showed that both knowledge retention scores (
z = 4.765,
p < 0.001) and knowledge comprehension scores (
z = 4.128,
p < 0.001) were significantly higher post-intervention than pre-intervention (
Table 5). In summary, the findings indicate that participation in cultural heritage serious games significantly enhances users’ learning performance in cultural knowledge retention and comprehension, supporting research hypothesis H1. Item analysis and internal consistency test results for the assessment are summarized in
Supplementary Materials S2.
3.2. Differences in Eye-Tracking Behavior Across Visual Elements (RQ2)
To investigate users’ attention allocation patterns toward different categories of visual elements during gameplay, this study examined ten key visual elements: menu icons, function icons, text boxes, historic buildings, spatial navigation, characters, NPCs, dialogue boxes, props, and architectural light and shadow shows. A systematic analysis was conducted based on six eye-tracking metrics: TFD, FC, TFF, FFD, AFD, and VC.
Descriptive statistics (
Figure 6) revealed significant differences in gaze behavior across visual elements under multiple eye-tracking metrics. For TFD, props (
M = 48.40,
SD = 27.55) and text boxes (
M = 47.38,
SD = 25.76) exhibited the highest means, while menu icons (
M = 0.59,
SD = 0.84) had the lowest mean. FC showed a similar trend, with text boxes (
M = 172.07,
SD = 79.70), props (
M = 146.80,
SD = 84.01), and architectural light and shadow shows (
M = 81.17,
SD = 59.55) exceeding other elements. For TFF, architectural light and shadow shows (
M = 296.74,
SD = 114.05) required the longest time, whereas menu icons (
M = 17.81,
SD = 32.63) were captured most rapidly.
Shapiro–Wilk normality tests, together with histogram and Q–Q plot inspections, indicated no overall severe deviation from normality at the aggregated “subject × AOI type” level, with only mild skewness observed for a small number of AOIs on a few metrics. Mauchly’s sphericity tests showed that all metrics violated the sphericity assumption (
p < 0.05); therefore, results were reported with Greenhouse–Geisser corrections. Repeated-measures ANOVA results (
Table 6) revealed significant main effects of visual elements on all six eye-tracking metrics, with
F values ranging from 2.853 to 71.397 (
p ≤ 0.023) and partial
ηp2 values ranging from 0.090 to 0.711. Bonferroni-corrected post hoc comparisons further indicated that props, text boxes, historic buildings, and architectural light and shadow shows had significantly higher mean values for TFD and FC than most other elements, suggesting that these four categories primarily carried information and conveyed context throughout the experience. Architectural light and shadow shows also showed significantly higher TFF means than other elements, consistent with their role as reward-based feedback at the end of the sequence. Spatial navigation and function icons displayed lower mean values for TFD and FC than the four categories above, while menu icons and NPCs ranked lowest overall in gaze duration and fixation frequency. These results support H2, indicating systematic differences in eye-tracking metrics across visual element categories, with the highest visual attention directed toward props, text boxes, architectural light and shadow shows, and historic buildings.
3.3. Eye-Tracking Heat Map Distribution and Analysis
This study utilized the Tobii Pro Lab analysis platform to overlay eye-tracking data from 30 participants, generating a comprehensive visual attention heatmap. The heatmap uses a color gradient to indicate the concentration of gaze points, with red representing the longest fixation duration and highest attention intensity. Yellow and green denote areas receiving secondary attention, while transparent or colorless regions indicate areas that did not receive significant visual attention.
During the system interface and navigation phase, AOIs such as menu icons (a), function icons (b), and spatial navigation (e) showed relatively dispersed visual attention distribution in the heatmap, with overall low color intensity dominated by green tones. This indicates that users engaged in goal-oriented rapid scanning and target localization during this phase. This finding is consistent with the quantitative analysis, in which the mean TFD and FC values for these elements were at their lowest levels, demonstrating that the interaction design is efficient and minimally disruptive in guiding functionality.
During the exploration and mission execution phases, significant yellow-to-orange hotspots appeared in the heatmap across AOIs, including historic buildings (d), characters (f), NPCs (g), dialogue boxes (h), and props (i1–i3). Props (i1–i3), serving as core carriers of mission objectives, formed sustained and concentrated red hotspots. Historic buildings, representing the cultural heritage core within the virtual environment, elicited moderate gaze intensity, reflecting users’ active observation of architectural forms and details. Text boxes (c) generated high-density linear hotspots during information reading, indicating sequential reading behavior. In contrast, narrative elements such as NPCs, characters, and dialogue boxes elicited moderate gaze intensity, predominantly appearing in light yellow to green hues. This indicates that while they serve narrative and interactive functions, their visual appeal remains lower than that of high-feedback tasks and dynamic content. The spatial distribution of heatmap patterns explains the significant differences in gaze metrics across various visual elements. These findings further validate the quantitative analysis results at the spatial distribution level, providing intuitive visual support for H2 and H4.
During the reward phase (j) of the architectural light and shadow shows, the heatmap revealed a highly concentrated red core hotspot that nearly covered the animation display area. The results indicate that as a passive interactive visual reward element, the architectural light and shadow shows possess significant advantages in visual appeal and sustaining user attention. This finding corroborates the “visual impact” and “emotional motivation” experiences reflected in user interviews, providing further support for Hypothesis H4. It demonstrates that dynamic visual effects can effectively promote positive user emotions, immersion, and a sense of accomplishment. In summary, the eye-tracking heatmap reveals that user attention is spatially concentrated on task-relevant, information-dense, and dynamically responsive visual objects (e.g., props, text boxes, architectural light and shadow shows, and historic buildings), while functional and navigational elements exhibit attention allocation patterns at the periphery of visual cognition (
Figure 7 and
Supplementary Materials S4).
3.4. The Relationship Between Information Dynamism and Emotional Arousal (RQ3)
To examine the relationship between information dynamics and emotional arousal of visual elements across different interaction types and to assess interaction-type differences [
58], this study analyzed the relationships between the transition entropy and stationary entropy of active and passive interaction visual elements, respectively, and each affective dimension of the GEQ using Spearman’s correlation coefficient [
25]. Given that the correlation analysis is based on uncorrected
p-values and is intended for exploratory identification of potential associations, the results presented herein do not support causal inference. As this study is exploratory and some visual element categories have small sample sizes (
N ≈ 20–30), statistical power is limited. To maximize the identification of potentially important effects, this report retains results reaching the trend-level threshold (
p < 0.10) as preliminary findings requiring further validation in future large-sample studies (
Table 7).
Among active interactive visual elements, both transition entropy (
r = 0.442,
p < 0.05) and stationary entropy (
r = 0.500,
p < 0.05) of NPCs showed significant positive correlations with negative emotions, suggesting that increased visual complexity of NPCs may trigger stronger negative emotional responses in players. Furthermore, character transition entropy exhibited a significant negative correlation with flow experience (
r = −0.372,
p < 0.05), indicating that dynamic changes in this element may partially disrupt players’ immersive experiences. In addition to significant correlations, several marginally significant trends were observed: the transition entropy and stationary entropy of NPCs showed marginally significant positive correlations with boredom (
r = 0.355,
p = 0.082 < 0.10) (
r = 0.391,
p = 0.053 < 0.10), while stationary entropy showed a marginally significant positive correlation with fatigue (
r = 0.375,
p = 0.065 < 0.10). The transition entropy of function icons showed a marginally significant negative correlation with fatigue (
r = −0.362,
p = 0.054 < 0.10), while stationary entropy exhibited a marginally significant positive correlation with competence (
r = 0.314,
p = 0.097 < 0.10). The transition entropy and stationary entropy of menu icons both showed marginally negative correlations with positive emotion (
r = −0.402,
p = 0.079 < 0.10) and (
r = −0.440,
p = 0.052 < 0.10), respectively. Transition entropy also exhibited a marginally negative correlation with tension (
r = −0.435,
p = 0.055 < 0.10). Among passive interactive visual elements, only the stationary entropy of historic building showed a significant positive correlation with tension (
r = 0.433,
p < 0.05). The analysis results indicate that the dynamic information of active interactive elements exerts a more pronounced effect on emotional arousal than that of passive interactive elements, supporting Hypothesis H3 [
62]. Consequently, the conclusions presented in this section primarily serve to propose mechanistic insights and directions for subsequent validation. Their robustness requires further testing with larger samples and appropriate multiple-comparison control procedures.
3.5. The Impact of Visual Elements on Interactive Behavior and User Experience (RQ4)
Using Spearman correlation analysis, this study examined the relationships between active interaction clicks, passive interaction eye-tracking metrics, and each dimension of the GEQ scale. The analysis (
Table 8) revealed that, regarding active interaction behaviors, the interactive mechanism linking NPCs and dialogue boxes showed significant negative correlations with multiple dimensions of user experience. Specifically, NPCs click counts showed significant negative correlations with positive emotion (
r = −0.582,
p < 0.01), immersion (
r = −0.390,
p < 0.05), sense of accomplishment (
r = −0.423,
p < 0.05), and tension (
r = −0.374,
p < 0.05). Clicks on the dialogue box showed significant negative correlations with immersion (
r = −0.440,
p < 0.05) and positive emotions (
r = −0.405,
p < 0.05). This indicates that while this interactive design aids task progression, it disrupts narrative flow, thereby negatively affecting emotional immersion and positive emotions [
63]. In contrast, character clicks showed a significant positive correlation with sense of accomplishment (
r = 0.455,
p < 0.05), and AFD also correlated positively with immersion (
r = 0.361,
p < 0.05). This suggests that, as a core interactive object, characters effectively enhance players’ goal attainment and immersive experience. However, character FC showed a significant negative correlation with competence (
r = −0.463,
p < 0.05), suggesting the need to optimize its visual feedback mechanism.
(
Table 9) Eye-tracking analysis of active interactive elements further revealed that the AFD of function icons showed significant positive correlations with perceived fatigue (
r = 0.403,
p < 0.05) and perceived tension (
r = 0.466,
p < 0.01). In contrast, the AFD of FC showed a significant positive correlation with negative affect (
r = 0.378,
p < 0.05). The AFD of menu icons was negatively correlated with sense of accomplishment (
r = −0.451,
p < 0.05), while the AFD of spatial navigation was negatively correlated with negative emotions (
r = −0.400,
p < 0.05), indicating that clear navigation design helps reduce users’ negative experiences; The TFF of props showed a significant negative correlation with competence (
r = −0.383,
p < 0.05). These findings indicate that active interaction involves cognitive load, necessitating improved user experience through optimized information hierarchy and enhanced interactive feedback.
At the passive interaction level (
Table 9), environmental narrative elements significantly enhanced emotional engagement. The FC of architectural light and shadow shows showed significant positive correlations with positive emotions (
r = 0.526,
p < 0.01), sense of accomplishment (
r = 0.375,
p < 0.05), and immersion (
r = 0.381,
p < 0.05), indicating that it effectively promotes users’ emotional engagement through atmosphere creation. Text box FC also showed significant correlations with emotional engagement, exhibiting strong positive correlations with immersion (
r = 0.417,
p < 0.05) and sense of accomplishment (
r = 0.378,
p < 0.05), as well as a significant positive correlation with positive emotion (
r = 0.363,
p < 0.05). Simultaneously, text box FC showed a significant positive correlation with boredom (
r = 0.378,
p < 0.05), suggesting that optimizing information presentation is necessary to enhance content appeal.
The findings indicate that active and passive interactive elements exert distinct effects on user experience through different behavioral patterns. These results support Hypothesis 4. Specifically, NPCs, dialogue boxes, and characters—as active interactive elements—significantly influence user experience, with their impact depending on the fluidity of the interaction process. Conversely, architectural light and shadow shows and text boxes—as passive interactive elements—significantly enhance users’ positive emotional engagement through visual gaze behavior.
3.6. Thematic Analysis of Interview Results
Through thematic analysis of interview transcripts from 30 participants, this study identified four core themes: learning performance, attention to visual elements, interactive experience, and emotional and engagement experience. The qualitative findings complemented the quantitative data, providing additional support for the research hypotheses.
Regarding learning outcomes, 21 respondents (70%) affirmed the game’s positive influence in facilitating the internalization of cultural heritage knowledge, noting that interactive tasks effectively directed their attention to architectural details and cultural symbols. For example, P8 stated, “Through some activities in the game, I was able to examine the ornamental features of these buildings in greater detail.” This finding aligns with Hypothesis H1, as the post-test knowledge scores significantly exceeded pre-test scores (p < 0.001). Additionally, 29 respondents (96.7%) reported that the game inspired their interest in revisiting sites in person. As P23 noted: “After playing this game, I notice more details and want to revisit the actual locations.” However, text overload was widely mentioned. Seventeen respondents (56.7%) indicated that lengthy text blocks caused fatigue and boredom, leading them to feel disconnected from visual elements. P16 commented, “Text should be broken up and integrated with scenes; otherwise, you forget the preceding content when viewing images.” This finding corresponds with quantitative analysis, which showed that text boxes had the highest FC and a positive correlation with “boredom.”
Regarding visual element engagement, respondent feedback indicated that environment-driven passive interaction elements, such as architectural light and shadow shows, effectively captured visual attention and evoked positive emotions. Twenty-one respondents (70%) described the architectural light and shadow shows as “stunning” and “eye-opening” (P15). P30 remarked, “The projection segment was incredibly impactful, significantly enhancing my sense of participation.” This aligns with quantitative findings that revealed a strong positive correlation between the architectural light and shadow show’s FC scores and positive emotions/immersion, supporting H4. In contrast, while active interaction elements like NPCs played a crucial role in task progression, 18 respondents (60%) noted deficiencies in interface guidance. These included unclear task icon meanings, illogical NPC positioning and orientation that caused spatial confusion (P17), and the absence of a more explicit mini-map navigation system (P23).
Regarding the interactive experience, users generally expressed high satisfaction with the interaction mechanisms, finding designs such as first-person exploration and location-triggered tasks intuitive and user-friendly (P14). However, 13 respondents (43.3%) observed that the current depth of interaction remained insufficient. They expressed a desire to move beyond shallow “drag-and-drop/right-or-wrong” interaction patterns, to include contextual explanations for cultural symbols, and to enhance the interactivity of historic buildings rather than treating them merely as static backdrops (P11, P30). P8 further suggested that “NPCs should allow diverse interactions with each character, akin to the open world of The Legend of Zelda.” This indicates that embedding interactive behaviors within narrative and cultural contexts is crucial for enhancing immersion and the depth of learning.
In terms of emotional engagement and participatory experience, most participants (70%) acknowledged the unique value of virtual experiences. Twenty-one respondents (70%) believed that the game provided perspectives and immersive environments difficult to achieve in real life. P5 remarked, “The real-life Ruins of St. Paul’s is packed with people, but the game version is empty—it instantly made me feel immersed.” The architectural light and shadow shows were identified as a core feedback mechanism. However, 10 participants (33.3%) suggested strengthening their connection to historical and cultural themes and introducing more diverse reward formats, such as virtual badges, to sustain a sense of achievement (P11, P18).
The thematic analysis deepened the understanding of the research questions from users’ subjective perspectives. The qualitative findings triangulated effectively with the quantitative results, revealing underlying causal mechanisms and the logic of user experience. These results provide empirical evidence and practical directions for optimizing visual-interaction-affective design in cultural heritage serious games.
4. Discussion
4.1. Enhancing Cultural Heritage Learning Through Serious Games
This study employed a pre- and post-test comparative analysis and found that users’ scores on cultural knowledge retention and comprehension tests significantly improved (
p < 0.001) after experiencing the cultural heritage serious game, validating Hypothesis H1. The results indicate that the serious game intervention based on the “Macau Historic Centre Science Popularization System” effectively enhances learning outcomes in cultural heritage knowledge. This finding aligns with the educational value of serious games, which use gamification mechanisms to strengthen learning motivation and knowledge retention [
64].
The game adopts a progressive learning framework of “exploration–learning–reward,” integrating knowledge of historical architectural features and cultural symbols into first-person exploration, NPC dialogue interactions, and contextualized missions. This design provides users with an immersive learning experience and supports them in actively constructing knowledge through interaction, aligning with the “learning by doing” principle emphasized in experiential learning theory. Interview data further corroborate this conclusion. User feedback indicates that the game “helps notice architectural details previously overlooked” (P8) and “sparks interest in visiting sites in person” (P23). Thus, the game not only enhances knowledge retention but also fosters emotional engagement and motivation. This demonstrates that serious games can convey knowledge content and promote active learning behaviors through immersive experiences [
10].
4.2. The Influence of Visual Elements on Interactive Behavior and Cognitive Resource Allocation
Eye-tracking data analysis revealed significant differences in user gaze behavior across different categories of visual elements (
p < 0.001), supporting Hypothesis H2. Props, text boxes, architectural light and shadow shows, and historic buildings exhibited significantly higher TFD and FC values than other interface elements, indicating that these components more effectively captured users’ visual attention [
65].
As the core vehicle for task objectives, props command significantly longer fixation durations than visual elements such as menu icons, reflecting users’ prioritization of functional components. Text boxes, which serve as primary carriers of cultural information, also attracted considerable attention, indicating that users allocate greater cognitive resources to textual reading and semantic processing [
66]. Furthermore, the significant difference in First Fixation Duration (FFD) (
p = 0.023) reveals users’ cognitive processing strategies across different visual hierarchies. Users exhibited extremely brief first fixations (0.16 s) on persistent UI elements such as menu icons, indicating strong intuitive recognition and low extraneous load. In contrast, spatial navigation (0.38 s) and character elements (0.32 s) elicited longer initial fixations, reflecting users’ need for immediate spatial orientation and semantic decoding when encountering narrative scene elements. This differentiated pattern of “UI rapid recognition” versus “scene deep processing” confirms that the interface design effectively established a visual hierarchy between functional and content layers. Architectural light and shadow shows, functioning as dynamic reward-based content [
67], substantially extended user fixation duration through their visual appeal.
Based on interview feedback, users generally affirmed the visual appeal of architectural light and shadow shows, describing them as “impressive” (P15), but also noted that overly dense textual information can be “fatiguing” (P16). This aligns with cognitive load theory’s perspective that extraneous load impairs emotional engagement and learning motivation. Therefore, serious game design should prioritize optimizing the information structure and presentation of high-attention elements to balance cognitive load and user experience. From the perspectives of cognitive load theory and cognitive theory of multimedia learning, text boxes with excessive information density introduce excessive extraneous load within limited working memory resources. This shifts attentional resources away from processing cultural content itself toward managing textual form and interface structure, more readily inducing fatigue and boredom [
26,
27]. Therefore, to optimize text processing, we recommend a “layered presentation” strategy, decomposing dense text into visually prominent core points and expandable details. Concurrently, implement a “multimodal redundancy” design by providing brief voice narrations for key content to share cognitive load with the visual channel. In contrast, passive visual elements such as architectural light and shadow shows enhance emotional arousal and interest through atmospheric immersion and emotional engagement without significantly increasing intrinsic load, thereby indirectly boosting learning motivation and memory retention [
26,
28]. This reveals a critical trade-off in serious game visual design. Highly interactive, task-driven elements such as props effectively capture visual attention and promote deep cognitive processing but carry higher cognitive load and fatigue risks. Conversely, high-quality, narrative-driven passive elements such as architectural light and shadow shows offer lower active information transfer efficiency yet provide essential emotional buffering, immersive experiences, and learning memory consolidation. Therefore, designers should not rely solely on any single type of element. Instead, they should establish a dynamic and rhythmic narrative balance between high-cognitive-load interactive tasks and high-arousal experiential scenarios, addressing both cognitive and emotional needs.
4.3. Differential Impact of Interactivity on Emotional Experience
This study analyzed the impact of active and passive interaction elements on users’ emotional experiences [
68]. The findings provided exploratory evidence and mechanism clues supporting Hypotheses H3 and H4, suggesting potential differences in their emotional arousal mechanisms [
62]. The dynamic nature of information in active interaction elements exerted a more pronounced effect on emotional arousal than that in passive interaction elements [
69].
Active interactive elements play a critical role in advancing task progression, yet their design quality directly influences user experience. The analysis of NPCs elements showed a significant correlation between their information dynamism and negative emotions. Specifically, NPC transition entropy exhibited a significant positive correlation with negative emotions (
r = 0.442,
p < 0.05). transition entropy and stationary entropy both demonstrated marginally significant positive correlations with boredom (
r = 0.355,
p = 0.082;
r = 0.391,
p = 0.053), while stationary entropy showed a marginally significant positive correlation with fatigue (
r = 0.375,
p = 0.065). These findings indicate that NPC interaction processes characterized by insufficient fluidity and feedback delays may cause users to experience operational interruptions and cognitive resistance [
63]. User comments in interviews, such as “NPC dialogues feel slightly lengthy” (P11) and “Unreasonable NPC positioning and orientation design leads to spatial cognitive confusion” (P17), further reflect this issue. These representative quotes suggest that high-frequency interactions may be accompanied by experiential friction and spatial orientation costs. Unreasonable dynamic interactions may necessitate more frequent visual searches and repeated confirmations, thereby increasing subjective stress and correlating with higher levels of negative emotional experiences. This provides a qualitative, mechanism-level explanation for the positive association between NPCs information dynamism metrics and negative emotions. From a flow theory perspective, excessively long and poorly paced dialogues, along with interactions that frequently interrupt exploration of the main storyline, disrupt the equilibrium of “clear goals—immediate feedback—smooth control,” thereby diminishing immersive experiences [
28]. From an embodied cognition perspective, NPCs’ spatial positioning and orientation that misalign with the player’s embodied movement paths and viewpoint disrupts the environmental spatial reference system, increasing cognitive load and disorientation [
29,
30]. This may collectively explain the underlying mechanism by which NPC interactions more readily induce tension and frustration.
Analysis of character elements revealed a significant positive correlation between click frequency and sense of accomplishment (
r = 0.455,
p < 0.05), while AFD also exhibited a significant positive correlation with immersion (
r = 0.361,
p < 0.05), demonstrating their role in goal attainment and narrative engagement. However, FC showed a significant negative correlation with competence (
r = −0.463,
p < 0.05), while transition entropy demonstrated a significant negative correlation with flow experience (
r = −0.372,
p < 0.05). This indicates that character interactions require optimization in terms of visual feedback clarity and operational fluidity [
34]. When information dynamics are excessively high without clear feedback guidance, it easily disrupts users’ flow state [
28].
Analysis of function icons further reveals the complexity of how visual elements influence user experience, with information dynamism metrics and eye-tracking metrics exhibiting distinct patterns of impact. Regarding information dynamics, transition entropy showed a marginally significant negative correlation with fatigue (r = −0.362, p = 0.054), while stationary entropy exhibited a marginally significant positive correlation with competence (r = 0.314, p = 0.097). However, in eye-tracking metrics, AFD showed significant positive correlations with fatigue (r = 0.403, p < 0.05) and tension (r = 0.466, p < 0.01), while FC also correlated significantly with negative emotions (r = 0.378, p < 0.05). Therefore, function icon design must balance recognition efficiency with visual guidance. This requires enhancing icon intuitiveness to reduce cognitive load while refining information architecture to provide clear visual exploration pathways.
In contrast, passive interactive elements such as architectural light and shadow shows, as well as text boxes, demonstrated more positive emotional enhancement effects. The fixation count (FC) for architectural light and shadow shows was positively correlated with positive emotions, immersion, and a sense of accomplishment, indicating that, as highly arousing visual reward elements, they effectively strengthened users’ emotional engagement [
70,
71]. While text boxes enhance immersion and a sense of accomplishment [
72], when text density is excessively high, their fixation count (FC) was positively correlated with boredom (
p < 0.05), suggesting a need to optimize information presentation [
66]. This finding aligns with cognitive load theory, which posits that excessive extraneous load can induce fatigue and boredom [
26,
27].
Therefore, serious game design should balance both types of interaction elements. First, for active interaction design, given the significant correlation between NPCs’ high information dynamism and negative emotions, the core design focus should be on reducing cognitive friction during interactions. Specifically, implementing a “low-friction interaction redesign” strategy is recommended: abandoning modal pop-ups that disrupt task flow and flow states in favor of non-modal, persistent dialogue bubbles for feedback. Simultaneously, optimizing NPCs’ spatial positioning and viewpoint orientation maintains cognitive consistency within the virtual environment. Thus, active interactions must prioritize operational fluidity and timely feedback to prevent negative emotions arising from interaction barriers. Second, passive interaction design should fully leverage its strengths in atmosphere creation and emotional motivation while reasonably controlling information load. This complements active interaction, synergistically enhancing the overall experience’s coherence and enjoyment.
4.4. Limitations and Future Work
This study has certain limitations. It did not include a delayed post-test one to two weeks later to assess delayed retention. Both the pre-test and post-test were completed within the same experimental session with an interval of approximately 30 min, so the results primarily reflect immediate learning gains. Future research could incorporate transfer learning measurement tasks based on delayed retention assessments. Regarding sample adequacy, while N = 30 aligns with typical eye-tracking experiment sample sizes, the complex experimental design involving both eye-tracking data collection and interviews prevented large-scale sampling and rigorous a priori power analysis. Consequently, interpretations rely primarily on effect size measures like Cohen’s d and confidence intervals, rather than p-values alone. Future research may explore streamlining experimental procedures to increase sample size, thereby validating the robustness of this study’s conclusions.
This study has limitations regarding the system’s accessibility and universal design. The current game prototype was developed primarily for experimental purposes, prioritizing variable control and standardized information presentation within the experimental context. Consequently, the system does not yet integrate configurable accessibility features such as subtitle toggles, color schemes optimized for users with color vision deficiencies, or font and interface scaling controls adaptable to diverse visual needs. Future practice-oriented research should reference established accessibility design standards and best practices to systematically implement features like synchronized subtitles, optimized color contrast, and adjustable interface elements. These should be tested and validated across broader, more diverse user groups to enhance the fairness and universal value of digital cultural heritage experiences.
Nevertheless, this study ensured internal validity through specific sample controls. First, regarding sample composition, while participants were primarily recruited from a university setting (N = 30), their demographic characteristics exhibited internal diversity rather than representing a homogeneous group of young students. Participants ranged in age from 18 to 45, with mature individuals aged 26 and above constituting 70% of the sample; occupational backgrounds encompassed educators (26.7%), corporate employees (16.7%), and graduate students (56.6%). This cohort represents “knowledge-based users” possessing high digital literacy and cultural curiosity, constituting the core target audience for serious games on cultural heritage and immersive cultural tourism. Additionally, participants’ gaming habits spanned the full spectrum from “daily gamers” (20%) to “rarely play” (37%). Consequently, the study’s conclusions possess high ecological validity in explaining the cognitive and emotional patterns of this core user group.
Second, the study controlled for the “cultural familiarity” variable in its design, as all participants had prior experience visiting the Historic Centre of Macau in person. This control effectively minimized confounding interference from “cultural unfamiliarity,” ensuring the experiment focused on testing the impact of the “visual-interaction mechanism” itself on cognition and emotion. While this may somewhat limit the direct generalizability of conclusions to groups without relevant background knowledge, it provides targeted empirical evidence for digital design strategies aimed at enhancing “post-visit experiences” or stimulating “revisit intentions.” Furthermore, while the serious game materials used in this study focused on Macau, the revealed interaction mechanisms such as how poorly designed active interaction elements (e.g., NPCs) can induce cognitive friction, whereas passive visual rewards (e.g., architectural light and shadow shows) effectively drive positive emotions are grounded in universal principles of human cognitive load and affective arousal. Consequently, these design principles possess cross-contextual applicability and can inform digital design for other cultural heritage sites, including museums and archeological sites.
Statistical validation also presents limitations. Constrained by the sample size (N = 30) for high-precision eye-tracking experiments, current analyses such as correlation and analyses of variance sufficiently reveal significant associations between variables and meet standard statistical power requirements. However, the sample size remains insufficient to support complex multivariate mediation models like structural equation modeling (SEM) for rigorously quantifying mediating or causal effects across the full “visual–interaction–emotion–learning” pathway. Furthermore, this study was not formally preregistered. Consequently, the findings related to H3–H4 primarily serve as a basis for hypothesis generation and require further validation in future preregistered studies with larger samples.
Regarding the application of information dynamics metrics, interactive game data may exhibit temporal non-stationarity, potentially affecting entropy metric estimates based on global transition matrices. This study employs an intra-trial local stationarity assumption, calculating entropy values per trial and avoiding cross-trial mixed estimation to mitigate bias risks from global non-stationarity. Basic sequence quality control thresholds are also established, such as excluding trials with insufficient effective transitions from calculations, to mitigate estimation biases caused by short sequences. Future research may further incorporate methods like sliding-window entropy analysis or dynamic Markov models to characterize attention transfer structures at finer temporal granularity, thereby enhancing the depth of processing and interpretability for non-stationary data.
Future research may explore the following avenues. First, expand sample sizes and employ structural equation modeling for path and mediation analyses to formally test the specific effects of attention and emotion within the influence mechanism. Second, broaden participant diversity by including users with varying educational backgrounds, age groups, and cultural contexts to validate the applicability of current findings across broader audiences. Additionally, longitudinal experiments could track the long-term impact of serious games on cultural heritage learning. Concurrently, integrating multimodal data such as EEG and heart rate could provide a more comprehensive assessment of user experience. Finally, the theoretical framework should be validated and extended across more diverse cultural contexts and game mechanics. Exploring adaptive game design based on real-time user behavior data could enable personalized learning support.
5. Conclusions
This study employed a mixed-methods approach to systematically investigate how visual elements in cultural heritage serious games influence user interaction behaviors, cognitive resource allocation, and emotional experiences. Using the “Macau Historic Centre Science Popularization System” as an empirical platform, and integrating eye-tracking, behavioral log, questionnaire, and interview data, the following key conclusions were drawn:
Regarding cultural heritage learning performance, this study confirmed the effectiveness of serious games in cultural heritage education. Following the game intervention, users demonstrated significant improvements in cultural knowledge retention and comprehension (p < 0.001), confirming that game designs based on the “exploration–learning–reward” framework promote knowledge internalization and emotional engagement, supporting Hypothesis H1. Interview findings further indicate that the game not only enhanced short-term learning outcomes but also stimulated users’ long-term interest in exploring cultural heritage.
Regarding visual attention allocation, the study identified significant differences in how various visual elements capture user attention. Eye-tracking data revealed that four types of elements—prop objects conveying core task information and cultural narratives, text boxes, historic buildings, and reward-based dynamic feedback elements such as architectural light and shadow shows—exhibited significantly higher TFD and FC values than other interface elements (p < 0.001). Although the effect size for first fixation duration (FFD) was smaller (ηp2 = 0.09), it remained statistically significant (p < 0.05), further revealing that users differentiated between function icons and narrative elements during the initial stages of visual processing. Overall results indicate that functionally relevant, information-dense, and dynamically presented visual elements more readily capture users’ cognitive resources, supporting Hypothesis H2.
Regarding the impact of interaction types on emotional experiences, this study reveals distinct mechanisms through which active and passive interaction elements influence user emotional responses. Results indicate that the dynamic nature of information in active interaction elements exhibits an association pattern with emotional arousal consistent with Hypothesis H3, providing exploratory trend evidence and mechanistic clues for H3. The relevant association requires further validation with larger samples and appropriate multiple-comparison control procedures. However, when the interaction design involves complex processes or delayed feedback, users experience cognitive friction that negatively affects immersion. For instance, certain NPC dialogue interactions—characterized by excessive hierarchical levels and unclear guidance—caused operational interruptions and frustration, partially offsetting the positive effects of emotional arousal. Therefore, although active interaction elements play a key role in eliciting emotional responses, their overall user experience outcomes depend heavily on the fluidity and intuitiveness of interaction design. In contrast, passive interaction elements demonstrated significant advantages in promoting positive emotional experiences. Architectural light and shadow shows and text boxes reveal differentiated effects consistent with Hypothesis H4 across experiential metrics such as positive emotions, immersion, and sense of accomplishment, providing exploratory evidence for H4. The robustness of these findings requires further validation through larger samples and more rigorous multiple-comparison controls. While text boxes aided cognitive immersion, their high information density correlated positively with frustration and boredom, indicating the need to optimize information presentation. These findings suggest that in serious game design, atmosphere creation and environmental storytelling are equally important as functional operations. Optimization should focus on simplifying active interaction paths to minimize frustration while enhancing the visual expressiveness and emotional motivation of passive interaction elements. This approach can achieve a balanced integration of functionality and emotional engagement in the user experience.
At both theoretical and practical levels, this study integrates perspectives from cognitive load theory, flow theory, and embodied cognition. By combining eye-tracking data, information dynamics analysis, and user reports, it constructs a “visual elements–interaction behaviors–cognition-emotion” analytical framework, providing a multidimensional methodology and empirical foundation for serious games research in cultural heritage preservation. Methodologically, this study integrates traditional eye-tracking metrics with information dynamics indicators based on transition entropy, pre- and post-test learning performance, the GEQ game experience questionnaire, and semi-structured interview data. This enables multimodal, process-oriented measurement of user experience. Compared to existing research often reliant on single questionnaires or static eye-tracking metrics, it systematically reveals the connections from visual presentation and behavioral responses to subjective experience and learning outcomes. Practically, using the World Heritage site of Macau Historic Centre as a case study, findings indicate that architectural light and shadow shows serve as passive visual elements with high emotional arousal and low interaction burden. Meanwhile, NPC interactions and text box designs require careful balancing between information density, interaction complexity, and cognitive load. Based on these findings, it is recommended that serious game design simplify active interaction processes, enhance the emotional motivation of passive elements, and adopt multimodal approaches to reduce textual cognitive load. These discoveries provide evidence-based, concrete guidance for the visual and interactive design of serious games and digital guide systems in similar heritage contexts.
Furthermore, this study offers significant implications for cultural heritage institutions’ digital policy development. First, in exhibition planning, a “digital-first” strategy can be adopted, positioning serious games as “cognitive scaffolding” for physical exhibitions. Leveraging online experiences to pre-construct visitors’ knowledge schemas enhances the depth of understanding and immersion during offline visits. Second, in content development and procurement, evaluation criteria centered on “cognitive usability” should be established, prioritizing cognitive load management and interaction fluidity. This approach effectively mitigates experience fragmentation and user attrition caused by poorly designed digital content in public cultural facilities.
This study’s limitations include a limited sample size, a relatively homogeneous participant background, and the absence of physiological indicator data. Future research could address these by expanding the sample size, conducting cross-cultural comparisons, and integrating physiological measurement methods such as electroencephalography (EEG) or galvanic skin response (GSR) to further explore the neural mechanisms underlying users’ cognitive and emotional responses. Furthermore, adaptive game mechanisms could be developed to dynamically adjust visual and interactive designs based on real-time user behavior data, thereby achieving personalized learning experiences.