1. Introduction
Narratives are significant in the art of serious game design because they transform trivial tasks into enjoyable activities. By embedding the actions of players in a narrative, people perceive that their actions are a part of a larger developing story instead of a set of mechanics [
1,
2]. This storytelling technique creates emotional investment and motivation, keeping the player engaged for longer. As an example, even simple elements of a narrative, like those present in
Space Invaders, can transform basic gameplay into a series of emotional moments, and more complex stories of role-playing games such as
The Elder Scrolls can take a player into a world of incredible depth and space [
1].
Studies have determined some of the core principles of successful narrative design, such as decentralized narratives allowing alternate engagement routes [
3,
4]; seamless integration, where tasks and narrative are tightly interwoven [
5,
6,
7,
8]; relatable characters that enhance identification and empathy [
9,
10,
11,
12]; and dynamic narratives that respond to player choices and environments [
3,
7,
13,
14,
15]. All of these factors contribute to participation, motivation, and better learning results.
Although previous research has shown that storytelling has the potential to enhance engagement and learning in educational and serious games [
16,
17], its participation in Games-with-a-Purpose (GWAPs), especially those that focus on linguistic annotation and differences in text style or content, is not well studied. The goal of GWAPs is normally to gather descriptive language data of various text types to support computational language modeling. In our research, we particularly focus on narrative texts to determine how the connection with the game world influences the player experience. Whether the format of narrative text, i.e., whether it is scene-based (associated with the game world) or non-scene-based (not related to the game world), affects player engagement and cognitive load is not clear. To fill this knowledge gap, our study explores how narrative alignment, using GPT-generated texts incorporated into two different game versions (one scene-based, one non-scene-based), impacts player experience in a coreference annotation GWAP via both quantitative and qualitative methods.
4. Methods
4.1. Experimental Framework
This study investigates the impact of narrative text alignment with the game environment on user engagement and cognitive load in a gamified annotation task. To gain Arabic coreference annotations, a virtual world game was created in three-dimensional environments, set in a desert-cave setup which relies on the aesthetics of the Middle East old town. The subjects were asked to respond to thematic matching questions embedded in a short narrative passage, which allowed the evaluation of the extent to which thematic alignment adjusts experiential performance. The visual space, including a deserted setting, an old marketplace, and a cave, was kept in place throughout all versions of the game in order to control the effect of story matching. With a difference in thematic relevance in the text of the narrative being the only difference, any difference in engagement or cognitive load would be ascribed to the text condition, hence creating a controlled framework through which the effects of narrative-driven gamification may be probed.
The participants were provided with standardized instructions that defined the annotation task, e.g., finding words or phrases in a text that describe the same entity (i.e., connecting the ancient city and Al-Zahra). The game interface also featured tutorials that were used to familiarize the players with the navigation and annotation mechanics, such that the interface has been accessible to both gamers and non-gamers. Such a methodological setup enabled the authors to focus on the effect of narrative presentation on player engagement and cognitive effort in a GWAP setup.
4.2. Experimental Design
The research design was a between-subjects design, and the participants were randomly assigned to either of two versions of the games:
Scene-based condition: The annotation tasks were embedded in narrative passages that reflected the desert-cave condition, including tales about a trip across a deserted sandy wasteland or a visit to a secret cave.
Non-scene-based condition: Annotation tasks were incorporated with the narrative, which was not in any way associated with the environment, such as a story in a forest or a medieval tower, which were given in the same desert-cave setting.
The independent variable consisted of the alignment with the narrative (scene versus non-scene), and the dependent variables included user engagement (self-reported focus, perceived reward, and other measures) and cognitive load (demanded by subjective ratings and performance indicators of the task). In order to have a strong comparison, the A/B test method was employed, which supports causal inference because the effects of narrative alignment and all other aspects of the game, including visuals, interface, and difficulty of a task, remained unchanged [
35]. The selected A/B framework has been popularly known to provide statistically valid results for the impact of a single design factor on user experience [
35]. Conditions were randomly assigned to the participants, and each group was given the same number of annotation tasks so that similar datasets were obtained.
4.3. Stroll with a Scroll: A 3D Virtual World Game
Stroll with a Scroll is a virtual world game that was specifically created to perform Arabic NLP work, and it is based on a treasure hunt topic to involve the players in the coreference annotation. The task is to find words or phrases that define the same thing (such as the relationship between the city of Cairo and the capital) in narrative texts. The game is played with the aid of the third-person camera, and the players move an avatar that is depicted on the screen (controlled with the help of arrow keys to move and the shift key to change the speed) in a three-dimensional environment of the fictional ancient town, a bustling bazaar, full of colorful stalls, and a cave in the desert (see
Figure 2). Avatars, wearing culturally inspired Middle Eastern attire, help increase immersion. A red, yellow, and green navigation system controls the closeness to concealed chests, thus leading players through the world.
Upon reaching a chest, players opened an in-game scroll UI that displayed a narrative passage with an embedded annotation task (Arabic excerpt) and the labeling interface. Each play session presented two annotation passages in sequence, with all passages generated and curated as detailed in
Section 4.3. The first passage appeared on a scroll in the ancient town. After completing it, participants followed the color-coded navigation system through the bazaar to the cave, where the second passage was presented on a new scroll. These passages appear in two forms: scene-based texts, which reference the in-game environment (e.g., describing a desert storm or cave exploration), and non-scene-based texts, which present the same annotation tasks without ties to the visual setting (e.g., describing a forest journey).
The visual environments and interaction loop (navigate → open chest → read on scroll → annotate) remained identical across conditions. Players achieve the assigned task by identifying coreferential expressions, which are followed by other exploratory actions. The suggested design will combine a treasure hunt system and linguistic annotation generation with the use of visually notable stimuli, in this case, illuminated chests, to maintain the interest of the players and increase their motivation.
4.4. LLM-Driven Story Generation and Preprocessing
A large language model (LLM) provided by the OpenAI API, namely GPT-4o, version: 1.26.0 (released and accessed on May 2024), was used to generate narrative texts due to its higher capability to generate coherent and contextually adequate Arabic text. To come up with the narratives, we used a Python 3.12.3 script to automate the process with the same prompt template mentioning the clear role and task of the model and a detailed input structure. These are the inputs to the prompt: story theme or title, text length, narration perspective, and writing style. These inputs were identical when running the prompt for each generated text: between 400 and 800 words, second-person perspective, descriptive short style that would fit into the game as a video game, and structure (exploration → obstacles → travel journey → reflective ending). The only difference between the two is that the former type was a set of stories told in Modern Standard Arabic, in the imaginary city of Al-Zahra (The Radiant City). These narratives portrayed a second-person journey through a desert to a prehistoric cave where there were two exits and ended with a trip back to the town. It was explicitly noted that it should be thematically aligned with the desert-cave environment. The second type of prompt created contrasting stories that intentionally went off-course from the environment of the game. An example is the sea voyage in the Emerald Forest, with rivers, fertile foliage, and towers of stone, where there was no theme overlap with the desert-cave environment.
Within the generated XML file, there are several layers. To create <baseLayer>, we tokinezed text; for <markableLayer>, we made an additional API call with a prompt asking the model to extract all nouns and places; and for <anaphoraLayer>, we asked the model to extract all coherent mentions related to the extracted mentions in the markable layer.
The parameters of both prompt types were controlled: the temperature was set to 0.7 to ensure a balance between creativity and coherence, and the top-
p value was set to 0.9 to maintain the lexical diversity but not allow too much randomness. The generated texts, presented in
Figure 3, were stored in an XML format, containing both the raw narrative text and other layers, which include word segmentation, named entity recognition (NER), and coreference links. Both narratives were subject to a thorough manual review that was performed by using four different criteria: (1) thematic consistency is checked for the aligned text within the desert-cave environment; (2) narrative consistency within sentences; (3) linguistic clarity in Arabic; (4) compatibility with the extraction of accurate mentions and annotation of coreferences. In addition, the texts that met all four requirements were manually controlled by the main author and truncated, to ensure similarity based on measurable attributes such as word count and sentences length, as presented in
Table 1. The truncation is performed by removing some descriptive texts until both narratives are equal in length at the same time while truncating, and the cohesion and meaning of narratives remain intact.
In order to obtain an annotation of high quality, a number of pretrained Arabic Natural Language Processing (NLP) models were tested, among which Hugging Face models [
36], Gemini, Stanford NLP/CoreNLP-Arabic, asafaya/bert-base-arabic, and hatmimoha/arabic-ner were utilized, and the results were obtained through the Hugging Face Inference API. These models achieved tokenization, named entity recognition, and coreference resolution of the Arabic texts. They were, however, less accurate and contextually consistent in their outputs compared to GPT-4o when it came to identifying entity mentions in complex stories. The GPT-4o was thus chosen as the main model in terms of generating the story and processing the annotation. A Python script was written to automate the process, which involves GPT-4o outputs into the XML format and simplifies the next components, tokenization, named entity extraction, and coreference resolution on the Arabic text, together with the markable layer generation and anaphora detection. This machine learning provided uniformity and scalability in the preparation of the narrative texts for the game.
4.5. Ethical Approval
This study received ethical approval from the Research Ethics Committee at Queen Mary University of London under approval reference number QMERC20.565.DSEECS23.010. All participant data were anonymized to ensure confidentiality, they were informed about the purpose of the study, and informed consent was obtained from all individuals prior to participation.
5. Experiment 1: A Quantitative Study on Scene-Based vs. Non-Scene-Based Texts and Their Relationship to Engagement and Cognitive Load
In this section, we present our quantitative study comparing scene-based and non-scene-based texts. This study examines the effects of these text types within an in-game environment, focusing on their impact on player experience and engagement. The following sections detail the experimental design, materials, and procedures used to evaluate these text types.
5.1. Participants
Power analysis was performed using G*power to calculate the sample size required to run a
t-test analysis with α = 0.05; power was 80%, and a large effect size [
33] sample of 52 was required to achieve statistical power.
The final sample size was 80 participants, with 53.8% females and 46.3% males, with the majority in the 25–34-year age group (41.3%) and then 28.7% in the 35–44 age group, 17.5% in the 18–24 age group, 10% in the 45–54 age group, and only 2.5% aged 55 years or above.
5.2. Materials and Measures
The User Engagement Scale (UES) was selected to measure the engagement [
37], which is one of the most widely validated instruments for assessing engagement with digital systems. We selected the short form (UES-SF) [
37], which retains the validated factor structure of the original while reducing participant burden by minimizing questionnaire length. The subscales had three items on a five-point Likert scale. The UES-SF measures engagement in four major dimensions, namely Focused Attention (FA), which is an indicator of immersion and concentration; Perceived Usability (PU), which is an indicator of ease of use and frustration levels; Aesthetic Appeal (AE), which is an indicator of visual and sensory attractiveness; and Reward (RW), which is an indicator of perceived benefits or satisfaction with the experience. For the purpose of hypothesis testing, a single summative Engagement score was calculated by averaging all UES-SF items across these subscales. This summative score served as the primary metric for evaluating H1. Individual subscale analyses were conducted for supplementary insight and were not treated as independent hypothesis tests; they were considered exploratory and were not adjusted for multiple comparisons.
Internal consistency for each subscale was evaluated, showing good reliability: FA (ω = 0.82), PU (ω = 0.86), AE (ω = 0.84), and RW (ω = 0.81). Also, there was construct and face validity in the scale [
37]. Each item is averaged, while the UES score is the average of the subscale scores to score the subscales.
Measurement of subjective cognitive load was based on NASA-TLX [
38,
39], which examines six areas including cognition requirement (mental requirement MD), physical requirement (physical requirement PD), time requirement (temporal requirement TD), output (performance P), work input (effort E) and psychological workload (frustration F). Each area gets an overall score ranging from 0 to 100 in increments of 5 to indicate possible physical and mental loading that the participants may have undergone [
38]. These areas include different sources of cognitive workload in various tasks, which have been tried several times [
39,
40,
41]. For hypothesis testing (H2), a single overall workload score was calculated by averaging the scores across all six dimensions. This summative score served as the primary measure of subjective cognitive load. Individual subscale results were examined for additional descriptive context and were not treated as separate hypothesis tests; they were considered exploratory and were not adjusted for multiple comparisons.
This study used the unweighted version of NASA-TLX. The decision was informed by a previous study [
42], which indicated that there was a high level of correlation (r = 0.94) in the weighted and unweighted TLX scores with no significant difference between the two processes. These results are consistent with previous reports, including those of Hill et al. [
43], who stated that weighted scoring is not necessary in certain circumstances. Moreover, Moroney et al. [
42] proved that a 15 min latency in the provision of ratings produces similar results to those obtained in previous studies.
This study was based on users’ subjective answers, and there were no objective measures used; therefore future studies might examine, e.g., time spent or annotation accuracy using gold-standard data along with subjective measures to help in interpreting objective values.
5.3. Procedure
The experiment subjects were recruited through Prolific and the screening criteria included Arabic being the first language of the participants and them being fluent in English. Before the study started, the participants were asked to read a consent form and sign it, which confirmed their knowledge of the purpose of the study, as well as the fact that they were free to participate in the study. Data tracking had unique identifiers that ensured anonymity. Those involved were paid the minimum wage in the UK and had the option of dropping out at any time.
The survey utilized demographic data to give background knowledge of the analysis. Standardized measures were used to determine user engagement and cognitive load using such standardized measures as the User Engagement Scale (Short Form) and NASA-TLX. The process was ethical and provided a holistic view of experiences among the participants.
5.4. Data Analysis and Design
IBM SPSS Statistics (version 29) was used to analyze the data. Descriptive statistics were calculated based on the overall sample and each group of participants: the mean, mode, median, and range. The variables’ normality was checked with the help of the Shapiro–Wilk test. Since it was deemed necessary for certain specific tests, the Levene test was performed in an attempt to analyze the homoscedasticity of the variances. To compare the usage and the constructed mental load of all groups and the scene-based group in particular, the independent t-test was used on the data that was normally distributed. Where there were variables that could not be assumed to have equal variances, particularly if the data were not normally distributed, the Mann–Whitney rank sum test was used. For all the tests conducted, the accepted level of significance was at 0.05; any p-value less than this value was considered significant. It was an effective way of making sure that virtually no areas were left out and the outcome represented a dependable conclusion with regard to the findings.
5.5. Results
The data was first preprocessed on the dataset collected for analysis and reliability enhancement. No problems in data entry were found; all records were entered, and no absence of any value was noted in all the datasets. To check for outliers, the boxplots of each variable were visually examined. No outliers were noted, which validated the dataset for the statistical analysis that follows the exploratory analysis phase. These measures helped to achieve methodological and statistical soundness of the results and exclusion of bias in conclusions.
Table 2 provides the descriptive statistics of all variables. The initial four variables are related to the User Engagement Scale, which involves different aspects of user engagement, and the six additional variables are associated with the NASA-TLX dimensions that measure cognitive load in a variety of task factors. This table provides a summary of the most important measures discussed in the research.
5.6. Scene-Based Analysis
In this section, the scene-based analysis results and the correlation between the thematic alignment of user engagement and cognitive load are presented.
Table 3 presents descriptive statistics of all the variables in the study by distinguishing scene-based groups from other non-scene groups to make a clear contrast of the results. The Shapiro–Wilk test was used to check the normality of the variables. It revealed that FA (
p = 0.653), AE (
p = 0.206), and Mental (
p = 0.154) are normally distributed and pass the Equality of Variances assumption. However, all the other variables are skewed (
p < 0.05).
In
Table 4, independent
t-tests were conducted to compare engagement subscales, Focused Attention and Aesthetic Appeal, which were treated as exploratory and were not used to directly test the hypotheses and, therefore, were not adjusted for multiple comparisons. In addition, independent
t-tests were conducted for the cognitive load dimension, Mental load, between based and not-based scenes, which were also treated as exploratory and were not used to directly test the hypotheses and therefore were not adjusted for multiple comparisons. The results showed that the Focused Attention score was significantly higher in the scene-based (M = 3.44, SD = 0.78) compared to the non-scene-based (M = 3.03, SD = 0.63) group, t(78) = 2.58,
p = 0.012, with Cohen’s d = 0.58. Neither aesthetic appeal nor mental demand showed significant differences between the scene-based and non-scene-based groups, indicating that thematic alignment did not impact these variables.
Mann–Whitney tests for skewed variables are shown in
Table 5, which were treated as exploratory and not used to directly test the hypotheses and therefore were not adjusted for multiple comparisons. They showed no significant differences between scene-based and non-scene-based groups, except for the Reward variable, where thematic alignment corresponded to significantly higher levels of perceived satisfaction. The results showed that RW score was significantly higher in the scene-based (Mdn = 3.33, IQR = 1.33) compared to the non-scene-based group (Mdn = 2.5, IQR = 1.33), U = 1011.5,
p = 0.040 with
= 0.26.
An independent-samples
t-test was conducted, the results of which are shown in
Table 6, to compare engagement and cognitive load between the scene-based and non-scene-based conditions and to investigate our two hypotheses. For engagement, participants in the scene-based condition (M = 3.47, SD = 0.37) reported significantly higher scores than those in the non-scene-based condition (M = 3.26, SD = 0.41), t(78) = 2.43,
p = 0.017, 95% CI [0.039, 0.386], indicating a moderate effect (Cohen’s d = 0.54). This suggests that scene-based content, which is thematically aligned with the game, enhanced users’ engagement during the task. In contrast, for cognitive load, the difference between the scene-based (M = 28.69, SD = 6.95) and non-scene-based (M = 31.63, SD = 7.38) conditions was not statistically significant, t(78) = −1.83,
p = 0.071, 95% CI [−6.13, 0.25]. The effect size, Cohen’s d = −0.41, indicates a small to moderate effect, with the scene-based group lower than the non-aligned group.
While there was a trend toward lower cognitive load in the scene-based condition, this difference did not reach significance.
5.7. Discussion
The results show that texts based on scenes significantly increased engagement compared to non-scene-based texts. Participants who viewed the scene-based condition gave greater scores for Reward, which is the perception of greater enjoyment and intrinsic motivation, which is consistent with previous studies that emphasize the relevance of the context to immersion [
30,
31,
32]. Also, the texts that were scene-based yielded higher Focused Attention scores, which suggested that the players were more consumed by the congruency of the narrative with the desert-cave setting of the game. However, the effect sizes were relatively small and medium, meaning that thematic alignment can improve engagement but not significantly. These small rewards are still useful in keeping the mind focused when working on large-scale annotation tasks, but designers must consider other factors that affect attention and intellectual load. In addition, the observed engagement effects are moderate in size and context-dependent, and results from short-session annotation tasks should not be overgeneralized to longer narrative serious game experiences.
Even though participants in the scene-based condition recorded a reduced mean cognitive load, including reduced mental effort and frustration, the difference did not attain statistical significance (p = 0.071). Thus, I cannot say yet that I fully trust the idea that the scene-based texts decrease cognitive load.
Reading through [
20], these reports support the notion that coherence/realism acts as a barrier to immersion: matching text with the visual location likely minimized the need to re-orient, making it simpler to remain interested. In the words of [
21], participants’ descriptions corresponding to immersion indicators are Focused Attention and involvement, and this pattern is echoed in our quantitative results (higher Focused Attention and Reward in the aligned condition).
6. Experiment 2: A Qualitative Study on Scene-Based vs. Non-Scene-Based Texts and Their Relationship with Engagement and Cognitive Load
The quantitative study showed that thematic alignment is an aspect that leads to the user experience. In order to explore this more, a qualitative follow-up study was planned to take a closer look at these relationships.
6.1. Participants
Eight subjects were recruited, including four in the scene-based (Players 3, 4, 5, and 6) and four in the non-scene-based (Players 1, 2, 7, and 8) condition.
6.2. Measures and Procedure
This part outlines open-ended inquiries that are specific to responses to capture qualitative information about engagement, cognitive load, and text-specific feedback under the rubric of a gamified annotation exercise. The questions attempt to record the experiences of the participants in reading texts in scenes and the texts that are not scenes, with a strong focus on their emotional reactions, perceived complexity, and their style of narration preference.
6.2.1. Engagement
What was your experience during the process of reading and annotating the text?
Follow up: Did you feel engaged, disengaged, or indifferent? What affected your emotions?
Was the type of text interesting to you?
Follow-up: What did you find attractive or unattractive about it (story content, style of writing, etc.)?
Would you have the motivation to play the game, using this type of text?
Follow-up: What prompts you to continue or not?
6.2.2. Cognitive Load
How difficult was it to write notes on the text?
Follow-up: What text characteristics made the text easier/more challenging?
What was the extent of mental effort needed to comprehend the text and label it?
Have you felt overwhelmed or tired at any time?
Follow-up: What do you think was evoked in you in the text?
6.2.3. Text-Specific Feedback
Have you been affected by the text when undertaking the annotation exercise?
What was your favorite or least favorite thing about the style and the content of the text?
Follow-up: What is the influence of these factors on your experience?
How might the text be shortened/edited to become more interesting/easier to annotate?
Follow-up question: Why do you think these changes will make your experience better?
6.3. Results
To explore participants’ experiences and perceptions, we conducted a thematic analysis of the think-aloud protocols and interview data, following Braun and Clarke’s reflexive thematic analysis framework [
43,
44,
45].
The data was analyzed by the author, with no inter-coder reliability; while this does not invalidate the findings, it should be acknowledged as a limitation. It is initially transcribed and then coded to create an affinity diagram, from which themes have emerged and are translated into English where necessary. The sessions were conducted face-to-face and synchronously. After conducting the sessions with two participants in each group, three recurring patterns and insights were reported. No additional theme emerged from the following participants, which was considered sufficient for the study, indicating theme saturation; no new perspectives emerged in the last four interviews. This process led to the development of three central themes that capture the key aspects of user engagement and task interaction within the game. These themes are presented below, each supported by illustrative quotes from the participants.
6.3.1. Theme 1: Contextual Relevance Enhances Engagement
The respondents identified greater engagement when the text was presented in the gaming environment. Individuals in the scene-based condition were very interested, saying “I lived the story by going to the locations of where it was set and behaving like the protagonist of the narrative, in which case I traveled through the town to the treasure location in the cave”. Similarly, Player 5 liked the combination of text and environment and remarked that, “You play the story in a very rich environment… I wasn’t bored, the story was interesting.” Such reactions indicate that scene-congruent narratives may contribute to a more engaging, game-like reading process. This is why the experience was immersive; it was clear and style-wise described; in the words of Player 3, “The language was simple, and the description of the place made it intriguing… it’s thrilling to read similar stories to know how they would end.”
However, people in the non-scene-based group often reported disengagement, especially in cases when the subject matter seemed irrelevant to the game world. Player 1 mentioned, “I was bored… It took me time to understand it.”. Player 2 had a similar opinion. He said, “I liked visiting places in the game, but the content wasn’t engaging,” expressing clear dissatisfaction with the text content. A player who read a non-scene-based text (player 7) was interested in the first reading of the text, but noted later, “I was interested at the start, but then I got bored.” This decrease in interaction seems to appear not only due to the non-thematic relevance but also to the design of the experience. The game presented the story in two different places to the participant, and to proceed with reading, they had to move physically to the other place. Such a break in the flow of the story, combined with the fact that the story is irrelevant to the game setting, could have caused a loss of attention and recollection of previous information. The second part of the story was thus lost and not as interesting, and thus the player may have lost emotional involvement and may not have been immersed in the narrative easily, thus leading to boredom. In addition, Player 8 expressed that the disassociation between text and context seemed to restrict narrative engagement.
6.3.2. Theme 2: Scene Integration Eases Cognitive Processing
Scene-based texts were always seen as less challenging in terms of mental load. The reading and labelling activities were smooth and intuitive according to the participants. Player 3 stated, “It was easy… the story was exciting, and I was focused,” emphasizing how the narrative pulled attention in and made comprehension almost effortless. The sentiments of Player 4 were similar, saying, “The answers were very clear… I didn’t feel overwhelmed because the text was very clear.” Player 5 reported minimal cognitive strain, saying, “Almost no effort, it was very easy,” and even noted that the coherence of the text helped him stay on task: “The text is very connected and to the point, so I didn’t get distracted.”
Unlike the scene-based texts, non-scene-based texts had a higher tendency to interfere with attentional focus and create friction. Player 1 has stated that the level of effort involved was moderate, and he gets distracted when reading, stating, “I am easily distracted when reading. I was merely responding and scanning it rather than reading it comprehensively.” Even though Player 2 has described the text as being easy, regarding the level of understanding, he has noticed the lack of motivation: “You work hard when you like the thing… I liked the game; however, the text was distracting”. Fatigue also emerged for some participants, as Player 2 admitted feeling “mentally tired at the end,” and Player 7 recalled “feeling overwhelmed at the end of the story.” While the texts may have been structurally simple, this suggests that the lack of thematic connection may have contributed to reduced focus.
6.3.3. Theme 3: Narrative Coherence Shapes Labelling Approach
The ability of the participants to do the labelling tasks differed on the basis of the clarity with which the text conveyed the meaning and concomitance of the text with the game environment. The respondents in the scene-based conditions were convinced that the stories appeared to reinforce their decisions to label the other by creating smoother content to read. According to Player 3 of the group, the story was exciting and they concentrated, which point towards the high engagement predisposing them to be attentive in the course of annotation. Likewise, Player 5 explained that the text was very connected as well as to the point as to not distrust themself (which shows that the integration of game context with narrative clarity is useful in minimizing the mental friction relating to switching activities that is otherwise theoretically expressed in task switching). Even Player 4 asserted that the task was very clear, making the labelling almost automatic.
Conversely, other participants who dealt with non-scene-based texts tended to become less attentive. Participant 1 confessed that they were skimming and answering without reading it through; primarily, this is an indication of a lack of comprehension and interest. Though Player 8 found the language simple, they also claimed the game interrupted aspects of the reading process where they had to literally move physically to a different in-game location in order to resume the narrative. Disengagement during the annotation process was often associated with the absence of thematic coherence and a sense of narrative continuity in the non-scene-based texts.
6.4. Discussion
The qualitative findings suggest that when the story matched the scene (scene-based), players may have felt more absorbed and focused; when it did not match (non-scene-based), some reported boredom and frustration. We approach these as impressions, not proof of mechanism, and consider possible alternatives (for example, topic interest, tiredness, breaks in tale flow between sites).
7. Discussion
In this section, we will discuss the findings and address our research hypothesis based on the results obtained.
H1: Scene-based text will significantly enhance user engagement compared to non-scene-based text by creating immersive and contextually relevant narratives.
To answer the hypothesis, an overall Engagement score was calculated by summing all User Engagement Scale—Short Form item scores—spanning Focused Attention, Perceived Usability, Aesthetic Appeal, and Reward Factor—and dividing by twelve. The subscales are treated as exploratory and are not used to directly test the hypotheses.
The Engagement score of the t-test was a good indication of normal distribution values, with p = 0.017. Since the p-value was less than the standard significance value (i.e., 0.05), the null hypothesis was discarded, which indicated that the effect was statistically significant.
The observed significance in the overall engagement was mainly determined by the significant impacts in several subscales of the User Engagement Scale. Specifically, significant differences were found in Focused Attention and Reward Factor, which made significant contributions to the final engagement result. When comparing scene-based text to non-scene-based text, the results show that the Focused Attention score was significantly higher in scene-based (M = 3.44, SD = 0.78) compared to non-scene-based (M = 3.03, SD = 0.63) groups, t (78) = 2.58, p = 0.012 with Cohen’s d = 0.58. The findings also indicated that scene-based Reward (RW) (Mdn = 3.33, IQR = 1.33) scored significantly higher than non-scene-based Reward (RW) (Mdn = 2.5, IQR = 1.33), U = 1011.5, p = 0.040 with = 0.26.
For Aesthetic Appeal, the total outcome was not significantly different but there was a slight leaning towards the scene-based group based on the proportion of the mean score of all three items (see
Table 3) and the score of each item separately. This implies that theme-based text reading could have produced this minor variation although the visual scene remained constant across the conditions. The lack of significant difference makes sense since aesthetic appeal is mostly related to the game’s visual and sensory elements, not the text. Perceived Usability did not show a statistically significant difference between the scene-based and non-scene-based groups, indicating that usability perceptions were comparable across conditions.
Individuals that were presented with scene-based texts continued to mention greater engagement, and the stories were labeled as immersive and consistent with the environment of the game. The reactions that they had pointed out the importance of environmental relatability to maintain interest and involvement. A number of participants expressed their feeling of interest in the plot and, thus, the desire to engage in the story more, which encouraged them to read more. The responses of the participants stressed the importance of the integration of the narrative with environmental relevance in maintaining interest and engagement. A certain number of participants stated that they felt greater interest in the narrative; therefore, they were more motivated to read the text and thus they spent more time reading. These results indicate that narrative coherence and topical relevance are important factors in user experience and interaction. This fact highlights the fact that the user experience and interaction are concerned with coherence in narration and topicality. On the other hand, those participants who received non-scene-based texts indicated disengagement and dissatisfaction. The content was seen by many as irrelevant to the context of the game, which can be one of the reasons why attention and desire to continue decreased. Although a few of them admitted that they read the texts, they did not have meaningful engagement or sustained reading. Suggestions to do better were to incorporate visuals (context) or more relevant information, thus supporting the idea that narrative isolation may be counterproductive as far as the whole user experience is concerned.
H2: Scene-based text is expected to reduce cognitive load by providing thematic coherence and minimizing extraneous mental effort compared to non-scene-based text.
An overall subjective workload score was calculated by summing all NASA Task Load Index item scores—spanning Mental Demand, Physical Demand, Temporal Demand, Performance, Effort, and Frustration—and dividing by six. A t-test was conducted because the values were normally distributed, with a p-value of 0.071. Since the p-value exceeds the conventional significance threshold (e.g., 0.05), we fail to reject the null hypothesis, suggesting that the results fail to demonstrate enough evidence for a difference between the conditions.
The mental load subscale of the NASA-TLX showed no significant difference with a p-value of (p = 0.060). The mean mental load score for the scene-based group was 30.63, compared to 37.25 for the non-scene-based group, further indicating that scene-based texts may reduce cognitive demands.
The frustration subscale of NASA-TLX yielded a p-value of 0.051, meaning it is not statistically significant. The results are in line with PU1 in the UES-SF, which is the measure of frustration specifically; the average score in the scene-based condition was 1.8, whereas in the non-scene-based condition, it was 2.76. The p-value of 0.077 of PU1 also indicates that scene-based text may help create a less straining user experience more easily, although this is not statistically significant.
NASA-TLX subscales also showed that there were no significant differences in physical demand (p = 0.961), temporal demand (p = 0.625), and the subscale of effort, which indicated that there were no differences in perceived effort between text conditions. Performance, when measured using smaller scores implying better performance, indicated that the scene-based group had a higher mean rank (25) than the non-scene-based group (35.00), with p-value = 0.075, which implies that it did not have any significant influence on performance. The findings revealed that there was no notable difference in the effort subscale of the NASA-TLX, and the implication is that perceived effort did not differ between text based on scene and non-scene conditions.
These results are also explored by qualitative responses. Subjects who were shown the scene-based texts found the task easy and straightforward, with many stating that they remained focused because of the relevance and clarity of the content. Fatigue and confusion were hardly reported. An example is that one respondent expressed that they were not distracted, and another was motivated to carry on.
Conversely, non-scene-based text users reported instances of boredom and lack of concentration, to the effect of exerting more mental effort and having to re-focus. Some of them claimed to have an energy decline or lack of concentration, particularly when the texts did not seem related to the game. One of the participants mentioned that they were interrupted by new locations due to unrelated texting. Other respondents expressed no interest in putting effort into it and said that the material was not interesting enough to keep their focus on it.
Though the total cognitive load did not demonstrate a statistically significant difference in conditions, certain NASA-TLX subscales, in particular, Mental Demand and Frustration, revealed tendencies with difference in mean values, which could be worth mentioning. The scene-based group scored lower in the following subscales, and this can imply that thematic coherence and contextual integration may facilitate a reduction in mental effort. The qualitative feedback also supports these findings greatly, as in many instances, participants reported that scene-based tasks were easier and more focused. Even though the overall difference in cognitive load was not significant, these subscale patterns may shed light on possible ways in which the content presented in the form of scenes can alleviate certain aspects of mental load.
8. Conclusions
To test how different narrative alignments influenced the engagement and cognitive experience of players in a game environment, a customized 3D gaming environment was created with special zones and differences between narrative alignments and non-aligned narratives based on the environment (i.e., a scene-based narrative and a non-aligned narrative embedded in the desert-cave world). We found that a scene-based narrative greatly improved user engagement, especially Focused Attention and Reward, which implies that players perceived themselves to be more immersed and motivated when narratives corresponded to the visual setting. Their statistically significant but small-to-medium effects suggest that thematic alignment provides an engagement benefit, but with valuable, rather than transformative, utility to help keep attention during large-scale annotation tasks. This implies that narrative compatibility is among a number of factors affecting engagement.
Although there were no statistically significant differences in overall cognitive load between scene- and non-scene-based conditions, there were tendencies in mental demand and frustration that were in support of scene-based texts. This was also explored by qualitative responses, whereby the participants viewed the scene-based narratives as more engaging and easier to follow and felt that they created a feeling of participation in the story. On the other hand, non-scene-based texts were fragmented and less interesting, which decreased interest. These findings highlight the importance of having a follow-up study as our short experimental session and limited stimuli may have reduced our ability to detect significant effects. We could test participants spending longer time in the task because longer time spent on a task is essential to complex linguistic processes, like coreference resolution. We could also test more clearly unrelated text types, e.g., news articles or research documents. In addition, variables like experience playing previous games or familiarity with annotation were not investigated, and therefore, future research should investigate the effect of factors like participation on engagement and cognitive workload. Such studies may also streamline a narrative to gamify linguistic activities.