1. Introduction
Enabling machines to comprehend human language’s full richness, ambiguity, and flexibility remains a central challenge in artificial intelligence and human–computer interaction (HCI). While recent advances in natural language processing (NLP) have spurred impressive progress, the question of how to systematically evaluate and deepen machine semantic understanding, beyond surface-level syntax to context, nuance, and communicative intent, remains unresolved. Addressing this gap calls for innovative testbeds that expose systems to unpredictable, creative, and open-ended human inputs, requiring real-time interpretation and adaptation.
In this work, we introduce Semantrix, an interactive semantic word-guessing game designed not merely for entertainment but as a platform to probe and accelerate the development of machine semantic understanding. The game situates state-of-the-art Transformer models in a dynamic, user-driven environment where every user’s guess and system response become a data point for the real-time study of ambiguity, context, and adaptive language processing. In Semantrix, machine learning models must not only score and evaluate user guesses for semantic proximity but also generate personalised, context-sensitive hints, requiring both deep comprehension and generative skill.
Word games and puzzle-based interfaces have a long tradition in cognitive science as tools for investigating the capacity of humans, and, by extension, artificial agents, to reason, associate, and communicate meaning under uncertainty. We move beyond static benchmarking and toward ecological, interactive assessment by embedding modern language models within such a framework. Whereas previous NLP evaluation tasks have typically relied on fixed datasets or limited response formats, the gameplay context elicits a wide variety of free-form, user-driven language, a crucible for examining machine robustness to error, ambiguity, and creativity. Beyond text-only contexts, recent advances in adaptive representation learning show how meta-learning can improve robustness under domain shift. For example, CAMeL introduces cross-modality adaptive meta-learning for text-based person retrieval, emphasising task diversity and domain-agnostic pretraining to generalise from biased synthetic image–text data to real benchmarks; this underscores the value of adaptation strategies that we pursue here in an interactive, user-facing context [
1].
The rise of Transformer architectures such as BERT [
2] and GPT-4 [
3] has dramatically expanded the capabilities of NLP systems to encode, compare, and generate natural language. Yet, it remains to be seen how these models perform in environments that demand fine-grained semantic judgement, such as assessing the “distance” between two words or ideas and generating instructive, meaningful feedback. Building on successes in grounded language learning [
4] and the emerging literature on educational and gamified AI interfaces [
5,
6], Semantrix merges these two threads: robust semantic modelling and adaptive, generative dialogue in an ecologically valid, user-facing scenario.
Despite these advances, contemporary systems still grapple with interpretive failures: idiomatic, ambiguous, or contextually entangled inputs frequently derail dialogue or hinder effective feedback. In word-guessing games, such challenges are amplified, as users often generate intentionally indirect, metaphorical, or creatively oblique guesses. An effective system must do more than parse tokens; it must infer meaning, respond adaptively, and know when and how to scaffold user learning with timely, personalised hints.
Semantrix thus serves as a living laboratory that supports several interconnected research objectives. First, it enables the evaluation of contemporary Transformer-based models concerning their capacity to model and compare semantic content at the word and short-phrase level, all within realistic, unconstrained contexts. Second, the platform allows one to investigate how generative models can adaptively support user understanding and engagement throughout the interaction when used for dynamic hint production. Finally, Semantrix provides a unique opportunity to study the interplay between algorithmic advances and human factors such as motivation, engagement, and strategy that naturally emerge in digital language games. The platform leverages Sentence Transformers for robust multilingual semantic scoring, alongside large language models such as GPT-4 for generating context and performance-sensitive hints. Extensive logging and experimental controls permit analysis and ablation across model types, hints strategies, and user behaviours. From an HCI standpoint, unified experience models offer a complementary lens on what interactive AI should optimise. A recent unified model for Haptic Experience synthesises pragmatic, hedonic, and eudaimonic qualities to guide the design of technology-mediated touch. However, our modality is textual/visual; this framework highlights the importance of designing for experiential richness. It suggests directions for multimodal hints and accessibility features in future iterations of Semantrix [
7].
This paper claims that interactive, game-based frameworks such as Semantrix do not merely entertain; they create new frontiers for studying and refining machine semantic understanding in the wild. The insights from such environments extend beyond gaming, illuminating pathways for advancing more intuitive, adaptive, and human-aware NLP systems across domains.
The remainder of this paper is structured as follows.
Section 2 reviews the state of the art in semantic embedding and generative language models.
Section 3 details the NLP models, natural language generation technologies, and web-deployment frameworks employed in Semantrix.
Section 4 presents the whole design of the semantic guessing game, including game mechanics, feedback processes, and the adaptive hint generation system.
Section 5 describes the integration, architecture, and deployment of the Gradio-based web application.
Section 6 outlines the methodology, experimental design, and participant recruitment.
Section 7 reports the behavioural outcomes and statistical analyses.
Section 8 discusses the results regarding interpretation, limitations, and future directions. Finally,
Section 9 summarises this work’s main contributions and implications.
2. Background: Semantic Embedding and Generative Models
Understanding and operationalising semantic similarity remains a core challenge in NLP and HCI. Semantic word games, such as the one addressed in this work, demand models that can assess proximity between user guesses and secret targets based on minimal input (typically a single word or short phrase), and in multilingual or code-switching contexts. This section surveys classic and state-of-the-art alternatives, offering a comparative perspective and motivating the final selections for our application.
Early approaches to semantic similarity leveraged static distributed embeddings, most notably Word2Vec [
8] and GloVe [
9]. These models represent each word as a fixed vector, efficiently capturing distributional relationships from large corpora and supporting many languages. However, assigning each word a single embedding regardless of context fails to resolve polysemy or ambiguity, a significant limitation for applications like word games where user inputs can be brief and heavily context-dependent. Furthermore, while multilingual variants exist, their cross-lingual capabilities are typically weaker and less robust, particularly for short or ambiguous queries.
The advent of Transformer-based architectures, such as BERT [
2], RoBERTa [
10], MPNet [
11], and their multilingual derivatives (e.g., XLM-RoBERTa [
12]), has enabled a new level of contextualisation in language representations. Embeddings generated by these models are sensitive to the surrounding text, better capturing the nuance required for semantic similarity tasks. Adaptations like Sentence-BERT (SBERT) [
13] further adapt these models for short-text or sentence-level comparison, providing more reliable polysemy resolution and significantly improving robustness to ambiguity. Moreover, many of these Transformers have been fine-tuned for word- or sentence-level similarity across multiple languages, which is crucial for mixed-language user contexts. Notably, they lead recent benchmarks such as the Massive Text Embedding Benchmark (MTEB) [
14] for tasks relevant to our setting.
In our survey of candidate models for semantic similarity, we considered Word2Vec, for its efficiency and interpretability, although acknowledging its lack of contextual sensitivity, as well as a range of Transformer-based models, including SBERT and paraphrase-multilingual-mpnet-base-v2, which offer advanced contextualisation and strong multilingual support, as demonstrated by recent MTEB results. Additional Transformer variants, such as RoBERTa, XLM-R, and MPNet, with or without further fine-tuning, were also evaluated for potential applicability. Word2Vec and Transformers were compared empirically to assess their impact on user experience and behaviour in the experimental phase.
Beyond modelling semantic similarity, providing real-time, contextually relevant feedback and hints is central to interactive word games. Rule-based or manually scripted responses often prove insufficient: the deterministic nature of pre-written hints restricts the system’s capacity to adapt to unanticipated user actions, making it challenging to offer practical assistance in dynamic scenarios. Moreover, scripted strategies possess limited creativity and variation, which experienced users may exhaust or learn to predict, ultimately reducing engagement and replayability.
The emergence of large autoregressive language models (LLMs), developed by organisations such as OpenAI (e.g., GPT-3, GPT-4 [
3]), Google (Gemini 2.5 [
15]), Mistral [
16], and others, has transformed the capabilities of natural language generation. In numerous languages and formats, these models can produce flexible, high-quality text outputs for various content types, including hints, definitions, analogies, and curiosity facts. Critically, they generate personalised, context-sensitive responses that can dynamically adapt to user actions and learning trajectories, providing supportive feedback that evolves with user experience. This high naturalness has substantially increased user engagement and the perceived value of in-game assistance.
Our pilot testing showed that GPT-4 delivered significant qualitative and practical improvements for dynamic, multilingual, and challenging hint-generation tasks, outperforming earlier models and numerous open-source alternatives. Nevertheless, our selection process also weighed strong contenders from other providers, such as Google Gemini and Mistral’s advanced LLMs, which offer competitive performance and diverse deployment options.
The final system was thus constructed around two main components: for semantic similarity calculation, we selected paraphrase-multilingual-mpnet-base-v2 [
13] due to its robustness with short inputs and strong multilingual capabilities; for natural language generation, we adopted GPT-4, balancing high output quality with API accessibility despite higher resource demands. Importantly, our modular platform design preserves flexibility for future experimentation, including potential integration of alternative LLMs, such as Gemini, Mistral, or DeepSeek, as their capabilities evolve.
3. Materials
In exploring the design and implementation of a semantic word-guessing game, we focus on leveraging modern natural language processing techniques within a web-based interface. The primary objective of this section is to present the technical framework, detailing our integration of Transformer-based models for semantic similarity evaluation and dynamic hint generation, all accessed through an interactive Gradio application.
In the Semantrix application, semantic similarity between user guesses and target words is computed using the publicly available paraphrase-multilingual-mpnet-base-v2 Sentence Transformer model, which provides robust, contextualised embeddings suitable for real-time gameplay. No additional fine-tuning was performed; the model is used as released for efficient extraction of semantic relationships from user input, supporting multilingual interaction and accurate feedback throughout the game. The application currently supports English and Spanish. For each language, we provide UI text, rules and word lists. To add another language, one can supply its word list, and, if needed, adjust the UI text and rules in configuration files. No source-code changes are required.
Central to our approach is the use of Gradio (
https://www.gradio.app/ (accessed on 24 May 2025), version 5.21.0), a Python-based framework that enables the development of interactive machine learning interfaces. The complete application is deployed on Hugging Face Spaces (
https://huggingface.co/spaces (accessed on 24 May 2025)), which provides the hosting infrastructure and integrates seamlessly with Gradio to maximise accessibility and maintainability for the research team and the user community. This combination offers several practical advantages for developing and disseminating NLP-powered applications. First, Gradio supports rapid prototyping and iterative model integration, making developing, refining, and evaluating interactive systems with state-of-the-art NLP components straightforward. Deploying the application on Hugging Face Spaces further streamlines the process, as updates and new model versions can be pushed live with minimal overhead, eliminating the need for dedicated server maintenance or complex configuration.
The resulting platform is accessible from any modern web browser without installation or user registration, lowering participation barriers and facilitating broad experimental reach. Its design is responsive and inherently multiplatform, ensuring a consistent user experience across devices such as smartphones, tablets, and desktop computers, and automatically adapting to different screen sizes and input methods.
Regarding reproducibility and open science, the deployment on Hugging Face Spaces allows our semantic game to remain version-controlled and openly accessible, supporting independent review, reuse, and further extension by the research community. Additionally, Gradio’s integration with the Hugging Face model repository and backend infrastructure simplifies the technical implementation and ensures that the interface components are closely aligned with the underlying model resources.
Overall, this architecture provides a robust, modular foundation for ongoing experimentation and future extension, facilitating the rapid translation of NLP research advances into accessible, interactive applications.
The core mechanism for gameplay, measuring semantic proximity between user guesses and the secret word, requires a model that can produce highly expressive and context-robust word embeddings. Unlike sentence-level similarity, the word-guessing context is challenging: user input is brief, often a single word.
Our approach employs Sentence Transformers, a bi-encoder architecture wherein a Transformer model encodes both target and guess in parallel.
Figure 1 illustrates the bi-encoder architecture used for evaluating semantic similarity between words in our system. In this setup, each input word, the target word (Word A) and the user’s guess (Word B), is independently processed by an identical encoder model. The models encode each input word, generating contextualised token representations. These representations are subsequently transformed into fixed-size sentence embeddings via a pooling operation (such as taking the [CLS] token or averaging all token embeddings), resulting in two dense vectors, denoted
u and
v. Finally, the semantic similarity between these two vectors is computed using cosine similarity. The resulting score quantifies the semantic proximity between the guess and the target word, enabling the system to provide meaningful feedback based on the user’s input. This architecture allows for efficient and scalable evaluation, as both words can be encoded in parallel, and is well suited for real-time applications like interactive word games.
For simplicity, throughout this paper, we refer to paraphrase-multilingual-mpnet-base-v2 as Sentence Transformers.
Semantic scoring is performed with an encoder-only model, but hint generation requires dynamic, natural-sounding output. This needs a decoder-only, autoregressive language model. Encoder models capture “meaning”; decoder models generate “language” in context. For tailored hints, clues, definitions, and “curiosity facts,” we require a model to interpret prompts flexibly and produce suitable out-of-context responses.
All generation tasks are handled by OpenAI’s GPT-4 and accessed via an API. Key attributes:
Foundation: Pretrained on enormous web, book, and dialogue corpora; further aligned with human feedback [
3].
Capabilities: Multilingual, robust in general and domain-specific query answering, creativity (emoji, poetry), and varied response formality.
Integration: Prompt templates specify output language, type, and difficulty. The model generates hints in response to gameplay events.
Limits: Online inference only, subject to latency and API reliability.
All system components, including embedding models, hint generation, session management, and user feedback, are integrated within a unified Gradio interface and deployed on Hugging Face Spaces. This setup provides a convenient, cross-platform web application accessible via any browser, enabling broad user participation for experimental studies. The combination of Gradio and Hugging Face simplifies deployment and model updates, supports reproducible experiments through automated logging and data collection, and allows easy maintenance and rapid prototyping as research evolves. Moreover, the public deployment facilitates transparency and open science, as the system code and configuration are available for review and reuse by other researchers. This infrastructure ensures a robust foundation for our experiments and supports future platform extensions, including multimodal or robotic interface integration.
4. Game Design
In this section, we aim to present the overall design of the game experience, detailing its structure and individual modules, which are elaborated upon later. The game’s main objective is for the user to guess a secret word. Initially, the user may attempt different words randomly, but their guesses should become more refined as the game progresses; this way, a secondary objective is to make this process of knowledge refinement mentally stimulating for the user.
4.1. Initialisation
The interaction flow of the game has been structured to provide a seamless and engaging user experience, as illustrated in
Figure 2. Upon initialising the game, as described in the first row of the Figure (green section), users are greeted with an introduction screen. They are then offered the option to review the game’s rules; this is particularly important for first-time users or those who need a refresher. If they choose to review the rules, the system provides a detailed description before proceeding to the game setup.
Once past the introductory phase, users can select the difficulty level, a feature crucial for determining the secret word’s complexity and the nature of hints provided. Our system offers four difficulty levels: easy, normal, hard, and expert. When users choose the easy or normal level, the secret word is drawn from a basic vocabulary list, making the game accessible to a broad audience. In contrast, selecting hard or expert levels requires users to guess words from a more complex, domain-specific vocabulary, thus increasing the challenge. All word lists have been curated from language learning resources and classified according to their semantic complexity.
The selected difficulty level also governs the hint system: on easy mode, users only receive advanced hints, while expert mode offers no hints. Players receive the standard initial and advanced hints sequence for normal and hard modes. A detailed description of hint types is provided below.
This design ensures the game adapts to varying user knowledge and preferences by modulating both the lexical complexity of the target word and the level of support provided via hints. After users configure their preferred settings, the system randomly selects a secret word from the appropriate list and generates its embedding vector to optimise further game processes.
4.2. Main Loop: Semantic Understanding
With all preparations complete, the main game loop begins, as illustrated in the second row (blue section) of
Figure 2. The process starts when the user inputs a guess (Input Guess Word). We then Compute Semantic Similarity between that guess and the secret word using cosine similarity; the dashed link from Secret Word Chosen in
Figure 2 indicates the reuse of the precomputed embedding of the secret word at this step. Each guessed word is converted into an embedding vector, allowing the game to compute the semantic similarity between the guessed and pre-selected secret words. This similarity score generates feedback, guiding the user towards the correct answer. This feedback mechanism is central to assisting users throughout their gameplay experience.
To generate the similarity score, each word undergoes processing by the Sentence Transformer model, which produces an embedding vector. The specific model employed generates vectors with a dimensionality of 768. Once embedding vectors for the guessed and the secret words have been obtained, the similarity score is calculated using the cosine similarity measure. Cosine similarity was chosen due to its widespread use and effectiveness in comparing high-dimensional semantic embeddings: it quantifies the cosine of the angle between two vectors, thereby capturing semantic closeness based on their orientation in the embedding space, independent of their magnitude. We derive the cosine similarity value from the dot product equation, as illustrated in Equation (
1). We can isolate the vectors from this equation, enabling us to compute the cosine similarity. The dot product equation is mathematically expressed as:
where
and
are the vectors representing the guessed word and the secret word, respectively, and
n is the dimensionality of these vectors. To transition from the dot product to the cosine similarity, we utilise Equation (
2)
where
denotes the cosine similarity,
is the dot product, and
and
are the magnitudes of vectors
and
, respectively. The magnitudes are computed as follows:
By substituting these magnitudes back into the cosine similarity formula, we obtain Equation (
3).
We derive the cosine value through this transformation. Then, once multiplied by ten, we obtain a semantic score over ten, quantifying the semantic similarity between the guessed and secret words. This quantified measure of similarity forms the foundation of our feedback. Similarly, all the scores obtained during each round are recorded to analyse the user’s performance within the hint system. This data collection also serves the secondary purpose of informing the user about the words they have attempted during the game that are closest to the secret word. This allows users to leverage this information in the game’s feedback mechanism. Tracking these scores can give users detailed insights into their guessing patterns and improve the gameplay experience. The stored data help refine the hint system and ensure that users can learn from previous attempts, enhancing their strategy with each subsequent guess.
Initially, if the user’s guess is far from the secret word, the system provides encouraging but non-specific feedback, indicating whether they are “hot” or “cold” in semantic proximity. The feedback becomes increasingly targeted as the guesses draw closer to the secret word. While the feedback does not directly give hints about the word, it operates with the dynamic hint system that offers progressively specific clues when appropriate, narrowing down the available options and aiding the user in their quest to identify the secret word. In the web application, feedback is conveyed through the graphical user interface and on-screen messages, rather than by physical devices or speech. All system outputs, including feedback and hints, are presented visually within the web interface. Because input is unrestricted, users often can try metaphorical or slang words. In this context, every guess is embedded and scored identically against the secret word; the interface immediately shows the numerical score and qualitative “hot/cold” feedback and updates the best-tries table. When a sequence of such guesses shows a sustained downward trend according to the detector, which the extended use of ambiguous words could cause, the system gives a hint that narrows the semantic search space without revealing the answer.
4.3. Dynamic Hint System
Identifying the optimal moment to provide hints to the user presents a nuanced challenge, as there is a risk of the user perceiving the assistance as either offensive or condescending. Bearing this in mind, our objective is to offer subtle assistance when we detect that the user is experiencing difficulty or is stagnating without being intrusive. This approach ensures that the user remains engaged and enjoys the experience.
As previously mentioned, data collection enables us to analyse the user’s performance throughout their interaction with the game. To effectively capture trends and fluctuations in user performance, we adopted the Exponential Moving Average (EMA), a widely used technique in performance analysis across fields such as sports [
17] and finance [
18]. The first EMA applied in our system uses a window of five samples (
n = 5), corresponding to a smoothing factor of (
= 0.33), as shown in Equation (
4) below. This EMA provides a smoothed representation of the user’s score progression by assigning exponentially greater weight to more recent observations, making it more responsive to current changes than a simple arithmetic mean. This improved sensitivity to recent data points allows the system to detect trends in user performance more rapidly, helping us provide timely and context-appropriate hints.
where
is the Exponential Moving Average at time
t;
is the smoothing factor, defined as
with
n being the number of samples considered;
is the observed value (e.g., performance score) at time
t; and
is the EMA from the previous period.
The application of the EMA allows us to smooth out irregularities and short-term fluctuations in performance, while retaining sufficient sensitivity to recent user actions. This balance is essential for adaptive educational interventions, such as calibrating the timing and specificity of game hints.
Figure 3 illustrates this process. In the upper half of the figure, the black dots represent individual user scores for each guess, which can fluctuate substantially from one attempt to the next. The blue line shows these scores’ EMA, providing a smoother trajectory that reveals the underlying trend in user performance, rather than being dominated by noise or outliers. For example, even though there may be sharp drops or spikes in individual scores, the EMA curve responds more gradually, making it easier to distinguish between random variation and sustained changes in performance.
Beyond smoothing, it is crucial to determine the direction and rate of change in performance to identify optimal moments for intervention. To this end, we evaluate the slope of the Exponential Moving Average, which reflects whether user performance is improving, declining, or stable. The slope is estimated by computing the derivative of the EMA, using the gradient formula detailed in Equation (
5).
Given the slope of user performance in each game iteration, we can observe a trend in the performance metrics. However, these metrics are notably volatile and significantly influenced by game iteration changes. To address this, we further smooth the slope signal by applying a second EMA to the derivative, using a window of three samples (n = 3), corresponding to a smoothing factor of (
= 0.5). This approach provides a more nuanced smoothing, balancing capturing immediate changes and mitigating aggressive spikes in our metrics. The outcomes of our approach are illustrated in the lower part of
Figure 3, which displays the derivative (slope) of the EMA (green squares), alongside an EMA-smoothed estimate of the derivative itself (red diamonds). When the slope is close to zero, user performance is stable. Positive slope values signal improving performance, while negative slope values suggest declining performance. For instance, in guesses 5 to 9, the negative slope indicates a decline in performance, which could be an optimal point to trigger an intervention or provide a targeted hint. Conversely, after guess 13, the trend reverses, reflecting a rapid improvement captured by the raw and smoothed slope curves.
We must establish a threshold that acts as the trigger to leverage these data to determine the optimal moment to provide hints to the user. This threshold was defined heuristically and derived from testing with several users. We must establish a threshold based on trends extracted from the moving average statistics to determine the optimal moment to provide hints to the user. Specifically, the system monitors the Exponential Moving Average (EMA, ) of the first discrete derivative of the score EMA (). A hint is triggered when this derived value falls below zero, when user performance exhibits a sustained downward trend, indicating the user is likely struggling or stagnating. When this condition is met, the interface shows the hint panel. To avoid premature intervention, the system requires at least three consecutive guesses to have occurred since the last hint before considering a new one. Furthermore, due to the initial EMA computation with a window of five, the system will only begin evaluating these thresholds and offering its first hint after a minimum of five user guesses.
This combined threshold mechanism was defined heuristically based on pilot experiments with several users. As a result, we can robustly identify when a user is lost or not making progress solely based on performance metrics. The next step involves devising dynamic and creative strategies to assist the user, ensuring that help is valuable and engaging, while maintaining the integrity of the game. Consequently, our initial problem is effectively addressed; we can now identify when a user is lost or stagnate solely based on their performance metrics. The next step involves devising strategies to assist the user dynamically and creatively, ensuring that the help is valuable and entertaining while not compromising the game experience. This requires a carefully balanced approach to give hints or guidance that enhances the user’s engagement and enjoyment without spoiling the core challenges of the game.
Given this framework, we meticulously designed a set of dynamic hints categorised into two distinct tiers. The first tier consists of subtle cues that provide slight indications regarding the target word, assisting users in navigating the semantic theme without directly revealing the solution. Conversely, the second tier comprises more explicit and conspicuous hints. While these clues do not directly disclose the answer, they significantly narrow down the possibilities, guiding the user towards an imminent discovery.
In addition to hint categories, as mentioned above, our system adapts to the initial difficulty level the user selects. For moderate difficulty, the system is more likely to provide second-tier hints when users encounter impediments. At higher difficulty levels, users predominantly receive first-tier hints unless prolonged stagnation is detected, in which case, more advanced hints may be provided. In the “expert” mode, users face all challenges entirely unaided, receiving no hints during gameplay.
The dynamic hint system incorporates various hint types based on natural language processing and generative models.
Table 1 summarises these hint categories, describes their purpose, and indicates the model(s) used and the type of output involved. For added clarity, concrete examples of each kind of hint are provided immediately after the table.
Word Ranking: e.g., “fruit: 92%, car: 15%, school: 54%, blue: 35%.” The user sees a ranked list expressing the semantic closeness of various words to the secret word.
Film Representation: e.g.,
![Electronics 14 03480 i001 Electronics 14 03480 i001]()
(a sequence of emojis referencing the film “Snow White”).
Word Definition: e.g., “A fruit with sweet red or green skin and crisp flesh.”
Word Representation: e.g.,
![Electronics 14 03480 i002 Electronics 14 03480 i002]()
for “apple” (an emoji or pictogram representation of the word).
Poem: e.g., “Red or green, crisp and sweet, Harvest joy you love to eat.”
As detailed previously in
Section 3, given the unpredictability of the secret word selection, our system dynamically generates hints irrespective of the specific word while ensuring consistent structural patterns across different games. We utilise autoregressive models, particularly those optimised for text generation tasks such as GPT models, to achieve this. We designed dedicated processes for each predefined hint structure that combine state-of-the-art semantic and generative models. For example, to generate the “Word Ranking” hint, we select three random words and one semantically close to the secret word, as determined by the Sentence Transformers embedding model. These four words are then presented to the user in a ranked list, indicating their semantic proximity to the secret word. For other hint types such as “Word Definition”, “Film Representation”, “Poem”, or “Word Representation”, we employ carefully designed, task-specific prompts for each hint type, submitting them directly to GPT-4, which generates the corresponding output (e.g., definitions, relevant emoji sequences, or creative texts). This modular approach enables dynamic, context-appropriate, and engaging hint generation for any secret word, while ensuring consistency in structure and user experience throughout the game.
Although we adopted an EMA-based trigger, other policies are reasonable. Rule-based triggers (e.g., triggering after a fixed number of failed attempts, or when scores remain below a threshold) are simple and transparent. Still, they may ignore the magnitude and direction of change and require frequent readjustments, sometimes triggering after harmless oscillations. Time-based triggers are easy to implement but they combine reflection time, writing speed, and device or network latency, so they may not reliably indicate that a player has been lost. On the other hand, user-initiated hints maximise autonomy and transparency but can introduce biases that complicate comparisons between conditions; in the pilot tests where we enabled the option to request hints, several participants avoided asking for help even when they got stuck, which increased the variability in frustration. In contrast, our EMA-based policy is lightweight and reacts to trends rather than noise or peaks. The two-stage EMA described above reduces fluctuation while maintaining sensitivity to avoid getting stuck. It still requires parameter choice and a brief warm-up period.
4.4. Game End
The game concludes when the participant either accurately identifies the secret word or opts to withdraw. Upon conclusion, users are presented with a comprehensive performance summary detailing the number of attempts taken, the overall accuracy of their guesses, and a curiosity about the secret word, which is dynamically generated via a language model. This feature enhances the application’s interactive and educational attributes, providing users with a sense of accomplishment and motivating continued play.
A key aspect for sustaining user engagement is increasing the game’s replayability. To this end, users are prompted to start a new game with a new secret word at the end of each session. This mechanism introduces variety and unpredictability, encouraging repeated playthroughs and maintaining user interest. The game’s design aims to balance challenge and approachability, offering an engaging experience that appeals to many users.
5. Integration in the Semantrix Web Application
The Semantrix word association game was developed as a highly modular and configurable web application, enabling research and public deployment scenarios. The web-based architecture supports dynamic experimental conditions, session management, multilingual interaction, and robust data handling, all operated through an intuitive user interface built with the Gradio framework.
Figure 4 presents an overview of the Semantrix web application architecture, highlighting the modular front-end, backend, session management, configuration, and data flows described in detail in the following sections. Arrows indicate interaction and data flow; coloured backgrounds are used solely to distinguish component groups, without further visual meaning.
5.1. Application Architecture and Technology Stack
The Semantrix web application was implemented in Python (version 3.11.5) and leverages the Gradio Blocks API (version 5.21.0) to construct flexible, component-based user interfaces. The backend consists of custom Python classes for all game logic, state management, and interaction with natural language processing models.
The overall architecture and the relationships among core components are depicted in
Figure 4. The main architectural layers are as follows:
User Interface (Gradio Blocks): Handles all interactive elements (textboxes, radio buttons, action buttons, tables, and dynamic images) and user inputs/outputs.
Session Manager: Maintains unique session identifiers and manages per-user game state isolation throughout each play session.
Backend Game Logic: Implements core game mechanics, including game flow, input evaluation, scoring, hint generation, and state transitions.
Configuration Files: JSON-based files used to load experimental parameters, user language, menu content, and model selection at runtime. Language is handled entirely via JSON. Per-session settings bind the UI language, the corresponding rules text, and the language-specific secret-word list. Hints are generated in the selected language via prompt templates. To change languages, only the configuration needs to be updated; the backend game logic remains unchanged.
Data Storage (Rankings, Local + Hub Commits): Stores play sessions and rankings in local JSON files with automated saving and synchronisation to a Hugging Face Hub repository for persistent data management.
Commit Scheduler: Batches and orchestrates data commits from local storage to the Hub for research data collection.
Logging: Uses the Python logging module across all components for monitoring, debugging, and ensuring traceability and reproducibility.
5.2. User Interface and Experience
The user interface (UI) is dynamically constructed based on the game state, experimental condition, and user actions. Major UI components include the following:
Header Markdown: Game instructions, experiment information, and session feedback.
Image Display: Game logo and animated win/lose images for engagement.
Textboxes and Buttons: For submitting guesses, requesting hints, and controlling game flow.
Radio Buttons: For option selection (e.g., difficulty, or rules viewing).
Ranking and History Tables: Dynamically updated tables show prior guesses, hint status, and user ranking.
Session Timer: Ensures sessions are closed adequately on user inactivity, supporting user experience and experimental cleanliness.
Stateful interactions are managed via Gradio’s State and BrowserState objects, with event-driven updates coordinated by backend functions.
5.3. Session Management and Request Handling
The Semantrix web application is deployed on Hugging Face Spaces, enabling global, concurrent user participation. Each user session is fully isolated to guarantee experimental integrity and fairness when evaluating user interactions with language models. Every user is assigned a unique session identifier and experiences a distinct game instance. This prevents participant interference and ensures that user inputs, feedback, and performance data are recorded independently.
While session isolation is a standard engineering practice in web applications, in this context, it is essential for providing reliable, reproducible data for analysing gameplay and user interactions.
This approach not only preserves data integrity but also creates a controlled environment for investigating the effectiveness and behaviour of adaptive hint generation and language understanding in real-time gameplay scenarios.
5.4. Configuration Files and Experimental Setup
As shown in
Figure 4, configuration files are loaded by the Backend Game Logic at application start. Before any user session, these files specify the user’s language, experimental parameters, and model selection. While it is possible to modify these files and update conditions between sessions, configuration changes do not take effect within an active game; adjustments are applied only when the application (or session) is re-initialised. This design allows the following features:
Switching user language and updating the corresponding UI, before the start of a session.
Selecting embedding models (e.g., Sentence Transformers or Word2Vec) and controlling hint availability to support different experimental conditions, without modifying the core code, but only at start time.
Flexibly adjusting or extending experiments by editing the condition JSON files before launching a new session.
5.5. Data Storage, Logging, and Commit Operations
All gameplay data are stored in local JSON files, segregated by session ID, for efficient tracking and easy export. Rankings and play histories are also session-bound. Periodically, the system batches and commits local user data to a Hugging Face Hub repository. This automated process enables collaborative research, reproducibility, and centralised analytics, while cleaning up local files after successful commits to optimise storage.
5.6. Extensibility and Customization
Semantrix’s modular architecture ensures that new languages, embedding models, or experimental conditions can be incorporated with minimal code changes, usually only requiring updates to configuration files or model wrappers. UI elements can be rapidly adapted thanks to the flexibility of the Gradio Blocks API, and additional analytics or feedback modules can be integrated as required.
6. Methods
This section outlines the methodological approach followed to investigate the effects of semantic model selection and adaptive support on user experience in the Semantrix web application. Drawing from established guidelines in human–computer interaction research, our methods employed a preregistered, between-subject factorial design with random assignment to conditions, adherence to research ethics and GDPR compliance, and using validated self-report questionnaires alongside objective behavioural metrics. By deploying the study in a real-world, web-based setting while systematically controlling experimental variables, we sought to maximise both robustness and ecological validity.
6.1. Rationale and Overall Study Design
The design of Semantrix targets a central question in HCI and intelligent systems: How do underlying computational mechanisms and system-level scaffolds (such as adaptive hints) alter both the measurable and felt quality of human–digital game interaction? Rather than focusing solely on technical accuracy or raw performance, we aimed to develop a rich, multidimensional understanding of user experience, capturing engagement, perceived competence, enjoyment, and the subtle interplay between user intent and system feedback.
To this end, we adopted a
between-subject factorial design. Several considerations drove this choice. First, manipulating both the semantic embedding model (Sentence Transformers, representing a state-of-the-art contextual approach, and Word2Vec as a widely recognised baseline) and the presence or absence of dynamic hinting allow us to independently and jointly assess their contributions to end-user outcomes and system usability. To limit factorial growth and preserve power, we fixed the hint-trigger policy to the EMA-based detector described in
Section 4.3. Second, the between-subject paradigm, wherein each participant experiences only one experimental condition, was preferred to within-subject alternatives to preclude confounding effects due to carry-over, learning, or fatigue. This design minimises the risk that experiences in one condition would bias or influence responses in another, an important consideration when eliciting sensitive subjective appraisals and minimising participant burden [
19]. Before data collection, all aspects of the study protocol, including hypotheses, variable manipulations, and planned analyses, were preregistered. This step affirmed the transparency and reproducibility of our experiment, aligning our approach with ongoing recommendations in computational and behavioural research communities [
20].
6.2. Experimental Conditions and Interventions
At the heart of our experiment were two independent variables, each modelled as a binary factor:
Semantic Embedding Model. Participants either interacted with a version of the game powered by a Sentence Transformer model capable of nuanced and context-sensitive semantic similarity judgments or a more traditional Word2Vec model, which relies on static word-level vector relationships. The rationale for this contrast is twofold: While Sentence Transformers have demonstrated superiority in capturing context and meaning in NLP research, it remains an open empirical question whether such technological advances meaningfully translate into richer, more satisfying user experiences in a live system context. Moreover, direct comparison allows us to contextualise the added value of state-of-the-art AI within the lived realities of end-users.
Dynamic Hint System. We further established whether participants received only basic semantic score feedback or whether the full dynamic hint system was enabled, providing progressively more informative cues and support when users encountered difficulty or stagnation. This adaptive support mechanism is intended to enhance perceived competence and enjoyment by offering timely help, though it may also reduce perceived challenge or the sense of autonomy when overused, a trade-off noted in prior educational research [
21]; thus, its effects are subject to empirical examination.
(1) Sentence Transformer + Dynamic Hints: Users experience both advanced semantic scoring and adaptive feedback.
(2) Sentence Transformer + No Dynamic Hints: Advanced scoring, minimal feedback.
(3) Word2Vec + Dynamic Hints: Classic embedding paired with adaptive support.
(4) Word2Vec + No Dynamic Hints: Baseline in both model and feedback.
Random assignment to these conditions was implemented automatically upon each participant’s entry to the web application, ensuring balanced condition sizes and controlling for self-selection and diurnal or batch effects.
Figure 5 provides an overview of the participant flow, from entry and random assignment to session activities and survey invitation. All phases were fully automated in the backend.
6.3. Participants and Recruitment
Our sampling strategy prioritised diversity and external validity while remaining feasible for a controlled experimental deployment. We recruited 42 individuals (age: 18–64, years), using multimodal online outreach (university mailing lists, professional association newsletters, public social media groups). This approach was chosen to avoid over-reliance on a single, possibly homogeneous population and reflect contemporary digital platforms’ broad demographic reach.
Of the 42 participants, 31 (73.81%) identified as women, 10 (23.81%) as men, and 1 (2.38%) preferred not to disclose their gender. The educational level was high: 81% (34) had or were pursuing a university degree (bachelor’s or higher), and 15 reported having a master’s or PhD. Three reported completing only secondary education (high school or vocational training). Participants also varied in their self-assessed familiarity with technology: 14 rated themselves as “high” familiarity, 7 as “medium-high”, 15 as “medium”, 3 as “medium-low”, and 3 as “low”. Overall, the sample skewed toward higher formal education and moderate to high digital literacy. While this may align with the potential primary users of web-based language games, it may limit the generalisability of our findings to populations with lower education levels or less experience with digital games and applications based on NLP.
All participants reported normal or corrected-to-normal vision, which was required for engagement with the visual interface and fair comparison of usability metrics. After reviewing a detailed privacy and data protection statement (following GDPR), each prospective participant provided informed, digitally recorded consent. Completing the post-game survey and open-ended comment fields was entirely optional to ensure voluntary participation and minimise pressure, with no compensation contingent on these elements.
6.4. Experimental Procedure
The entire participant experience was designed to be seamless and unobtrusive, mirroring naturalistic interaction with web-based entertainment applications. The experimental flow is summarised in
Figure 5. The UI sequence that instantiates this flow is shown in
Figure 6, which maps onto the game phases in
Figure 2: initialisation (green; A1a–e), main loop (blue; A1f–g), and game end (pink; A1h).
Data Consent and Onboarding: After reading the data protection summary and providing informed consent, participants reached the landing and welcome screens (
Figure 6a,b), which introduced the goal and the “hot/cold” feedback idea. This corresponds to the initialisation phase in
Figure 2.
Instructional Support: Participants were asked whether they wished to view the rules (
Figure 6c). The rules and scoring overview (semantic score 0–10 and best-tries ranking) were then presented (
Figure 6d), followed by the ready-to-start screen before the first guess (
Figure 6e). These steps complete the initialisation phase in
Figure 2.
Game Session: Participants played one or more rounds, guessing the secret word with semantic feedback. The gameplay view (
Figure 6f) showed the best-tries table and last-guess feedback. When the EMA-based detector (
Figure 3) indicated participants getting stuck, the system displayed a dynamic hint panel (
Figure 6g; Word Ranking shown as an example). These elements instantiate the main loop in
Figure 2.
Transition to Survey: At the end of a round, the interface displayed the solution and a short “curiosity” fact, with options to play again or finish (
Figure 6h), after which participants were invited (optional) to complete the post-game survey. This corresponds to the game-end phase in
Figure 2.
All experimental operations, from randomisation to data saving, were automated in the backend to minimise demand effects or unintentional bias.
6.5. Measurement Instruments and Rationale
Game Metrics: Throughout gameplay, fine-grained behavioural data were collected, leveraging the affordances of digital logging in web applications. Logged metrics encompassed the following:
Number of rounds played: Serves as a broad proxy for engagement and perseverance.
Attempts and correct guesses: Support learning and error patterns analysis.
Cumulative/average semantic scores: Permit quantification of performance quality under different feedback regimes.
Hint requests and hint usage: Directly inform the relationship between support structure and observed reliance on system scaffolding.
Completion times: Offer additional insight into cognitive load, fluency, and decision strategies.
These measures offer the potential to disentangle qualitative phenomena such as engagement and challenge from their behavioural correlates.
Self-Reported Metrics: To capture the multi-faceted and often subtle subjective experience of interacting with Semantrix, we assembled a unified post-experiment questionnaire incorporating validated constructs from contemporary HCI and psychology. Questionnaire selection was informed by coverage of relevant theoretical domains, prior use in engagement and digital entertainment studies, and brevity to minimise user fatigue. Components consisted of the following:
User Engagement Scale—Short Form (UES-SF) [
22]: This questionnaire measures four dimensions: Focused Attention, Perceived Usability (reverse-coded), Aesthetic Appeal, and Reward Factor. We included all four subscales in our study, using the full set of 12 questions.
Intrinsic Motivation Inventory (IMI) [
23]: From this questionnaire, we administered only the Interest/Enjoyment and Perceived Competence subscales, well established as being predictive of engagement and positive appraisal in digital gameplay. A total of 13 items from these subscales were used.
FunQ [
24]: We used this questionnaire by selecting the dimensions most relevant to our research focus: immersion, delight, challenge, and stress. These dimensions were chosen to align with the specific aspects of user experience we aimed to investigate in the context of Semantrix. The Loss of Social Barriers and Autonomy dimensions were omitted, as they were less relevant to our study objectives and the solitary nature of the game. In total, 12 items from this adapted version were used.
In total, participants were presented with a fixed sequence of 33 questions. The order of the questions was not randomised, ensuring that all participants shared the same experience and facilitating cross-user comparison. We took active steps to minimise redundancy by cross-referencing items for conceptual overlap during questionnaire integration. For non-English-speaking participants, a rigorous translation and back-translation protocol (following ITC guidelines) was enacted to preserve each item’s semantic and psychometric integrity.
Scoring: All survey items used a consistent five-point Likert scale (1 = strongly disagree, 5 = strongly agree). For analytic consistency, scores on each construct (e.g., engagement, motivation) were computed as the arithmetic mean of their respective items, except for the FunQ, where a sum score approach, reflecting factor structure, was used, as recommended by the original authors. Open-ended feedback provided qualitative insights, supplementing quantitative results.
6.6. Hypotheses
Grounded in theoretical models of user experience in digital games and prior findings on adaptive system support and intelligent interfaces, we formulated distinct sets of hypotheses for three core experiential constructs: engagement, intrinsic motivation, and fun. Each construct was operationalised through validated measurement questionnaires as described above.
Engagement, comprising attentional focus, aesthetic appreciation, ease of use, and willingness to re-engage, is a key predictor of positive user experiences in digital interaction [
22]. Building on existing literature suggesting that both system intelligence and adaptive feedback can play roles in sustaining user attention and immersion, we hypothesised the following:
: There will be significant differences in reported engagement as a function of the underlying semantic model; specifically, participants experiencing the Sentence Transformer-based system will report higher levels of engagement than those assigned to the Word2Vec-based system.
: Activation of the dynamic hint system will yield higher self-reported engagement compared to the minimal-feedback conditions.
: There will be significant interaction effects such that the combination of a sophisticated semantic model and rich dynamic feedback will produce higher engagement scores than can be attributed to either variable in isolation.
Intrinsic motivation reflects the degree to which users perceive the activity as inherently interesting, enjoyable, and satisfying, independent of external rewards, a critical driver of persistence and deep learning in serious games and HCI contexts [
23]. Based on self-determination theory and prior empirical work with adaptive systems, we proposed the following hypotheses:
: The use of the Sentence Transformer model will result in significantly higher intrinsic motivation scores relative to the Word2Vec model, as advanced semantic understanding may promote curiosity and perceived task meaningfulness.
: Participants with access to the dynamic hint system will report greater intrinsic motivation, reflecting enhanced feelings of competence and effective challenge.
: The impact of dynamic hints on intrinsic motivation will be amplified in the context of the advanced semantic model, resulting in a statistically significant interaction effect.
Fun encapsulates both affective enjoyment and playful challenge, underpinning the appeal of many serious games [
24]. It was conceptualised here not merely as hedonic pleasure but as a multidimensional construct comprising immersion, delight, and the balance of challenge and stress:
: Participants in the Sentence Transformer conditions will report higher levels of overall fun, as advanced semantic scoring is expected to create a more dynamic and responsive interaction, promoting states of flow and playful absorption.
: Activation of the dynamic hint system will be associated with increased fun ratings, by supporting playful exploration while buffering against excessive frustration.
: The presence of both an advanced semantic model and a dynamic hint system will have a synergistic effect, yielding the highest observed fun scores and demonstrating a significant interaction between the two factors.
Each set of hypotheses were tested using statistical analyses selected by the observed properties of the collected data. When assumptions such as multivariate normality and adequate correlation among the dependent variables were satisfied, factorial MANOVA was applied to investigate the main and interaction effects of system intelligence and feedback. The final analytical approach was determined after verifying the suitability of these statistical assumptions.
7. Results
This section reports the results of the Semantrix online platform experiment, including in-game behavioural data and self-reported user experience measures. We first present the core user outcome metrics across experimental conditions, followed by formal statistical analyses addressing our preregistered hypotheses (Heng1,2,3, Him1,2,3, Hfun1,2,3). The statistical analyses were carried out using the IBM SPSS Statistics software (version 26) and the information extracted from the comments left by the 42 participants of the experiment. An alpha significance level of was used for the statistical analyses.
7.1. User Outcome Metrics
To assess user engagement, task performance, and experience, we collected a range of behavioural (e.g., playing time, percentage of games won, hints received, word attempts, rounds completed) and self-report measures (UES–SF for engagement, FunQ for fun, IMI for intrinsic motivation).
Table 2 and
Table 3 summarise these outcome variables by experimental condition.
Notably, substantial differences in gameplay behaviour were observed across conditions, as shown in
Table 2. Conditions with Sentence Transformers and/or enabled hints were characterised by longer average playing times per round and more word attempts per round, particularly notable in the Sentence Transformers + Hints group (mean playing time
s/round, mean word attempts
), compared to the much shorter duration in the Word2Vec + No Hints condition (
s/round, mean word attempts
).
The number of rounds played per user was also highest in the Sentence Transformers with hints (), again suggesting that the combination of more advanced semantic processing and adaptive hinting fostered greater persistence and re-engagement.
These game metrics could indicate that advanced language models and dynamic hints contributed to greater user persistence (longer play sessions, more attempts, and rounds played). In contrast, their absence, most markedly in the Word2Vec + No Hints group, was associated with shorter sessions and reduced behavioural engagement. These objective patterns reinforce the trends observed in the self-report engagement, fun, and motivation scores.
Longer playing times and more hints received were thus most prominent when dynamic hints were available, especially in the Sentence Transformers + Hints group. This convergence of game and self-report data could suggest that adaptive feedback and the use of advanced semantic models both sustain user engagement and promote persistence in solving the game.
Descriptively, the self-reported outcomes in
Table 3 show a coherent pattern across conditions. The combined condition (Sentence Transformers + Hints) reached the highest means for engagement (UES-SF = 3.63 ± 0.46; 95% CI: 3.33–3.92), fun (FunQ = 3.87 ± 0.25; 95% CI: 3.71–4.03), and intrinsic motivation (IMI = 3.86 ± 0.53; 95% CI: 3.52–4.20). Disabling the hint system was associated with lower ratings under both models, with the largest declines observed for IMI: from 3.86 to 2.67 with Sentence Transformers and from 3.30 to 2.71 with Word2Vec. When hints were available, Sentence Transformers slightly exceeded Word2Vec on UES-SF (3.63 vs. 3.56) and FunQ (3.87 vs. 3.79), whereas motivation appeared primarily driven by hint availability rather than model choice. The weakest profile corresponded to Word2Vec without hints (UES-SF = 3.15; FunQ = 3.32; IMI = 2.71), though the wider confidence intervals in this group, reflecting its smaller sample size (n = 5; see
Section 7.3), suggest caution. These descriptive trends motivate the inferential analyses reported next. While both game metrics and self-reported questionnaire data were collected as outcome measures, our hypothesis testing and subsequent multivariate analyses focused specifically on the questionnaire-based measures: UES-SF (engagement), FunQ (fun), and IMI (intrinsic motivation).
7.2. Assumption Checks
Before conducting multivariate analysis, we assessed all primary outcome variables for the relevant assumptions. Shapiro–Wilk tests indicated no significant deviations from normality for UES–SF (), FunQ (), or IMI (). Levene’s tests also indicated that the assumption of homogeneity of variances was satisfied (all ). Furthermore, inspection of Pearson correlation coefficients among the three dependent variables showed moderate and statistically significant positive correlations: UES–SF and FunQ (, ), FunQ and IMI (, ). The correlation between UES–SF and IMI was weaker and not significant (, ). These results indicate that, while the constructs are related, they are not redundant, and the intercorrelation is sufficient to justify a multivariate analysis approach. Given that all necessary assumptions were satisfied, we proceeded with a factorial MANOVA to evaluate the effects of the semantic embedding model and the dynamic hint system on the outcome measures.
7.3. Participant Decay and Condition-Specific Dropout
During data collection, a pronounced pattern of participant dropout was observed in one experimental condition: Word2Vec + Hints Disabled. Out of 15 users who entered this group, only 5 completed the post-game self-report questionnaire. In contrast, completion rates were substantially higher in all other conditions (Sentence Transformer + Hints, Sentence Transformer + No Hints, Word2Vec + Hints), with 12 valid responses per condition.
This markedly higher decay rate (only 33.3% completion in Word2Vec + No Hints compared to nearly 85% in other conditions) is noteworthy. Most participants either exited the game before reaching the survey or chose not to continue after the gameplay session, suggesting a significant lack of engagement or dissatisfaction uniquely associated with that configuration. This outcome could indicate reduced engagement, precisely the facet measured by our primary dependent variables, and aligns with our finding that advanced models and/or hinting systems are critical for sustained user involvement in a semantic guessing game context.
Consequently, for subsequent statistical analyses, the condition with Word2Vec Model + No Hints contained a much reduced sample size (
), while other conditions each included
participants. This imbalance was taken into account in the interpretation of inferential statistics (see
Section 8). Nonetheless, this dropout phenomenon may reflect reduced engagement without semantic modelling and adaptive support, although we cannot definitively attribute causality to these factors.
7.4. Statistical Analysis and Hypothesis Testing
A multivariate analysis of variance (MANOVA) was conducted to examine the main and interaction effects of semantic embedding model (Sentence Transformers vs. Word2Vec) and Dynamic Hint System (enabled vs. disabled) on engagement (UES–SF), perceived fun (FunQ), and intrinsic motivation (IMI). This approach directly addressed our primary preregistered hypotheses (Heng,1,2,3, Hfun,1,2,3, Him,1,2,3).
The main effect of the Model factor (
) indicated that, across all conditions, participants using Sentence Transformers reported higher engagement and fun than those using Word2Vec (cf.
Table 4,
Figure 5). The main effect of the Hints factor (
) showed that, independent of Model, enabling dynamic hints significantly increased all three outcomes (engagement, fun, and motivation), supporting hypotheses
,
, and
. The main effect for Model was not significant for intrinsic motivation (
), thus only partially supporting
.
Crucially, a significant
interaction effect (Model × Hints,
,
) demonstrates that the effect of one variable depends on the level of the other. For instance, as detailed in
Table 4, the benefit of dynamic hints on engagement and fun was especially strong when combined with the Sentence Transformer model. For motivation (IMI), the combination produced a larger increase than either factor alone. Still, the motivational benefit of Sentence Transformers became apparent mainly when hints were absent, suggesting possible “redundancy” in their combined application (see
Figure 7).
To clarify individual variable effects concerning our hypotheses, we conducted Bonferroni-corrected univariate ANOVAs for each dependent variable.
Table 4 details the test statistics for all main and interaction effects. The estimated marginal means (±95% CI) for engagement (UES-SF), fun (FunQ), and intrinsic motivation (IMI) by Model and Hint condition are shown in
Figure 7,
Figure 8 and
Figure 9. These plots complement the main and interaction effects reported in
Table 4.
Across outcomes found in
Table 4, dynamic hints showed large effects,
on engagement (UES-SF),
on fun (FunQ), and
on intrinsic motivation (IMI). The semantic model showed moderate-to-large effects on engagement (
) and fun (
), and a small-to-moderate effect on motivation (
). Interactions were moderate (
), indicating that Sentence Transformers amplified the benefit of hints.
In summary, engagement and fun benefited from both advanced models and dynamic hints, especially in combination, while adaptive hints most reliably enhanced motivation. This pattern supports all engagement and fun hypotheses and those relating to the motivational effects of adaptive support.
8. Discussion
This study aimed to elucidate how advanced semantic modelling and including a dynamic hint system influence engagement, intrinsic motivation, and fun in users interacting with an online semantic word-guessing game. Through a controlled factorial design, we were able to isolate the effects of Transformer-based semantic embeddings and adaptive feedback mechanisms on both behavioural and subjective metrics.
The results revealed clear patterns across the primary dependent variables: First, using Transformer-based embeddings and dynamic hints significantly enhanced user engagement as measured by the UES-SF. The effect was statistically robust and reflected behavioural persistence, with longer session times and higher completion rates in these conditions. The fact that engagement was most pronounced when both factors were combined highlights the importance of aligning technical capabilities (deep semantic matching) with the adaptivity of user support mechanisms.
User-reported fun, as assessed by the FunQ questionnaire, showed a parallel pattern. Both main effects and their interaction reached significance, with the significant fun reported under the dual presence of rich semantic modelling and dynamic hints. This is consistent with the theoretical view that enjoyment in digital games is fostered when the challenge is well matched to user abilities and is scaffolded in ways that avoid excessive stagnation or frustration.
Intrinsic motivation, measured via the adapted IMI, was reliably increased by providing adaptive hints but not as a main effect of the underlying semantic model. Notably, a significant interaction was found, indicating that the motivational advantage of semantic Transformers manifests most strongly when hints are absent. This pattern suggests that motivational benefits may arise from either deeper semantic feedback or a supportive hint system, but not necessarily from both simultaneously, pointing to a kind of “motivational redundancy” in their combined application. Although the Model’s main effect on intrinsic motivation did not reach significance (
), the partial eta-squared (
) suggested a small-to-moderate trend. Together with the significant Model × Hints interaction and the estimated marginal means, shown in
Figure 7, this is compatible with a motivational redundancy theory: adaptive hints robustly increase perceived competence and thus intrinsic motivation. At the same time, richer semantic feedback from Sentence Transformers contributes to motivation mainly when hints are disabled. On the other hand, we note that the unequal sample sizes, particularly in the Word2Vec + No Hints condition, may have reduced sensitivity to small main effects on IMI; estimates should therefore be treated as relative (see
Section 8 for details). In terms of use cases, we can say that these results could point to the importance of detecting the moment and adjusting the appropriate level of hints to preserve the “productive struggle” and, on the other hand, suggests that advanced semantic modelling may be particularly valuable for maintaining motivation in contexts where the hint system is unable to provide sufficient support to the user, while more consistently benefiting engagement and enjoyment in all circumstances.
The observed effects may be understood through cognitive and affective mechanisms. Enhanced semantic modelling likely enables more accurate and rewarding feedback; users can sense that the system “understands” their intent at a meaning-based level, which can foster a sense of competence and relatedness. Meanwhile, the adaptive hint system prevents stagnation, averts frustration, and may restore user self-efficacy when progress stalls. Both systems’ synergetic effect maintains a “productive struggle” that is optimal for engagement and positively affects gameplay. The effect sizes obtained, particularly for dynamic hints, suggest practical significance beyond statistical reliability, while the small-to-moderate model effect on motivation () indicates a trend that may be resolved with larger samples.
Although we adopted an EMA-based trigger, other policies are reasonable as well. Rule-based triggers (e.g., triggering after a fixed number of failed attempts, or when scores remain below a threshold) are simple and transparent. Still, they may ignore the magnitude and direction of change and require frequent readjustments, sometimes triggering after harmless oscillations. Time-based triggers are easy to implement, but they combine reflection time, writing speed, and device or network latency, so they may not reliably indicate that a player has been lost. On the other hand, user-initiated hints maximise autonomy and transparency but can introduce biases that complicate comparisons between conditions; in the pilot tests where we enabled the option to request hints, several participants avoided asking for help even when they got stuck, which increased the variability in frustration. In contrast, our EMA-based policy is lightweight and reacts to trends rather than noise or peaks. The two-stage EMA, described above, reduces fluctuation while maintaining sensitivity to avoid getting stuck. It still requires parameter choice and a brief warm-up period.
Limitations
While our results are promising, several limitations must be acknowledged. Though adequate for detecting main effects, the sample size was restricted in some subgroups due to pronounced participant dropout. Notably, in the condition lacking advanced semantics and dynamic hints (Word2Vec + No Hints), only a few participants completed the experiment. Such an imbalance can reduce power, widen confidence intervals, and may make interaction estimates more variable and sensitive to deviation from normality and equal variances across groups. The dropout in that condition may also be related to the outcomes (e.g., lower engagement), which could introduce bias into condition means. For example, if less-engaged participants were less likely to complete the survey, the remaining users might overestimate engagement. To minimise these risks, we checked distributional assumptions, reported effect sizes and confidence intervals, and used multivariate tests, which are commonly applied with unequal n’s; even so, effects involving the smallest sample size, especially interactions, should be interpreted with caution.
Our group of participants was predominantly highly educated, and many reported having a medium to high level of technological knowledge. This may have several implications. First, it could produce ceiling effects in basic usability and understanding of instructions, which could overstate perceived ease of use. Secondly, digitally experienced users may be more tolerant of new patterns of interaction with the interface, as well as hints generated by the system, and may have different help-seeking behaviours than less experienced users. These factors could attenuate or amplify the effects observed, so external validity for populations with less digital experience should be considered provisional.
While advantageous for reach and ecological validity, the web-based study design can introduce selection biases related to digital access and user motivation, limiting the extent to which these results can be generalised across different user populations or deployment scenarios.
Finally, reliance on remote API calls for generative models (e.g., GPT-4) may constrain responsiveness and scalability for real-time or edge applications. Technical optimisations regarding cost, latency, and robustness should be addressed in future work.
9. Conclusions
This work demonstrated that integrating Transformer-based NLP models and adaptive hinting mechanisms significantly enhanced user engagement and satisfaction in web-based semantic guessing games. Our results showed that the combination of advanced semantic embedding (Sentence Transformers) and dynamic feedback yielded statistically significant improvements in self-reported engagement, fun, and motivation compared to static word embeddings and minimal feedback baselines. Notably, user attrition rates were dramatically reduced in adaptive conditions, supporting the necessity of both technological and interactional scaffolding to sustain meaningful human–AI interaction.
Beyond technical performance, our findings suggest potential implications for designing intelligent educational and entertainment systems, particularly in the context of semantic word games and similar adaptive interfaces. The successful deployment of Semantrix illustrates the value of responsive, personalised support in promoting richer and potentially longer-lasting user experiences within this domain.
While this study demonstrates the benefits of adaptive and semantic feedback for user engagement and motivation, several routes remain for further exploration that could broaden the implications and applicability of our findings. Increasing sample size and expanding demographic diversity would enable a more robust assessment of generalisability across varied populations and usage contexts. In addition, it would be helpful to adopt methodological control measures to reduce imbalance and improve statistical reliability, such as setting per-condition quotas based on a priori power analysis and using stratified randomisation to limit uneven distribution. To reinforce generalisation, it would be very beneficial to broaden the selection of participants and establish quotas by condition stratified by educational level and digital literacy. Alternative types of hints (e.g., iconographic and audio-supported) to better support non-gamers, older adults, and users with less digital literacy would be valuable.
Longitudinal studies may also help examine whether observed engagement and motivation improvements translate to lasting educational or cognitive benefits. Additionally, it would be beneficial to examine the effects of different feedback modalities (e.g., audio or visual cues) and variants in game structure (such as collaborative or competitive modes), to better understand the mechanisms underlying the observed effects and their potential transferability. Looking ahead, it might be useful to develop the alternatives mentioned (rule-based and time-based triggers, etc.) and explore a comparison that also considers user-initiated hints. It might also be insightful to examine a hybrid policy that combines EMA detection with user confirmation and a conservative timeout, along with sensitivity analyses of EMA parameters.
These directions are not exhaustive but suggest natural extensions of the present work. Addressing them would help establish a broader understanding of how adaptive feedback systems can be deployed effectively across settings, while clarifying the scope of their educational and motivational impact.
While preparing this work, the authors used GPT-4o, from OpenAI, and Grammarly, from Grammarly Inc., to improve the work’s readability and language. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.