WarehouseGame Training: A Gamified Logistics Training Platform Integrating ChatGPT, DeepSeek, and Grok for Adaptive Learning

Romero Marras, Juan José; De la Torre, Luis; Chaos García, Dictino

doi:10.3390/app15126392

Open AccessArticle

WarehouseGame Training: A Gamified Logistics Training Platform Integrating ChatGPT, DeepSeek, and Grok for Adaptive Learning

by

Juan José Romero Marras

^*,†

,

Luis De la Torre

^†

and

Dictino Chaos García

^†

Departamento de Informática y Automática, ETSI Informatica, UNED, Juan del Rosal 16, 28015 Madrid, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(12), 6392; https://doi.org/10.3390/app15126392

Submission received: 2 May 2025 / Revised: 1 June 2025 / Accepted: 3 June 2025 / Published: 6 June 2025

Download

Browse Figures

Versions Notes

Abstract

Modern warehouses play a fundamental role in today’s logistics, serving as strategic hubs for the reception, storage, and distribution of goods. However, training warehouse operators presents a significant challenge due to the complexity of logistics processes and the need for efficient and engaging learning methods. Training in logistics operations requires practical experience and the ability to adapt to real-world scenarios, which can result in high training costs. In this context, gamification and artificial intelligence emerge as innovative solutions to enhance training by increasing operator motivation, reducing learning time, and optimizing costs through personalized approaches. But is it possible to effectively apply these techniques to logistics training? This study introduces WarehouseGame Training, a gamified training tool developed in collaboration with Mecalux Software Solutions and implemented in Unity 3D. The solution integrates large language models (LLMs) such as ChatGPT, DeepSeek, and Grok to enhance adaptive learning. These models dynamically adjust challenge difficulty, provide contextual assistance, and evaluate user performance in logistics training scenarios. Through this gamified training tool, the performance of these AI models is analyzed and compared, assessing their ability to improve the learning experience and determine which one best adapts to this type of training.

Keywords:

gamification; logistics training; large language models; adaptive learning; Unity 3D; ChatGPT; DeepSeek; Grok

1. Introduction

A Statista report [1] suggests that by 2025 the global e-commerce sector is expected to hit around $6.3 trillion in sales, showcasing a compound annual growth rate (CAGR) of 11.8% since 2021. This increase in demand creates substantial hurdles for the logistics supply chain. Continuous improvement and adjustments have become crucial in aspects such as storage, transportation, and distribution. Warehouses serve a key function in contemporary logistics by acting as centers for the effective receiving, storing, and distributing of products. The incorporation of digitalization and automation in warehouses has greatly boosted their productivity. Technologies like automated guided vehicles (AGVs), digital twins, and virtual reality have optimized decision-making, lowered errors, and enhanced efficiency [2,3,4,5]. Nonetheless, despite these technological innovations, the human element remains vital in many warehouse logistic operations. Warehouses are crucial for various logistical responsibilities, even within highly automated settings. Repetitive activities, such as order picking and inventory management, can lead to lower motivation. The self-determination theory (SDT) emphasizes that intrinsic motivation flourishes when people feel competent, autonomous, and socially connected [6]. This is where training tools can significantly aid operators during their learning journeys. In recent times, a variety of technologies have emerged to improve learning, including serious games, gamification, and artificial intelligence. Serious games are digital solutions specifically designed for educational and training purposes, intertwining training goals with engaging gameplay, leading to increased adoption in educational environments due to their ability to enhance learning across multiple subjects [7,8,9]. Serious games actively involve users through engaging and challenging interactions, prompting many organizations to implement them for employee training in areas like leadership, sales, or vital operations. Another significant technology is gamification, which applies game-like elements and mechanics in settings outside of gaming, such as business and education [10]. When gamification is incorporated into training tools, it boosts user engagement, motivation, and learning results. Lastly, artificial intelligence, especially the use of large language models (LLMs), is a key change agent in education and gamification. LLMs are sophisticated neural networks built on extensive textual datasets, designed to comprehend, generate, and adapt language in context. By using architectures like transformers and attention mechanisms, these models can understand queries, produce coherent replies, and flexibly adjust information based on user interactions. Within this platform, LLM-based tools like ChatGPT 4, DeepSeek, and Grok [11,12,13] can enhance personalization in gamification and serious games. These models can fluidly adjust challenges and content within the platform, tailoring them to fit the skills and requirements of the players [14,15]. This research aimed to assess and contrast the performance of the leading AI models available today to find out which one best fits gamification environments. Important factors for evaluation included the ability to customize challenges, accuracy in providing relevant training content, and adaptability to various user experience levels by tailoring experiences and facilitating practical learning through the integration of LLMs.

2. Background

The incorporation of gamification, serious games, and large language models (LLMs) into the logistics sector is revolutionizing both training and everyday tasks. These innovations not only improve the efficiency and motivation of operators but also prepare companies to tackle future supply chain issues with increased flexibility and effectiveness. Several investigations [16,17] have looked into how serious games merge hands-on learning with interactive simulations, resulting in better outcomes for logistics education. Serious games give users the chance to learn from their mistakes without any real-life risks. For example, ref. [18] highlights the favorable effects of serious games on independent learning. In addition, gamification enhances logistics operations and workforce training by introducing challenges, rewards, and simulations, which stimulate operators and boost their performance. This leads to less downtime and dwindling mistakes [19,20,21]. From an educational standpoint, gamified platforms that feature leaderboards and performance metrics increase engagement in tasks and satisfaction in developing competencies [22,23]. LLMs also offer many advantages for training in logistics. They facilitate real-time error analysis and provide personalized improvement suggestions, as well as the design of more realistic and adaptable training scenarios. These models can fluidly respond to the unique demands of logistics training, delivering tailored feedback and guidance. Operators trained with LLMs and serious games show quicker learning rates and better skill development [24]. Most studies conducted so far in the academic field have focused on how serious games can function as teaching aids for logistics education. The studies by [25,26] created a game model simulating genuine supply chain scenarios, tackling issues like storage, distribution, and transport. In a similar vein, ref. [27] examined over 40 games, analyzing the influence of game-based learning on supply chain management, particularly how the complexity of game design affects learning and operational results. Although these studies have offered useful insights, they have predominantly aimed at training students in educational settings. This creates some constraints, as the absence of application in actual business environments may lead to games that fail to realistically portray logistical operations or that have not been tested for practical use. Expanding upon this previous research, the current study advances by applying these methods (serious games, gamification, and large language models (LLMs)) within a more practical context.

3. Materials and Methods

In collaboration with Mecalux Software Solutions [28], researchers have developed WarehouseGame Training (see Figure 1), a gamification platform specifically designed to train and enhance logistical processes. This platform aims to explore the effects of gamification on training in logistical processes, specifically in the simulation of order picking. In a 3D virtual environment, the platform incorporates key gamification elements, such as scoring systems and engaging visual and narrative components. WarehouseGame Training was conceived with the following guiding principles:

1.: Realistic and immersive simulation.
2.: Gamification.
3.: Personalized learning.

3.1. Realistic and Immersive Simulation

The solution proposed in this paper is designed to replicate warehouse operations with a high degree of realism, accurately representing workflows, equipment, and the common challenges encountered in daily operations. This immersive environment allows trainees to familiarize themselves with warehouse processes in a risk-free setting, facilitating a smoother and more effective transition to real-world tasks. It was developed using the Unity3D engine [29] (see Figure 2), renowned for its advanced graphics and simulation capabilities, incorporating essential elements such as automated storage areas, picking stations, loading and unloading docks, and sorting zones. The goal is to enhance user immersion, making the training experience more engaging and effective while fostering a deeper understanding and better adaptation to real-world scenarios. The simulated warehouse is based on realistic customer cases provided by Mecalux Software Solutions. Key elements of warehouse work are included, such as radio frequency terminals, shelf tag scanning, and product barcode scanning, reflecting common tasks in these environments.

A standout feature is the simulation of a real warehouse management system (WMS) on the radio frequency terminal. Players receive instructions through the terminal as if they are interacting with an actual WMS, enabling them to perform tasks as if they are working in a real warehouse. The simulated WMS software used is Easy WMS [28], a leading solution in warehouse management. The simulation is designed to encompass a wide variety of logistical tasks. The activities include order preparation (see Figure 3), material receiving, material placement, and forklift operations. This approach ensures comprehensive skill development, enabling employees to perform various roles and assume multiple responsibilities within the warehouse.

Each task category is designed with multiple difficulty levels, which are tailored to the player’s skills and progress. For example, in forklift operation tasks (see Figure 4) the initial challenges focus on basic movements, while more advanced levels require precise coordination. Similarly, order preparation tasks start with simple activities and progress to more complex scenarios, involving multiple simultaneous orders and time constraints.

3.2. Gamification

WarehouseGame Training has been designed to maximize player motivation and engagement through gamification. Research indicates that gamification enhances task engagement, skill acquisition, and knowledge retention [30,31,32]. However, its use in logistics training remains underexplored, underscoring the need for strategies tailored to individual profiles to maximize real-world impact [33,34,35]. Various gamification elements are carefully integrated into the learning experience (see Figure 5). These elements aim not only to make the process more dynamic and entertaining but also to strengthen the technical and operational competencies required in a real logistics environment. This approach encourages continuous improvement, enabling users to track their progress and strive for better results in each task category, from material reception to forklift operation. Additionally, each challenge is designed with time constraints, fostering speed and precision in task execution—key aspects in a logistics context.

Among the standout features is a points-and-levels system that rewards players for their successes and penalizes their mistakes. At the end of each challenge, the player’s results are displayed (see Figure 6):

Another improvement have been incorporated: an expanded narrative (see Figure 7). The expanded narrative provides detailed explanations of the challenges and concepts related to warehouse operations, offering a clearer and more engaging context that enhances the understanding of logistics processes.

3.3. Personalized Learning

One of the key innovations is the ability to customize training, using artificial intelligence (AI). The incorporation of advanced AI models, such as large language models (LLMs), distinguishes this study, enabling personalized training plans tailored to each player’s skill level. Through the simulation, players can interact with challenges designed by the following AI models:

1.: ChatGPT [11]: Developed by OpenAI using the GPT-4 architecture, ChatGPT stands out for its versatility and ease of use. It excels in tasks such as creative writing, coding assistance, and general inquiries. Its chain-of-thought reasoning capability allows it to generate detailed and contextually relevant responses. However, it has limitations in accessing up-to-date information, as it cannot perform real-time web searches. GPT-4 is the model that has been employed throughout the present study. GPT-4 was accessed via OpenAI’s API for this study.
2.: DeepSeek [12]: Developed by the Chinese startup of the same name, DeepSeek is a language model based on a mixture of experts (MoE) architecture. Its latest version, DeepSeek-V3, features 671 billion parameters, activating 37 billion per token to enable efficient inference through multi-head latent attention (MLA). The model has surpassed ChatGPT in downloads on the App Store and has been adopted by companies such as Great Wall Motor. DeepSeek-V3 has demonstrated competitive performance and was accessed via its API throughout this research.
3.: Grok [13]: Developed by xAI, the artificial intelligence company founded by Elon Musk in 2023, Grok is a chatbot designed to answer questions and provide broad insights. Its latest version, Grok 3, is noted for its advanced reasoning capabilities, outperforming competitors like ChatGPT. Additionally, xAI has integrated source citations into responses and introduced multimodal models, such as grok-2-vision-1212, which enhance accuracy and instruction-following capabilities across multiple languages. While Grok 3 was released during the course of this study, it was not yet accessible via API, so Grok 2 was employed instead.

These models enable the personalization of the training plan, based on each player’s initial skill level. The skill level can be determined in two ways: (1) through an initial survey (see Figure 8), designed based on Mecalux Software Solutions’ expertise, consisting of two questions per category to assess the player’s proficiency (see Table A9), or (2) by direct selection of the skill level by the player prior to starting the simulation (see Figure 9). The survey evaluates five key warehouse processes: general, material reception, order preparation, container placement, and forklift handling (see Figure 8). If the survey is used, the AI analyzes the responses to assign an appropriate skill level (e.g., beginner, medium, advanced, or expert).

The LLMs used in this study were accessed via their respective APIs (see next section) without additional training or finetuning. Instead, personalization was achieved through carefully designed prompts, leveraging zero-shot and few-shot learning techniques to adapt the models’ responses to the training context.

3.3.1. Learning Techniques

Two primary techniques are employed:

Zero-shot learning: The LLMs generated responses based solely on the instructions provided in the prompts, without prior examples. This approach is used for initial challenge generation, where clear and detailed.
Few-shot learning: In cases requiring higher precision or specific response formats, prompts include contextual examples. For instance, to ensure a consistent structure for challenge instructions, prompts provide one or two examples of desired outputs.

3.3.2. Prompt Design Strategy

Two types of prompts were developed and tested (see Table A1):

Initial Level Assessment Prompts: Based on the results of a two-question survey per training category, the system uses these prompts to classify the player’s experience level (beginner, intermediate, advanced, expert) across key logistics skills. The survey is based on real-world experience with Mecalux Software Solutions and reflects practical warehouse scenarios.
Error-based Instructional Prompts: A separate set of prompt templates was created to generate context-aware feedback based on user performance. For example, if a player commits a common error (such as misplacing a container or skipping a verification step) then the game engine captures this action and feeds a predefined error prompt to the LLM. These prompts contain embedded variables, such as the player’s level, the task type, and the mistake made, which the model uses to generate adaptive feedback.

3.3.3. Dynamic Progression and Adaptation

Following the initial assessment of the player’s skill level, either through a survey or manual selection, the artificial intelligence system initiates a customized training trajectory composed of progressively challenging scenarios. Each training category (e.g., order preparation, material reception) commences with a task adapted to the player’s level (e.g., beginner). Upon successful completion, the system incrementally increases the difficulty of subsequent tasks, thereby facilitating a structured advancement toward expert-level competency. This adaptive mechanism ensures that progression aligns with the player’s performance. When a player fails to complete a challenge successfully, the AI regenerates the task at the same difficulty level, incorporating modifications designed to enhance understanding, such as increased instructional support or reduced complexity. In instances of partial errors, the system may provide targeted assistance in the form of contextual hints or corrective guidance (see Figure 10). The interaction between the player and the AI is governed by a closed-loop feedback cycle, consisting of the following stages: Action → Evaluation → Prompt → Response → Adjustment. Throughout each challenge, the game engine continuously monitors the player’s actions (e.g., scanning items, positioning containers, operating a forklift) and evaluates them against predefined performance criteria. The outcomes, ranging from successful completions to errors, are encapsulated in a structured prompt and transmitted to a large language model (LLM). The LLM, in turn, processes this contextual information to generate personalized feedback, adjust the current challenge, or propose subsequent tasks within the training framework. This closed-loop system enables real-time personalization of the training experience, ensuring that the learning process dynamically adapts to the player’s evolving capabilities. Upon achieving expert-level performance in a given category, the system automatically advances the player to the next instructional module. A demonstration of WarehouseGame Training is available in the following video [36]:

4. Implementation

To enable personalization, a communication protocol is implemented, using the APIs of LLMs. The AI models share a common communication model based on a REST architecture. In this architecture, the client sends HTTP requests to the API endpoints exposed by the respective servers. Any application that interacts with the API can function as a client, including chatbots, virtual assistants, or a Unity C# application such as WarehouseGame Training. Each platform provides extensive documentation, including sample code in multiple programming languages, simplifying the integration process. Before our software can communicate with any of these APIs, several preliminary steps must be completed:

1.: Account creation: A user account must be created on each platform.
2.: API key generation: An API key must be generated to authenticate and secure communication between our software and the API.
3.: Credit balance: Since API usage is not free, it is necessary to add funds to the account. Each platform applies different pricing, based on the model and number of requests. If the user lacks sufficient credit, the service will not process requests.

Once these steps are completed, our software can establish communication with the selected API and process the responses accordingly. During interaction, the application sends requests based on user input, processes the JSON response returned by the API, and interprets it to provide real-time feedback through the 3D interface. The communication flow is straightforward (see Figure 11):

1.

Request AI assistance: From the Unity-based application, which serves as a 3D simulator, requests are sent to the AI for assistance in various aspects of training. These requests are handled by the corresponding class within the application, depending on the AI model used, ensuring efficient and adaptive communication with the selected API.

2.

Send message: Once the request type has been identified, the system constructs the message to be sent to the AI. There are several types of messages. Each message consists of two parts: a description of the request’s objective (see Table A1) and a specification of the expected response format (see Table A2). The first part is a textual instruction directed at the AI, while the second part is a standardized JSON structure, ensuring that the Unity application can correctly interpret and process the response.

3.

HTTP Post Request: Once the message is constructed, the next step is to send a POST request to the LLM server. This request must follow a specific structure to ensure that the server can correctly process it. The message must be encapsulated in a format that complies with the API’s requirements. Fortunately, as previously mentioned, all LLM models use share architectural similarities, making the POST request structure identical across all of them. The required format is as follows:

where

“model” specifies the language model to be used;
“messages” contains the structured conversation;
“role”: “system” defines the AI’s general behavior;
“role”: “user’” represents the message sent by the user.

Once this structure is prepared, the HTTP client sends a POST request to a generic API endpoint, such as https://server_llm/v1/chat/completions, where server_llm represents the server of the LLM model (ChatGPT, Grok, or DeepSeek). This endpoint is consistent across all models, with only the server name changing depending on the LLM provider.

4.

Http post response: The server processes the request, executes the language model, and generates a response based on the conversation history and the defined parameters. The response is returned in JSON format with the following structure:

The ‘content’ field contains the response generated by the AI. This response adheres to the JSON format previously specified in the request sent to the model. The specific JSON structure varies depending on the type of response, as detailed in Table A2.

5.

Process JSON response: Finally, the AI class in Unity deserializes the response and processes its content for interpretation and use within the simulation environment. At this stage, the system analyzes the AI-generated response and executes the appropriate actions based on the type of request. For instance, if the response pertains to an in-progress challenge assistance request then the system dynamically adjusts the help parameters to provide additional guidance to the user. If the request involves generating a new challenge, the system utilizes the AI-provided information to configure and present a tailored task suited to the player’s needs. Moreover, the AI response is displayed to the user within the Unity 3D environment through the simulator interface, offering a detailed explanation that seamlessly integrates with the training context (see Figure 10). This ensures that users can fully understand the recommendations or adjustments made in real-time, thereby enhancing the learning experience.

The source code of the implementation developed and described in this work is available on Zendo [37], where it can be accessed for review and use.

5. Experimental Design for Evaluating LLM Performance

This experiment aimed to assess the performance of different large language models (LLMs) within WarehouseGame Training, focusing on their ability to generate, adjust, and optimize logistics challenges based on user performance and skill level. Simulated user profiles were categorized by skill level (beginner, medium, advanced, and expert) in order preparation (picking) tasks. The evaluation objectives were as follows:

1.: Evaluation of difficulty progression.
2.: Evaluation of model adaptability.
3.: Evaluation of response compliance.
4.: Evaluation of response time.

This methodology allowed for a structured evaluation of the LLMs, identifying the most effective models for adapting warehouse training scenarios.

5.1. Evaluation of Difficulty Progression

The LLMs were evaluated based on their ability to dynamically adjust the difficulty of the challenges according to user performance. An effective progression system ensured a balanced learning curve, preventing both frustration from excessive difficulty and disengagement due to lack of challenge. The evaluation focused on the following criteria:

1.: Generating challenges appropriate to the user’s current skill level.
2.: Gradually increasing difficulty when the user demonstrated consistent mastery.
3.: Preventing difficulty escalation if the user had not yet corrected previous errors.

5.2. Evaluation of Model Adaptability

The adaptability of each LLM was assessed to determine its ability to personalize training experiences in real-time. A well-adapted model should ensure that challenges evolve logically based on user progress, promoting an effective and engaging learning process. The evaluation focused on the following criteria:

1.: Dynamically adjusting challenges according to user performance.
2.: Retaining user history for logical progression.
3.: Identifying recurring errors and adapting challenge complexity to prevent frustration and reinforce learning.

5.3. Evaluation of Response Compliance

The evaluation of each LLM focused on its ability to generate responses that strictly adhered to the predefined communication protocol. Ensuring compliance with the expected response format was essential for seamless integration with WarehouseGame Training and to prevent system errors. The assessment criteria included:

1.: JSON format compliance: The response had to strictly follow the expected JSON structure to ensure correct processing by the software.
2.: Data integrity: All required fields had to be present and correctly structured within the response.
3.: Error handling: The model should not generate malformed responses or omit critical information that would prevent interpretation by the system.

Maintaining response compliance ensured that the AI-generated outputs were functional, reliable, and seamlessly integrated into the training environment.

5.4. Evaluation of Response Time

The response speed of each LLM was analyzed to assess its ability to provide real-time feedback without perceptible delays. A fast response time was crucial for maintaining a seamless and engaging user experience, as excessive delays could disrupt the learning flow and negatively impact user interaction.

6. Results

6.1. Evaluation of Difficulty Progression

The difficulty progression of each LLM was evaluated based on its ability to gradually increase challenge complexity, with 50 iterations per level (see Table A3). Grok demonstrated the most structured and consistent difficulty progression, with orders (1 → 3) and tasks (5 → 20) scaling logically, multi-reference containers introduced at the medium level, failures reduced (3 → 2), and assistance removed by advanced/expert levels. It demonstrated high consistency (72.22% medium, 87.72% expert). However, 41.34% of advanced-level responses were misclassified as medium, indicating significant level response errors that could disrupt difficulty alignment. ChatGPT provided satisfactory progression, with orders (1 → 3), tasks (5 → 15), multi-reference containers at the advanced level, and assistance phased out by the expert level. It retained three failures at the advanced level (55.32%), potentially hindering mastery, and assistance until advanced (53.82%). Consistency was lower (36% beginner, 29.17% medium), but no level response mismatches were observed, ensuring challenge alignment. DeepSeek displayed irregularities, with an abrupt task increase (10 → 12) at the medium level, no multi-reference containers at this stage, and lenient time limits at the expert level (25 s). It was highly consistent at the beginner (100%) and expert (94%) levels but less so at the medium (38%) and advanced (60%) levels. No level response mismatches occurred, a notable strength. Based on the evaluation, ChatGPT is the most suitable model, with potential adjustments to failure tolerances and assistance removal to enhance training efficacy.

6.2. Evaluation of Model Adaptability

The analysis of the results demonstrates how each model adapted the difficulty when players reached the maximum number of allowed failures. The evaluation criteria encompassed reductions in orders and tasks, increases in allowed failures, and reactivity, quantified through the Average Error Ratio (errors/failures × 100) and Average Difference (errors − failures). Table A4, Table A5 and Table A6 present the simulation results, detailing for each model the level and challenge, the adjusted challenge designed by the AI, and the consistency percentage (% of agreement), based on 50 iterations per model/level. ChatGPT implemented substantial reductions in orders (from 2–3 to 1 at the medium and advanced levels) and tasks (from 15 to 5–12), increasing allowed failures to 10–12. However, it exhibited variable consistency (13.04–100%) and delayed reactivity (ratios: 128.43–424.69; differences:

- 0.84

to

- 7.22

), indicating tardy adjustments. DeepSeek reduced tasks (from 12–15 to 8–10) and increased failures (to 5–8, 100% consistency in most cases) but retained orders at the medium level (2 in 80.95% of cases), limiting adaptability. Its reactivity was moderate (ratios: 100–381; differences: 0 to

- 2.98

). Grok balanced task reductions (from 15–20 to 10–15) and failure increases (to 3–5), with high consistency (100% at advanced). Its reactivity was optimal (ratios: 89.33–200; differences: 0.32 to

- 2

), particularly at the advanced level (ratio: 101.75; difference: 0). Overall, ChatGPT provided aggressive adjustments but its slow reactivity compromised user experience, DeepSeek was consistent but limited in adaptability, due to minimal order reductions, and Grok integrated balanced adjustments with rapid reactivity, minimizing player frustration. Grok was the most effective model, owing to its precise reactivity and well-calibrated adjustments, optimizing the learning experience.

6.3. Evaluation of Response Compliance

The evaluation of response compliance revealed that all the tested models successfully adhered to the predefined communication protocol. Each LLM generated responses that strictly followed the expected JSON structure, ensuring seamless integration with WarehouseGame Training. Throughout the assessment, no errors were detected in the interpretation of requests, and all the models correctly understood the context of the queries. The generated responses consistently included all the required fields in the expected format, demonstrating full compliance with data integrity and error-handling criteria. This result confirms the reliability of these models in structured data generation tasks, particularly within controlled environments that require strict adherence to predefined formats. The ability of each LLM to consistently provide well-formed responses reduces the risk of system errors and facilitates automated processing within the training environment. In summary, the tested models demonstrated a high level of accuracy in response formatting and compliance, validating their suitability for structured interactions in gamified training applications.

6.4. Evaluation of Response Time

The response time of large language models (LLMs) is a critical factor influencing the overall user experience, particularly in real-time training environments such as WarehouseGame Training. The response time directly affects the fluidity of interactions, as excessive delays may lead to user frustration and disrupt the learning process. Table A8 presents the response time statistics (in milliseconds) for each LLM evaluated. Grok demonstrated the best performance, in terms of response speed. With a mean response time of 1924.86 ms and a median of 1852.0 ms, the model responded consistently and quickly. Its standard deviation (299.42 ms) indicated minimal variability across interactions, and the 95% confidence interval (1883.10 ms to 1966.61 ms) confirmed the model’s stability. Furthermore, its minimum (1384 ms) and maximum (3235 ms) response times remained within acceptable thresholds, reinforcing its suitability for real-time interaction. ChatGPT provided moderate performance. Its mean response time was 5823.45 ms, with a median of 5578.0 ms. The standard deviation (1415.13 ms) suggested higher variability, occasionally resulting in longer waiting times. The 95% confidence interval (5626.12 ms to 6020.77 ms) showed general consistency, but the range between its minimum (3313 ms) and maximum (11,956 ms) values highlights the potential for noticeable delays. While ChatGPT can still be considered viable for real-time training, these peaks in response time may interrupt the user experience in more dynamic scenarios. DeepSeek exhibited the longest response times among the evaluated models. With a mean of 10,213.75 ms and a median of 10,098.5 ms, the model consistently responded more slowly. Although the standard deviation (718.28 ms) was smaller than ChatGPT’s, its base latency was significantly higher. The 95% confidence interval (10,113.59 ms to 10,313.91 ms) indicated consistent delay, and the minimum (8513 ms) and maximum (12,836 ms) response times confirmed performance issues that could compromise real-time usability in an interactive learning environment. When comparing the three models,

Grok clearly outperformed the others, offering the lowest latency and highest consistency, making it ideal for seamless integration in real-time educational platforms.
ChatGPT provided acceptable performance with occasional delays, suitable for applications where slight lags are tolerable.
DeepSeek, while functional, presented response times that might hinder interactivity and user engagement in scenarios requiring prompt feedback.

Overall, Grok emerged as the most efficient model in terms of response time, delivering fast and reliable outputs suitable for interactive training applications. Its combination of low latency, minimal variance, and consistency underlines its advantage for real-time adaptive learning systems like WarehouseGame. While ChatGPT remains a strong alternative, further optimization would be required to match Grok’s responsiveness. DeepSeek, on the other hand, would need substantial improvements in response latency to meet the demands of such environments.

7. Discussion

The comprehensive evaluation of three large language models (LLMs) (ChatGPT, DeepSeek, Grok) within WarehouseGame Training provides clear insights into their suitability for optimizing logistics simulation, considering four critical dimensions: difficulty progression, adaptability, response compliance, and response time. Each model exhibited distinct strengths and limitations that shape their applicability in this context. In difficulty progression, ChatGPT excelled by generating challenges consistently aligned with the requested level, ensuring a coherent experience across all skill levels, from beginner to expert. This consistency is vital for maintaining a clear learning curve. However, its lenient failure tolerances and prolonged assistance in advanced levels could hinder the development of autonomous skills. Grok delivered a well-structured progression, logically increasing complexity, but its reliability was undermined by alignment errors between requested and responded levels, particularly at the advanced stage. DeepSeek exhibited less refined scalability, with abrupt difficulty transitions that could disorient players, especially at medium levels. Regarding adaptability, Grok proved superior in dynamically adjusting challenges based on user performance. Its rapid response to excessive errors, through balanced difficulty reductions and increased failure allowances, minimized frustration and fostered a fluid learning environment. ChatGPT, while implementing more aggressive adjustments, suffered from slower reactivity, potentially prolonging players’ exposure to overly challenging tasks. DeepSeek, despite its consistency, was less effective due to insufficient order reductions at medium levels, limiting its personalization capabilities. For response time, Grok emerged as the clear leader, delivering the fast and consistent responses essential for maintaining interactivity in a real-time training environment. ChatGPT performed acceptably but exhibited occasional delays that could disrupt fluidity in dynamic scenarios. DeepSeek, with significantly longer response times, was less suitable for applications requiring immediate feedback. Grok is the most suitable model for gamified simulation, excelling in adaptability and response time and offering robust progression despite alignment issues. ChatGPT is a strong alternative, particularly for lower levels, but it requires improvements in reactivity and rigor. DeepSeek is not recommended, due to its irregular scalability and high latency, which compromise user experience. However, a hybrid approach, using ChatGPT for lower levels and Grok for higher levels, could maximize training effectiveness. ChatGPT would be ideal for beginner and medium levels, where its perfect level alignment ensures a clear and coherent introduction to logistics tasks. As players progress to advanced and expert levels, Grok could take over, leveraging its superior adaptability and fast response times to manage more complex, personalized challenges, provided its alignment errors are addressed.

8. Future Work and New Research Directions

While this research has demonstrated the potential of LLM models to dynamically adjust difficulty in a gamified logistics environment, it is important to acknowledge certain limitations that should be addressed in future studies:

Diversity of evaluated profiles: The study focused on a specific group of users, without considering a wide range of profiles with varying levels of experience and training in logistics. Evaluating LLM performance with a more diverse sample would validate their adaptability to different learning styles and competency levels.
Exploration of other language models: The research was limited to a specific set of LLM models (Grok, ChatGPT, and DeepSeek). Alternatives such as LLaMA (Meta) or other emerging models, which could offer improvements in adaptability, efficiency, and personalization of the learning experience, were not assessed.
Application to other warehouse tasks: The study concentrated on order picking, without extending its analysis to other key areas of the logistics environment, such as receiving, material sorting, or inventory management. Exploring the applicability of LLMs to these processes would allow for an evaluation of their impact across a broader range of warehouse operations.

Recognizing these limitations provides a foundation for future research and optimizations, enabling the enhancement of LLM integration in logistical environments and expanding their applicability to more complex and realistic scenarios. Future research should focus on refining the integration of LLMs into gamified logistics environments, improving their ability to personalize the learning experience based on real-time user performance. One of the primary aspects to optimize is the adjustment of error tolerance and the adaptive scalability of difficulty, ensuring smoother and more effective progression. Likewise, enhancing the structure and timing of visual and contextual aids could increase the effectiveness of the training process. Another relevant line of research is the use of virtual and mixed reality environments in combination with LLM models to enhance immersion and learning effectiveness. Simulation in a more realistic setting could improve the transfer of knowledge to practical scenarios, optimizing training in logistical environments. Another key direction for future research is the application of this system in a real-world setting, transferring the dynamics developed in the game to an operational warehouse with actual order-picking processes. This approach would allow for an evaluation of the impact of AI-driven gamification under real working conditions, validating its effectiveness in optimizing logistical tasks. Furthermore, incorporating a collaborative multiplayer environment within the game could strengthen learning through user cooperation, fostering teamwork dynamics and problem-solving in gamified logistics settings. Addressing these areas will contribute to enhancing the role of AI-powered gamification in workforce training, making logistics learning more efficient, engaging, and tailored to the individual needs of each user.

Author Contributions

Study conception and design: J.J.R.M., L.D.l.T., D.C.G.; development, testing, and data collection: J.J.R.M.; analysis and interpretation of results: J.J.R.M., L.D.l.T., D.C.G.; J.J.R.M. wrote the manuscript with support from L.D.l.T. and D.C.G. All authors reviewed the results and approved the final version of the manuscript.

Funding

This research was funded by the Spanish Ministry of Science and Innovation through a research project with reference PID2022-139187OB-I00.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available in a publicly accessible repository at https://doi.org/10.5281/zenodo.15002938, accessed on 10 March 2025.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Message Descriptions and Prompt Message.

Message Description	Prompt Request Example
Before starting, the AI is informed of the player’s level, and a challenge is requested.	The player is about to undertake order preparation challenges. Their level in the order preparation category is beginner. I need a challenge appropriate for their level, so that they can progress and reach expert level.
Once the challenge is completed, the results are sent to the AI for evaluation and to generate the next challenge. Regardless of the AI model used, the messages sent remain the same for challenge design, adjustment, and evaluation.	Analyze the results obtained by the player in the challenge. The player had to complete the following challenge: `orders’: 1, `tasks’: 5, `multi-reference’: false, `level’: `beginner’, `time’: 500, `errors’: 2, `assistance’: true, `minimap’: false. The results were: errors: 5 and the time spent: 10 min. Knowing that the player’s current level in the order preparation category is beginner, I need you to determine whether they have successfully completed the challenge.
If the player makes a mistake, the AI is notified with details about where the error occurred, the challenge the player was attempting, and a request for an assessment of the mistake.	The player has just made an error while scanning the location of the container. The scanned label is incorrect. I want you to analyze whether it is necessary to assist the player. The assistance you can provide falls into three categories: (A) Add more in-game assistance: In this case, you could activate the minimap and highlight the location to help the player orient themselves better. (B) In addition to the above, if the player has made too many mistakes, reduce the difficulty of the challenge. This means adjusting the challenge to a lower level. (C) Do nothing: If the number of mistakes is not high enough compared to the difficulty of the challenge and the player’s level, no adjustments are necessary.
	The player has just made a mistake picking the product from the container. The product and the quantity are not correct. They should have picked a quantity of 10 of the product picking.Stock. But that has not been the case. This may be due to several causes: 1. They have placed the product in the wrong client container; they should have selected the client container CL01. 2. They have not picked the selected product. 3. They have not picked the full quantity for the indicated product. To perform the picking, the player must do as follows: 1. Select the client container, using the F1, F2, or F3 keys (they must check the radio frequency terminal to know which to select). 2. Move the hand with the radio frequency terminal, using the cursor, to the container/product indicated in the task. 3. Select the box with the product (it will appear in violet when selected). 4. Confirm with the mouse. Next, I will provide you with the data for the challenge that the player is currently performing. The player had to complete the following challenge: `orders’: 1, `tasks’: 10, `multi-reference’: false, `level’: beginner, `time’: 15, `mistakes’: 3, `help’: true, `minimap’: true.

Table A2. Message descriptions and JSON response.

Message Description	JSON Response
Before starting, the AI is informed of the player’s level, and a challenge is requested.	{ `orders’, `tasks’, `multi-reference’, `level’, `time’, `failures’, `explanation’, `help’, `minimap’ }
Once the challenge is completed, the results are sent to the AI for evaluation and to generate the next challenge. Regardless of the AI model used, the messages sent remain the same for challenge design, adjustment, and evaluation.	{ `OvercomeChallenge’, `AdjustChallenge’, `orders’, `tasks’, `multi-reference’, `level’, `time’, `failures’, `explanation’, `help’, `minimap’ }
If the player makes a mistake, the AI is notified with details about where the error occurred, the challenge the player was attempting, and a request for an assessment of the mistake.	{ `OptionHelp’ `AdjustChallenge’ `orders’, `tasks’, `multi-reference’, `level’, `time’, `failures’, `explanation’, `help’, `minimap’ }

Table A3. Evaluation of difficulty progression.

Model	Level	Level Response	Orders	Tasks	MultiRef	Time	Failures	Help	%
ChatGPT	beginner	beginner	1	5	0	15	3	1	36.00
		beginner	2	5	0	20	5	1	16.00
		beginner	1	5	0	15	5	1	14.00
		beginner	2	5	0	15	3	1	10.00
		beginner	2	5	0	20	3	1	10.00
	medium	medium	2	15	0	25	3	1	29.17
		medium	2	10	0	20	3	1	22.92
		medium	2	15	0	20	3	1	14.58
		medium	2	10	0	20	3	1	10.42
		medium	2	15	0	25	3	1	8.33
	advanced	advanced	2	15	1	25	3	1	31.91
		advanced	2	15	1	25	3	0	21.28
		advanced	2	15	1	25	2	1	19.15
		advanced	2	15	1	25	2	0	12.77
		advanced	3	15	1	25	3	0	2.13
	expert	expert	3	15	1	25	2	0	50.91
		expert	3	15	1	25	3	0	10.91
		expert	3	15	1	30	2	0	7.27
		expert	2	15	1	25	2	0	7.27
		expert	2	15	1	25	3	0	5.45
DeepSeek	beginner	beginner	1	5	0	10	3	1	100.0
	medium	medium	2	12	0	15	3	1	38.0
		medium	2	12	0	15	3	1	30.0
		medium	2	10	0	15	3	1	18.0
		medium	2	10	0	15	3	1	14.0
	advanced	advanced	2	15	1	25	2	1	60.0
		advanced	2	15	1	25	2	0	26.0
		advanced	2	15	1	20	2	0	4.0
		advanced	3	15	1	25	2	1	4.0
		advanced	2	12	1	20	2	1	2.0
	expert	expert	3	15	1	25	1	0	94.0
	expert	expert	3	15	1	25	2	0	6.0
Grok	beginner	beginner	1	5	0	10	3	1	58.00
	beginner	beginner	1	5	0	15	3	1	42.00
	medium	medium	2	15	1	25	3	1	72.22
		medium	2	15	0	20	3	1	16.67
		medium	2	10	1	20	3	1	5.56
		medium	2	15	0	25	3	1	5.56
	advanced	advanced	3	15	1	25	2	0	58.67
		medium	2	15	1	25	3	1	22.67
		medium	2	15	1	25	2	1	8.00
		medium	2	15	1	25	3	0	8.00
		medium	2	15	1	25	2	0	2.67
	expert	expert	3	20	1	30	2	0	87.72
		advanced	3	15	1	30	2	0	10.53
		medium	2	15	1	25	2	0	1.75

Table A4. Evaluation of model adaptability—ChapGPT.

Model	Level	Orders	Task	Failures	New Orders	New Tasks	New Failures	%
ChatGPT	beginner	1	5	3	1	3	5	90.91
		1	5	5	1	3	10	36.36
		1	5	5	1	3	6	27.27
		2	5	3	1	3	8	40.00
		2	5	3	1	3	10	40.00
		2	5	4	1	3	5	100.00
		2	5	5	1	3	8	45.45
		2	5	5	1	3	5	18.18
	medium	2	10	3	1	5	10	57.14
		2	10	3	1	5	8	21.43
		2	12	3	1	8	8	50.00
		2	12	3	1	8	10	50.00
		2	15	3	1	10	10	50.00
		2	15	3	1	10	8	15.62
		2	15	5	1	10	5	100.00
	advanced	2	15	2	1	10	5	27.27
		2	15	2	2	12	6	18.18
		2	15	2	2	10	5	18.18
		2	15	3	2	10	8	26.09
		2	15	3	2	12	8	17.39
		2	15	3	1	10	10	13.04
		3	15	2	2	10	5	50.00
		3	15	2	2	10	10	50.00
		3	15	3	2	12	10	50.00
		3	15	3	2	10	12	50.00
	expert	2	15	2	2	10	6	40.00
		2	15	2	1	10	6	20.00
		2	15	3	1	10	8	50.00
		2	15	3	2	10	10	50.00
		3	15	2	2	10	8	46.15
		3	15	2	2	12	8	15.38
		3	15	3	2	10	5	83.33
		3	15	3	2	10	6	16.67
		3	18	2	2	15	7	100.00

Table A5. Evaluation of model adaptability—DeepSeek.

Model	Level	Orders	Task	Failures	New Orders	New Tasks	New Failures	%
DeepSeek	beginner	1	5	3	1	3	5	100.00
	medium	2	10	3	2	8	8	28.57
		2	10	3	2	8	7	19.05
		2	10	3	2	7	8	19.05
		2	10	3	1	8	8	14.29
		2	12	3	2	8	8	62.07
		2	12	3	1	8	8	37.93
	advanced	2	12	2	1	8	6	100.00
		2	15	2	1	10	6	58.70
		2	15	2	2	10	6	23.91
		3	15	2	2	10	8	100.00
	expert	3	15	1	2	10	5	100.00
	expert	3	15	2	2	10	8	100.00

Table A6. Evaluation of model adaptability—Grok.

Model	Level	Orders	Task	Failures	New Orders	New Tasks	New Failures	%
Grok	beginner	1	5	3	1	3	5	66.00
		1	5	3	1	3	2	18.00
		1	5	3	1	3	4	16.00
	medium	2	15	3	2	10	5	38.71
		2	15	3	1	10	5	29.03
		2	15	3	1	10	3	22.58
		3	15	2	2	10	5	94.00
		3	15	2	2	10	4	6.00
	advanced	2	15	2	2	10	3	100.00
		2	15	3	2	10	5	88.89
		2	15	3	2	10	4	5.56
	expert	3	20	2	2	15	5	70.00
	expert	3	20	2	2	15	3	26.00

Table A7. Relationship between actual user errors and the allowed error threshold per level.

Model	Level	Avg Error Ratio	Std. Dev	Mean Difference	Std. Diff
ChatGPT	beginner	128.43	35.73	-0.84	1.18
	medium	220.82	33.76	−3.67	1.07
	advanced	312.93	77.30	−5.12	1.64
	expert	424.69	74.73	−7.22	1.19
DeepSeek	beginner	100.00	0.00	0.00	0.00
	medium	154.00	18.92	−1.62	0.57
	advanced	193.00	17.53	−1.86	0.35
	expert	381.00	52.38	−2.98	0.14
Grok	beginner	89.33	18.37	0.32	0.55
	medium	126.13	31.28	−0.48	0.71
	advanced	101.75	26.00	0.00	0.58
	expert	200.00	14.29	−2.00	0.29

Table A8. Summary of response time statistics.

Model	Mean	Median	Std. Deviation	Minimun	Maximun	95% Confidence Interval
ChatGPT	5823.445	5578.0	1415.125	3313	11956	(5626.122, 6020.767)
DeepSeek	10213.75	10098.5	718.281	8513	12836	(10113.593, 10313.906)
Grok	1924.855	1852.0	299.421	1384	3235	(1883.104, 1966.605)

Table A9. Survey.

Category	Question	Answer
General	Do you have previous experience working in logistics?	(X) Yes, more than one year. (A) Yes, less than one year. (B) No, I’m completely new
General	How comfortable are you with using devices such as RF terminals?	(X) Very comfortable. (A) Somewhat comfortable. (B) I have never used one.
Material reception	In your experience, how do you verify that a received material is correct?	(X) I compare the labels with the reception documents. (A) I use an automated or digital system (e.g., RF). (B) I have no experience with this.
Material reception	How would you identify discrepancies in a multi-reference container?	(X) I manually review each reference. (A) I use digital tools to do so. (B) I have never done this before.
Order preparation	Have you performed picking tasks?	(X) Yes, several times. (A) Only in basic exercises or training. (B) No, I have never done it.
Order preparation	How quickly do you think you could prepare an order with multiple references?	(X) Very quickly, I know how to optimize time. (A) It would depend on the system used. (B) I would need time to learn.
Container placement	What methods do you use to locate materials in a warehouse?	(X) With a specific plan or digital system. (A) Manually, following labels or instructions. (B) I have no experience.
Container placement	How would you handle a full warehouse or one with limited space to store materials?	(X) I would reorganize the space according to priorities. (A) I would ask for help or guidance. (B) I’m not sure.
Forklift handling	Have you operated a forklift?	(X) Yes, regularly. (A) Yes, but only a few times. (B) No, never.
Forklift handling	How confident are you maneuvering in tight spaces?	(X) Very confident. (A) Somewhat unsure, I would need practice. (B) Very unsure, I have no experience.

References

Department, S.R. E-Commerce Worldwide—Statistics & Facts. 2025. Available online: https://www.statista.com/topics/871/online-shopping/ (accessed on 2 January 2025).
Weng, X.; She, W.; Fan, H.; Zhang, J.; Yun, L. A Reliable Location Model for Charging Piles of Automated Guided Vehicles in the Logistics Center Based on Queuing. 2025. Available online: https://ssrn.com/abstract=5229750 (accessed on 2 January 2025).
Chen, Q.; Han, Y.; Pan, N.; Guo, X.; Zhang, L. Unmanned Aerial Vehicle 3D Trajectory Planning Based on Background of Complex Industrial Product Warehouse Inventory. Available online: https://pdfs.semanticscholar.org/838b/fe3f0b00c1ee958ca469156703af052d60d9.pdf (accessed on 15 June 2019).
Reif, R.; Günthner, W.A. Pick-by-vision: Augmented reality supported order picking. Vis. Comput. 2009, 25, 461–467. [Google Scholar] [CrossRef]
Andaluz, V.H.; Castillo-Carrión, D.; Miranda, R.J.; Alulema, J.C. Virtual Reality Applied to Industrial Processes. In Proceedings of the Augmented Reality, Virtual Reality, and Computer Graphics: 4th International Conference, AVR 2017, Ugento, Italy, 12–15 June 2017; Proceedings, Part I, pp. 59–74. [Google Scholar] [CrossRef]
Deci, E.L.; Ryan, R.M. The “what” and “why” of goal pursuits: Human needs and the self-determination of behavior. Psychol. Inq. 2000, 11, 227–268. [Google Scholar] [CrossRef]
Hernandez, L. Digital Pedagogy and Pentiment (2022): Playing with Critical Art History. Available online: https://www.digitalrhetoriccollaborative.org/2024/04/13/digital-pedagogy-and-pentiment-2022-playing-with-critical-art-history/ (accessed on 5 June 2025).
Putra, A.; Zainul, R. Serious Games in Science Education: A Review of Virtual Laboratory Development for Indicator of Acid-Base Solution Concepts. In Chemistry Smart; Universitas Bengkulu: Bengkulu, Indonesia, 2024. [Google Scholar]
Rezaeirad, M.; Jafarkhani, F.; Maghami, H. Educational Design Based on Community Language Learning and its Effect on Self-Directed Learning, Academic Motivation, and Academic Self-Efficacy of Students. Teach. Res. J. 2024, 12, 161–184. [Google Scholar]
Ponte, I.; Silva, B.; Batista, L. Serious Games as Tools for Food and Nutrition Education: A Systematic Review. ABCS Health Sci. 2024, 49, e024306. [Google Scholar] [CrossRef]
OpenAI. ChatGPT: Language Model for Conversational AI. 2024. Available online: https://chat.openai.com (accessed on 5 June 2025).
Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. DeepSeek; Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd.: Hangzhou, China, 2023. [Google Scholar]
xAI. Grok; xAI: San Francisco Bay Area, CA, USA, 2023. [Google Scholar]
Steenstra, I.; Murali, P.; Perkins, R.; Joseph, N. Engaging and Entertaining Adolescents in Health Education Using LLM-Generated Fantasy Narrative Games and Virtual Agents. In Proceedings of the Extended Abstracts of the ACM Conference on Health, Singapore, 13–17 May 2024. [Google Scholar]
Raguraman, R.; Subbulakshmi, P.; Raju, J.S. Adaptive NPC in Serious Games Using Artificial Intelligence. 2024. Available online: https://ssrn.com/abstract=4806061 (accessed on 5 June 2025).
Franke, S.; Hermes, S.; Roidl, M. An Educational Game to Learn Picking Techniques in Warehousing–WareMover. Simul. Gaming 2024, 55, 964–975. [Google Scholar] [CrossRef]
Alcantar-Nieblas, C.; Glasserman-Morales, L.D. EGame-flow: Psychometric properties of the scale in the Mexican context. J. Appl. Res. High. Educ. 2024, 17, 1003–1014. [Google Scholar] [CrossRef]
Pacheco-Velazquez, E.; Rodés Paragarino, V.; Glasserman, L.D.; Carlos Arroyo, M. Playing to learn: Developing self-directed learning skills through serious games. J. Int. Educ. Bus. 2024, 17, 416–430. [Google Scholar] [CrossRef]
Bombelli, A.; Atasoy, B.; Fazi, S.; Boschma, D. From the ORy to Application: Learning to Optimize with Operations Research in an Interactive Way; Delft University of Technology: Delft, The Netherlands, 2024. [Google Scholar]
Pacheco, E.; Palma-Mendoza, J. Using Serious Games in Logistics Education. In Proceedings of the 2nd International Conference on Industrial Engineering and Industrial Management, Barcelona, Spain, 8–11 January 2021; pp. 51–55. [Google Scholar]
Bright, A.; Ponis, S. Introducing Gamification in the AR-Enhanced Order Picking Process. Logistics 2021, 5, 14. [Google Scholar] [CrossRef]
Sailer, M.; Hense, J.U.; Mayr, S.K.; Mandl, H. How Gamification Motivates: An Experimental Study of the Effects of Specific Game Design Elements on Psychological Need Satisfaction; Elsevier Ltd.: Amsterdam, The Netherlands, 2017; Volume 69, pp. 371–380. [Google Scholar]
Bahr, W.; Mavrogenis, V.; Sweeney, E. Gamification of Warehousing: Exploring Perspectives of Warehouse Managers. Int. J. Logist. Res. 2022, 25, 247–259. [Google Scholar] [CrossRef]
Zhao, Y.; Pan, J.; Dong, Y.; Dong, T.; Wang, G. Language Urban Odyssey: A serious game for enhancing second language acquisition through large language models. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery (ACM): New York, NY, USA, 2024. [Google Scholar]
Rzeczycki, A.; Chrzastek, G.; Niemcewicz, P. Serious games in logistics education: Analysis of effectiveness and students’ decision-making processes. Eur. Res. Stud. J. 2024, XXVII, 925–967. [Google Scholar] [CrossRef]
Pacheco-Velázquez, E.; Palma-Mendoza, J. Using Serious Games in Logistics Education; ACM: New York, NY, USA, 2021.
Deghedi, G.A. Game-Based Learning for Supply Chain Management: Assessing the Complexity of Games. Int. J. Game-Based Learn. (IJGBL) 2023, 13, 1–20. [Google Scholar] [CrossRef]
Mecalux Software Solutions. Easy WMS: Advanced Warehouse Management System. Available online: https://www.mecalux.com/software/warehouse-management-system-wms (accessed on 5 June 2025).
Unity Technologies. Unity 3D Game Engine; Version used: 2024; Unity Technologies: San Francisco, CA, USA, 2024. [Google Scholar]
Attali, Y.; Arieli-Attali, M. Gamification in assessment: Do Points Affect Test Performance? Elsevier Ltd.: Amsterdam, The Netherlands, 2015; Volume 83, pp. 57–63. [Google Scholar] [CrossRef]
Fortes Tondello, G.; Premsukh, H.; Nacke, L. In Proceedings of the A Theory of Gamification Principles Through Goal-Setting Theory; Hawaii International Conference on System Sciences, Hilton Waikoloa Village, HI, USA, 3–6 January 2018.
Deterding, S.; Dixon, D.; Khaled, R.; Nacke, L. From Game Design Elements to Gamefulness: Defining “Gamification”; MindTrek ’11; ACM: New York, NY, USA, 2011; pp. 9–15. [Google Scholar]
Tooma, E.; Badr, N.; Hage, H.S. The impact of intrinsic motivation on employees’ job satisfaction, productivity, and turnover intentions: A study of information technology employees in Lebanon. Int. Journalccm Hum. Resour. Dev. Manag. 2021, 21, 1–20. [Google Scholar]
Vogt, L. The relationship between intrinsic motivation and productivity in a German retail company. Int. J. Organ. Anal. 2019, 27, 634–652. [Google Scholar]
Papalexandris, N.; Panayotopoulou, L.; Vassilopoulou, J. The impact of intrinsic and extrinsic motivation on employee engagement during times of organizational change. J. Bus. Res. 2019, 98, 371–381. [Google Scholar]
Marras, J.R.; de la Torre Cubillo, L.; García, D.C. WarehouseGame Training: Demonstration Video. Available online: https://www.youtube.com/watch?v=Cr84hdYre8I&t=6s (accessed on 1 June 2025).
Marras, J.R.; de la Torre Cubillo, L.; García, D.C. WarehouseGame Training: A Gamified Logistics Training Platform. 2025. Available online: https://zenodo.org/records/15002939 (accessed on 1 June 2025).

Figure 1. WarehouseGame Training.

Figure 2. Unity engine.

Figure 3. Order picking.

Figure 4. Forklift operation tasks.

Figure 5. Gamification’s elements.

Figure 6. Player’s results.

Figure 7. Narrative.

Figure 8. Initial survey.

Figure 9. Selection level.

Figure 10. AI assistance.

Figure 11. Communication architecture.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Romero Marras, J.J.; De la Torre, L.; Chaos García, D. WarehouseGame Training: A Gamified Logistics Training Platform Integrating ChatGPT, DeepSeek, and Grok for Adaptive Learning. Appl. Sci. 2025, 15, 6392. https://doi.org/10.3390/app15126392

AMA Style

Romero Marras JJ, De la Torre L, Chaos García D. WarehouseGame Training: A Gamified Logistics Training Platform Integrating ChatGPT, DeepSeek, and Grok for Adaptive Learning. Applied Sciences. 2025; 15(12):6392. https://doi.org/10.3390/app15126392

Chicago/Turabian Style

Romero Marras, Juan José, Luis De la Torre, and Dictino Chaos García. 2025. "WarehouseGame Training: A Gamified Logistics Training Platform Integrating ChatGPT, DeepSeek, and Grok for Adaptive Learning" Applied Sciences 15, no. 12: 6392. https://doi.org/10.3390/app15126392

APA Style

Romero Marras, J. J., De la Torre, L., & Chaos García, D. (2025). WarehouseGame Training: A Gamified Logistics Training Platform Integrating ChatGPT, DeepSeek, and Grok for Adaptive Learning. Applied Sciences, 15(12), 6392. https://doi.org/10.3390/app15126392

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WarehouseGame Training: A Gamified Logistics Training Platform Integrating ChatGPT, DeepSeek, and Grok for Adaptive Learning

Abstract

1. Introduction

2. Background

3. Materials and Methods

3.1. Realistic and Immersive Simulation

3.2. Gamification

3.3. Personalized Learning

3.3.1. Learning Techniques

3.3.2. Prompt Design Strategy

3.3.3. Dynamic Progression and Adaptation

4. Implementation

5. Experimental Design for Evaluating LLM Performance

5.1. Evaluation of Difficulty Progression

5.2. Evaluation of Model Adaptability

5.3. Evaluation of Response Compliance

5.4. Evaluation of Response Time

6. Results

6.1. Evaluation of Difficulty Progression

6.2. Evaluation of Model Adaptability

6.3. Evaluation of Response Compliance

6.4. Evaluation of Response Time

7. Discussion

8. Future Work and New Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI