1. Introduction
The logistics and supply chain management sector faces unprecedented challenges in the contemporary global economy, characterized by increasing complexity and volatility. The emergence of Industry 4.0 and its subsequent evolution toward Industry 5.0 have catalyzed a fundamental transformation in how logistics processes are modeled, simulated, and optimized [
1]. A paradigmatic shift is driven by an integrated ecosystem of cyber-physical systems, the Internet of Things (IoT), and advanced artificial intelligence [
2].
Artificial intelligence has become a key enabler of Logistics 4.0. Alongside “classical” approaches; generative artificial intelligence has gained prominence by producing text, code, and other content based on large language models (LLMs) [
3].
In logistics, generative AI is increasingly employed not only for predictive tasks but also for generating transportation plans, simulating supply chain disruption scenarios, and supporting strategic planning. Its hybrid use opens new possibilities through natural language interfaces [
2,
4].
Simulation modeling remains a cornerstone methodology for analyzing complex logistics systems. The following three primary simulation paradigms dominate logistics modeling: discrete-event simulation, agent-based modeling and continuous simulation [
5]. Traditional simulation software platforms such as Arena, FlexSim, ExtendSim, and Tecnomatix Plant Simulation provide graphical interfaces; yet, they still require substantial domain expertise and technical proficiency. The inherent complexity of these tools often creates a bottleneck in simulation adoption, particularly for domain experts who lack formal training in simulation methodologies [
6,
7].
In an era marked by global disruptions, the ability to model, simulate, and optimize logistics systems with AI-driven intelligence has become a strategic imperative. Integrating AI systems with existing logistics infrastructure presents significant technical challenges and requires organizational changes [
8,
9]. Therefore, it is essential to focus also on implementing generative AI capabilities for model development within traditional simulation software platforms.
The recent emergence of LLMs, such as GPT-3 and GPT-4, represents a breakthrough in assisted modeling. These systems have demonstrated remarkable capabilities in code generation, natural language understanding, and domain-specific reasoning. Research indicates that LLMs can extract structured information from textual descriptions and map it directly to simulation constructs [
6], enabling the translation of verbal descriptions into functional simulation code [
10].
In this context, the “copilot” metaphor has emerged as the dominant framework for human–AI collaboration in simulation development. In this paradigm, users provide high-level descriptions and iteratively refine AI-generated outputs through suggestions and corrections. This approach leverages complementary strengths: humans provide domain expertise and strategic judgment, while LLMs contribute to rapid code generation and tireless iteration [
11,
12,
13].
Multiple studies have demonstrated successful applications of generative AI to Python-based simulation frameworks, where models generate valid queuing, inventory-control, healthcare DES, DEVS, and physics-based PyChrono simulations with substantial gains in setup speed and code accuracy [
6,
10,
11,
12].
Most existing studies focus on programming-based approaches or specialized frameworks, which may limit their immediate practical applicability for organizations invested in commercial platforms. Only a limited number of studies explicitly focus on commercial simulation platforms. Romero-Guerrero demonstrated the use of ChatGPT to automatically generate FlexSim (version 23.0.1, 64-bit, educational license) DES models from industrial layouts. This approach transforms the simulation modeling process by automating the translation of physical layouts into executable FlexSim models using FlexScript coding, thereby offering advantages in speed and accessibility. However, it still faces limitations in interpreting complex visual inputs, which require manual validation [
7].
Reider et al. explored the use of generative AI for automated simulation model generation in Tecnomatix Plant Simulation. Their synergistic approach aims to democratize access to simulation modeling by lowering technical barriers. In their framework, generative AI interfaces directly with Tecnomatix through a programmed mediator. While the study highlights benefit in terms of speed and accessibility, it lacks implementation, validation of the framework, and a detailed discussion of its limitations [
14].
Wiśniewski et al. investigated the integration of GPT-4 with Arena DES for intelligent management in agile projects. Their study benchmarked GPT-4 outputs against Arena simulations. However, the two models were developed independently, with the Arena models serving as the baseline for assessing the adequacy of the AI-generated results. There is no evidence that LLM contributed to the creation of the Arena models [
15].
Successful implementation of LLMs in model creation faces significant hurdles. LLM capabilities remain limited to relatively simple problems due to constraints in context windows and the quality of training data [
6,
12].
While modern LLMs feature extensive context windows capable of processing millions of tokens, research indicates that model performance degrades as the input length increases. A critical phenomenon is the “Lost in the Middle” (LiM) effect, where models demonstrate superior retrieval for information located at the beginning (primacy bias) or end (recency bias) of a prompt. Positional biases in reasoning are inherited from failures in information retrieval [
16,
17].
A significant gap exists between the sequential text-centric training of LLMs and the topological requirements of block-based DES. LLMs are trained to predict the next word in a sequence, making them proficient at generating code but less effective at directly constructing the spatial and logical dependencies of graphical simulation platforms like ExtendSim or Arena. DES models are networks of concurrent logic and event scheduling, whereas LLMs process information as a sequence [
13,
18].
This mismatch often results in loss of nuance, logical breakdowns, and hallucinations. The most critical issue is AI hallucination, where the LLMs generates plausible but incorrect code or fabricated references. This necessitates an iterative process where human experts rigorously verify and validate AI-generated models to ensure they remain reasonable under real-world conditions [
10,
13,
19].
Prior research has not explicitly addressed the integration of generative AI within traditional desktop simulation software nor analyzed specific applications in logistics modeling. The question of whether generative AI can create or assist in developing logistics simulation models remains unexplored.
This paper addresses this gap by evaluating the application of generative AI via LLMs in developing a logistics model using ExtendSim 10 Pro 2024 within a case study. The primary objective is to assess generative AI support in creating traditional desktop-based simulation models. During the model development process, Perplexity and ChatGPT are employed as LLMs. The research contribution lies in the analysis of three specific use cases of LLMs:
independent creation of a simulation model;
determination of simulation outputs based on input parameters and process information;
copilot-assisted guidance.
3. Results
3.1. Independent Development of a Simulation Model by LLMs
Both LLMs responded to the first prompt by confirming their ability to create a simulation model. Perplexity provided a Growth model example with source code. ChatGPT proposed a DES model concept, described required inputs, and offered Python code.
For the second prompt, both LLMs stated they cannot directly interact with traditional desktop-based simulation software ExtendSim. They offered alternatives: code generation or step-by-step guide for manual model creation.
For the third prompt, they explained the lack of direct access to local software. They suggested enabling such access would require an appropriate Application Programming Interface (API) or scripting.
A supplementary prompt was used to check if the limitation stemmed from lack of ExtendSim modeling knowledge. Both models refuted this. Perplexity referenced ExtendSim sources and manuals. ChatGPT listed available libraries, software blocks, and example models without citations.
3.2. Determination of DES Results by LLMs
This experiment builds directly on the description of the construction material production process (including its schematic in
Figure 1) and the quantitative parameters listed in
Table 1 from
Section 2. Both LLMs initially focused on the declared production range of 280–320 pallets per 12 h from
Table 1. They estimated losses from process waste and weight loss after the drying activity.
The LLMs requested additional details. Perplexity specified mass-to-pallet ratio, exact material input rates, weight and dimensions of the finished product, and waste rates. ChatGPT asked for unit definitions, batch sizes, processing time distributions, pallet conversion rate, and simulation duration.
Within prompt refinement, a detailed textual description of the process was provided, standardizing flow to one ton regardless of whether it represented raw materials, products, or pallets.
Perplexity defined an initial output of 45,128 tons of packaged products. The computations were based on the conversion of average flow rates and total mass losses scaled to consume all 64,000 tons of input clay. Disabling scaling reduced this to 5082 tons, still below the benchmark value of 9721 tons. Additional refinement prompts included model settings checks and definition of a DES model. In DES model one item equals one ton of material. Following the DES model definition, Perplexity declared reproducing discrete-event simulation behavior and generated the outputs shown in
Table 2.
ChatGPT immediately applied a DES approach, initially yielding 1303 and 1404 tons. Prompt refinements had to explicitly define storage units and clarify the interpretation of water consumption in the simulation model. After the iterative refinement process, ChatGPT produced a final output range of 7500–9600 tons.
Following the second experiment, an accuracy assessment of the LLM results was conducted. The quantitative metric MAPE (Mean Absolute Percentage Error) quantifies deviations of AI-generated outputs from the ExtendSim model benchmark 9721 tons. The evaluation in
Table 3 quantifies LLM outputs after each prompt iteration. Lower MAPE result indicates higher accuracy of LLM output.
3.3. Copilot Approach to Assist in Simulation Model Development
The third experiment evaluated the copilot approach, where LLMs assist the user in simulation model development by providing detailed step-by-step instructions. ChatGPT generated instructions for a DES model, while Perplexity attempted to construct a broader-type model of a continuous, or rather, discrete rate nature. The instructions provided by Perplexity were inaccurate, did not correspond to the software’s available blocks, and contained non-functional equations that prevented the simulation from running. Ultimately, it was not possible to derive a model with a coherent and stable structure.
For Perplexity, the task was explicitly defined as concerning a DES model in which one item represents one ton. It subsequently generated a procedure consisting of twelve steps. Each step described in detail which block to use, where to set capacity, and how to define the appropriate distribution function. The steps were followed from the perspective of an inexperienced user.
Figure 2 depicts the initial ExtendSim model generated based on the first LLM prompt response. At first glance, it appears error-free. Detailed examination reveals missing blocks, unconnected entity connectors, incorrect flow merging, and swapped blocks. Red rectangles highlight locations with the highest concentration of errors, while underlined block labels indicate improper internal configurations.
Model simulation cannot be executed because ExtendSim generates error messages related to equations lacking inputs or outputs, or Gate blocks without shift settings. By comparing the model and its settings with the process scheme and input parameter values, the following deficiencies were identified:
Insufficient input value for clay in the Create block;
Missing input storage units in the model;
Absence of blocks representing material feeding into the production process;
Coarse and Fine Milling activities combined into a single activity without accounting for delays in the distribution function;
LLM instructions did not describe how to connect flows from two silos into a single gate;
Gate blocks do not function because no block was identified for setting the shift schedule;
Press Mixer and Cutting activities combined into a single activity without accounting for delays in the distribution function;
Cutting waste now draws exclusively to the first silo, though this is not specified in the scheme;
Drying waste is not returned to the clay storage because it does not exist;
LLM instructions did not define how to set the equation for mass loss;
No initial inventory of 400 tons is generated in the finished goods storage at the start of the simulation;
Palletizing and Packaging activities combined into a single activity without accounting for delays in the distribution function.
According to the process scheme, the flows must be merged without any loss of entities, whereas the Batch block combines flows such that it requires one entity from each input and then creates a single entity from the three, meaning that 3 tons enter the block but only 1 ton exits. During refinement some issues were resolved correctly and immediately, while others required extended interaction or remained unresolved.
The first inappropriate solution arose when merging two flows from silos into a single Gate block. The initial and preferred recommendation was again the Batch block, which causes waiting and, additionally, entity loss. The Schedule for Gate blocks is correctly set using the Shift block, but the LLM provided incorrect guidance on its configuration and connection to the Gate block. For recycling 1% back to both silos the LLM proposed a complex method involving division and randomization with blocks that do not exist in ExtendSim. It also modeled weight loss after drying incorrectly. After refinement that items cannot be divided, it suggested an incorrect attribute-based approach, leading users to adopt an alternative method that the LLM described correctly.
For initializing the inventory of finished goods in storage, the LLM initially proposed methods using blocks unavailable in ExtendSim. Refined, it began describing the correct solution, though later steps it suggested rendered it non-functional. Post-refined solutions still imposed a capacity limit of 400, despite this not being defined in the scheme. The remaining identifiable issues were resolved correctly.
After several refinement iterations, the LLM generated procedures that resolved most errors identified following the initial prompt.
Figure 3 depicts a nearly correct and more complex ExtendSim model resulting from applying these procedures. The simulation software now executes runs, yielding an output of 971 tons instead of the case study benchmark of 9721 tons. Red arrows in the figure denote the four remaining unresolved entity flow merging issues, which cause the low simulation output.
Based on iterative prompts, the LLM generated infeasible solutions or non-existent blocks. When prompted to define blocks that work with entity flows, such as Select Item Out or In, the LLM produced the correct solution. The solution was explained for merging flows from material feeders. Equivalently, the Select Item In block can be applied in other instances of incorrect Batch block usage.
The resulting model is shown in
Figure 4, where Batch blocks have been replaced by Select Item In blocks. Model validation identified no further errors. Upon simulation run, this model achieves an output of 9718 tons with utilization rates of 56% for wet semi-finished product manufacturing, 99.9% for finished product manufacturing, and 48% for dispatch operations. These outputs closely match the benchmark model results.
The copilot approach using ChatGPT followed a very similar procedure. The instructions comprised 15 steps. Likewise, the initial model was not executable, and the identifiable errors largely overlapped with those from the previous case. A new issue specific to ChatGPT arose, where it used 1 ton of water per 1 ton of material mass. This was resolved through refining prompting of the underlying principle. The primary challenge remained the modeling of item flow merging. For weight loss, ChatGPT did not propose the correct simpler solution but instead relied on attributes. However, described configuration enabled even an inexperienced user to implement them successfully.
All LLM errors in the copilot approach were categorized.
Table 4 summarizes their frequencies, total counts across LLMs, and percentage distribution by category. The category classification followed this schema:
Syntactic errors refer to incompatible blocks or settings, such as equations lacking inputs/outputs;
Logical fallacies refer to logically inappropriate modeling solutions, such as entity loss during flow merging;
LLM hallucinations refer to proposals of non-existent blocks or unsuitable methods, such as attribute-based merging;
Parameter misinterpretations refer to incorrect parameter values or model connections, such as imposing a 400 ton capacity limit on finished goods storage.
4. Discussion
This research relies exclusively on a single case study. The case features a manufacturing process with more demanding logistical challenges under real-world constraints. Its complexity includes feedback waste loops, mass loss during drying, and variable work shifts. This design highlights LLM limitations in advanced, real-world scenarios. The applicability to alternative logistics scenarios may differ according to system complexity. Specific characteristics like material blending, flow splitting, and merging increase LLM comprehension difficulty.
Both LLMs confirmed their capability to create a simulation model. Perplexity provided a Growth model example with source code, while ChatGPT proposed a DES concept, including input descriptions and Python code. However, in response to subsequent prompts, they acknowledged the inability to directly interact with ExtendSim due to the lack of access to local software, which would require an API.
Unlike the study by Romero-Guerrero et al., where generative AI automated FlexSim model creation via FlexScript [
7], generative AI cannot independently build models in ExtendSim. ExtendSim lacks a direct API and integrated blocks for code-based programming, thereby preventing drag-and-drop interaction. For AI-assisted modeling attempts, simulation software with open architecture must be selected.
In the second experiment, LLMs successfully approximated the ExtendSim model benchmark results but revealed critical limitations in interpreting inputs and system dynamics. Iterative prompt refinement was key to achieving accuracy, underscoring the necessity of human intervention. LLMs repeatedly requested data already present in the prompt, exposing issues with context retention and application. Providing a detailed textual specification of the process improved DES behavior approximation.
Perplexity initially applied a continuous steady-state mass balance model, ignoring time intervals and process time distributions. It exhibited less intuitive logistical understanding, resulting in a high initial error of 364%. Only after explicit DES definition did it achieve 9734 tons with MAPE of 0.1%, demonstrating broader analytical coverage and the ability to iteratively develop and refine responses.
ChatGPT started with a DES approach and showed more consistent intuitive logistical comprehension across iterations. However, it misinterpreted water as a fixed additive, yielding initial MAPE errors around 86%. After refinement iterations, final error variability stabilized in the 1.2–22.8% range, with lower errors achievable only near the upper bound of 9600 tons.
LLMs can generate accurate outputs only under human-guided iterative refinement. Simulation model generated largely by an LLM faces additional challenges in verification and validation, particularly when internal logic remains opaque due to hallucinations or unstated assumptions. In practice, experts cannot confidently trace causal relationships or assumptions without deep manual inspection. LLMs should serve as a supporting tool during model construction, ensuring developer control over structure and assumptions for reliable verification and validation.
In Experiment 3, the copilot approach represents the most promising form of LLM assistance in creating simulation models in ExtendSim, despite significant initial errors in the instructions. The initial model was inexecutable due to syntactic errors (e.g., equations lacking inputs), logical deficiencies, hallucinations, and incorrect parameters.
Targeted refinements resolved approximately 80% of errors through new LLM instructions, though human critical thinking and intuition were required in some cases. A persistent issue was the incorrect recommendation of the Batch (or Merge in ChatGPT) block for entity merging, which in ExtendSim batches entities (causing entity loss) rather than merging flows. The correct solution would be the Select Item In block. Refinements resulted in an executable model with output of only 971 tons compared to the benchmark of 9721 tons.
Hallucinations likely arose from the functional similarity of blocks in ExtendSim, misinterpretation of source materials, or insufficient depth of LLM knowledge regarding specific simulation software. One possible solution is future work involving Retrieval-Augmented Generation (RAG) knowledge bases tailored specifically to ExtendSim to mitigate such issues.
A shift to a one-shot prompt with a focus on entity flow generated correct instructions, resulting in a validated model with an output of 9718 tons and utilization rates of 56% (wet semi-finished products), 99.9% (finished products), and 48% (dispatch), which are nearly identical to the benchmark.
Overall, within the copilot approach, ChatGPT generated slightly fewer errors. Error analysis comparing Perplexity and ChatGPT indicates that ChatGPT has fewer syntax-related issues. When it identifies the correct block, it typically also selects appropriate parameter settings. Across the remaining metrics, LLM-generated instructions are comparable in quality and frequency. Most errors and hallucinations stem from an incorrect understanding of system dynamics.
Perplexity also provided sources alongside its solutions. However, these sources were drawn from three different simulation platforms—ExtendSim, FlexSim, and Visual Components. This lack of source specificity may explain many of the hallucinations. It led to suggestions involving non-existent blocks or incorrect methods.
From a time-efficiency perspective, the copilot approach shows clear advantages over traditional ExtendSim modeling. In both approaches, the tasks were performed by experienced ExtendSim users, which represents a limitation when generalizing the time metrics.
In the case study, classical model development involved interpreting outputs, identifying suitable methods, and validating them through iterative simulations before assembling and refining the final model. This process was further prolonged by the tendency of experienced users to explore alternative approaches. Model construction alone required approximately two working days, preceded by at least five days of system analysis and input parameter identification. Additionally, solution development often involved time-consuming review of materials, which could lead to suboptimal choices and repeated analysis.
In contrast, the copilot approach enabled complete model development, including verification and validation, within 8 to 10 h of working time. The LLM reduced the need for manual search and method selection by providing targeted suggestions, making the process more focused. However, unresolved LLM errors remain a limitation and require additional user intervention. Furthermore, the efficiency of this approach may differ if the LLM is not provided with preprocessed system analysis and input data.
LLMs are most effective when used as assistants during model development rather than as autonomous model generators. The copilot approach is a viable way to integrate LLMs into traditional software environments. No single LLM is clearly superior, and the choice depends largely on user preference. However, rigorous validation remains essential to ensure the accuracy and reliability of the resulting models.
Even inexperienced users can develop functional simulation models with LLM assistance. Although, the risk of hidden errors remains. As a result, model output may not accurately reflect real-world systems. Hallucinations and incomplete instructions highlight the need for validation by more experienced users.
5. Conclusions
The literature confirms the transformative potential of generative AI in logistics simulation, particularly through successes in programming frameworks such as Python/SimPy. Studies highlight the copilot approach as an effective paradigm for integrating generative AI into modeling and simulation. A few research studies have addressed commercial simulation software, but comprehensive evaluations for desktop platforms like ExtendSim have been lacking, especially in complex cases encountered in real industrial environments.
This study fills this gap by evaluating the modeling and simulation of a real manufacturing system using LLMs Perplexity and ChatGPT. The evaluation covers the following three scenarios: autonomous model creation, output estimation, and copilot assistance. The benchmarked complex DES model of construction materials production in ExtendSim yields 9721 tons of finished products.
Findings indicate that LLMs cannot directly construct models without an API. ExtendSim lacks an open architecture supporting API, rendering autonomous model construction impossible. Output estimation reached benchmark levels only after iterative prompt refinement, with MAPE deviations of 0.1% (Perplexity) or 1.2–22.8% range (ChatGPT). LLMs face additional challenges in verification and validation. Experts cannot confidently trace causal relationships or assumptions without deep manual inspection.
The copilot approach enabled creation of a functional model yielding 9718 tons. Initial instructions contained errors preventing simulation execution, but refinement iterations produced a correct, functional model. Error analysis revealed 25 errors for Perplexity and 22 for ChatGPT, with hallucinations comprising 28% (Perplexity) and 32% (ChatGPT). ChatGPT exhibited fewer syntactic errors. The copilot approach substantially reduced model development time versus traditional ExtendSim modeling, shortening the process from several days to approximately 8–10 h. However, as both were performed by experienced users, time savings may not fully generalize to novices.
Future work should incorporate Retrieval-Augmented Generation (RAG) knowledge bases tailored to ExtendSim, systematic long-term monitoring of AI consistency, comparative analyses across commercial platforms, and evaluations of output quality under varied prompting strategies.
Generative AI performs best as a copilot rather than autonomously. No LLM is unequivocally superior. Practical deployment requires structured human–AI workflows, with validation essential to ensure model accuracy and reliability.