Exploring Artificial Intelligence as a Tool for Logistics Process Simulation

Straka, Martin; Ondov, Marek

doi:10.3390/app16073301

Open AccessArticle

Exploring Artificial Intelligence as a Tool for Logistics Process Simulation

by

Martin Straka

and

Marek Ondov

^*

Faculty of Mining, Ecology, Process Control and Geotechnologies, Technical University of Kosice, Letna 9, 04200 Kosice, Slovakia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3301; https://doi.org/10.3390/app16073301 (registering DOI)

Submission received: 28 February 2026 / Revised: 20 March 2026 / Accepted: 26 March 2026 / Published: 29 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The growing integration of generative artificial intelligence in logistics demands efficient simulation modeling. This study evaluates generative large language models, Perplexity and ChatGPT, for discrete-event simulation in ExtendSim. It focuses on modeling a real, complex manufacturing system, yielding 9721 tons of output. The following three scenarios were assessed: autonomous model creation, output estimation from process descriptions and parameters, and copilot-guided manual building. LLMs cannot autonomously construct ExtendSim models due to the lack of APIs. Output estimation only matched benchmarks after iterative prompt refinement, achieving errors of 0.1% for Perplexity and 1.2% to 22.8% for ChatGPT. Estimation without substantial human intervention proved infeasible. Only the copilot approach appeared viable despite initial errors. It enabled a validated model with 9718 tons output after resolving 25 errors for Perplexity and 22 for ChatGPT through iterative refinement. Approximately 28% (Perplexity) or 32% (ChatGPT) of the errors were hallucinations. The copilot approach reduced development time from several days to 8–10 h. Human expertise remained essential for verifying model outputs and addressing hallucinations and logical flaws. Consequently, this approach may be less feasible for inexperienced users. The copilot paradigm offers practical acceleration for experienced users; however, its limitations underscore the need for API integration and retrieval-augmented generation enhancements.

Keywords:

generative AI; large language models; discrete-event simulation; logistics modeling; copilot approach; ExtendSim

1. Introduction

The logistics and supply chain management sector faces unprecedented challenges in the contemporary global economy, characterized by increasing complexity and volatility. The emergence of Industry 4.0 and its subsequent evolution toward Industry 5.0 have catalyzed a fundamental transformation in how logistics processes are modeled, simulated, and optimized [1]. A paradigmatic shift is driven by an integrated ecosystem of cyber-physical systems, the Internet of Things (IoT), and advanced artificial intelligence [2].

Artificial intelligence has become a key enabler of Logistics 4.0. Alongside “classical” approaches; generative artificial intelligence has gained prominence by producing text, code, and other content based on large language models (LLMs) [3].

In logistics, generative AI is increasingly employed not only for predictive tasks but also for generating transportation plans, simulating supply chain disruption scenarios, and supporting strategic planning. Its hybrid use opens new possibilities through natural language interfaces [2,4].

Simulation modeling remains a cornerstone methodology for analyzing complex logistics systems. The following three primary simulation paradigms dominate logistics modeling: discrete-event simulation, agent-based modeling and continuous simulation [5]. Traditional simulation software platforms such as Arena, FlexSim, ExtendSim, and Tecnomatix Plant Simulation provide graphical interfaces; yet, they still require substantial domain expertise and technical proficiency. The inherent complexity of these tools often creates a bottleneck in simulation adoption, particularly for domain experts who lack formal training in simulation methodologies [6,7].

In an era marked by global disruptions, the ability to model, simulate, and optimize logistics systems with AI-driven intelligence has become a strategic imperative. Integrating AI systems with existing logistics infrastructure presents significant technical challenges and requires organizational changes [8,9]. Therefore, it is essential to focus also on implementing generative AI capabilities for model development within traditional simulation software platforms.

The recent emergence of LLMs, such as GPT-3 and GPT-4, represents a breakthrough in assisted modeling. These systems have demonstrated remarkable capabilities in code generation, natural language understanding, and domain-specific reasoning. Research indicates that LLMs can extract structured information from textual descriptions and map it directly to simulation constructs [6], enabling the translation of verbal descriptions into functional simulation code [10].

In this context, the “copilot” metaphor has emerged as the dominant framework for human–AI collaboration in simulation development. In this paradigm, users provide high-level descriptions and iteratively refine AI-generated outputs through suggestions and corrections. This approach leverages complementary strengths: humans provide domain expertise and strategic judgment, while LLMs contribute to rapid code generation and tireless iteration [11,12,13].

Multiple studies have demonstrated successful applications of generative AI to Python-based simulation frameworks, where models generate valid queuing, inventory-control, healthcare DES, DEVS, and physics-based PyChrono simulations with substantial gains in setup speed and code accuracy [6,10,11,12].

Most existing studies focus on programming-based approaches or specialized frameworks, which may limit their immediate practical applicability for organizations invested in commercial platforms. Only a limited number of studies explicitly focus on commercial simulation platforms. Romero-Guerrero demonstrated the use of ChatGPT to automatically generate FlexSim (version 23.0.1, 64-bit, educational license) DES models from industrial layouts. This approach transforms the simulation modeling process by automating the translation of physical layouts into executable FlexSim models using FlexScript coding, thereby offering advantages in speed and accessibility. However, it still faces limitations in interpreting complex visual inputs, which require manual validation [7].

Reider et al. explored the use of generative AI for automated simulation model generation in Tecnomatix Plant Simulation. Their synergistic approach aims to democratize access to simulation modeling by lowering technical barriers. In their framework, generative AI interfaces directly with Tecnomatix through a programmed mediator. While the study highlights benefit in terms of speed and accessibility, it lacks implementation, validation of the framework, and a detailed discussion of its limitations [14].

Wiśniewski et al. investigated the integration of GPT-4 with Arena DES for intelligent management in agile projects. Their study benchmarked GPT-4 outputs against Arena simulations. However, the two models were developed independently, with the Arena models serving as the baseline for assessing the adequacy of the AI-generated results. There is no evidence that LLM contributed to the creation of the Arena models [15].

Successful implementation of LLMs in model creation faces significant hurdles. LLM capabilities remain limited to relatively simple problems due to constraints in context windows and the quality of training data [6,12].

While modern LLMs feature extensive context windows capable of processing millions of tokens, research indicates that model performance degrades as the input length increases. A critical phenomenon is the “Lost in the Middle” (LiM) effect, where models demonstrate superior retrieval for information located at the beginning (primacy bias) or end (recency bias) of a prompt. Positional biases in reasoning are inherited from failures in information retrieval [16,17].

A significant gap exists between the sequential text-centric training of LLMs and the topological requirements of block-based DES. LLMs are trained to predict the next word in a sequence, making them proficient at generating code but less effective at directly constructing the spatial and logical dependencies of graphical simulation platforms like ExtendSim or Arena. DES models are networks of concurrent logic and event scheduling, whereas LLMs process information as a sequence [13,18].

This mismatch often results in loss of nuance, logical breakdowns, and hallucinations. The most critical issue is AI hallucination, where the LLMs generates plausible but incorrect code or fabricated references. This necessitates an iterative process where human experts rigorously verify and validate AI-generated models to ensure they remain reasonable under real-world conditions [10,13,19].

Prior research has not explicitly addressed the integration of generative AI within traditional desktop simulation software nor analyzed specific applications in logistics modeling. The question of whether generative AI can create or assist in developing logistics simulation models remains unexplored.

This paper addresses this gap by evaluating the application of generative AI via LLMs in developing a logistics model using ExtendSim 10 Pro 2024 within a case study. The primary objective is to assess generative AI support in creating traditional desktop-based simulation models. During the model development process, Perplexity and ChatGPT are employed as LLMs. The research contribution lies in the analysis of three specific use cases of LLMs:

independent creation of a simulation model;
determination of simulation outputs based on input parameters and process information;
copilot-assisted guidance.

2. Materials and Methods

The current capabilities of LLMs in simulation model development or result generation are examined using an existing model. Only two studies addressing commercial simulation software present their models. Romero-Guerrero uses a moderately complex model consisting of 12 blocks focused on process layout. Wiśniewski et al. employ a simple DES project management model with fewer blocks used as a benchmark. The third study, Reider et al. mentions a model developed in Tecnomatix Plant Simulation (Siemens Digital Industries Software, Plano, TX, USA) but neither displays it nor implements their proposed framework. The case study in this research features a comprehensive DES model representative of real industrial modeling and simulation applications.

2.1. Benchmark Case Study

The benchmark model is a complex DES model of construction material production developed in ExtendSim 8 LT (Imagine That, Inc. (ITI), San Jose, CA, USA). The schematic representation of the model in Figure 1 demonstrates an actual manufacturing facility composed of 13 operations. The model’s complexity arises from feedback waste loops, mass loss during drying, and variable work shifts. In contrast to the simpler models discussed in the aforementioned studies, this one examines more demanding logistical challenges under real-world constraints, thereby highlighting the limitations of LLMs in advanced scenarios.

The manufacturing process consists of continuously interconnected operations, from raw material input to finished product dispatch. The simulation duration is set to 30 days, with individual process segments operating under different shift schedules. The basic unit of material flow is 1 ton per time unit [20].

Input raw materials—clay from a nearby dump, scobs, and ash—are dosed by feeders at defined intervals into the system. The clay, scobs, and ash sequentially pass through feeders into a wheel mill, coarse milling, and fine milling stages, after which the material is accumulated in silos (2 × 55 m³) [20].

The subsequent stage operates for two 8 h work shifts per day. In this phase, the technological mixture is prepared with water according to the recipe, with water serving as the final input raw material. The mixture is further processed in a mixer-extruder and press, where a continuous strip is formed and cut into precise construction elements (raw “wet” construction material) [20].

Wet part of the production is followed by drying of the construction materials in a dryer with a 20% weight loss. This semi-finished product is then continuously fired in a kiln. Finally, the fired construction material is palletized and packaged, processing 280–320 pallets of finished products per 12 h shift [20].

The construction materials manufacturing case study serves as a benchmark for evaluating results obtained from AI-assisted simulation. From this case study, the input parameters listed in Table 1, along with the process elements, relationships and results, are derived. All this information was gathered through an analysis of the existing manufacturing process and will serve as the foundation of LLMs.

The baseline simulation results define the initial state of the manufacturing process at 9721 tons of products, equivalent to 8565 pallets. Additionally, 19 tons of by-product were produced, corresponding to 17 pallets. Utilization rates were 53.4% for wet semi-finished product manufacturing, 99.9% for finished product manufacturing, and 44.8% for dispatch operations [20].

2.2. Prompt Engineering Protocol

In this study, Perplexity.ai (Perplexity AI, Inc., San Francisco, CA, USA) and ChatGPT 5.2 (OpenAI, San Francisco, CA, USA) were used. ChatGPT 5.2 was the version available from December 2025 to March 2026, during which the practical experiments were conducted. Perplexity is an AI-powered search engine focused on delivering precise, real-time responses backed by source citations. This capability reduces the likelihood of hallucinated information and makes the model particularly suitable for research-oriented tasks. In contrast, ChatGPT, developed by OpenAI (San Francisco, CA, USA), is a conversational LLM optimized for analytical tasks, creative writing, and programming-related tasks. However, unless integrated with external search tools, it may rely on previously trained knowledge rather than real-time information sources [21].

Both LLMs were evaluated for their potential application in the development and analysis of simulation models. The experiments were conducted in conjunction with ExtendSim simulation software—ExtendSim 10 Pro 2024 (ANDRITZ Inc., Rockville, MD, USA).

The prompt engineering process was designed to systematically evaluate the capabilities of LLMs in supporting DES model development. The prompts followed a zero-shot prompting strategy, meaning that the LLMs were given direct instructions without examples of expected outputs. This approach allows researchers to evaluate the baseline reasoning capabilities of the LLMs when dealing with simulation-related tasks [22].

The prompts were designed according to several guiding principles, including clarity of instruction, explicit specification of the task, and inclusion of relevant contextual information. The LLMs were accessed through the standard web interface with default parameter settings. In this environment, parameters such as temperature, top-p, and token limits are managed automatically by LLMs [22].

To improve the relevance and clarity of generated responses, an iterative prompt refinement strategy was employed. In this process, initial prompts were gradually modified based on the outputs generated by the LLMs. The refinement typically involved adding contextual information, clarifying the task requirements, or specifying the expected type of output. The iterative prompting process followed a consistent pattern consisting of [23]

identifying errors or inaccuracies in the generated output;
specifying the correct parameter or system setting based on the provided scheme or input data;
requesting a corrected or improved solution, if necessary, involving one-shot prompting strategy.

The prompt engineering protocol consisted of three main experimental tasks designed to evaluate different levels of LLMs involvement in simulation modeling.

The first experiment focused on assessing whether LLMs are capable of independently creating a simulation model within traditional simulation software. The objective was to determine whether the LLMs could autonomously generate a complete simulation structure or identify the requirements necessary to make this feasible. The individual prompts were formulated as follows:

Can you create a logistic simulation model?
I would like to have a simulation model in simulation software, for example, ExtendSim. Can you create models in this software or use any similar software?
What do you need to directly interact with drag-and-drop tools? Is that even possible?

The second experiment aimed to determine whether LLMs can generate DES outputs based on complete input data. In this case, the LLMs were provided with a description of the construction material production process and its schematic representation presented in this chapter, along with the quantitative parameters listed in Table 1.

The following prompt was used to evaluate whether the LLMs could determine system outputs based on the provided inputs: “Just take the scheme and other inputs. Based on them, can you calculate, determine, or simulate the number of outputs?” In subsequent prompts, the LLMs were asked to specify additional data required to construct the model and generate the desired outputs.

The third experiment evaluated the copilot approach, where LLMs assist the user in simulation model development by providing detailed step-by-step instructions. In this scenario, both LLMs did not directly create the simulation model but instead generated procedural guidance enabling the user to build the simulation model manually within ExtendSim 10 Pro 2024. Both LLMs received the same input data and process scheme as in the previous experiment.

The experiment was designed from the perspective of a user with minimal or no prior experience in simulation modeling. The prompt used in this phase intentionally did not specify the type of simulation model. The following prompt was used: “Can you provide me with a very detailed, stepwise ‘click here, set this parameter to X’ procedure to create a simulation model in ExtendSim 10?”

The full dialogue with the generative AI models and all prompts used across the three experiments are provided in the Supplementary Materials—AI Prompts and Outputs.

3. Results

3.1. Independent Development of a Simulation Model by LLMs

Both LLMs responded to the first prompt by confirming their ability to create a simulation model. Perplexity provided a Growth model example with source code. ChatGPT proposed a DES model concept, described required inputs, and offered Python code.

For the second prompt, both LLMs stated they cannot directly interact with traditional desktop-based simulation software ExtendSim. They offered alternatives: code generation or step-by-step guide for manual model creation.

For the third prompt, they explained the lack of direct access to local software. They suggested enabling such access would require an appropriate Application Programming Interface (API) or scripting.

A supplementary prompt was used to check if the limitation stemmed from lack of ExtendSim modeling knowledge. Both models refuted this. Perplexity referenced ExtendSim sources and manuals. ChatGPT listed available libraries, software blocks, and example models without citations.

3.2. Determination of DES Results by LLMs

This experiment builds directly on the description of the construction material production process (including its schematic in Figure 1) and the quantitative parameters listed in Table 1 from Section 2. Both LLMs initially focused on the declared production range of 280–320 pallets per 12 h from Table 1. They estimated losses from process waste and weight loss after the drying activity.

The LLMs requested additional details. Perplexity specified mass-to-pallet ratio, exact material input rates, weight and dimensions of the finished product, and waste rates. ChatGPT asked for unit definitions, batch sizes, processing time distributions, pallet conversion rate, and simulation duration.

Within prompt refinement, a detailed textual description of the process was provided, standardizing flow to one ton regardless of whether it represented raw materials, products, or pallets.

Perplexity defined an initial output of 45,128 tons of packaged products. The computations were based on the conversion of average flow rates and total mass losses scaled to consume all 64,000 tons of input clay. Disabling scaling reduced this to 5082 tons, still below the benchmark value of 9721 tons. Additional refinement prompts included model settings checks and definition of a DES model. In DES model one item equals one ton of material. Following the DES model definition, Perplexity declared reproducing discrete-event simulation behavior and generated the outputs shown in Table 2.

ChatGPT immediately applied a DES approach, initially yielding 1303 and 1404 tons. Prompt refinements had to explicitly define storage units and clarify the interpretation of water consumption in the simulation model. After the iterative refinement process, ChatGPT produced a final output range of 7500–9600 tons.

Following the second experiment, an accuracy assessment of the LLM results was conducted. The quantitative metric MAPE (Mean Absolute Percentage Error) quantifies deviations of AI-generated outputs from the ExtendSim model benchmark 9721 tons. The evaluation in Table 3 quantifies LLM outputs after each prompt iteration. Lower MAPE result indicates higher accuracy of LLM output.

3.3. Copilot Approach to Assist in Simulation Model Development

The third experiment evaluated the copilot approach, where LLMs assist the user in simulation model development by providing detailed step-by-step instructions. ChatGPT generated instructions for a DES model, while Perplexity attempted to construct a broader-type model of a continuous, or rather, discrete rate nature. The instructions provided by Perplexity were inaccurate, did not correspond to the software’s available blocks, and contained non-functional equations that prevented the simulation from running. Ultimately, it was not possible to derive a model with a coherent and stable structure.

For Perplexity, the task was explicitly defined as concerning a DES model in which one item represents one ton. It subsequently generated a procedure consisting of twelve steps. Each step described in detail which block to use, where to set capacity, and how to define the appropriate distribution function. The steps were followed from the perspective of an inexperienced user.

Figure 2 depicts the initial ExtendSim model generated based on the first LLM prompt response. At first glance, it appears error-free. Detailed examination reveals missing blocks, unconnected entity connectors, incorrect flow merging, and swapped blocks. Red rectangles highlight locations with the highest concentration of errors, while underlined block labels indicate improper internal configurations.

Model simulation cannot be executed because ExtendSim generates error messages related to equations lacking inputs or outputs, or Gate blocks without shift settings. By comparing the model and its settings with the process scheme and input parameter values, the following deficiencies were identified:

Insufficient input value for clay in the Create block;
Missing input storage units in the model;
Absence of blocks representing material feeding into the production process;
Coarse and Fine Milling activities combined into a single activity without accounting for delays in the distribution function;
LLM instructions did not describe how to connect flows from two silos into a single gate;
Gate blocks do not function because no block was identified for setting the shift schedule;
Press Mixer and Cutting activities combined into a single activity without accounting for delays in the distribution function;
Cutting waste now draws exclusively to the first silo, though this is not specified in the scheme;
Drying waste is not returned to the clay storage because it does not exist;
LLM instructions did not define how to set the equation for mass loss;
No initial inventory of 400 tons is generated in the finished goods storage at the start of the simulation;
Palletizing and Packaging activities combined into a single activity without accounting for delays in the distribution function.

According to the process scheme, the flows must be merged without any loss of entities, whereas the Batch block combines flows such that it requires one entity from each input and then creates a single entity from the three, meaning that 3 tons enter the block but only 1 ton exits. During refinement some issues were resolved correctly and immediately, while others required extended interaction or remained unresolved.

The first inappropriate solution arose when merging two flows from silos into a single Gate block. The initial and preferred recommendation was again the Batch block, which causes waiting and, additionally, entity loss. The Schedule for Gate blocks is correctly set using the Shift block, but the LLM provided incorrect guidance on its configuration and connection to the Gate block. For recycling 1% back to both silos the LLM proposed a complex method involving division and randomization with blocks that do not exist in ExtendSim. It also modeled weight loss after drying incorrectly. After refinement that items cannot be divided, it suggested an incorrect attribute-based approach, leading users to adopt an alternative method that the LLM described correctly.

For initializing the inventory of finished goods in storage, the LLM initially proposed methods using blocks unavailable in ExtendSim. Refined, it began describing the correct solution, though later steps it suggested rendered it non-functional. Post-refined solutions still imposed a capacity limit of 400, despite this not being defined in the scheme. The remaining identifiable issues were resolved correctly.

After several refinement iterations, the LLM generated procedures that resolved most errors identified following the initial prompt. Figure 3 depicts a nearly correct and more complex ExtendSim model resulting from applying these procedures. The simulation software now executes runs, yielding an output of 971 tons instead of the case study benchmark of 9721 tons. Red arrows in the figure denote the four remaining unresolved entity flow merging issues, which cause the low simulation output.

Based on iterative prompts, the LLM generated infeasible solutions or non-existent blocks. When prompted to define blocks that work with entity flows, such as Select Item Out or In, the LLM produced the correct solution. The solution was explained for merging flows from material feeders. Equivalently, the Select Item In block can be applied in other instances of incorrect Batch block usage.

The resulting model is shown in Figure 4, where Batch blocks have been replaced by Select Item In blocks. Model validation identified no further errors. Upon simulation run, this model achieves an output of 9718 tons with utilization rates of 56% for wet semi-finished product manufacturing, 99.9% for finished product manufacturing, and 48% for dispatch operations. These outputs closely match the benchmark model results.

The copilot approach using ChatGPT followed a very similar procedure. The instructions comprised 15 steps. Likewise, the initial model was not executable, and the identifiable errors largely overlapped with those from the previous case. A new issue specific to ChatGPT arose, where it used 1 ton of water per 1 ton of material mass. This was resolved through refining prompting of the underlying principle. The primary challenge remained the modeling of item flow merging. For weight loss, ChatGPT did not propose the correct simpler solution but instead relied on attributes. However, described configuration enabled even an inexperienced user to implement them successfully.

All LLM errors in the copilot approach were categorized. Table 4 summarizes their frequencies, total counts across LLMs, and percentage distribution by category. The category classification followed this schema:

Syntactic errors refer to incompatible blocks or settings, such as equations lacking inputs/outputs;
Logical fallacies refer to logically inappropriate modeling solutions, such as entity loss during flow merging;
LLM hallucinations refer to proposals of non-existent blocks or unsuitable methods, such as attribute-based merging;
Parameter misinterpretations refer to incorrect parameter values or model connections, such as imposing a 400 ton capacity limit on finished goods storage.

4. Discussion

This research relies exclusively on a single case study. The case features a manufacturing process with more demanding logistical challenges under real-world constraints. Its complexity includes feedback waste loops, mass loss during drying, and variable work shifts. This design highlights LLM limitations in advanced, real-world scenarios. The applicability to alternative logistics scenarios may differ according to system complexity. Specific characteristics like material blending, flow splitting, and merging increase LLM comprehension difficulty.

Both LLMs confirmed their capability to create a simulation model. Perplexity provided a Growth model example with source code, while ChatGPT proposed a DES concept, including input descriptions and Python code. However, in response to subsequent prompts, they acknowledged the inability to directly interact with ExtendSim due to the lack of access to local software, which would require an API.

Unlike the study by Romero-Guerrero et al., where generative AI automated FlexSim model creation via FlexScript [7], generative AI cannot independently build models in ExtendSim. ExtendSim lacks a direct API and integrated blocks for code-based programming, thereby preventing drag-and-drop interaction. For AI-assisted modeling attempts, simulation software with open architecture must be selected.

In the second experiment, LLMs successfully approximated the ExtendSim model benchmark results but revealed critical limitations in interpreting inputs and system dynamics. Iterative prompt refinement was key to achieving accuracy, underscoring the necessity of human intervention. LLMs repeatedly requested data already present in the prompt, exposing issues with context retention and application. Providing a detailed textual specification of the process improved DES behavior approximation.

Perplexity initially applied a continuous steady-state mass balance model, ignoring time intervals and process time distributions. It exhibited less intuitive logistical understanding, resulting in a high initial error of 364%. Only after explicit DES definition did it achieve 9734 tons with MAPE of 0.1%, demonstrating broader analytical coverage and the ability to iteratively develop and refine responses.

ChatGPT started with a DES approach and showed more consistent intuitive logistical comprehension across iterations. However, it misinterpreted water as a fixed additive, yielding initial MAPE errors around 86%. After refinement iterations, final error variability stabilized in the 1.2–22.8% range, with lower errors achievable only near the upper bound of 9600 tons.

LLMs can generate accurate outputs only under human-guided iterative refinement. Simulation model generated largely by an LLM faces additional challenges in verification and validation, particularly when internal logic remains opaque due to hallucinations or unstated assumptions. In practice, experts cannot confidently trace causal relationships or assumptions without deep manual inspection. LLMs should serve as a supporting tool during model construction, ensuring developer control over structure and assumptions for reliable verification and validation.

In Experiment 3, the copilot approach represents the most promising form of LLM assistance in creating simulation models in ExtendSim, despite significant initial errors in the instructions. The initial model was inexecutable due to syntactic errors (e.g., equations lacking inputs), logical deficiencies, hallucinations, and incorrect parameters.

Targeted refinements resolved approximately 80% of errors through new LLM instructions, though human critical thinking and intuition were required in some cases. A persistent issue was the incorrect recommendation of the Batch (or Merge in ChatGPT) block for entity merging, which in ExtendSim batches entities (causing entity loss) rather than merging flows. The correct solution would be the Select Item In block. Refinements resulted in an executable model with output of only 971 tons compared to the benchmark of 9721 tons.

Hallucinations likely arose from the functional similarity of blocks in ExtendSim, misinterpretation of source materials, or insufficient depth of LLM knowledge regarding specific simulation software. One possible solution is future work involving Retrieval-Augmented Generation (RAG) knowledge bases tailored specifically to ExtendSim to mitigate such issues.

A shift to a one-shot prompt with a focus on entity flow generated correct instructions, resulting in a validated model with an output of 9718 tons and utilization rates of 56% (wet semi-finished products), 99.9% (finished products), and 48% (dispatch), which are nearly identical to the benchmark.

Overall, within the copilot approach, ChatGPT generated slightly fewer errors. Error analysis comparing Perplexity and ChatGPT indicates that ChatGPT has fewer syntax-related issues. When it identifies the correct block, it typically also selects appropriate parameter settings. Across the remaining metrics, LLM-generated instructions are comparable in quality and frequency. Most errors and hallucinations stem from an incorrect understanding of system dynamics.

Perplexity also provided sources alongside its solutions. However, these sources were drawn from three different simulation platforms—ExtendSim, FlexSim, and Visual Components. This lack of source specificity may explain many of the hallucinations. It led to suggestions involving non-existent blocks or incorrect methods.

From a time-efficiency perspective, the copilot approach shows clear advantages over traditional ExtendSim modeling. In both approaches, the tasks were performed by experienced ExtendSim users, which represents a limitation when generalizing the time metrics.

In the case study, classical model development involved interpreting outputs, identifying suitable methods, and validating them through iterative simulations before assembling and refining the final model. This process was further prolonged by the tendency of experienced users to explore alternative approaches. Model construction alone required approximately two working days, preceded by at least five days of system analysis and input parameter identification. Additionally, solution development often involved time-consuming review of materials, which could lead to suboptimal choices and repeated analysis.

In contrast, the copilot approach enabled complete model development, including verification and validation, within 8 to 10 h of working time. The LLM reduced the need for manual search and method selection by providing targeted suggestions, making the process more focused. However, unresolved LLM errors remain a limitation and require additional user intervention. Furthermore, the efficiency of this approach may differ if the LLM is not provided with preprocessed system analysis and input data.

LLMs are most effective when used as assistants during model development rather than as autonomous model generators. The copilot approach is a viable way to integrate LLMs into traditional software environments. No single LLM is clearly superior, and the choice depends largely on user preference. However, rigorous validation remains essential to ensure the accuracy and reliability of the resulting models.

Even inexperienced users can develop functional simulation models with LLM assistance. Although, the risk of hidden errors remains. As a result, model output may not accurately reflect real-world systems. Hallucinations and incomplete instructions highlight the need for validation by more experienced users.

5. Conclusions

The literature confirms the transformative potential of generative AI in logistics simulation, particularly through successes in programming frameworks such as Python/SimPy. Studies highlight the copilot approach as an effective paradigm for integrating generative AI into modeling and simulation. A few research studies have addressed commercial simulation software, but comprehensive evaluations for desktop platforms like ExtendSim have been lacking, especially in complex cases encountered in real industrial environments.

This study fills this gap by evaluating the modeling and simulation of a real manufacturing system using LLMs Perplexity and ChatGPT. The evaluation covers the following three scenarios: autonomous model creation, output estimation, and copilot assistance. The benchmarked complex DES model of construction materials production in ExtendSim yields 9721 tons of finished products.

Findings indicate that LLMs cannot directly construct models without an API. ExtendSim lacks an open architecture supporting API, rendering autonomous model construction impossible. Output estimation reached benchmark levels only after iterative prompt refinement, with MAPE deviations of 0.1% (Perplexity) or 1.2–22.8% range (ChatGPT). LLMs face additional challenges in verification and validation. Experts cannot confidently trace causal relationships or assumptions without deep manual inspection.

The copilot approach enabled creation of a functional model yielding 9718 tons. Initial instructions contained errors preventing simulation execution, but refinement iterations produced a correct, functional model. Error analysis revealed 25 errors for Perplexity and 22 for ChatGPT, with hallucinations comprising 28% (Perplexity) and 32% (ChatGPT). ChatGPT exhibited fewer syntactic errors. The copilot approach substantially reduced model development time versus traditional ExtendSim modeling, shortening the process from several days to approximately 8–10 h. However, as both were performed by experienced users, time savings may not fully generalize to novices.

Future work should incorporate Retrieval-Augmented Generation (RAG) knowledge bases tailored to ExtendSim, systematic long-term monitoring of AI consistency, comparative analyses across commercial platforms, and evaluations of output quality under varied prompting strategies.

Generative AI performs best as a copilot rather than autonomously. No LLM is unequivocally superior. Practical deployment requires structured human–AI workflows, with validation essential to ensure model accuracy and reliability.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16073301/s1, Supplementary Materials—AI Prompts and Outputs.

Author Contributions

Each author (M.S. and M.O.) contributed to this publication. Conceptualization, M.S.; methodology, M.S.; software, M.S. and M.O.; validation, M.S.; formal analysis, M.O.; investigation, M.O.; resources, M.O.; data curation, M.O.; writing—original draft preparation, M.O.; writing—review and editing, M.S. and M.O.; visualization, M.O.; supervision, M.S.; project administration, M.S.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Grant Agency of the Ministry of Education, Research, Development and Youth of the Slovak Republic and the Slovak Academy of Sciences as part of the research project VEGA 1/0380/25 “Research into logistics systems based on educational robot models and computer simulation”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in this article; further inquiries can be directed to the corresponding author.

Acknowledgments

Generative AI tools (Perplexity.ai, ChatGPT 5.2) were used for experimental testing of simulation model generation (Section 3). Authors reviewed, verified, and take full responsibility for all content. No AI generated original text, analysis, or conclusions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rane, N.L.; Kaya, O.; Rane, J. Artificial Intelligence, Machine Learning, and Deep Learning for Enhancing Resilience in Industry 4.0, 5.0, and Society 5.0. In Artificial Intelligence, Machine Learning, and Deep Learning for Sustainable Industry 5.0; Deep Science Publishing: London, UK, 2024. [Google Scholar]
Wassan, A.N.; Kalwar, M.A. The Role of Logistics 4.0 and Industry 4.0 in Promoting Sustainable Operations and Performance. Jordan J. Mech. Ind. Eng. 2025, 19, 105–128. [Google Scholar] [CrossRef] [PubMed]
Bahroun, Z.; Saihi, A.; As’ad, R.; Tanash, M. A Systematic Analysis of Generative Artificial Intelligence for Supply Chain Transformation. Supply Chain Anal. 2026, 13, 100188. [Google Scholar] [CrossRef]
Nalbantoğlu, C.B. Artificial Intelligence in the Logistics Sector: Efficiency, Automation and Current Trends. J. Econ. Res. Found. 2025, 2, 160–175. [Google Scholar] [CrossRef]
Skapinyecz, R. Recent Trends in the Optimization of Logistics Systems Through Discrete-Event Simulation and Deep Learning. Algorithms 2025, 18, 573. [Google Scholar] [CrossRef]
Jackson, I.; Saenz, M.J.; Ivanov, D. From Natural Language to Simulations: Applying AI to Automate Simulation Modelling of Logistics Systems. Int. J. Prod. Res. 2024, 62, 1434–1457. [Google Scholar] [CrossRef]
Romero-Guerrero, J.A.; Bautista-Orduna, G.E.; Suarez-Luna, J.M.; Arenas-Islas, D. Creation of Discrete Event Simulation Models Using Artificial Intelligence and FlexSim. Ing. Investig. Tecnol. 2026, 27, 1–12. [Google Scholar] [CrossRef]
Nwokocha, G.C.; Alao, O.B.; Filani, O.M. AI-Driven Predictive Analytics Framework for Proactive Supply Chain Disruption Management and Contingency Planning. Comput. Sci. IT Res. J. 2025, 6, 450–474. [Google Scholar] [CrossRef]
Ahmed, A.A.; Abdullahi, A.U.; Gital, A.Y.; Dutse, A.Y. Application of Artificial Intelligence in Supply Chain Management: A Review on Strengths and Weaknesses of Predictive Modeling Techniques. Sci. J. Eng. Technol. 2024, 1, 1–18. [Google Scholar] [CrossRef]
Monks, T.; Harper, A.; Heather, A. Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation Models. J. Oper. Res. Soc. 2025, 76, 1–24. [Google Scholar] [CrossRef]
Carreira-Munich, T.; Paz-Marcolla, V.; Castro, R. DEVS Copilot: Towards Generative AI-Assisted Formal Simulation Modelling based on Large Language Models. In 2024 Winter Simulation Conference (WSC); IEEE: Orlando, FL, USA, 2024; pp. 2785–2796. [Google Scholar] [CrossRef]
Wang, J.; Negrut, A.; Zhang, H.; Slaton, K.; Wang, S.; Serban, R.; Wu, J.; Negrut, D. ChronoLLM: Customizing Language Models for Physics-Based Simulation Code Generation. Multibody Syst. Dyn. 2026. [Google Scholar] [CrossRef]
Akhavan, A.; Jalali, M.S. Generative AI and Simulation Modeling: How Should You (Not) Use Large Language Models like ChatGPT. Syst. Dyn. Rev. 2024, 40, e1773. [Google Scholar] [CrossRef]
Reider, R.; Lang, S. GenAI for Simulation Modeling: A Synergistic Approach. In Proceedings of the 18th International Doctoral Students Workshop on Logistics, Supply Chain and Production Management; Glistau, E., Ed.; Universitätsbibliothek Magdeburg: Magdeburg, Germany, 2025; pp. 115–122. [Google Scholar]
Wiśniewski, T.; Szymański, R.; Starostka-Patyk, M. Sprint into the Future: Intelligent Management through LLM-Powered Computer Simulations in Agile Projects. In Intelligent Management and Artificial Intelligence: Trends, Challenges, and Opportunities; University of Szczecin: Szczecin, Poland, 2025; Volume 2, pp. 482–495. [Google Scholar]
Hosseini, P.; Castro, I.; Ghinassi, I.; Purver, M. Efficient Solutions for an Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, UAE; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 1880–1891. [Google Scholar]
Veseli, B.; Chibane, J.; Toneva, M.; Koller, A. Positional Biases Shift as Inputs Approach Context Window Limits. arXiv 2025, arXiv:2508.07479. [Google Scholar] [CrossRef]
Chen, Z.; Shen, W.; Huang, J.; Shao, L. Joint Enhancement of Relational Reasoning for Long-Context LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 8706–8720. [Google Scholar]
Miceli-Barone, A.V.; Lascarides, A.; Innes, C. Dialogue-Based Generation of Self-Driving Simulation Scenarios Using Large Language Models. arXiv 2023, arXiv:2310.17372. [Google Scholar] [CrossRef]
Straka, M.; Spirkova, D.; Filla, M. Improved Efficiency of Manufacturing Logistics by Using Computer Simulation. Int. J. Simul. Model. 2021, 20, 501–512. [Google Scholar] [CrossRef]
Perplexity vs. ChatGPT: I Ran 11 Prompts to See Who Wins. Available online: https://learn.g2.com/perplexity-vs-chatgpt (accessed on 2 February 2026).
Costa, C.J. Prompt Engineering Strategies. OAE—Organ. Archit. Eng. J. 2025, 7. [Google Scholar] [CrossRef]
Olla, P.; Elliott, L.; Abumeeiz, M.; Mihelich, K.; Olson, J. Promptology: Enhancing Human–AI Interaction in Large Language Models. Information 2024, 15, 634. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of the manufacturing operations from raw material storage to finished product dispatch [20].

Figure 2. Initial ExtendSim DES model of construction materials production following the first copilot prompt. Red rectangles highlight sites of high error concentration and underlined block labels indicate improper internal configurations.

Figure 3. ExtendSim DES model of construction materials production after LLM copilot refinement iterations (output: 971 tons vs. benchmark 9721 tons), with red arrows indicating the four remaining entity flow merging issues.

Figure 4. Final validated ExtendSim DES model after full LLM copilot refinement (output: 9718 tons; utilizations: 56% wet, 99.9% finished, 48% dispatch), matching benchmark (9721 tons; utilizations: 53.4% wet, 99.9% finished, 44.8% dispatch).

Table 1. Parameter and interval values required for simulation model development [20].

Parameter	Value
Basic material flow unit	1 ton
Simulation length	30 days
Clay	64,000 t/month
Scobs	24 min
Ash	20 min
Water	18 to 50 min
Clay feeding	2 min
Scobs feeding	24 min
Ash feeding	20 min
Wheel mill	1.33 min
Coarse and fine milling	1 to 1.2 min
Storage silos	55 m³
Gate (2 × 8 h shifts)	16 h open and 8 h close per day
Extrusion mixer	1 to 3 min
Press mixer and cutting	1.2 to 2.4 min
Cutting	1% waste to silo
Drying	3.6 min
	1% waste to clay dump
	20% weight loss (5 t → 4 t)
Firing	4.5 min
Firing	0.2% waste = by-product
Storage	Available stock = 400 t
Gate (12 h shift)	12 h open and 12 h close per day
Palletizing and packaging	1.98 to 2.27 min

Table 2. DES model outputs by Perplexity.

Output	Amount	% of Input
Primary	9734 t	30%
By-product	19 t	0.06%
Total	9753 t	-
Pack utilization	48%	-

Table 3. MAPE evaluation of LLM outputs from the second experiment compared to ExtendSim benchmark.

LLM	Iteration	Output [t]	MAPE [%]
Perplexity	Initial	45,128	364.2
	Description refinement	5082	47.7
	Final	9734	0.1
ChatGPT	Initial lower bound	1303	86.6
	Initial upper bound	1404	85.6
	Description refinement	2278	76.6
	Final lower bound	7500	22.8
	Final upper bound	9600	1.2

Table 4. Error classification in copilot experiment.

Error Category	Frequency (Perplexity)	Percentage [%]	Frequency (ChatGPT)	Percentage [%]
Syntactic errors	6	24	4	18
Logical fallacies	6	24	5	23
LLM hallucinations	7	28	7	32
Parameter misinterpretations	6	24	6	27
Total	25	100	22	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Straka, M.; Ondov, M. Exploring Artificial Intelligence as a Tool for Logistics Process Simulation. Appl. Sci. 2026, 16, 3301. https://doi.org/10.3390/app16073301

AMA Style

Straka M, Ondov M. Exploring Artificial Intelligence as a Tool for Logistics Process Simulation. Applied Sciences. 2026; 16(7):3301. https://doi.org/10.3390/app16073301

Chicago/Turabian Style

Straka, Martin, and Marek Ondov. 2026. "Exploring Artificial Intelligence as a Tool for Logistics Process Simulation" Applied Sciences 16, no. 7: 3301. https://doi.org/10.3390/app16073301

APA Style

Straka, M., & Ondov, M. (2026). Exploring Artificial Intelligence as a Tool for Logistics Process Simulation. Applied Sciences, 16(7), 3301. https://doi.org/10.3390/app16073301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Exploring Artificial Intelligence as a Tool for Logistics Process Simulation

Abstract

1. Introduction

2. Materials and Methods

2.1. Benchmark Case Study

2.2. Prompt Engineering Protocol

3. Results

3.1. Independent Development of a Simulation Model by LLMs

3.2. Determination of DES Results by LLMs

3.3. Copilot Approach to Assist in Simulation Model Development

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI