Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations

Shafiq, Ossama; Rahmat, Amin; Alexiadis, Alessio; Ghiassi, Bahman

doi:10.3390/app152212114

Open AccessArticle

Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations

¹

School of Engineering, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK

²

School of Chemical Engineering, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12114; https://doi.org/10.3390/app152212114

Submission received: 16 October 2025 / Revised: 8 November 2025 / Accepted: 12 November 2025 / Published: 14 November 2025

Download

Browse Figures

Versions Notes

Abstract

Finite-element simulations and computer-aided design workflows require complex preprocessing, with geometry creation and simulation setup traditionally demanding significant manual expertise. The question emerges: can machine learning, namely large language models, help automate these processes? This study evaluates how well nine large language models can automate finite-element simulations starting from natural language prompts, generating both the geometry files for meshing (using Gmsh, an open-source geometry and mesh generator) and the input files needed for the solver (using Elmer, an open-source multiphysics simulation tool). Two standard test cases, a simple bar and a wheel and axle assembly, are used to evaluate and compare their performance. A set of criteria and a scoring system is introduced to assess performance across geometry and simulation setup, covering aspects such as file completeness, Boolean operations, shape fidelity, and displacement error. Results show that most LLMs excel at generating solver input files, achieving 78–88% success rate with <1% displacement error when executed. Geometry generation proves more challenging, with 70% success for simple shapes but only 56% for assemblies. Critically, no model successfully implemented Boolean operations required for merging components; GPT-4o uniquely attempted these operations but failed due to volume reuse errors. This 0% success rate for Boolean operations represents the primary bottleneck for assembly automation. Notable findings include extreme performance variability in the smallest model (PHI-3 Mini, varying 0–97% between similar tasks) and complete elimination of unit errors when explicitly prompted for SI units. The results reveal a clear capability gap: while LLMs reliably generate physics solver inputs, they cannot produce ready-to-mesh assemblies, requiring manual intervention for Boolean operations. While the study focuses on a Gmsh–Elmer pipeline, the results likely generalise to other simulation software.

Keywords:

large language models; physics-based simulations; geometry; Gmsh; GPT-4; LLAMA; artificial intelligence in engineering design

1. Introduction

Most engineers use simulation platforms like COMSOL or ANSYS to model and optimise physical systems. However, setting up a simulation is often complex. It involves several steps: creating the geometry, generating a mesh, defining boundary conditions, and configuring the solver. Each step requires expertise in specialised software. Recent advances in artificial intelligence suggest a possible alternative. Large Language Models (LLMs) could automate significant parts of preprocessing, such as generating geometry files for meshing tools and preparing input files for simulation solvers. Their strength lies in natural language understanding. In this context, they can ‘translate’ plain-English descriptions of a simulation into the structured input files required by physics-based software.

Alexiadis and Ghiassi (2024) [1] assessed the feasibility of this approach. We build on their work by comparing nine different LLMs, from smaller models (PHI-3-Mini and LLAMA-3-8B) to recent state-of-the-art models (GPT-4, GPT-4o, LLAMA-3-70B, and MIXTRAL variants) using two representative geometries: a simple square bar and a wheel and axle assembly [1]. We also introduce a rigorous, quantitative scoring system for (i) geometry completeness, Boolean-operation usage, and shape fidelity; and (ii) simulation input completeness, material-property inclusion, and output accuracy.

The exploration of LLMs in the realm of physics-based simulations is still in its early stages, but several studies have begun to explore their potential and limitations. Early research showed that LLMs, such as GPT-3, can be employed for generating code and documentation for simple physics simulations [2]. However, these early models often struggle with cases that go beyond very simple systems due to their inherent limitations in understanding physical laws and ensuring numerical accuracy [3].

The spatial reasoning limitations observed in Computer Aided Design (CAD) generation align with broader documented challenges LLMs face in geometric tasks. Studies on Scalable Vector Graphics (SVG) generation from text prompts (e.g., “pelican riding a bicycle”) demonstrate that while LLMs produce syntactically correct vector graphics code, they struggle with spatial composition and maintaining geometric relationships between elements [4]. Research on geometry problem solving shows particular difficulty with tasks requiring understanding of coaxial arrangements and multi-body assemblies [5].

Current studies highlight the potential of LLMs to enhance Model-Based Systems Engineering (MBSE) methodologies, particularly through the integration of simulation frameworks that can support complex system analyses and physical modelling [6]. However, the existing literature indicates that the development of comprehensive modelling and simulation infrastructures remains insufficient, limiting the full application of MBSE in practical scenarios [5].

In recent years, artificial intelligence has been used to improve computational physics workflows. Many studies have explored machine learning methods to accelerate simulation by approximating or replacing traditional Partial Differential Equations (PDE) solvers [7,8,9]. These approaches focus on the solution stage and use tools such as physics-informed machine learning or neural operators. This has a different scope from the present study. While both involve AI and physics, the goals and methods are entirely different, and the reader should not confuse the two.

Also not to be confused is recent work on using LLMs within digital twin frameworks [9]. That study focuses on improving user interaction, automating reporting, and supporting decision-making in industrial systems. While it involves LLMs, its goals and technical focus are different from those of the present study.

Several recent studies have explored the intersection of LLMs and computational engineering, though with different scopes and objectives. Verduzco et al. [10] demonstrated GPT-4’s capability to interface with LAMMPS for molecular dynamics simulations, while Kumar et al. [11] developed MyCrunchGPT for physics-informed neural networks (PINNs) applications. Li et al. [12] introduced FEABench, which employs LLMs to generate COMSOL API calls, though their evaluation focuses solely on syntactic correctness rather than engineering validity. The work most closely related to ours is CADBench [13], which, alongside Alexiadis and Ghiassi (2024) [1], explores LLM-generated CAD models.

However, our study makes several key advances beyond these works. Unlike CADBench, we incorporate physics-based simulations to validate functional correctness, not merely geometric form. We move beyond qualitative assessments (visual inspection, bounding box measurements, and face counting) to introduce a rigorous quantitative framework encompassing structural completeness, Boolean operation validation, and simulation accuracy metrics. Furthermore, while previous studies typically evaluate a single model [13], we present a comprehensive comparison of nine LLMs spanning different architectures and parameter scales, revealing important insights about model capabilities and limitations in engineering applications.

2. Methodology

In this study, we explore how well LLMs can generate the input files needed to run physics-based simulations. This includes geometry files (.geo) and simulation input files (.sif), used by Gmsh (v4.12.0), a mesh generator [14], and Elmer (v9.0), a multiphysics solver [15]. We use these two software programmes because they are open-source and freely available, but the approach is applicable to other simulation tools. The .geo file is a plain-text script written in Gmsh’s native language. It defines the geometry of the domain to be meshed, including points, lines, surfaces, volumes, and any Boolean operations needed to combine shapes. The .sif file is also a plain-text script, written in Elmer’s format. It describes the physical setup of the simulation: equations to be solved, material properties, boundary conditions, solver settings, and time-stepping options. The goal is to describe the simulation setup in plain language, let LLMs generate both files, and compare the results produced by several LLMs.

As benchmarks, we selected two specific geometries for evaluation: a simple square bar and a more complex wheel and axle system. These geometries were chosen to represent a spectrum of complexity, from straightforward shapes to multi-component assemblies.

2.1. Workflow

This section outlines the systematic workflow followed to generate and evaluate geometry and simulation files using LLMs for physics-based simulations. The workflow, as illustrated in Figure 1, consists of several key stages, each supported by specific code implementations designed to automate and streamline the process.

We purposely propose a relatively simple code in this workflow. If the goal of this study is to leverage LLMs to enable non-expert users to set up multiphysics simulations, the same principle of accessibility must apply to the code used in this publication. Therefore, we have taken care to explain all aspects of the code in detail and, whenever possible, minimise its complexity to ensure that users with varying levels of expertise can easily understand and implement it.

The process begins with setting up the environment and creating scripts that allow for the manipulation of .geo and .sif files, which are necessary for generating geometries and running simulations in Gmsh and ELMER, respectively. These scripts were written in Python, using libraries such as LangChain [16] for interacting with LLMs and Gmsh’s Python API for meshing tasks.

In the LangChain framework, the LLMChain object is configured to connect an LLM with a prompt, incorporating a memory mechanism and specific output controls. The conversation buffer window memory is set to retain the last eight interactions, allowing the model to maintain context across multiple turns in a conversation. The language model is further configured with parameters such as maximum number of tokens, which was set to 4096, which limits the response length, and temperature, which was set to 0.00, ensuring that the output is deterministic. This setup enables the generation of contextually aware, controlled responses while efficiently managing conversational history.

2.1.1. Development of Prompt Templates

To facilitate effective interaction with the LLMs, we developed prompt templates, comprising a tailored system prompt (Figure 2) and a sequence of user prompts (Figure 3), designed to guide the models in generating the required outputs. These templates were structured to ensure clarity and consistency across different models. Specifically, the templates provided the necessary structure and instructions for the LLMs to generate .geo files, which describe the geometrical properties of the models, and .sif files, which are used as input for ELMER physics-based simulations. This structured approach ensured that the generated files met the requirements for accurate geometry representation and simulation.

2.1.2. Implementation of Functions

To process the outputs from the LLMs and facilitate the generation and validation of geometry and simulation files, we implemented several key functions in Python, leveraging specific libraries tailored to interact with the LLMs and simulation tools. These functions are crucial in automating the workflow and ensuring that the outputs are correctly formatted and ready for simulation. The functions described below are also integrated into the workflow presented in Figure 1, which illustrates how these components fit into the overall process:

extract_and_save_geo_file: This function extracts the content of a .geo file from the response text generated by the LLM and saves it to a file. The LLM, accessed via the LangChain library, generates text that describes the geometry. This text is then parsed by the function to identify relevant sections and save them in a structured .geo file format compatible with Gmsh (Figure 4a and Figure 5a).
generate_mesh: Utilising the Gmsh Python API, this function creates a 3D mesh from the .geo file. The function initialises Gmsh, sets mesh size options (using default settings for consistency), generates the mesh, and writes it to an output .msh file. This mesh file is an intermediate step between geometry creation and simulation setup, essential for discretising the model into elements that ELMER can process (Figure 4b and Figure 5b).
generate_ELMER_mesh: This function uses the ElmerGrid tool, a command-line utility that comes with the ELMER software, to convert the .msh file generated by Gmsh into a format that ELMER can use. This conversion is necessary because ELMER requires a specific mesh format to perform simulations, and this function automates that conversion process.
extract_and_save_sif_file: Similar to the extract_and_save_geo_file function, this function extracts the content of a .sif file from the LLM’s response and saves it to a file. The .sif file contains simulation parameters such as material properties, boundary conditions, and solver settings. This function ensures that the text generated by the LLM is structured correctly for ELMER to execute the simulation (Figure 4c and Figure 5c).

Interaction with the LLM was managed through API calls, where prompt templates were used to generate the desired output text for the .geo and .sif files.

2.2. Test Cases and Boundary Conditions

Test Case 1: Square bar (10 cm × 1 cm × 1 cm). Fixed at one end (face), point load F = 100 MN (meganewtons) at the other end (face).

Test Case 2: Wheel and axle assembly. Two wheels (r = 5 cm, w = 2 cm) connected by an axle (r = 1 cm, l = 20 cm). Assembly is treated as one unified volume as such fixed at one end of the axle wheel and point load F = 5 GN (giganewtons) applied to the other end of the axle wheel. Material properties: Steel (E = 210 GPa, v = 0.3, p = 7850 kg/m³).

Selection of LLMs: We selected a diverse set of LLMs to evaluate their performance in generating the necessary files for physics-based simulations. The models included MIXTRAL 8X7B, MIXTRAL 8X22B [17], LLAMA-2-70B [18], LLAMA-3-8B, LLAMA-3-70B [19], GPT-3.5 Turbo [20], GPT-4 [21], GPT-4o [22], and PHI-3-Mini [7]. This selection provided a broad spectrum of models with varying capabilities, ensuring a comprehensive evaluation of their strengths and weaknesses.

Software: Python 3.9, LangChain for LLM interaction, Gmsh Python API for mesh generation, meshio 5.3.4 for mesh analysis, SciPy 1.9.0 for distance calculations, and ElmerGrid for mesh conversion.

2.3. Geometry File Evaluation

Geometry files (.geo format, Gmsh v4.12.0) were assessed on three criteria:

Structural Completeness (40%)
Presence of required geometric primitives: Square bar: 8 points, 12 lines, 6 faces, 1 volume. Wheel and axle assembly: ≥3 cylinders, ≥2 volumes, ≥1 physical volume.
Dimensional Accuracy (40% simple geometry, 25% assemblies)
Point coordinates were extracted considering variable definitions, with bounding box dimensions compared against specifications using a ±10% tolerance (scoring heuristic inspired by the general-tolerance framework of ISO 2768-1:1989 [8] and common rapid-prototyping practices [9,23].
Boolean Operations (15%, assemblies only)
Detection and validation of union/difference operations required for component merging, including syntax verification and volume reference consistency. Weights reflect engineering priorities: completeness and accuracy are fundamental requirements [24], while Boolean operations enable manufacturability for assemblies [25].

2.3.1. Quality Categories

Excellent (≥90%): Production-ready geometry.
Good (70–89%): Minor corrections required.
Fair (50–69%): Significant manual intervention needed.
Poor (<50%): Fundamental reconstruction required.

These thresholds align with the CAD model quality standards in Product Lifecycle Management (PLM) systems [26].

2.3.2. Geometry Implementation

Variable-aware parsing handled common patterns where dimensions are defined symbolically (e.g., L = 10; Point(1) = {0,0,L}). Cylinder components were classified by radius to distinguish wheels (3–7 cm) from axles (0.5–2 cm). Boolean validation specifically checked for volume reuse errors that cause Computer Aided Design (CAD) kernel failures.

2.4. Simulation File Evaluation

2.4.1. Test Specifications

LLMs generated Elmer Solver Input Format (SIF) files for the geometries described in Section 2.3. The test conditions included are as follows:

Square bar: Fixed end, 100 MN point load;
Wheel and axle: Fixed wheel face, 5 GN point load;
Material: Steel (E = 210 GPa, ν = 0.3, ρ = 7850 kg/m³).

2.4.2. Evaluation Metrics

We assessed SIF files using a weighted scoring system based on finite element analysis requirements [27]:

Structural completeness (25%): Presence of five mandatory sections (Header, Simulation, Material, Boundary Condition, and Solver).
Material properties (35%): Correct definition of E, ν, and ρ with appropriate units.
Boundary conditions (30%): Valid constraints and loads defining a well-posed problem.
Solver configuration (10%): Appropriate equation type and settings.

These weights reflect each component’s impact on simulation validity per ASME V&V 10-2019 guidelines [28]. Material properties receive the highest weight as errors directly scale with solution errors [29].

2.4.3. Validation Criteria

Files were categorised as follows:

Excellent (≥90%): Production-ready;
Good (70–89%): Minor corrections needed;
Fair (50–69%): Significant intervention required;
Poor (<50%): Fundamental errors.

For executed simulations, we compared displacement fields against reference solutions, with <1% maximum nodal error considered ‘Excellent’ per NAFEMS benchmarking standards [30].

2.4.4. Simulation Implementation

Evaluation employed regex pattern matching for section detection and property extraction. Material properties were validated against expected values with engineering tolerances (±20% for E, ±10% for ν and ρ) to accommodate unit variations.

3. Results and Discussion

3.1. Geometry Generation Results

Table 1 presents the square bar evaluation with columns indicating Structure (percentage of required geometric elements present), Dimensions (✓ = correct within 10% tolerance, ✗ = incorrect), Quality (overall category based on combined scores, with * indicating special conditions), Score (percentage of maximum possible points), and Key Issues (primary deficiency if any). Table 2 adds Boolean Ops (✓ = correctly implemented, ✗ = not attempted, ⚠ = attempted but failed) for assembly evaluation. In Table 2, Quality ratings with asterisks (e.g., Good * and Fair *) indicate geometries that are structurally sound but lack the Boolean operations necessary for creating a unified meshable volume, thus requiring manual intervention despite otherwise good scores.

3.1.1. Simple Geometry Performance

For the square bar task, four of nine models achieved perfect scores, correctly generating all structural elements with accurate dimensions. GPT-4o, GPT-4, LLaMA-3-70B, and Mixtral 8X22B (Figure 6) ranked first place with 100% scores, and successfully produced complete geometric specifications including eight vertices, twelve edges, six surfaces, and one volume with the specified 10 cm × 1 cm × 1 cm dimensions. GPT-3.5 ranked next with a strong 85% score, demonstrating robust performance with only minor deficiencies. Mid-tier performance was observed in Mixtral 8X7B (60%) and LLaMA-3-8B (40%), both exhibiting structural completeness but dimensional errors, with Mixtral 8X7B generating a 10 × 10 × 0.5 cm geometry (Figure 7), and LLaMA-3-8B producing a 1 × 1 × 1 cm cube. PHI-3 Mini (26%) and LLAMA-2-70B (23%) ranked last, respectively, demonstrating the most severe deficiencies, with PHI-3 Mini generating only three points with incorrect length specification (L = 2). The average score of 70% indicates strong capability for simple geometry generation among modern LLMs, though clear performance hierarchies emerged with proprietary models and select open-source variants (LLaMA-3-70B amd Mixtral 8X22B) dominating the top ranks.

3.1.2. Assembly Generation Challenges

Performance decreased markedly for the wheel and axle assembly, with average scores dropping to 56%. GPT-4o achieved the highest score (80%) and uniquely attempted Boolean operations, though implemented incorrectly by reusing Volume {1} after deletion, causing CAD kernel errors. Mixtral 8X7B ranked next with 70%, successfully defining all required components (two wheels with 5 cm radius and one axle with 1 cm radius) with proper structure. LLaMA-3-70B, GPT-4, and GPT-3.5 ranked third with 60%, generating complete component definitions but omitting Boolean operations. PHI-3 Mini and Mixtral 8X22B both scored 50%, showing moderate component completion (60%) without Boolean operations, while LLaMA-3-8B and LLaMA-2-70B only achieved 35%, and minimal component generation (30%). Critical failures emerged in Boolean operations necessary for component merging, with eight of nine models failing to attempt any Boolean operations, leaving components as separate volumes unsuitable for unified boundary condition application (Figure 8 and Figure 9). This 0% success rate for functional Boolean operations represents the primary bottleneck in assembly generation, and notably, assembly task rankings differed substantially from simple geometry performance, where top simple geometry performers like Mixtral 8X22B dropped to 50%, indicating that task complexity introduces different performance requirements.

3.1.3. Engineering Implications

The results reveal a critical capability gap between generating individual CAD primitives (70% average success) and performing Boolean operations essential for practical assemblies (0% success). This limitation necessitates hybrid workflows combining LLM-generated components with manual Boolean operations. For immediate deployment, LLMs should focus on single-part geometry where they demonstrate competence. The stark performance difference between simple and assembly tasks indicates that current LLMs lack a deep understanding of CAD topology and construction sequences, with significant implications for automation strategies in engineering design.

3.2. Simulation File Generation Results

Table 3 and Table 4 present simulation evaluations with File Quality (category based on weighted component scores), Status (Ready = executable without changes, Ready * = minor fixes beneficial, Not ready = requires intervention), Score (percentage of maximum 100 points), and Accuracy (displacement error for executed simulations: Excellent < 1%, or ‘Did not run’).

3.2.1. Overall Performance

For the square bar geometry, seven of nine models achieved perfect scores, with Mixtral 8X22B, Mixtral 8X7B, LLaMA-3-70B, LLaMA-3-8B, GPT-4o, GPT-4, and GPT-3.5 all tied at 100%, generating production-ready SIF files requiring no manual intervention. LLaMA-2-70B ranked next with 5%, while PHI-3 Mini failed completely at 0%. All successfully executed simulations achieved excellent accuracy with less than 1% maximum nodal error, validating the engineering correctness of LLM-generated files. Performance remained strong for the wheel and axle assembly, with six of nine models achieving scores of 97% or higher, though rankings shifted compared to the square bar task. Six models achieved 100% (Mixtral 8X22B, Mixtral 8X7B, LLaMA-3-70B, GPT-4o, GPT-4, and GPT-3.5), followed by PHI-3 Mini at 97%, a dramatic improvement from its square bar failure. LLaMA-3-8B ranked third at 83%, LLaMA-2-70B fourth at 10%, and PHI-3 Mini’s square bar result remained at 0%. The modest performance decrease contrasts sharply with the significant degradation observed in geometry generation tasks, indicating that simulation file generation is less sensitive to problem complexity.

3.2.2. Model Size and Consistency

A critical finding emerged regarding model reliability as a function of size. PHI-3 Mini, the smallest model evaluated (3.8B parameters versus 7B-70B+ for others), exhibited extreme performance variability between tasks. Despite receiving similar prompts without any intervening learning opportunity, PHI-3 Mini failed completely on the square bar task (0%, missing Header section) yet achieved near-perfect performance on the wheel and axle (97%, excellent file quality with only solver equation type unspecified). This erratic behaviour contrasts starkly with larger models, which demonstrated consistent performance patterns across both tasks—either succeeding or failing in predictable ways. LLaMA-2-70B consistently underperformed (5% and 10%), while the Mixtral variants, GPT models, and LLaMA-3-70B consistently excelled (100% on both tasks).

3.2.3. Failure Analysis

Three distinct failure patterns emerged across the evaluation. Missing sections represented the most severe failures, with PHI-3 Mini omitting the Header section (square bar) and LLaMA-2-70B missing Simulation and Material sections across tasks. Incomplete property definitions manifested in LLaMA-3-8B’s omission of density, which, while theoretically non-critical for static analysis, prevented solver execution due to Elmer’s requirements. Solver specification errors appeared in PHI-3 Mini’s otherwise excellent wheel and axle file. Notably, no model exhibited unit confusion for material properties when explicitly prompted to use SI units, successfully avoiding a common source of engineering errors.

3.2.4. Capability Decoupling

Comparing geometry and simulation results reveals the independence of these capabilities. Average geometry scores decreased from 70% (square bar) to 56% (assembly), while simulation scores showed minimal degradation in fact an increase from 78% to 88% (accounting for PHI-3 Mini’s anomalous improvement). Models achieving perfect simulation scores despite moderate geometry performance, such as the Mixtral variants, confirm that simulation file generation and geometry creation represent distinct competencies. This decoupling suggests that LLMs can be effectively deployed for simulation setup even when geometry generation requires human intervention.

3.2.5. Practical Implications

The results demonstrate that simulation input generation represents a mature application for LLMs, with most models achieving high success rates. The 78–88% average success rate indicates strong automation potential for standard structural problems. However, the high variability observed in PHI-3 Mini’s performance highlights the importance of consistency validation, rather than relying solely on average performance metrics.

Key recommendations include (1) implement robust validation protocols regardless of model choice, as even high-performing models may exhibit task-specific failures; (2) consider performance consistency alongside accuracy when selecting models for production use; (3) leverage the geometry-simulation decoupling by using LLMs for simulation setup even when geometry requires manual creation; and (4) test models thoroughly across representative tasks before deployment, as performance on one task may not predict performance on similar tasks, particularly for models that show high variability.

These findings indicate that, while most LLMs can effectively automate simulation setup, deployment strategies should emphasise validation and consistency testing. The observed variability in some models suggests that newer architectures and training approaches may yield different performance characteristics than those observed in this study, warranting continued evaluation as model development progresses.

Our evaluation employed zero-shot prompting to establish baseline capabilities. Future work should investigate improvement pathways, including (1) few-shot learning with example .geo/.sif files to guide generation patterns, (2) chain-of-thought prompting to explicitly decompose spatial and physics tasks, (3) fine-tuning on domain-specific CAD and simulation corpora, (4) transfer learning from visual-language models to provide stronger geometric priors, and (5) retrieval-augmented generation with software documentation. These techniques may particularly address the Boolean operation failures and spatial reasoning limitations identified in our results.

4. Conclusions

The results demonstrate a clear capability asymmetry between geometry and simulation tasks. For simple geometries, 4 out of 9 models achieved perfect scores, with average performance reaching 70%. However, performance degraded significantly for multi-component assemblies (average 56%), with the critical finding that no model successfully implemented functional Boolean operations. GPT-4o uniquely attempted Boolean unions but failed due to volume reference errors, while all other models omitted these essential operations entirely. This 0% success rate for Boolean operations represents the primary bottleneck preventing fully automated CAD generation for assemblies.

In contrast, simulation file generation proved remarkably tractable, with 7 out of 9 models generating perfect SIF files for simple geometries and 6 out of 9 models generated perfect sif files maintaining high performance for complex assemblies. All successfully executed simulations achieved excellent accuracy (<1% error), validating that LLMs effectively capture physics specifications when file syntax is correct. The minimal performance degradation with increased complexity (78% to 88% average) indicates that simulation file generation is largely insensitive to geometric complexity.

Our analysis reveals that geometry generation and simulation setup represent independent capabilities. Models such as the Mixtral variants achieved perfect simulation scores despite moderate geometry performance, confirming this decoupling. This finding has immediate practical implications: organisations can deploy LLMs for simulation automation even when geometry creation requires human intervention, enabling hybrid workflows that leverage each technology’s strengths.

Model performance stratified into clear tiers, with GPT-4, GPT-4o, Mixtral-8x22B, and LLaMA-3-70B consistently excelling across all tasks. An unexpected finding emerged with PHI-3 Mini, which exhibited extreme performance variability (0% to 97%) between similar tasks despite identical prompting. This inconsistency, contrasting with the predictable performance of larger models, highlights the importance of validation protocols and consistency testing over average performance metrics.

Author Contributions

Conceptualization, O.S., A.R., A.A. and B.G.; methodology, O.S., A.R., A.A. and B.G.; software, O.S.; validation, O.S.; formal analysis, O.S.; investigation, O.S.; resources, A.A. and B.G.; data curation, O.S.; writing—original draft preparation, O.S.; writing—review and editing, A.A. and B.G.; visualisation, O.S.; supervision, A.A. and B.G.; project administration, O.S.; funding acquisition, A.A. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the results of this study are available at GitHub.

Acknowledgments

During the preparation of this work, the author(s) used generative AI in order to refine the language and enhance clarity. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alexiadis, A.; Ghiassi, B. From text to tech: Shaping the future of physics-based simulations with AI-driven generative models. Results Eng. 2024, 21, 101721. [Google Scholar] [CrossRef]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Marcus, G. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. arXiv 2020, arXiv:2002.06177. [Google Scholar] [CrossRef]
Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv 2023, arXiv:2303.12712. [Google Scholar] [CrossRef]
Yamada, Y.; Bao, Y.; Lampinen, A.K.; Kasai, J.; Yildirim, I. Evaluating Spatial Understanding of Large Language Models. arXiv 2023, arXiv:2310.14540. [Google Scholar]
Xie, K.; Zhang, L.; Li, X.; Gu, P.; Chen, Z. SES-X: A MBSE Methodology Based on SES/MB and X Language. Information 2022, 14, 23. [Google Scholar] [CrossRef]
Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
ISO 2768-1:1989; General Tolerances for Linear and Angular Dimensions. International Organization for Standardization: Geneva, Switzerland, 1987.
Budynas, R.; Nisbett, K. Shigley’s Mechanical Engineering Design in SI Units, 10th ed.; McGraw-Hill: Columbus, OH, USA, 2014. [Google Scholar]
Verduzco, J.C.; Holbrook, E.; Strachan, A. GPT-4 as an interface between researchers and computational software: Improving usability and reproducibility. arXiv 2023, arXiv:2310.11458. [Google Scholar] [CrossRef]
Kumar, V.; Gleyzer, L.; Kahana, A.; Shukla, K.; Karniadakis, G.E. MyCrunchGPT: A ChatGPT assisted framework for scientific machine learning. arXiv 2023, arXiv:2306.15551. [Google Scholar] [CrossRef]
Li, W.; Zhang, X.; Guo, Z.; Mao, S.; Luo, W.; Peng, G.; Huang, Y.; Wang, H.; Li, S. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. arXiv 2025, arXiv:2503.06680. [Google Scholar]
Du, Y.; Chen, S.; Zan, W.; Li, P.; Wang, M.; Song, D.; Li, B.; Hu, Y.; Wang, B. BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvement. arXiv 2024, arXiv:2412.14203. [Google Scholar]
Geuzaine, C.; Remacle, J. Gmsh: A 3-D finite element mesh generator with built-in pre- and post-processing facilities. Int. J. Numer. Methods Eng. 2009, 79, 1309–1331. [Google Scholar] [CrossRef]
CSC—IT Center for Science. Elmer FEM Solver. [Online]. Available online: https://www.csc.fi/web/elmer (accessed on 29 May 2025).
Reynolds, L. LangChain: Open-Source Library for Building LLM Applications. Available online: https://github.com/langchain-ai/langchain (accessed on 14 June 2025).
Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Peng, A.; Wu, M.; Allard, J.; Kilpatrick, L.; Heidel, S. GPT-3.5 Turbo Fine-Tuning and API Updates. Open AI 2023. Available online: https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/ (accessed on 14 June 2025).
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Open AI; Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.J.; Welihinda, A.; Hayes, A.; et al. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar]
Formlabs Blog. Understanding Accuracy, Precision & Tolerance in 3D Printing. 2025. Available online: https://formlabs.com/global/3d-printers/?srsltid=AfmBOoqj5eOY38gafH7hZcmuIaYwrnFsMPytmpGMlqTlMpgKCB18xAy7 (accessed on 14 June 2025).
González-Lluch, C.; Company, P.; Contero, M.; Camba, J.D.; Plumed, R. A survey on 3D CAD model quality assurance and testing tools. Comput. Aided Des. 2017, 83, 64–79. [Google Scholar] [CrossRef]
Mantyla, M. An Introduction to Solid Modeling; Computer Science Press: New York, NY, USA, 1988. [Google Scholar]
Y14.41; Digital Product Definition Data Practices. ASME: Houston, TX, USA, 2019.
V&V 10-2019; Standard for Verification and Validation in Computational Solid Mechanics. ASME: Houston, TX, USA, 2019.
Cook, R.D. Concepts and Applications of Finite Element Analysis; John Wiley & Sons: New York, NY, USA, 2001. [Google Scholar]
Zienkiewicz, O.; Taylor, R. The Finite Element Method for Solid and Structural Mechanics; Butterworth-Heinemann: Oxford, UK, 2013. [Google Scholar]
Roache, P. Verification and Validation in Computational Science and Engineering; Hermosa Publishers: Albuquerque, NM, USA, 1998. [Google Scholar]

Figure 1. Geometry and simulation file generation overall protocol.

Figure 2. System prompt to guide the models in generating the required outputs.

Figure 3. A sequence of user prompts utilised in the chat.

Figure 4. The expected and correct visual representation of the square bar (a) geometry and (b) mesh prior to running any simulation, (c) and the square bar after the simulation run.

Figure 5. The expected and correct visual representation of the wheel and axle (a) geometry and (b) mesh prior to running any simulation, (c) and the wheel and axle after the simulation run.

Figure 6. The output for the square bar geometry from MIXTRAL 8X22B was consistent with the expected output.

Figure 7. The (a) original and (b) updated output for the square bar geometry from MIXTRAL 8X7B.

Figure 8. The (a) original and (b) updated output for the wheel and axle geometry from LLAMA-3-70B.

Figure 9. The (a) original and (b) updated output for the wheel and axle geometry from MIXTRAL 8X7B.

Table 1. Square bar geometry evaluation for each LLM.

LLM	Structure	Dimensions	Quality	Score
PHI-3 Mini	65%	✗	Poor	26%
Mixtral 8X22B	100%	✓	Excellent	100%
Mixtral 8X7B	100%	✗	Fair	60%
LLaMA-3-70B	100%	✓	Excellent	100%
LLaMA-3-8B	100%	✗	Poor	40%
LLaMA-2-70B	57%	✗	Poor	23%
GPT-4o	100%	✓	Excellent	100%
GPT-4	100%	✓	Excellent	100%
GPT-3.5	62%	✓	Good	85%

Table 2. Wheel and axle assembly evaluation for each LLM. * Asterisk indicates geometries that are structurally sound but lack Boolean operations for unified mesh.

LLM	Components	Boolean Ops	Quality	Score
PHI-3 Mini	60%	✗	Fair	50%
Mixtral 8X22B	60%	✗	Fair	50%
Mixtral 8X7B	100%	✗	Good *	70%
LLaMA-3-70B	100%	✗	Good *	60%
LLaMA-3-8B	30%	✗	Poor	35%
LLaMA-2-70B	30%	✗	Poor	35%
GPT-4o	100%	⚠	Fair *	80%
GPT-4	100%	✗	Good *	60%
GPT-3.5	100%	✗	Good *	60%

Table 3. Square bar simulation file evaluation for each LLM.

LLM	File Quality	Status	Score	Accuracy
PHI-3 Mini	Poor	Not ready	0%	Did not run
Mixtral 8X22B	Excellent	Ready	100%	Excellent
Mixtral 8X7B	Excellent	Ready	100%	Excellent
LLaMA-3-70B	Excellent	Ready	100%	Excellent
LLaMA-3-8B	Excellent	Ready	100%	Excellent
LLaMA-2-70B	Poor	Not ready	5%	Did not run
GPT-4o	Excellent	Ready	100%	Excellent
GPT-4	Excellent	Ready	100%	Excellent
GPT-3.5	Excellent	Ready	100%	Excellent

Table 4. Wheel and axle simulation file evaluation for each LLM. * Ready with minor fixes beneficial.

LLM	File Quality	Status	Score	Accuracy
PHI-3 Mini	Excellent	Ready	97%	Did not run
Mixtral 8X22B	Excellent	Ready	100%	Excellent
Mixtral 8X7B	Excellent	Ready	100%	Excellent
LLaMA-3-70B	Excellent	Ready	100%	Excellent
LLaMA-3-8B	Good	Ready *	83%	Did not run
LLaMA-2-70B	Poor	Not ready	10%	Did not run
GPT-4o	Excellent	Ready	100%	Excellent
GPT-4	Excellent	Ready	100%	Excellent
GPT-3.5	Excellent	Ready	100%	Excellent

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shafiq, O.; Rahmat, A.; Alexiadis, A.; Ghiassi, B. Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations. Appl. Sci. 2025, 15, 12114. https://doi.org/10.3390/app152212114

AMA Style

Shafiq O, Rahmat A, Alexiadis A, Ghiassi B. Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations. Applied Sciences. 2025; 15(22):12114. https://doi.org/10.3390/app152212114

Chicago/Turabian Style

Shafiq, Ossama, Amin Rahmat, Alessio Alexiadis, and Bahman Ghiassi. 2025. "Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations" Applied Sciences 15, no. 22: 12114. https://doi.org/10.3390/app152212114

APA Style

Shafiq, O., Rahmat, A., Alexiadis, A., & Ghiassi, B. (2025). Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations. Applied Sciences, 15(22), 12114. https://doi.org/10.3390/app152212114

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations

Abstract

1. Introduction

2. Methodology

2.1. Workflow

2.1.1. Development of Prompt Templates

2.1.2. Implementation of Functions

2.2. Test Cases and Boundary Conditions

2.3. Geometry File Evaluation

2.3.1. Quality Categories

2.3.2. Geometry Implementation

2.4. Simulation File Evaluation

2.4.1. Test Specifications

2.4.2. Evaluation Metrics

2.4.3. Validation Criteria

2.4.4. Simulation Implementation

3. Results and Discussion

3.1. Geometry Generation Results

3.1.1. Simple Geometry Performance

3.1.2. Assembly Generation Challenges

3.1.3. Engineering Implications

3.2. Simulation File Generation Results

3.2.1. Overall Performance

3.2.2. Model Size and Consistency

3.2.3. Failure Analysis

3.2.4. Capability Decoupling

3.2.5. Practical Implications

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI