Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations
Abstract
1. Introduction
2. Methodology
2.1. Workflow
2.1.1. Development of Prompt Templates
2.1.2. Implementation of Functions
- extract_and_save_geo_file: This function extracts the content of a .geo file from the response text generated by the LLM and saves it to a file. The LLM, accessed via the LangChain library, generates text that describes the geometry. This text is then parsed by the function to identify relevant sections and save them in a structured .geo file format compatible with Gmsh (Figure 4a and Figure 5a).
- generate_mesh: Utilising the Gmsh Python API, this function creates a 3D mesh from the .geo file. The function initialises Gmsh, sets mesh size options (using default settings for consistency), generates the mesh, and writes it to an output .msh file. This mesh file is an intermediate step between geometry creation and simulation setup, essential for discretising the model into elements that ELMER can process (Figure 4b and Figure 5b).
- generate_ELMER_mesh: This function uses the ElmerGrid tool, a command-line utility that comes with the ELMER software, to convert the .msh file generated by Gmsh into a format that ELMER can use. This conversion is necessary because ELMER requires a specific mesh format to perform simulations, and this function automates that conversion process.
- extract_and_save_sif_file: Similar to the extract_and_save_geo_file function, this function extracts the content of a .sif file from the LLM’s response and saves it to a file. The .sif file contains simulation parameters such as material properties, boundary conditions, and solver settings. This function ensures that the text generated by the LLM is structured correctly for ELMER to execute the simulation (Figure 4c and Figure 5c).
2.2. Test Cases and Boundary Conditions
2.3. Geometry File Evaluation
- Structural Completeness (40%)Presence of required geometric primitives: Square bar: 8 points, 12 lines, 6 faces, 1 volume. Wheel and axle assembly: ≥3 cylinders, ≥2 volumes, ≥1 physical volume.
- Dimensional Accuracy (40% simple geometry, 25% assemblies)
- Boolean Operations (15%, assemblies only)Detection and validation of union/difference operations required for component merging, including syntax verification and volume reference consistency. Weights reflect engineering priorities: completeness and accuracy are fundamental requirements [24], while Boolean operations enable manufacturability for assemblies [25].
2.3.1. Quality Categories
- Excellent (≥90%): Production-ready geometry.
- Good (70–89%): Minor corrections required.
- Fair (50–69%): Significant manual intervention needed.
- Poor (<50%): Fundamental reconstruction required.
2.3.2. Geometry Implementation
2.4. Simulation File Evaluation
2.4.1. Test Specifications
- Square bar: Fixed end, 100 MN point load;
- Wheel and axle: Fixed wheel face, 5 GN point load;
- Material: Steel (E = 210 GPa, ν = 0.3, ρ = 7850 kg/m3).
2.4.2. Evaluation Metrics
- Structural completeness (25%): Presence of five mandatory sections (Header, Simulation, Material, Boundary Condition, and Solver).
- Material properties (35%): Correct definition of E, ν, and ρ with appropriate units.
- Boundary conditions (30%): Valid constraints and loads defining a well-posed problem.
- Solver configuration (10%): Appropriate equation type and settings.
2.4.3. Validation Criteria
- Excellent (≥90%): Production-ready;
- Good (70–89%): Minor corrections needed;
- Fair (50–69%): Significant intervention required;
- Poor (<50%): Fundamental errors.
2.4.4. Simulation Implementation
3. Results and Discussion
3.1. Geometry Generation Results
3.1.1. Simple Geometry Performance
3.1.2. Assembly Generation Challenges
3.1.3. Engineering Implications
3.2. Simulation File Generation Results
3.2.1. Overall Performance
3.2.2. Model Size and Consistency
3.2.3. Failure Analysis
3.2.4. Capability Decoupling
3.2.5. Practical Implications
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Alexiadis, A.; Ghiassi, B. From text to tech: Shaping the future of physics-based simulations with AI-driven generative models. Results Eng. 2024, 21, 101721. [Google Scholar] [CrossRef]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Marcus, G. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. arXiv 2020, arXiv:2002.06177. [Google Scholar] [CrossRef]
- Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv 2023, arXiv:2303.12712. [Google Scholar] [CrossRef]
- Yamada, Y.; Bao, Y.; Lampinen, A.K.; Kasai, J.; Yildirim, I. Evaluating Spatial Understanding of Large Language Models. arXiv 2023, arXiv:2310.14540. [Google Scholar]
- Xie, K.; Zhang, L.; Li, X.; Gu, P.; Chen, Z. SES-X: A MBSE Methodology Based on SES/MB and X Language. Information 2022, 14, 23. [Google Scholar] [CrossRef]
- Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
- ISO 2768-1:1989; General Tolerances for Linear and Angular Dimensions. International Organization for Standardization: Geneva, Switzerland, 1987.
- Budynas, R.; Nisbett, K. Shigley’s Mechanical Engineering Design in SI Units, 10th ed.; McGraw-Hill: Columbus, OH, USA, 2014. [Google Scholar]
- Verduzco, J.C.; Holbrook, E.; Strachan, A. GPT-4 as an interface between researchers and computational software: Improving usability and reproducibility. arXiv 2023, arXiv:2310.11458. [Google Scholar] [CrossRef]
- Kumar, V.; Gleyzer, L.; Kahana, A.; Shukla, K.; Karniadakis, G.E. MyCrunchGPT: A ChatGPT assisted framework for scientific machine learning. arXiv 2023, arXiv:2306.15551. [Google Scholar] [CrossRef]
- Li, W.; Zhang, X.; Guo, Z.; Mao, S.; Luo, W.; Peng, G.; Huang, Y.; Wang, H.; Li, S. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. arXiv 2025, arXiv:2503.06680. [Google Scholar]
- Du, Y.; Chen, S.; Zan, W.; Li, P.; Wang, M.; Song, D.; Li, B.; Hu, Y.; Wang, B. BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvement. arXiv 2024, arXiv:2412.14203. [Google Scholar]
- Geuzaine, C.; Remacle, J. Gmsh: A 3-D finite element mesh generator with built-in pre- and post-processing facilities. Int. J. Numer. Methods Eng. 2009, 79, 1309–1331. [Google Scholar] [CrossRef]
- CSC—IT Center for Science. Elmer FEM Solver. [Online]. Available online: https://www.csc.fi/web/elmer (accessed on 29 May 2025).
- Reynolds, L. LangChain: Open-Source Library for Building LLM Applications. Available online: https://github.com/langchain-ai/langchain (accessed on 14 June 2025).
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Peng, A.; Wu, M.; Allard, J.; Kilpatrick, L.; Heidel, S. GPT-3.5 Turbo Fine-Tuning and API Updates. Open AI 2023. Available online: https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/ (accessed on 14 June 2025).
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Open AI; Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.J.; Welihinda, A.; Hayes, A.; et al. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar]
- Formlabs Blog. Understanding Accuracy, Precision & Tolerance in 3D Printing. 2025. Available online: https://formlabs.com/global/3d-printers/?srsltid=AfmBOoqj5eOY38gafH7hZcmuIaYwrnFsMPytmpGMlqTlMpgKCB18xAy7 (accessed on 14 June 2025).
- González-Lluch, C.; Company, P.; Contero, M.; Camba, J.D.; Plumed, R. A survey on 3D CAD model quality assurance and testing tools. Comput. Aided Des. 2017, 83, 64–79. [Google Scholar] [CrossRef]
- Mantyla, M. An Introduction to Solid Modeling; Computer Science Press: New York, NY, USA, 1988. [Google Scholar]
- Y14.41; Digital Product Definition Data Practices. ASME: Houston, TX, USA, 2019.
- V&V 10-2019; Standard for Verification and Validation in Computational Solid Mechanics. ASME: Houston, TX, USA, 2019.
- Cook, R.D. Concepts and Applications of Finite Element Analysis; John Wiley & Sons: New York, NY, USA, 2001. [Google Scholar]
- Zienkiewicz, O.; Taylor, R. The Finite Element Method for Solid and Structural Mechanics; Butterworth-Heinemann: Oxford, UK, 2013. [Google Scholar]
- Roache, P. Verification and Validation in Computational Science and Engineering; Hermosa Publishers: Albuquerque, NM, USA, 1998. [Google Scholar]









| LLM | Structure | Dimensions | Quality | Score |
|---|---|---|---|---|
| PHI-3 Mini | 65% | ✗ | Poor | 26% |
| Mixtral 8X22B | 100% | ✓ | Excellent | 100% |
| Mixtral 8X7B | 100% | ✗ | Fair | 60% |
| LLaMA-3-70B | 100% | ✓ | Excellent | 100% |
| LLaMA-3-8B | 100% | ✗ | Poor | 40% |
| LLaMA-2-70B | 57% | ✗ | Poor | 23% |
| GPT-4o | 100% | ✓ | Excellent | 100% |
| GPT-4 | 100% | ✓ | Excellent | 100% |
| GPT-3.5 | 62% | ✓ | Good | 85% |
| LLM | Components | Boolean Ops | Quality | Score |
|---|---|---|---|---|
| PHI-3 Mini | 60% | ✗ | Fair | 50% |
| Mixtral 8X22B | 60% | ✗ | Fair | 50% |
| Mixtral 8X7B | 100% | ✗ | Good * | 70% |
| LLaMA-3-70B | 100% | ✗ | Good * | 60% |
| LLaMA-3-8B | 30% | ✗ | Poor | 35% |
| LLaMA-2-70B | 30% | ✗ | Poor | 35% |
| GPT-4o | 100% | ⚠ | Fair * | 80% |
| GPT-4 | 100% | ✗ | Good * | 60% |
| GPT-3.5 | 100% | ✗ | Good * | 60% |
| LLM | File Quality | Status | Score | Accuracy |
|---|---|---|---|---|
| PHI-3 Mini | Poor | Not ready | 0% | Did not run |
| Mixtral 8X22B | Excellent | Ready | 100% | Excellent |
| Mixtral 8X7B | Excellent | Ready | 100% | Excellent |
| LLaMA-3-70B | Excellent | Ready | 100% | Excellent |
| LLaMA-3-8B | Excellent | Ready | 100% | Excellent |
| LLaMA-2-70B | Poor | Not ready | 5% | Did not run |
| GPT-4o | Excellent | Ready | 100% | Excellent |
| GPT-4 | Excellent | Ready | 100% | Excellent |
| GPT-3.5 | Excellent | Ready | 100% | Excellent |
| LLM | File Quality | Status | Score | Accuracy |
|---|---|---|---|---|
| PHI-3 Mini | Excellent | Ready | 97% | Did not run |
| Mixtral 8X22B | Excellent | Ready | 100% | Excellent |
| Mixtral 8X7B | Excellent | Ready | 100% | Excellent |
| LLaMA-3-70B | Excellent | Ready | 100% | Excellent |
| LLaMA-3-8B | Good | Ready * | 83% | Did not run |
| LLaMA-2-70B | Poor | Not ready | 10% | Did not run |
| GPT-4o | Excellent | Ready | 100% | Excellent |
| GPT-4 | Excellent | Ready | 100% | Excellent |
| GPT-3.5 | Excellent | Ready | 100% | Excellent |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shafiq, O.; Rahmat, A.; Alexiadis, A.; Ghiassi, B. Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations. Appl. Sci. 2025, 15, 12114. https://doi.org/10.3390/app152212114
Shafiq O, Rahmat A, Alexiadis A, Ghiassi B. Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations. Applied Sciences. 2025; 15(22):12114. https://doi.org/10.3390/app152212114
Chicago/Turabian StyleShafiq, Ossama, Amin Rahmat, Alessio Alexiadis, and Bahman Ghiassi. 2025. "Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations" Applied Sciences 15, no. 22: 12114. https://doi.org/10.3390/app152212114
APA StyleShafiq, O., Rahmat, A., Alexiadis, A., & Ghiassi, B. (2025). Evaluating the Performance of Large Language Models for Geometry and Simulation File Generation in Physics-Based Simulations. Applied Sciences, 15(22), 12114. https://doi.org/10.3390/app152212114
