Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties
Featured Application
Abstract
1. Introduction
- We introduce IgnitionGPT, a fine-tuned LLM optimized for ignition-property extraction from technical corpora, with demonstrated superiority over general-purpose LLMs.
- We provide a rigorously curated, domain-specific dataset of ignition-related properties for 581 compounds, encompassing alkanes, alcohols, ethers, furans, aromatics, and selected inorganic or polymeric fuels.
- We release all code and data openly under the MIT License (https://github.com/AI4CHEMIA/IgnitionGPT (accessed on 22 January 2026)), ensuring transparency and reproducibility.
2. Methods
2.1. Data Collection and Curation
2.1.1. Data Sources and Retrieval
2.1.2. Screening and Selection
2.1.3. Dataset Composition
2.1.4. Incremental Training Splits
2.2. Model Architecture and Fine-Tuning
2.2.1. Baseline Model Selection
2.2.2. Training Data Representation
2.2.3. Fine-Tuning Framework
3. Results and Analyses
3.1. Comparing Zero-Shot Learning with Fine-Tuning
3.1.1. Overall Results
3.1.2. Zero-Shot Detailed Evaluation
3.2. Analyzing Training and Testing Accuracy
3.2.1. Descriptive Analysis of Fine-Tuning Logs
3.2.2. Variations in Token Accuracy and Loss While Training
3.3. Overfitting and Loss Analysis
3.3.1. Observation of Training Progress
3.3.2. Observing Overfitting Possibilities
4. Discussion
5. Conclusions and Future Directions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| ASTM | American Society for Testing and Materials |
| CFD | Computational Fluid Dynamics |
| CN | Cetane Number |
| CP | Complex Polymer |
| DOI | Digital Object Identifier |
| IUPAC | International Union of Pure and Applied Chemistry |
| LLM | Large Language Model |
| MON | Motor Octane Number |
| NLP | Natural Language Processing |
| QSPR | Quantitative Structure-Property Relationship |
| RON | Research Octane Number |
References
- Kalghatgi, G.T. Fuel anti-knock quality-Part I. Engine studies. In SAE Transactions 1993–2004; Society of Automotive Engineers: Warrendale, PA, USA, 2001. [Google Scholar]
- Mehl, M.; Pitz, W.J.; Westbrook, C.K.; Curran, H.J. Kinetic modeling of gasoline surrogate components and mixtures under engine conditions. Proc. Combust. Inst. 2011, 33, 193–200. [Google Scholar] [CrossRef]
- Sarathy, S.M.; Oßwald, P.; Hansen, N.; Kohse-Höinghaus, K. Alcohol combustion chemistry. Prog. Energy Combust. Sci. 2014, 44, 40–102. [Google Scholar] [CrossRef]
- Pitsch, H. The transition to sustainable combustion: Hydrogen-and carbon-based future fuels and methods for dealing with their challenges. Proc. Combust. Inst. 2024, 40, 105638. [Google Scholar] [CrossRef]
- Wilk-Jakubowski, J.L.; Pawlik, L.; Frej, D.; Wilk-Jakubowski, G. Data-driven computational methods in fuel combustion: A review of applications. Appl. Sci. 2025, 15, 7204. [Google Scholar] [CrossRef]
- Al-Rabiah, A.A.; Alshehri, A.S.; Ibn Idriss, A.; Abdelaziz, O.Y. Comparative Kinetic Analysis and Process Optimization for the Production of Dimethyl Ether via Methanol Dehydration over a γ-Alumina Catalyst. Chem. Eng. Technol. 2022, 45, 319–328. [Google Scholar] [CrossRef]
- Stolonogova, T. Change in the Functional Properties of Automobile Gasolines in the Presence of Mixtures of Ethanol and a Glycerin Ether. Chem. Technol. Fuels Oils 2025, 61, 898–902. [Google Scholar] [CrossRef]
- ASTM D2700-18; Standard Test Method for Motor Octane Number of Spark Ignition Engine Fuel. ASTM: West Conshohocken, PA, USA, 2011.
- ASTM D2699-12; Standard Test Method for Research Octane Number of Spark-Ignition Engine Fuel. ASTM International: West Conshohocken, PA, USA, 2012.
- Alshehri, A.S.; Tula, A.K.; Zhang, L.; Gani, R.; You, F. A Platform of Machine Learning-Based Next-Generation Property Estimation Methods for CAMD. In Computer Aided Chemical Engineering; Elsevier: Amsterdam, The Netherlands, 2021; Volume 50. [Google Scholar]
- Suzuki, S.; Mori, S. Flame synthesis of carbon nanotube through a diesel engine using normal dodecane/ethanol mixing fuel as a feedstock. J. Chem. Eng. Jpn. 2017, 50, 178–185. [Google Scholar] [CrossRef]
- Üstün, C.E.; Freitas, R.D.S.M.D.; Okafor, E.C.; Shahbakhti, M.; Jiang, X.; Paykani, A. Machine learning applications for predicting fuel ignition and flame properties: Current status and future perspectives. Energy Fuels 2025, 39, 13281–13314. [Google Scholar] [CrossRef]
- Tang, X.; Liao, H.; Gong, J. Machine Learning Approaches to Ignitability Classification of Solid Combustibles. In Combustion Science and Technology; Taylor & Francis: Oxfordshire, UK, 2025; pp. 1–24. [Google Scholar]
- Rittig, J.G.; Ritzert, M.; Schweidtmann, A.M.; Winkler, S.; Weber, J.M.; Morsch, P.; Heufer, K.A.; Grohe, M.; Mitsos, A.; Dahmen, M. Graph machine learning for design of high-octane fuels. AIChE J. 2023, 69, e17971. [Google Scholar] [CrossRef]
- Schweidtmann, A.M.; Rittig, J.G.; König, A.; Grohe, M.; Mitsos, A.; Dahmen, M. Graph neural networks for prediction of fuel ignition quality. Energy Fuels 2020, 34, 11395–11407. [Google Scholar] [CrossRef]
- Alshehri, A.S.; Gani, R.; You, F. Deep learning and knowledge-based methods for computer-aided molecular design—Toward a unified approach: State-of-the-art and future directions. Comput. Chem. Eng. 2020, 141, 107005. [Google Scholar] [CrossRef]
- Ye, G. De novo drug design as GPT language modeling: Large chemistry models with supervised and reinforcement learning. J. Comput.-Aided Mol. Des. 2024, 38, 20. [Google Scholar] [CrossRef]
- Alshehri, A.S.; Horstmann, K.A.; You, F. Versatile Deep Learning Pipeline for Transferable Chemical Data Extraction. J. Chem. Inf. Model. 2024, 64, 5888–5899. [Google Scholar] [CrossRef]
- Mavračić, J.; Court, C.J.; Isazawa, T.; Elliott, S.R.; Cole, J.M. ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science. J. Chem. Inf. Model. 2021, 61, 4280–4289. [Google Scholar] [CrossRef]
- Abdul Jameel, A.G.; Van Oudenhoven, V.; Emwas, A.-H.; Sarathy, S.M. Predicting octane number using nuclear magnetic resonance spectroscopy and artificial neural networks. Energy Fuels 2018, 32, 6309–6329. [Google Scholar] [CrossRef]
- Goldsmith, C.F.; Magoon, G.R.; Green, W.H. Database of small molecule thermochemistry for combustion. J. Phys. Chem. A 2012, 116, 9033–9057. [Google Scholar] [CrossRef] [PubMed]
- Keyvanpour, M.R.; Shirzad, M.B. An analysis of QSAR research based on machine learning concepts. Curr. Drug Discov. Technol. 2021, 18, 17–30. [Google Scholar] [CrossRef] [PubMed]
- Kessler, T.; Sacia, E.R.; Bell, A.T.; Mack, J.H. Predicting the cetane number of furanic biofuel candidates using an improved artificial neural network based on molecular structure. In Internal Combustion Engine Division Fall Technical Conference; American Society of Mechanical Engineers: New York, NY, USA, 2016; Volume 50503, p. V001T02A010. [Google Scholar]
- Li, R.; Herreros, J.M.; Tsolakis, A.; Yang, W. Machine learning-quantitative structure property relationship (ML-QSPR) method for fuel physicochemical properties prediction of multiple fuel types. Fuel 2021, 304, 121437. [Google Scholar] [CrossRef]
- Patel, R.; Rajaraman, T.; Rana, P.H.; Ambegaonkar, N.J.; Patel, S. A review on techno-economic analysis of lignocellulosic biorefinery producing biofuels and high-value products. Results Chem. 2025, 13, 102052. [Google Scholar] [CrossRef]
- Klein-Marcuschamer, D.; Simmons, B.A.; Blanch, H.W. Techno-economic analysis of a lignocellulosic ethanol biorefinery with ionic liquid pre-treatment. Biofuels Bioprod. Biorefin. 2011, 5, 562–569. [Google Scholar] [CrossRef]
- Zhongyang, L.; Oppong, F.; Wang, H.; Li, X.; Xu, C.; Wang, C. Investigating the laminar burning velocity of 2-methylfuran. Fuel 2018, 234, 1469–1480. [Google Scholar] [CrossRef]
- Cheng, Z.; He, S.; Xing, L.; Wei, L.; Li, W.; Li, T.; Yan, B.; Ma, W.; Chen, G. Experimental and kinetic modeling study of 2-methylfuran pyrolysis at low and atmospheric pressures. Energy Fuels 2017, 31, 896–903. [Google Scholar] [CrossRef]
- Guo, X. Feature-Based Localization Methods for Autonomous Vehicles. Doctoral Dissertation, Freie Universität Berlin Repository, Berlin, Germany, 2017. [Google Scholar]
- Swain, M.C.; Cole, J.M. ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 2016, 56, 1894–1904. [Google Scholar] [CrossRef]
- Tshitoyan, V.; Dagdelen, J.; Weston, L.; Dunn, A.; Rong, Z.; Kononova, O.; Persson, K.A.; Ceder, G.; Jain, A. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 2019, 571, 95–98. [Google Scholar] [CrossRef]
- Weston, L.; Tshitoyan, V.; Dagdelen, J.; Kononova, O.; Trewartha, A.; Persson, K.A.; Ceder, G.; Jain, A. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 2019, 59, 3692–3702. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Jiang, X.; Tian, S.; Liu, P.; Dang, D.; Su, Y.; Lookman, T.; Xie, J. Automated pipeline for superalloy data by text mining. npj Comput. Mater. 2022, 8, 9. [Google Scholar] [CrossRef]
- Agichtein, E.; Gravano, L. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries; Association for Computing Machinery: New York, NY, USA, 2000; pp. 85–94. [Google Scholar]
- Court, C.J.; Cole, J.M. Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction. Sci. Data 2018, 5, 180111. [Google Scholar] [CrossRef]
- Isazawa, T.; Cole, J.M. Automated construction of a photocatalysis dataset for water-splitting applications. Sci. Data 2023, 10, 651. [Google Scholar] [CrossRef]
- Sierepeklis, O.; Cole, J.M. A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Sci. Data 2022, 9, 648. [Google Scholar] [CrossRef]
- Huang, S.; Cole, J.M. BatteryDataExtractor: Battery-aware text-mining software embedded with BERT models. Chem. Sci. 2022, 13, 11487–11495. [Google Scholar] [CrossRef]
- Krishnan, N.A.; Kodamana, H.; Bhattoo, R. Machine Learning for Materials Discovery: Numerical Recipes and Practical Applications; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar] [CrossRef]
- Huang, M.-S.; Han, J.-C.; Lin, P.-Y.; You, Y.-T.; Tsai, R.T.-H.; Hsu, W.-L. Surveying biomedical relation extraction: A critical examination of current datasets and the proposal of a new resource. Brief. Bioinform. 2024, 25, bbae132. [Google Scholar] [CrossRef]
- Alshehri, A.S.; Tantisujjatham, B.; Alrashed, M.M. Uncertainty-Aware Deep Reinforcement Learning Approach for Computational Molecular Design. Ind. Eng. Chem. Res. 2025, 64, 10117–10130. [Google Scholar] [CrossRef]
- Alshehri, A.S.; Bergman, M.T.; You, F.; Hall, C.K. Biophysics-guided uncertainty-aware deep learning uncovers high-affinity plastic-binding peptides. Digit. Discov. 2025, 4, 561–571. [Google Scholar] [CrossRef]
- Decardi-Nelson, B.; Alshehri, A.S.; You, F. Generative artificial intelligence in chemical engineering spans multiple scales. Front. Chem. Eng. 2024, 6, 1458156. [Google Scholar] [CrossRef]
- Almomtan, M.; Ibrahim, E.A.; Farooq, A. Fuelprop: Fuel property prediction from ATR-FTIR spectroscopic data. arXiv 2025, arXiv:2506.01601. [Google Scholar] [CrossRef]
- Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.; Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; et al. Molecular sets (MOSES): A benchmarking platform for molecular generation models. Front. Pharmacol. 2020, 11, 565644. [Google Scholar] [CrossRef] [PubMed]
- Kuzhagaliyeva, N.; Horváth, S.; Williams, J.; Nicolle, A.; Sarathy, S.M. Artificial intelligence-driven design of fuel mixtures. Commun. Chem. 2022, 5, 111. [Google Scholar] [CrossRef]
- Schweidtmann, A.M.; Rittig, J.G.; Weber, J.M.; Grohe, M.; Dahmen, M.; Leonhard, K.; Mitsos, A. Physical pooling functions in graph neural networks for molecular property prediction. Comput. Chem. Eng. 2023, 172, 108202. [Google Scholar] [CrossRef]
- Decardi-Nelson, B.; Alshehri, A.S.; Ajagekar, A.; You, F. Generative AI and process systems engineering: The next frontier. Comput. Chem. Eng. 2024, 187, 108723. [Google Scholar] [CrossRef]
- Krallinger, M.; Rabal, O.; Leitner, F.; Vazquez, M.; Salgado, D.; Lu, Z.; Leaman, R.; Lu, Y.; Ji, D.; Lowe, D.M.; et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminform. 2015, 7, S2. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Wang, C.; Soukaseum, M.; Vlachos, D.G.; Fang, H. Unleashing the power of knowledge extraction from scientific literature in catalysis. J. Chem. Inf. Model. 2022, 62, 3316–3330. [Google Scholar] [CrossRef]
- Zhang, Y.; Vlachos, D.G.; Liu, D.; Fang, H. Rapid adaptation of chemical named entity recognition using few-shot learning and llm distillation. J. Chem. Inf. Model. 2025, 65, 4334–4345. [Google Scholar] [CrossRef]
- Yang, X.; Zhuo, Y.; Zuo, J.; Zhang, X.; Wilson, S.; Petzold, L. Pcmsp: A dataset for scientific action graphs extraction from polycrystalline materials synthesis procedure text. In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 6033–6046. [Google Scholar]
- Xing, X.; Chen, P. Entity extraction of key elements in 110 police reports based on large language models. Appl. Sci. 2024, 14, 7819. [Google Scholar] [CrossRef]
- Tunstall, L.; Reimers, N.; Eun Seo Jo, U.; Bates, L.; Korat, B.; Wasserblat, M.; Pereg, O. Efficient few-shot learning without prompts. arXiv 2022, arXiv:2209.11055. [Google Scholar] [CrossRef]
- Chen, P.; Wang, J.; Lin, H.; Zhao, D.; Yang, Z. Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning. Bioinformatics 2023, 39, btad496. [Google Scholar] [CrossRef]
- Han, R.; Yang, C.; Peng, T.; Tiwari, P.; Wan, X.; Liu, L.; Wang, B. An empirical study on information extraction using large language models. arXiv 2023, arXiv:2305.14450. [Google Scholar]
- Eschbach-Dymanus, J.; Essenberger, F.; Buschbeck, B.; Exel, M. Exploring the effectiveness of llm domain adaptation for business it machine translation. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation; European Association for Machine Translation: Tampere, Finland, 2024; Volume 1, pp. 610–622. [Google Scholar]
- Van Herck, J.; Victoria Gil, M.; Maik Jablonka, K.; Abrudan, A.; Anker, A.S.; Asgari, M.; Blaiszik, B.; Buffo, A.; Choudhury, L.; Corminboeuf, C.; et al. Assessment of fine-tuned large language models for real-world chemistry and material science applications. Chem. Sci. 2025, 16, 670–684. [Google Scholar] [CrossRef]
- Foppiano, L.; Lambard, G.; Amagasa, T.; Ishii, M. Mining experimental data from materials science literature with large language models: An evaluation study. Sci. Technol. Adv. Mater. Methods 2024, 4, 2356506. [Google Scholar] [CrossRef]
- Belkin, M.; Hsu, D.J.; Mitra, P. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
- Akhtar, M.; Reuel, A.; Soni, P.; Ahuja, S.; Ammanamanchi, P.S.; Rawal, R.; Zouhar, V.; Yadav, S.; Whitehouse, C.; Ki, D.; et al. When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation. arXiv 2026, arXiv:2602.16763. [Google Scholar] [CrossRef]
- Li, Y.; Shankar, V.S.B.; Yalamanchi, K.K.; Badra, J.; Nicolle, A.; Sarathy, S.M. Understanding the blending octane behaviour of unsaturated hydrocarbons: A case study of C4 molecules and comparison with toluene. Fuel 2020, 275, 117971. [Google Scholar] [CrossRef]
- Echekki, T.; Farooq, A.; Ihme, M.; Sarathy, S. Machine learning for combustion chemistry. In Machine Learning and Its Application to Reacting Flows: ML and Combustion; Springer International Publishing: Cham, Switzerland, 2023; pp. 117–147. [Google Scholar]
- Ji, W.; Su, X.; Pang, B.; Li, Y.; Ren, Z.; Deng, S. SGD-based optimization in modeling combustion kinetics: Case studies in tuning mechanistic and hybrid kinetic models. Fuel 2022, 324, 124560. [Google Scholar] [CrossRef]









| An Example JSONL Representation: |
|---|
| { “Properties”: 1, “Example”: 1, “DOI”: “CN115287106B”, “Text”: { “Text”: “the total content of the carbon six and the carbon seven alkanes is more than 60%, the benzene content is less than 0.5%, the olefin content is less than 3%, the density, the octane number and the vapor pressure are lower, and the cleanliness of small amount of aromatic hydrocarbon, olefin and components is higher. Further, the sum of the volume percentages of the 2-methylpentane and the 3-methylpentane in the isohexane is more than 99%. Further, the n-heptane has a research octane number of 0, and the main component of the n-heptane is seven alkane, the volume percentage content of the n-heptane is more than 98%, and the volume percentage content of the rest alkane components is less than 2%. Furthermore, the distillation range of the reformed gasoline is concentrated at 35–195°C” }, “Chemical”: { “name”: “n-heptane”, “value”: “0”, “property”: [“Research Octane Number”] } } |
| Instruction-Tuning Instance |
|---|
| { “messages”: [ { “role”: “system”, “content”: “IgnitionGPT is a fuel property extraction assistant. It identifies chemical names and their fuel-relevant properties (e.g., Research Octane Number, Motor Octane Number (MON), and Cetane Number, etc.) in scientific text.” }, { “role”: “user”, “content”: “Extract all occurrences of CHEMICAL, VALUE, and PROPERTY (e.g., Research Octane Number, Motor Octane Number (MON), and Cetane Number) for each fuel-related substance mentioned in the sentence below:\n$text_segment$\n\nOnly return the result as a strict JSON array of dictionaries.\n\nHere is an example:\nText:\”Propane has a motor octane number of 112 and a research octane number of 105.\”\n\nOutput:\n[{\”CHEMICAL\”: \”propane\”, \”VALUE\”: \”97\”, \”PROPERTY\”: \”Motor Octane Number (MON)\”}, {\”CHEMICAL\”: \”propane\”, \”VALUE\”: \”112\”, \”PROPERTY\”: \”Research Octane Number (RON)\”}]” }, { “role”: “assistant”, “content”: “$jsonl_style_output$” } ] } |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alshehri, A.S. Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties. Appl. Sci. 2026, 16, 3320. https://doi.org/10.3390/app16073320
Alshehri AS. Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties. Applied Sciences. 2026; 16(7):3320. https://doi.org/10.3390/app16073320
Chicago/Turabian StyleAlshehri, Abdulelah S. 2026. "Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties" Applied Sciences 16, no. 7: 3320. https://doi.org/10.3390/app16073320
APA StyleAlshehri, A. S. (2026). Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties. Applied Sciences, 16(7), 3320. https://doi.org/10.3390/app16073320

