Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties

Alshehri, Abdulelah S.

doi:10.3390/app16073320

Open AccessArticle

Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties

by

Abdulelah S. Alshehri

Department of Chemical Engineering, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia

Appl. Sci. 2026, 16(7), 3320; https://doi.org/10.3390/app16073320

Submission received: 5 March 2026 / Revised: 26 March 2026 / Accepted: 27 March 2026 / Published: 29 March 2026

(This article belongs to the Special Issue Information Retrieval: From Theory to Applications)

Download

Browse Figures

Versions Notes

Featured Application

An efficiently fine-tuned large language model with highly accurate and concurrent extraction capabilities is introduced as a scalable automated tool for building robust fuel datasets to accelerate next-generation fuel design and predictive combustion modeling.

Abstract

Large datasets of fuel properties are indispensable for predictive combustion modeling and next-generation fuel design. However, resource-intensive experiments restrict existing databases to 200–500 compounds, capturing an infinitesimal fraction of the C1-20 hydrocarbon space. Furthermore, conventional rule-based and supervised learning extraction methods are constrained by poor scalability, domain-specific nomenclature, and weak contextual inference. To address these limitations, we introduce IgnitionGPT, a large language model fine-tuned on GPT-4.1-mini for the automated, concurrent extraction of three ignition metrics: Research Octane Number, Motor Octane Number, and Cetane Number. The model was trained on a human-annotated JSONL dataset of 304 sources (263 peer-reviewed articles, 41 patents) encompassing 581 diverse compounds. By evaluating IgnitionGPT directly against its zero-shot foundation, we isolate the impact of domain-specific fine-tuning. The model overcomes baseline overgeneralization (47.8% F1) to achieve saturated extraction accuracy on unseen data (i.e., 100% for the best model). Remarkably, it reaches this saturation on an 85% held-out test split using a mere 10% of the data for fine-tuning, demonstrating true robustness across heterogeneous literature. Ultimately, by open-sourcing our data and methods, this fine-tuning framework transitions chemical information retrieval from fragmented, rule-based heuristics to unified, concurrent extraction towards bridging the gap between experimental limitations and data-driven molecular design and modeling.

Keywords:

large language models; ignition quality; fuel design; unstructured data extraction; predictive modeling; information retrieval

1. Introduction

Accurate characterization of fuel ignition quality through properties, such as the Octane Number (RON), Motor Octane Number (MON), and Cetane Number (CN), is indispensable for combustion modeling, engine optimization, and emissions prediction [1,2,3,4,5]. These parameters directly govern ignition delay, laminar flame propagation, and the formation of pollutant precursors, and thereby exert a profound influence on both engine efficiency and environmental performance [4,6,7]. Standardized measurement protocols, typically defined by American Society for Testing and Materials (ASTM) procedures, remain the primary means of property acquisition, supplemented by compilations from literature surveys [8,9]. However, these protocols require significant experimental resources and extended timelines, often weeks to months per compound, while covering only a fraction of the theoretical chemical space relevant to emerging or proprietary fuels [2,10,11]. Consequently, the lack of scalable and automated mechanisms for collecting and structuring ignition-related property datasets constitutes a persistent bottleneck in both fuel design and combustion modeling [12].

Recent methodological advances have sought to alleviate these limitations. Machine learning approaches, including surrogate-based regression, quantitative structure-property relationship (QSPR) models, and integrated computational workflows, have demonstrated promise in predicting fuel properties using existing empirical datasets [13,14,15]. Such models enable efficient screening of candidate molecules and design of optimized blends [16]. Nevertheless, their accuracy and transferability remain constrained by the limited scale, bias, and quality of the underlying datasets. In this context, large language models (LLMs) offer a potential route to ameliorate such issues by extracting and structuring property information directly from the unstructured academic literature and technical documentation [17,18,19]. As such, this route can substantially enlarge the available training data towards reducing the issues of scale, bias, and quality while eliminating the need for manual curation [12,15,20]. The resulting databases can subsequently enhance predictive model development, improve robustness across chemical classes, and strengthen the integration of computational predictions with experimental validation [2,21,22].

Current datasets for RON, MON, and CN are predominantly derived from manual curation or direct experimental acquisition, typically encompassing only 200–500 unique molecular structures [23,24]. This coverage is negligible relative to the estimated one to ten million potential hydrocarbon candidates encompassing molecules with 1–20 carbons [25,26,27,28]. The restricted scope not only limits chemical diversity but also induces systematic bias toward commercially available and well-studied fuels, thus hampering the generalization capabilities of predictive models. Although a wealth of property data exists within journal articles, patents, technical reports, and regulatory filings, such resources remain underexploited owing to the heterogeneity of formats and the complexity of extracting structured, machine-readable information [29,30,31]. Rule-based text mining approaches provide only partial solutions, as they lack scalability across large corpora, perform poorly with complex chemical nomenclature, and fail to capture contextual associations between compounds and measured properties [18,32]. Traditional pipelines (e.g., ChemDataExtractor [19,30], ChemREL [18], SuperalloyDigger [33]) rely heavily on rigid, handcrafted pattern-matching algorithms, such as the Snowball algorithm [34,35], and require large supervised sampling. As noted in the recent literature, these legacy systems exhibit brittle transferability, suffering F1-score drops of up to 36% when applied to novel extraction contexts [36,37], and demand substantial reprogramming to handle concurrent, multi-property extraction [19,38]. The absence of accurate and automated extraction pipelines thus represents a critical barrier to the expansion of ignition-property datasets required for modern predictive modeling [35,39].

LLMs, built on transformer-based architectures, have emerged as state-of-the-art tools for processing technical and scientific text [40,41]. By leveraging large-scale pretraining, these models achieve superior performance in extracting structured information from heterogeneous sources, identifying implicit relationships between molecular structures and thermophysical properties, and producing quantitatively annotated outputs suitable for integration into modeling pipelines [42]. Recent studies have demonstrated their applicability to tasks including chemical named entity recognition, property extraction with uncertainty quantification, and descriptor-property mapping [43,44,45]. Instruction-tuning and few-shot learning further enable the adaptation of LLMs to highly specialized chemical domains while maintaining generalization across structurally diverse molecular classes [46].

In response to these challenges, we present IgnitionGPT, a domain-adapted LLM fine-tuned on the general-purpose LLM GPT-4.1-mini. IgnitionGPT is designed for the automated extraction of ignition-relevant properties, including RON, MON, and CN, from unstructured scientific text. Our model fine-tuning employed a manually curated dataset of 304 documents (263 peer-reviewed articles and 41 patents) encompassing 581 distinct compounds across diverse chemical classes, with property annotations encoded in structured JSONL format. This dataset enables reproducible supervised fine-tuning and provides a benchmark for subsequent extraction tasks. Sequential training and testing experiments spanned 10–90% of the dataset. The main contributions of this work are threefold:

We introduce IgnitionGPT, a fine-tuned LLM optimized for ignition-property extraction from technical corpora, with demonstrated superiority over general-purpose LLMs.
We provide a rigorously curated, domain-specific dataset of ignition-related properties for 581 compounds, encompassing alkanes, alcohols, ethers, furans, aromatics, and selected inorganic or polymeric fuels.
We release all code and data openly under the MIT License (https://github.com/AI4CHEMIA/IgnitionGPT (accessed on 22 January 2026)), ensuring transparency and reproducibility.

The core significance and unique innovation of this work lie in transitioning chemical data extraction from rule-based, single-task legacy pipelines [18,19,30,33,37] to a highly data-efficient, concurrent LLM framework. IgnitionGPT simultaneously disentangles multiple interrelated ignition properties (RON, MON, CN) across highly heterogeneous document structures. The extracted dataset significantly expands the coverage of ignition properties and diversity of fuels towards enabling improved training of predictive surrogate models and enhancing generalization to out-of-distribution chemical classes [47,48]. The integration of such data into computational frameworks supports multi-objective fuel design workflows, wherein deep learning predictors are combined with optimization algorithms to identify blends satisfying performance and emissions constraints [44,49,50]. Furthermore, its incorporation into combustion simulations improves predictions of ignition delay, emissions profiles, and engine-fuel compatibility. Advances in foundation models and retrieval-augmented generation approaches are expected to further enhance extraction fidelity, supporting continuous database expansion [51].

2. Methods

This section describes the methodological framework employed in this study, including the systematic collection and curation of domain-specific documents, the annotation and structuring of chemical-property relationships, and the design and fine-tuning of a large language model for property extraction, as illustrated in Figure 1. First, we outline the data collection process, detailing the sources, selection criteria, and statistical composition of the final dataset. We then describe the annotation strategy, including entity labeling, property linking, and schema representation in structured JSONL format. Subsequently, we present the model architecture, fine-tuning procedure, and incremental training strategy, emphasizing reproducibility and generalization across dataset sizes. Finally, we discuss evaluation metrics, baseline comparisons, and experimental validation.

2.1. Data Collection and Curation

In this subsection, we describe the systematic collection of a domain-specific dataset for ignition-property extraction. We outline the sources and retrieval strategies, criteria for screening and selecting relevant documents, the composition and structural characteristics of the curated dataset, and the construction of incremental training splits used for evaluating model performance and data efficiency.

2.1.1. Data Sources and Retrieval

To construct a domain-specific dataset for automated extraction of ignition-relevant fuel properties, we systematically curated documents from both peer-reviewed research articles and patent literature. The objective was to capture explicit mentions of chemical entities and their associated ignition-quality metrics, namely RON, MON, and CN. Text segments containing modified or ambiguous descriptors (e.g., ethanol blend) were excluded to minimize variability introduced by non-standard nomenclature. Document retrieval employed multiple search platforms, including Google Scholar, Google Patents, and general web indexing tools, in addition to publisher databases such as the American Chemical Society (ACS), ScienceDirect, and MDPI, while adhering to publisher-specific text mining guidelines [18,52]. Targeted keyword searches (“Cetane Number,” “Motor Octane Number,” and “Research Octane Number”) guided initial collection, while inclusion of non-open-access excerpts (e.g., Elsevier journals) broadened corpus diversity and ensured representation across variable document structures and lengths. Each record was linked to unique identifiers, including Digital Object Identifiers (DOIs) or patent registration numbers, to preserve traceability and reproducibility.

2.1.2. Screening and Selection

An initial pool of over 600 candidate documents was screened according to relevance, accessibility, and presence of unambiguous property mentions. The final dataset consisted of 304 documents, of which 263 (86.5%) were journal articles and 41 (13.5%) were patents. Extracted passages ranged from 43 to 400 words, with an average of approximately 200 words. Research articles exhibited greater lexical richness and systematic nomenclature, whereas patents frequently employed bullet points, chemical symbols, and numeric formatting, introducing structural heterogeneity. Chemical entity mentions included systematic IUPAC names, molecular formulas, and trivial or trade names. Each mention was labeled by name type and subsequently categorized into broader chemical classes (e.g., alkanes, alcohols, ethers). Figure 2a,b summarize passage length distributions and frequencies of chemical representation types, while Figure 3 illustrates the distribution and diversity of compounds across chemical categories.

2.1.3. Dataset Composition

The curated dataset comprised 581 unique compounds organized into structured JSONL records that link raw text passages, chemical entities, and associated properties. Metadata fields included document identifiers (DOI or patent ID), extracted text, and annotated entities with corresponding property values. Hydrocarbons constituted the largest group, primarily alkanes (230 compounds), followed by oxygenated compounds such as alcohols (158) and ethers (58). Less represented categories included furan and cyclic ethers (41), aromatics (22), and a small subset of polymers and inorganic species. This distribution reflects the dataset’s sufficient chemical diversity of fuels to enable generalization not only in property extraction but also in the resulting estimation models. An illustrative JSONL entry is shown in Table 1, with JSON keys and nested entity-property linkage with multiple chemicals and values extraction allowed.

While a corpus of 304 documents may appear small compared to general-domain NLP datasets, it reflects a standard scale for fine-tuning LLMs on highly specialized, human-annotated chemical extraction tasks. Comparable studies in the literature have shown LLMs learning efficiently using similarly sized corpora, such as 185 documents for catalysis data [53,54], 305 documents for polycrystalline materials [54,55], and 110 documents for structured reports [56]. Furthermore, because our approach relies on fine and instruction-tuning a trained foundation model rather than training from scratch, it requires significantly less data to align to the task. Recent studies demonstrate that LLMs can achieve exceptional domain adaptation and substantial F1 score improvements with as few as 5 to 50 high-quality examples [54,57,58].

2.1.4. Incremental Training Splits

To evaluate scalability and learning stability, the dataset was partitioned into incremental training (fine-tuning) subsets corresponding to 10%, 20%, 50%, 80%, and 90% of the total 581 samples. In each case, the remaining fraction was reserved for testing. Training token counts for these subsets were 85,710 (10%), 167,130 (20%), 413,615 (50%), 659,885 (80%), and 743,645 (90%). This experimental design enabled a systematic assessment of performance progression and data requirement as a function of dataset size. Since we perform each experiment separately, the testing documents are entirely unseen during the fine-tuning process. As such, they serve effectively as a large-scale external testing dataset, providing a rigorous assessment of the model’s real-world robustness and generalization capabilities.

2.2. Model Architecture and Fine-Tuning

In this subsection, we detail the experimental design and adaptation of a large language model for fuel-property extraction. We describe the selection of the baseline model, the structure of training data for supervised fine-tuning, and the implementation of the fine-tuning framework, including optimization protocols and training procedures to ensure reproducible and efficient model performance.

2.2.1. Baseline Model Selection

All experiments employed GPT-4.1-mini (release 14 April 2025) as the baseline architecture. This model, developed by OpenAI, represents a transformer-based large language model optimized for computational efficiency while retaining competitive performance in structured text extraction tasks. Its architecture supports multi-turn conversational supervision, which is critical for aligning raw scientific text with structured annotation formats such as JSON. Two model configurations were evaluated: (i) the unmodified base GPT-4.1-mini in a zero-shot setting, and (ii) a fine-tuned variant, hereafter referred to as IgnitionGPT.

2.2.2. Training Data Representation

Training instances were encoded in JSONL format, where each entry captured a structured conversational exchange. The system role defined the extraction task, the user role contained the raw input text, and the assistant role provided the gold-standard structured output. This schema ensured deterministic supervision and reproducibility. An illustrative fine-tuning and instruction alignment instance is provided in Table 2.

2.2.3. Fine-Tuning Framework

Supervised fine-tuning, referred to interchangeably with training here, was conducted using OpenAI’s platform with the following settings (batch size = 16, learning rate = 5 × 10⁻⁵ with 5% warmup, weight decay = 0.01, three epochs per split, gradient clipping = 1.0), a maximum sequence length of 2048 tokens, and mixed-precision (bfloat16) training to ensure stability, efficiency, and reproducibility. Preliminary experiments exploring alternative hyperparameter configurations yielded either higher computational costs or inferior performance. It should also be noted that OpenAI’s fine-tuning platform does not disclose the percentage of trainable parameters of the base model being fine-tuned, nor the base model’s total number of weights.

To evaluate data efficiency and generalization capacity, fine-tuning was conducted sequentially on progressively larger subsets of the dataset, comprising 10%, 20%, 50%, 80%, 90%, and 100% of the available 581 annotated samples. The corresponding token counts were 85,710 (10%), 167,130 (20%), 413,615 (50%), 659,885 (80%), and 743,645 (90%). For each configuration, a small random fraction (5%) was held out as a validation set to guide checkpoint selection while the remaining fraction was used for final testing. The reasoning behind this incremental design is to enable systematic characterization of scaling behavior, convergence dynamics, and sensitivity to the data requirement for this task.

3. Results and Analyses

This section presents the empirical results obtained from evaluating GPT-4.1-mini (Version: 14 April 2025) in both its zero-shot configuration and a fine-tuned variant, IgnitionGPT. The structure of the section follows a logical progression from overall performance comparisons to detailed observations of training dynamics and potential overfitting. Section 3.1 compares zero-shot and fine-tuned performance, highlighting token-level accuracy improvements from fine-tuning, with 3.1.1 quantifying gains and 3.1.2 outlining baseline capabilities. Section 3.2 examines training and testing accuracy trends, summarizing fine-tuning logs and analyzing intra-epoch fluctuations to assess learning stability and convergence. Section 3.3 explores the relationship between model fit and generalization, evaluating training progression through accuracy and loss, and identifying overfitting by examining the divergence between training and testing performance.

3.1. Comparing Zero-Shot Learning with Fine-Tuning

Since existing pipelines in the literature mostly do not support zero-shot extraction [18,19,30,39], the performance of IgnitionGPT was benchmarked against the base GPT-4.1-mini under identical conditions. In the zero-shot setting, the general-purpose GPT-4.1-mini achieves approximately 48% F1 and accuracy, reflecting a poor understanding of fuel-property language from general training, as shown in Figure 4. Lower precision than recall indicates overgeneralization, with the model detecting relevant contexts but lacking specificity in token extraction. As shown in Figure 4, fine-tuning to IgnitionGPT increases accuracy to 100%, with 100% maximum accuracy reached using only 10% of the training data. This rapid saturation demonstrates efficient data utilization, consistent with patterns observed in instruction-tuned models and learning efficiency. These results indicate that IgnitionGPT requires limited supervision to achieve high accuracy with immense potential for reducing annotation effort and facilitating deployment in specialized domains.

3.1.1. Overall Results

Our experimental findings show that fine-tuning leads to a 52.2 percentage point increase in IgnitionGPT’s token-level accuracy compared to zero-shot results of GPT-4.1-mini, improving from 47.8% in the zero-shot setting to 100% on the testing set. The model also achieves an F1-score of 64.7% before fine-tuning, which increases to 100% afterward. This change reflects the model’s ability to learn domain-specific extraction behavior when trained on a structured and annotated dataset. The results demonstrate that general-purpose pretraining alone is insufficient for accurate token-level extraction in specialized tasks, and that domain-specific fine-tuning is necessary to reach high levels of accuracy.

After fine-tuning, the model consistently predicts correct token boundaries for fuel-property terms such as RON, MON, and CN. This improvement indicates that the model internalizes patterns present in the training data rather than memorizing specific examples. Given the well-defined structure of the annotation schema and low ambiguity in target token classes, 100% testing accuracy does not necessarily reflect overfitting. Instead, it suggests convergence on a consistent extraction function under clear task constraints. The results also show that IgnitionGPT can learn accurate extraction behavior from relatively compact datasets if the data covers representative linguistic variations. This suggests that supervised fine-tuning can be efficiently applied to specialized domains without requiring large-scale annotation, provided that the data is consistent and structurally informative.

3.1.2. Zero-Shot Detailed Evaluation

In the zero-shot setting, GPT-4.1-mini achieves 47.8% overall accuracy, with a precision of 61.0%, a recall of 68.9%, and an F1 score of 64.7%, as shown in Figure 4. These values indicate partial recognition of fuel-property entities in unseen text without any task-specific training. The 3.7% higher recall than precision indicates overprediction of positive tokens, increasing false positives. This suggests overgeneralization driven by reliance on loose contextual cues rather than strict token boundaries. This is consistent with the nature of general-purpose large pretrained language models, which are optimized for general language understanding rather than fine-grained token classification in technical subdomains. These results define the model’s baseline performance and clarify the gap addressed by fine-tuning. While zero-shot inference captures some relevant tokens, the precision-recall imbalance and low token accuracy show that the model lacks the ability to apply domain-specific extraction rules. These limitations provide a clear justification for fine-tuning and help contextualize the scale of improvement achieved with supervised training.

3.2. Analyzing Training and Testing Accuracy

In this section, we evaluate IgnitionGPT’s accuracy, data efficiency, and training stability across varying dataset sizes. Figure 5a shows that the model achieves consistently high mean accuracy (0.978–0.989) with just 10% of the training data, reaching 1.00 maximum accuracy across all data sizes. Low test loss and narrow standard deviation bands (0.010–0.025) indicate stable fine-tuning and low sensitivity to initialization or data splits, as shown in Figure 5a. Training performance shows early saturation, consistent with NLP scaling law predictions, suggesting rapid convergence once sufficient domain-relevant input is seen. The model also maintains high minimum accuracy (>0.87) (Figure 5b), implying robustness even under limited training conditions. Despite high token-level metrics, span-level extraction remains a limitation, particularly for multi-token entities and relational contexts. These results support IgnitionGPT’s efficiency and generalization capability, while highlighting areas for further structural evaluation and fine-grained assessment.

3.2.1. Descriptive Analysis of Fine-Tuning Logs

IgnitionGPT achieves high mean token-level accuracy across training splits, ranging from 0.978 at 10% of the dataset to 0.989 at 80%, with performance gains diminishing beyond the 20% threshold. This early plateau suggests that the model generalizes effectively from limited supervision, likely due to inductive biases from pretraining and alignment between model architecture and task structure, as shown in Figure 5a. Maximum accuracy reaches 1.00 across all data splits, including low-resource conditions, and standard deviation remains between 0.010 and 0.025, indicating low variability across runs. As illustrated in Figure 5a, accuracy variance follows a U-shaped trend, with increased spread at 10 and 50 percent, likely due to sampling noise or fold-specific distributional differences. At 20 and 80 percent, the narrower variance bands may reflect more representative training subsets or redundancy in domain-specific signals, as supported by the patterns in both panels of Figure 5.

Minimum accuracy values range from 0.878 to 0.955 across training conditions, with the lowest point observed when trained on 50% of the dataset, possibly due to higher heterogeneity within that subset, as shown in Figure 5b. These results suggest that model performance remains consistent under varied data availability and is not substantially affected by random partitioning. The rapid saturation of accuracy aligns with scaling law expectations, where most learning gains are realized early in fine-tuning for LLMs [54,57]. However, the evaluation remains limited to token-level correctness. Multi-token spans such as chemical properties may be inconsistently labeled or partially predicted, and the model’s ability to associate numerical values with the correct property is not evaluated. A more complete fine-tuning and, hence, assessment, as provided below, would require span-level and relation-level evaluation metrics to capture structural and semantic correctness beyond token boundaries.

3.2.2. Variations in Token Accuracy and Loss While Training

This section analyzes the joint evolution of token accuracy and loss during IgnitionGPT fine-tuning. Figure 6 presents the joint distribution plots of training token-level accuracy. It shows strong clustering along the diagonal near (1.0, 1.0), indicating consistent generalization across data splits. A strong linear correlation between training and validation accuracy confirms that improvements on training data translate effectively to validation sets without overfitting. Variability in accuracy follows a U-shaped pattern in Figure 6, with higher variance at 10% and 50% data splits likely due to limited domain representation and uneven inclusion of complex patterns, respectively, while 20% and 80% splits exhibit reduced variance from more representative data, as shown in Figure 6b,d. These results highlight the importance of data stratification in low-resource settings and demonstrate that IgnitionGPT learns robust token-level representations across different training regimes.

Training and validation loss curves converge steadily across all data splits, showing monotonic decreases during early epochs and plateauing after limited updates. This behavior indicates stable optimization without oscillations, loss spikes, or divergence, consistent with early saturation trends. The absence of instability across runs suggests that IgnitionGPT’s parameter space is effectively navigated using standard optimization methods without the need for specialized regularization or gradient clipping.

As per Figure 6, no evidence of mode collapse or degraded validation accuracy is observed across experiments, supporting reproducibility and robustness in the fine-tuning process. Consistency across random seeds and data folds implies that the model’s architecture and training setup are well-suited to the task. This also indicates that high token-level accuracy results from systematic learning rather than favorable initialization, reinforcing the reliability of IgnitionGPT’s performance under controlled fine-tuning conditions.

Figure 7 presents the joint distribution of training and validation loss across dataset sizes, showing that losses remain low and closely aligned, indicating effective generalization. Figure 7b,c illustrates that at low data levels (10–20%), validation loss is low with modest variance, suggesting that minimal supervision suffices for salient feature extraction, likely due to inductive biases from pretraining. As dataset size increases from 30% to 90%, Figure 7d–f show a decrease in loss variance, reflecting more stable convergence and reduced sensitivity to stochastic factors.

No overfitting is evident across Figure 7, as low training loss does not correspond with high validation loss, indicating effective regularization and avoidance of memorization. Rare outliers with elevated validation loss occur primarily in low-data splits but do not alter overall trends. Figure 7c also shows that loss reductions plateau beyond 20% training data, with further increases primarily reducing variance rather than mean loss, consistent with NLP scaling laws. These findings confirm that IgnitionGPT achieves stable, generalizable optimization with diminishing returns at higher data volumes, supporting efficient fine-tuning under limited supervision.

3.3. Overfitting and Loss Analysis

IgnitionGPT shows minimal overfitting across all data regimes, with validation accuracy consistently exceeding 97% and no significant divergence between training and validation at higher data sizes (80–90%), as per Figure 8. Figure 9 shows that loss decreases smoothly during training, with low variance and no signs of instability or gradient noise, particularly in large-data settings. In low-data conditions, slight divergence between training and validation loss provides reliable early stopping signals, and best-performing checkpoints consistently appear near the final epochs, indicating effective convergence, as shown in Figure 8.

3.3.1. Observation of Training Progress

Training progress curves in Figure 8 show that IgnitionGPT’s accuracy improves with increased supervision, with early saturation and wider divergence between training and validation in low-data settings (10–20%). In contrast, higher data fractions (80–90%) yield tightly aligned accuracy curves, indicating more consistent generalization and reduced distributional shift between training and validation data. Best-performing checkpoints in these settings typically occur at later epochs (Figure 8), reflecting stable convergence under continued optimization.

Mid-range configurations like 50% show partial convergence in Figure 8, with validation accuracy stabilizing mid-training, suggesting that generalization becomes limited without broader coverage of domain variability. Loss trajectories in Figure 9 support these trends: training and validation loss curves converge smoothly across all splits but exhibit tighter alignment and lower variance at higher data sizes. In small-data regimes, loss curves are more dispersed with occasional spikes, indicating higher sensitivity to stochastic effects and reduced stability in learning dynamics.

3.3.2. Observing Overfitting Possibilities

Figure 8 show increasing divergence between training and validation token accuracy over time in low-data settings (10–20%), indicating growing specialization to training examples. This divergence is mitigated by selecting checkpoints just before it intensifies, enabling the model to capture relevant patterns while avoiding overfitting. As shown in Figure 8, best-performing checkpoints consistently occur between 80% and 100% of training steps, indicating that validation accuracy, rather than training accuracy, drives model selection.

Validation loss trends in Figure 9 support this, showing occasional late-stage increases in low-data regimes. Early stopping based on these signals prevents loss degradation and sustains validation accuracy above 97% even under limited supervision. Together, these results show that the model does not overfit and maintains generalization across different data conditions through effective checkpointing.

4. Discussion

The results presented herein demonstrate that domain-adapted fine-tuning of a general-purpose language model substantially improves chemical property extraction accuracy and robustness. IgnitionGPT, derived from GPT-4.1-mini, achieved complete token-level accuracy following supervised fine-tuning, in stark contrast to the 47.8% observed in the zero-shot baseline. This performance improvement, illustrated in Figure 4, underscores the critical importance of domain-specific fine-tuning (instruction-tuning) for specialized chemical tasks, particularly when pretraining alone provides only partial recognition of chemical entities and their associated properties. The model’s enhanced precision, recall, and F1-score further indicate its ability to internalize structured patterns from annotated datasets rather than memorizing examples, consistent with prior studies on instruction-tuned chemical NLP models [43]. Notably, as shown in Figure 5, IgnitionGPT achieves near-perfect extraction performance with only 10% of the training data. Such high data efficiency is consistent with recent findings in biomedical and chemical information extraction, which demonstrate that pretrained LLMs can achieve substantial domain adaptation and accurate entity recognition utilizing as few as 5 to 50 instruction-tuned examples [54,57,58]. Such a zero-shot comparison isolates the performance gains directly attributable to the main contribution, domain-specific fine-tuning, and quantifies the resolution of zero-shot overgeneration as supported by the literature [59,60,61,62].

Automated extraction of fuel-relevant metrics, including RON, MON, and CN, enables direct integration of empirical data into combustion modeling workflows. By systematically mining literature and patent sources, IgnitionGPT expands dataset coverage beyond the limitations of traditional experimental campaigns, providing a broader representation of the C1-C20 hydrocarbon space. Figure 3 illustrates the chemical category distributions captured in the curated dataset, showing that hydrocarbons, alcohols, and ethers dominate, yet minor classes such as furan derivatives and aromatics are also represented, supporting generalization to diverse chemical structures. Structuring unstructured text into JSONL facilitates downstream application in QSPR and surrogate models, including graph-based neural networks, where enhanced property coverage improves predictive accuracy and reduces uncertainty [14,15,50]. Unlike rule-based [19] or supervised extraction approaches [18,19,30], LLM-based pipelines efficiently capture complex linguistic and structural patterns, producing high-quality datasets with minimal supervision and supporting rapid integration into computational fuel design.

Despite these strengths, several limitations merit consideration. The annotated corpus, while comprehensive, remains a finite representation of fuel-property literature and may not encompass rare linguistic variants such as documents containing multilingual tokens or complex multi-token entities. For instance, scientific texts frequently embed raw structural formulas as synonyms (e.g., from IgnitionGPT dataset, CH 2=CH-C(CH 3) 3), utilize non-standard Greek letter substitutions (e.g., from IgnitionGPT dataset, α-methyl naphthalene), or contain typographical issues like spacing within long IUPAC names (e.g., from IgnitionGPT dataset, 2,2,4,4,6, 8,8-heptamethylnonane). Consequently, though absent from our results due to the nature of LLMs, span-level and relational extraction of numerical-property associations remain potential sources of misclassification, particularly for intricate formulations typically described in patents. Nonetheless, because our final model is fine-tuned on just 10% of the data and tested on 85%, this significant held-out fraction of documents serves as a large-scale external dataset. Reaching this empirical performance ceiling on a much larger test set suggests a negligible generalization gap and renders comparisons to lesser baselines redundant [63,64]. Such extensive evaluation suggests that the model can be highly robust to out-of-distribution variations and deliver reliable real-world extraction performance.

Another limitation we observe is that the evaluation metrics primarily assess token-level alignment during fine-tuning, as illustrated in Figure 6, and minor discrepancies may occur in multi-property sentences where numerical values are embedded in dense contextual expressions. This issue is evident in highly parallel assignments that can be particularly observed in the zero-shot setting where relational misattribution occurred in the following sample: ‘the RON and MON values assigned for iso-octane and n-heptane are both 100 and 0, respectively’ [65]. Fundamentally, the reliance on pretraining inductive biases and the structured annotation schema may introduce subtle corpus-specific patterns, which could limit generalizability to unseen chemical subdomains or alternative document formats. Thus, despite fine-tuning IgnitionGPT on a few samples with high structural density, we urge end-users to account for matching such predominant document types and formats to avoid misclassifications and misattributions.

From a technical perspective, IgnitionGPT exhibits stable learning dynamics across varying dataset fractions. As shown in Figure 7, the early saturation of accuracy, narrow variance bands, and low validation loss indicate effective generalization and minimal overfitting. Observed U-shaped variance trends at low and intermediate data fractions highlight sensitivity to sample heterogeneity, emphasizing the value of stratified data selection and human-in-the-loop fine-tuning strategies to maximize performance in low-resource settings. The robustness of token-level predictions, with minimal variability across random seeds and data splits, further underscores the alignment between architectural inductive biases and domain-specific extraction demands.

The structured outputs generated by IgnitionGPT can be incorporated into QSPR-based models to predict RON, MON, and CN from molecular descriptors, as previously demonstrated for surrogate modeling frameworks [12]. By expanding the training dataset to 581 compounds (Figure 3), these models benefit from increased chemical diversity, enabling predictive identification of fuel formulations optimized for ignition delay, auto-ignition resistance, and emission reduction without extensive experimental testing. The integration into hybrid workflows, combining surrogate predictions with combustion, chemical kinetics and/or CFD simulations, would enhance both the accuracy and computational efficiency of engine performance predictions [66]. Furthermore, AI-guided property prediction pipelines, informed by structured datasets, support in silico generation of high-octane, low-sooting fuel blends, providing actionable tools for accelerated and cleaner fuel design [49,67].

Broader implications of this work extend to chemical informatics and industrial applications. High-fidelity extracted property data facilitates the construction of large-scale, relational datasets from unstructured sources, enabling downstream tasks such as property prediction, fuel discovery, and regulatory compliance analysis. The efficiency observed in Figure 5 suggests that similar approaches could be extended and adapted to other combustion subdomains or properties with minimal additional annotation effort. Moreover, the model’s robustness and generalization potential indicate compatibility with emerging methods, including uncertainty quantification and retrieval-augmented generation, enhancing the reliability of data-driven chemical research workflows.

5. Conclusions and Future Directions

We developed IgnitionGPT, a fine-tuned large language model based on GPT-4.1-mini, for automated extraction of fuel-relevant properties—RON, MON, and CN—from unstructured literature and patents. Using a curated dataset of 304 documents covering 581 compounds, IgnitionGPT outperformed the zero-shot baseline, increasing token-level accuracy from 47.8% to 100% and F1-score from 64.7% to 100%, consistently capturing correct token boundaries without false positives or negatives. High performance was achieved using only 10% of the training data, demonstrating the highly efficient learning of domain-specific linguistic patterns characteristic of modern instruction-tuning [54]. Concurrently, evaluating the model against the remaining 85% of unseen data strictly validated its real-world robustness, proving that massive, from-scratch datasets are not a prerequisite for deploying highly accurate, domain-specific chemical extraction tools. Analyses confirmed stable optimization and smooth loss convergence without overfitting. Performance saturation beyond 30% of the dataset reflects rapid convergence consistent with NLP scaling laws, while validation-based checkpoint selection ensured reproducibility under limited supervision.

Ultimately, the innovation of IgnitionGPT extends beyond immediate performance metrics. Rather, it establishes a scalable baseline for chemical informatics. By overcoming the constraints of rigid, rule-based legacy systems, the devised framework demonstrates that concurrent, multi-property extraction can be achieved with exceptional data efficiency and robustness. In the immediate subdomain of fuels informatics, IgnitionGPT provides a reproducible, highly transferable foundation for constructing structured fuel-property datasets from heterogeneous sources, supporting surrogate modeling, combustion simulations, and multi-objective fuel design.

Future work may extend this framework to extract a wider array of fuel properties and complex ignition mechanisms. To further enhance extraction versatility and generalization, subsequent iterations can explore the integration of advanced LLM methodologies, including model distillation, prompt contrastive learning, retrieval-augmented generation, and active learning. Moreover, expanding the current evaluation protocols to encompass span- and relation-level metrics can enable a more rigorous assessment of structural and semantic correctness across highly heterogeneous corpora towards accelerating data-driven fuel development and predictive combustion modeling.

Funding

This research was funded through the Waed Program (W25-5) by the Deanship of Scientific Research at King Saud University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that supports the findings of this study are openly available on GitHub at https://github.com/AI4CHEMIA/IgnitionGPT (accessed on 22 January 2026).

Acknowledgments

The author would like to extend appreciation to the Deanship of Scientific Research at King Saud University for funding this work through the Waed Program (W25-5).

Conflicts of Interest

No potential conflicts of interest were reported by the author.

Abbreviations

AI	Artificial Intelligence
ASTM	American Society for Testing and Materials
CFD	Computational Fluid Dynamics
CN	Cetane Number
CP	Complex Polymer
DOI	Digital Object Identifier
IUPAC	International Union of Pure and Applied Chemistry
LLM	Large Language Model
MON	Motor Octane Number
NLP	Natural Language Processing
QSPR	Quantitative Structure-Property Relationship
RON	Research Octane Number

References

Kalghatgi, G.T. Fuel anti-knock quality-Part I. Engine studies. In SAE Transactions 1993–2004; Society of Automotive Engineers: Warrendale, PA, USA, 2001. [Google Scholar]
Mehl, M.; Pitz, W.J.; Westbrook, C.K.; Curran, H.J. Kinetic modeling of gasoline surrogate components and mixtures under engine conditions. Proc. Combust. Inst. 2011, 33, 193–200. [Google Scholar] [CrossRef]
Sarathy, S.M.; Oßwald, P.; Hansen, N.; Kohse-Höinghaus, K. Alcohol combustion chemistry. Prog. Energy Combust. Sci. 2014, 44, 40–102. [Google Scholar] [CrossRef]
Pitsch, H. The transition to sustainable combustion: Hydrogen-and carbon-based future fuels and methods for dealing with their challenges. Proc. Combust. Inst. 2024, 40, 105638. [Google Scholar] [CrossRef]
Wilk-Jakubowski, J.L.; Pawlik, L.; Frej, D.; Wilk-Jakubowski, G. Data-driven computational methods in fuel combustion: A review of applications. Appl. Sci. 2025, 15, 7204. [Google Scholar] [CrossRef]
Al-Rabiah, A.A.; Alshehri, A.S.; Ibn Idriss, A.; Abdelaziz, O.Y. Comparative Kinetic Analysis and Process Optimization for the Production of Dimethyl Ether via Methanol Dehydration over a γ-Alumina Catalyst. Chem. Eng. Technol. 2022, 45, 319–328. [Google Scholar] [CrossRef]
Stolonogova, T. Change in the Functional Properties of Automobile Gasolines in the Presence of Mixtures of Ethanol and a Glycerin Ether. Chem. Technol. Fuels Oils 2025, 61, 898–902. [Google Scholar] [CrossRef]
ASTM D2700-18; Standard Test Method for Motor Octane Number of Spark Ignition Engine Fuel. ASTM: West Conshohocken, PA, USA, 2011.
ASTM D2699-12; Standard Test Method for Research Octane Number of Spark-Ignition Engine Fuel. ASTM International: West Conshohocken, PA, USA, 2012.
Alshehri, A.S.; Tula, A.K.; Zhang, L.; Gani, R.; You, F. A Platform of Machine Learning-Based Next-Generation Property Estimation Methods for CAMD. In Computer Aided Chemical Engineering; Elsevier: Amsterdam, The Netherlands, 2021; Volume 50. [Google Scholar]
Suzuki, S.; Mori, S. Flame synthesis of carbon nanotube through a diesel engine using normal dodecane/ethanol mixing fuel as a feedstock. J. Chem. Eng. Jpn. 2017, 50, 178–185. [Google Scholar] [CrossRef]
Üstün, C.E.; Freitas, R.D.S.M.D.; Okafor, E.C.; Shahbakhti, M.; Jiang, X.; Paykani, A. Machine learning applications for predicting fuel ignition and flame properties: Current status and future perspectives. Energy Fuels 2025, 39, 13281–13314. [Google Scholar] [CrossRef]
Tang, X.; Liao, H.; Gong, J. Machine Learning Approaches to Ignitability Classification of Solid Combustibles. In Combustion Science and Technology; Taylor & Francis: Oxfordshire, UK, 2025; pp. 1–24. [Google Scholar]
Rittig, J.G.; Ritzert, M.; Schweidtmann, A.M.; Winkler, S.; Weber, J.M.; Morsch, P.; Heufer, K.A.; Grohe, M.; Mitsos, A.; Dahmen, M. Graph machine learning for design of high-octane fuels. AIChE J. 2023, 69, e17971. [Google Scholar] [CrossRef]
Schweidtmann, A.M.; Rittig, J.G.; König, A.; Grohe, M.; Mitsos, A.; Dahmen, M. Graph neural networks for prediction of fuel ignition quality. Energy Fuels 2020, 34, 11395–11407. [Google Scholar] [CrossRef]
Alshehri, A.S.; Gani, R.; You, F. Deep learning and knowledge-based methods for computer-aided molecular design—Toward a unified approach: State-of-the-art and future directions. Comput. Chem. Eng. 2020, 141, 107005. [Google Scholar] [CrossRef]
Ye, G. De novo drug design as GPT language modeling: Large chemistry models with supervised and reinforcement learning. J. Comput.-Aided Mol. Des. 2024, 38, 20. [Google Scholar] [CrossRef]
Alshehri, A.S.; Horstmann, K.A.; You, F. Versatile Deep Learning Pipeline for Transferable Chemical Data Extraction. J. Chem. Inf. Model. 2024, 64, 5888–5899. [Google Scholar] [CrossRef]
Mavračić, J.; Court, C.J.; Isazawa, T.; Elliott, S.R.; Cole, J.M. ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science. J. Chem. Inf. Model. 2021, 61, 4280–4289. [Google Scholar] [CrossRef]
Abdul Jameel, A.G.; Van Oudenhoven, V.; Emwas, A.-H.; Sarathy, S.M. Predicting octane number using nuclear magnetic resonance spectroscopy and artificial neural networks. Energy Fuels 2018, 32, 6309–6329. [Google Scholar] [CrossRef]
Goldsmith, C.F.; Magoon, G.R.; Green, W.H. Database of small molecule thermochemistry for combustion. J. Phys. Chem. A 2012, 116, 9033–9057. [Google Scholar] [CrossRef] [PubMed]
Keyvanpour, M.R.; Shirzad, M.B. An analysis of QSAR research based on machine learning concepts. Curr. Drug Discov. Technol. 2021, 18, 17–30. [Google Scholar] [CrossRef] [PubMed]
Kessler, T.; Sacia, E.R.; Bell, A.T.; Mack, J.H. Predicting the cetane number of furanic biofuel candidates using an improved artificial neural network based on molecular structure. In Internal Combustion Engine Division Fall Technical Conference; American Society of Mechanical Engineers: New York, NY, USA, 2016; Volume 50503, p. V001T02A010. [Google Scholar]
Li, R.; Herreros, J.M.; Tsolakis, A.; Yang, W. Machine learning-quantitative structure property relationship (ML-QSPR) method for fuel physicochemical properties prediction of multiple fuel types. Fuel 2021, 304, 121437. [Google Scholar] [CrossRef]
Patel, R.; Rajaraman, T.; Rana, P.H.; Ambegaonkar, N.J.; Patel, S. A review on techno-economic analysis of lignocellulosic biorefinery producing biofuels and high-value products. Results Chem. 2025, 13, 102052. [Google Scholar] [CrossRef]
Klein-Marcuschamer, D.; Simmons, B.A.; Blanch, H.W. Techno-economic analysis of a lignocellulosic ethanol biorefinery with ionic liquid pre-treatment. Biofuels Bioprod. Biorefin. 2011, 5, 562–569. [Google Scholar] [CrossRef]
Zhongyang, L.; Oppong, F.; Wang, H.; Li, X.; Xu, C.; Wang, C. Investigating the laminar burning velocity of 2-methylfuran. Fuel 2018, 234, 1469–1480. [Google Scholar] [CrossRef]
Cheng, Z.; He, S.; Xing, L.; Wei, L.; Li, W.; Li, T.; Yan, B.; Ma, W.; Chen, G. Experimental and kinetic modeling study of 2-methylfuran pyrolysis at low and atmospheric pressures. Energy Fuels 2017, 31, 896–903. [Google Scholar] [CrossRef]
Guo, X. Feature-Based Localization Methods for Autonomous Vehicles. Doctoral Dissertation, Freie Universität Berlin Repository, Berlin, Germany, 2017. [Google Scholar]
Swain, M.C.; Cole, J.M. ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 2016, 56, 1894–1904. [Google Scholar] [CrossRef]
Tshitoyan, V.; Dagdelen, J.; Weston, L.; Dunn, A.; Rong, Z.; Kononova, O.; Persson, K.A.; Ceder, G.; Jain, A. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 2019, 571, 95–98. [Google Scholar] [CrossRef]
Weston, L.; Tshitoyan, V.; Dagdelen, J.; Kononova, O.; Trewartha, A.; Persson, K.A.; Ceder, G.; Jain, A. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 2019, 59, 3692–3702. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Jiang, X.; Tian, S.; Liu, P.; Dang, D.; Su, Y.; Lookman, T.; Xie, J. Automated pipeline for superalloy data by text mining. npj Comput. Mater. 2022, 8, 9. [Google Scholar] [CrossRef]
Agichtein, E.; Gravano, L. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries; Association for Computing Machinery: New York, NY, USA, 2000; pp. 85–94. [Google Scholar]
Court, C.J.; Cole, J.M. Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction. Sci. Data 2018, 5, 180111. [Google Scholar] [CrossRef]
Isazawa, T.; Cole, J.M. Automated construction of a photocatalysis dataset for water-splitting applications. Sci. Data 2023, 10, 651. [Google Scholar] [CrossRef]
Sierepeklis, O.; Cole, J.M. A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Sci. Data 2022, 9, 648. [Google Scholar] [CrossRef]
Huang, S.; Cole, J.M. BatteryDataExtractor: Battery-aware text-mining software embedded with BERT models. Chem. Sci. 2022, 13, 11487–11495. [Google Scholar] [CrossRef]
Krishnan, N.A.; Kodamana, H.; Bhattoo, R. Machine Learning for Materials Discovery: Numerical Recipes and Practical Applications; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar] [CrossRef]
Huang, M.-S.; Han, J.-C.; Lin, P.-Y.; You, Y.-T.; Tsai, R.T.-H.; Hsu, W.-L. Surveying biomedical relation extraction: A critical examination of current datasets and the proposal of a new resource. Brief. Bioinform. 2024, 25, bbae132. [Google Scholar] [CrossRef]
Alshehri, A.S.; Tantisujjatham, B.; Alrashed, M.M. Uncertainty-Aware Deep Reinforcement Learning Approach for Computational Molecular Design. Ind. Eng. Chem. Res. 2025, 64, 10117–10130. [Google Scholar] [CrossRef]
Alshehri, A.S.; Bergman, M.T.; You, F.; Hall, C.K. Biophysics-guided uncertainty-aware deep learning uncovers high-affinity plastic-binding peptides. Digit. Discov. 2025, 4, 561–571. [Google Scholar] [CrossRef]
Decardi-Nelson, B.; Alshehri, A.S.; You, F. Generative artificial intelligence in chemical engineering spans multiple scales. Front. Chem. Eng. 2024, 6, 1458156. [Google Scholar] [CrossRef]
Almomtan, M.; Ibrahim, E.A.; Farooq, A. Fuelprop: Fuel property prediction from ATR-FTIR spectroscopic data. arXiv 2025, arXiv:2506.01601. [Google Scholar] [CrossRef]
Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.; Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; et al. Molecular sets (MOSES): A benchmarking platform for molecular generation models. Front. Pharmacol. 2020, 11, 565644. [Google Scholar] [CrossRef] [PubMed]
Kuzhagaliyeva, N.; Horváth, S.; Williams, J.; Nicolle, A.; Sarathy, S.M. Artificial intelligence-driven design of fuel mixtures. Commun. Chem. 2022, 5, 111. [Google Scholar] [CrossRef]
Schweidtmann, A.M.; Rittig, J.G.; Weber, J.M.; Grohe, M.; Dahmen, M.; Leonhard, K.; Mitsos, A. Physical pooling functions in graph neural networks for molecular property prediction. Comput. Chem. Eng. 2023, 172, 108202. [Google Scholar] [CrossRef]
Decardi-Nelson, B.; Alshehri, A.S.; Ajagekar, A.; You, F. Generative AI and process systems engineering: The next frontier. Comput. Chem. Eng. 2024, 187, 108723. [Google Scholar] [CrossRef]
Krallinger, M.; Rabal, O.; Leitner, F.; Vazquez, M.; Salgado, D.; Lu, Z.; Leaman, R.; Lu, Y.; Ji, D.; Lowe, D.M.; et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminform. 2015, 7, S2. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Wang, C.; Soukaseum, M.; Vlachos, D.G.; Fang, H. Unleashing the power of knowledge extraction from scientific literature in catalysis. J. Chem. Inf. Model. 2022, 62, 3316–3330. [Google Scholar] [CrossRef]
Zhang, Y.; Vlachos, D.G.; Liu, D.; Fang, H. Rapid adaptation of chemical named entity recognition using few-shot learning and llm distillation. J. Chem. Inf. Model. 2025, 65, 4334–4345. [Google Scholar] [CrossRef]
Yang, X.; Zhuo, Y.; Zuo, J.; Zhang, X.; Wilson, S.; Petzold, L. Pcmsp: A dataset for scientific action graphs extraction from polycrystalline materials synthesis procedure text. In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 6033–6046. [Google Scholar]
Xing, X.; Chen, P. Entity extraction of key elements in 110 police reports based on large language models. Appl. Sci. 2024, 14, 7819. [Google Scholar] [CrossRef]
Tunstall, L.; Reimers, N.; Eun Seo Jo, U.; Bates, L.; Korat, B.; Wasserblat, M.; Pereg, O. Efficient few-shot learning without prompts. arXiv 2022, arXiv:2209.11055. [Google Scholar] [CrossRef]
Chen, P.; Wang, J.; Lin, H.; Zhao, D.; Yang, Z. Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning. Bioinformatics 2023, 39, btad496. [Google Scholar] [CrossRef]
Han, R.; Yang, C.; Peng, T.; Tiwari, P.; Wan, X.; Liu, L.; Wang, B. An empirical study on information extraction using large language models. arXiv 2023, arXiv:2305.14450. [Google Scholar]
Eschbach-Dymanus, J.; Essenberger, F.; Buschbeck, B.; Exel, M. Exploring the effectiveness of llm domain adaptation for business it machine translation. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation; European Association for Machine Translation: Tampere, Finland, 2024; Volume 1, pp. 610–622. [Google Scholar]
Van Herck, J.; Victoria Gil, M.; Maik Jablonka, K.; Abrudan, A.; Anker, A.S.; Asgari, M.; Blaiszik, B.; Buffo, A.; Choudhury, L.; Corminboeuf, C.; et al. Assessment of fine-tuned large language models for real-world chemistry and material science applications. Chem. Sci. 2025, 16, 670–684. [Google Scholar] [CrossRef]
Foppiano, L.; Lambard, G.; Amagasa, T.; Ishii, M. Mining experimental data from materials science literature with large language models: An evaluation study. Sci. Technol. Adv. Mater. Methods 2024, 4, 2356506. [Google Scholar] [CrossRef]
Belkin, M.; Hsu, D.J.; Mitra, P. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
Akhtar, M.; Reuel, A.; Soni, P.; Ahuja, S.; Ammanamanchi, P.S.; Rawal, R.; Zouhar, V.; Yadav, S.; Whitehouse, C.; Ki, D.; et al. When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation. arXiv 2026, arXiv:2602.16763. [Google Scholar] [CrossRef]
Li, Y.; Shankar, V.S.B.; Yalamanchi, K.K.; Badra, J.; Nicolle, A.; Sarathy, S.M. Understanding the blending octane behaviour of unsaturated hydrocarbons: A case study of C4 molecules and comparison with toluene. Fuel 2020, 275, 117971. [Google Scholar] [CrossRef]
Echekki, T.; Farooq, A.; Ihme, M.; Sarathy, S. Machine learning for combustion chemistry. In Machine Learning and Its Application to Reacting Flows: ML and Combustion; Springer International Publishing: Cham, Switzerland, 2023; pp. 117–147. [Google Scholar]
Ji, W.; Su, X.; Pang, B.; Li, Y.; Ren, Z.; Deng, S. SGD-based optimization in modeling combustion kinetics: Case studies in tuning mechanistic and hybrid kinetic models. Fuel 2022, 324, 124560. [Google Scholar] [CrossRef]

Figure 1. High-level schematic description of the development pipeline of IgnitionGPT for automated extraction of fuel ignition properties from scientific literature, including data curation, annotation, fine-tuning of a general-purpose LLM, and large-scale structured property generation.

Figure 2. Text and chemical entity statistics of the dataset: (a) distribution of word counts in extracted texts; (b) frequency of chemical representations in the fuel properties dataset.

Figure 3. Chemical categories and category groups in the fuel properties dataset, showing the number of chemicals per category (hydrocarbons: HC, alkanes: Alk, alkenes: AlkE, cycloalkanes: CycAlk, aromatics: Arom, oxygenated compounds: Oxy, alcohols: Alc, ethers: Eth, furan and cyclic ethers: FCE, aldehydes and ketones: A/K, carboxylic acids: CbxAc, esters: Est, polymers and macromolecules: Poly, complex polymer: CP; inorganic/elemental compounds: Inorg, element:Elem).

Figure 4. Comparison of zero-shot GPT-4.1-mini performance on the full dataset with the best fine-tuned IgnitionGPT model results, emphasizing measurable improvements across evaluation metrics.

Figure 5. Token-level accuracy of IgnitionGPT during fine-tuning across varying training data proportions. (a) Scatter plot with trend line showing how token accuracy changes with increasing data percentage; (b) Box plots summarizing the distribution of token accuracies at each data level, highlighting variance and stability across runs.

Figure 6. Distribution and correlation of token-level accuracy during IgnitionGPT training across the full dataset (a) and individual data splits: 10% (b), 20% (c), 50% (d), 80% (e), and 90% (f), illustrating intra-split variability and convergence trends throughout training epochs.

Figure 7. Joint distribution and correlation of training versus validation loss during IgnitionGPT fine-tuning. Panel (a) displays the aggregated distribution across the full dataset. Panels (b–f) illustrate loss dynamics for incremental data splits ranging from 10% to 90%. The tight clustering along the diagonal indicates strong correlation between training and validation performance, while the decreasing spread at higher data volumes (d–f) reflects improved stability and convergence.

Figure 8. Token accuracy progression during training across varying data sizes. Best-performing model checkpoints are indicated by red circles, highlighting points of optimal generalization before potential overfitting. The dashed lines represent the overall trend lines for both the training and validation accuracy.

Figure 9. Loss progression during training across varying data sizes. Best-performing model checkpoints are indicated by red circles, highlighting points of optimal generalization before potential overfitting. The dashed lines represent the overall trend lines for both the training and validation loss.

Table 1. Representative JSONL entry from the curated fine-tuning dataset. The record demonstrates the data structure used to train IgnitionGPT, linking unstructured source text (from a patent) to structured ground-truth annotations, including the document identifier, chemical entity (n-heptane), and its associated ignition property (RON).

An Example JSONL Representation:

{
“Properties”: 1,
“Example”: 1,
“DOI”: “CN115287106B”,
“Text”: {
“Text”: “the total content of the carbon six and the carbon seven alkanes is more than 60%, the benzene content is less than 0.5%, the olefin content is less than 3%, the density, the octane number and the vapor pressure are lower, and the cleanliness of small amount of aromatic hydrocarbon, olefin and components is higher. Further, the sum of the volume percentages of the 2-methylpentane and the 3-methylpentane in the isohexane is more than 99%. Further, the n-heptane has a research octane number of 0, and the main component of the n-heptane is seven alkane, the volume percentage content of the n-heptane is more than 98%, and the volume percentage content of the rest alkane components is less than 2%. Furthermore, the distillation range of the reformed gasoline is concentrated at 35–195°C”
},
“Chemical”: {
“name”: “n-heptane”,
“value”: “0”,
“property”: [“Research Octane Number”]
}
}

Table 2. Representative instruction-tuning instance used for supervised fine-tuning. The JSON object follows the conversational chat format, comprising three key components: a system message defining the model’s persona and task, a user message containing the input text, extraction constraints, and a one-shot example, and an assistant message containing the ground-truth structured output.

Instruction-Tuning Instance

{
“messages”: [
{
“role”: “system”,
“content”: “IgnitionGPT is a fuel property extraction assistant. It identifies chemical names and their fuel-relevant properties (e.g., Research Octane Number, Motor Octane Number (MON), and Cetane Number, etc.) in scientific text.”
},
{
“role”: “user”,
“content”: “Extract all occurrences of CHEMICAL, VALUE, and PROPERTY (e.g., Research Octane Number, Motor Octane Number (MON), and Cetane Number) for each fuel-related substance mentioned in the sentence below:\n$text_segment$\n\nOnly return the result as a strict JSON array of dictionaries.\n\nHere is an example:\nText:\”Propane has a motor octane number of 112 and a research octane number of 105.\”\n\nOutput:\n[{\”CHEMICAL\”: \”propane\”, \”VALUE\”: \”97\”, \”PROPERTY\”: \”Motor Octane Number (MON)\”}, {\”CHEMICAL\”: \”propane\”, \”VALUE\”: \”112\”, \”PROPERTY\”: \”Research Octane Number (RON)\”}]”
},
{
“role”: “assistant”,
“content”: “$jsonl_style_output$”
}
]
}

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshehri, A.S. Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties. Appl. Sci. 2026, 16, 3320. https://doi.org/10.3390/app16073320

AMA Style

Alshehri AS. Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties. Applied Sciences. 2026; 16(7):3320. https://doi.org/10.3390/app16073320

Chicago/Turabian Style

Alshehri, Abdulelah S. 2026. "Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties" Applied Sciences 16, no. 7: 3320. https://doi.org/10.3390/app16073320

APA Style

Alshehri, A. S. (2026). Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties. Applied Sciences, 16(7), 3320. https://doi.org/10.3390/app16073320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Tuning Large Language Models for the Efficient and Concurrent Extraction of Fuel Properties

Featured Application

Abstract

1. Introduction

2. Methods

2.1. Data Collection and Curation

2.1.1. Data Sources and Retrieval

2.1.2. Screening and Selection

2.1.3. Dataset Composition

2.1.4. Incremental Training Splits

2.2. Model Architecture and Fine-Tuning

2.2.1. Baseline Model Selection

2.2.2. Training Data Representation

2.2.3. Fine-Tuning Framework

3. Results and Analyses

3.1. Comparing Zero-Shot Learning with Fine-Tuning

3.1.1. Overall Results

3.1.2. Zero-Shot Detailed Evaluation

3.2. Analyzing Training and Testing Accuracy

3.2.1. Descriptive Analysis of Fine-Tuning Logs

3.2.2. Variations in Token Accuracy and Loss While Training

3.3. Overfitting and Loss Analysis

3.3.1. Observation of Training Progress

3.3.2. Observing Overfitting Possibilities

4. Discussion

5. Conclusions and Future Directions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI