1. Introduction
In recent years, computational gastronomy has emerged as an interdisciplinary field that applies computational techniques such as data mining and machine learning to the study of food and cooking. The aim is to understand and model the complex interactions between ingredients, cooking methods, and the human perception of taste and nutrition. One of the primary goals of this field is to develop methods for ingredient substitution, aimed at improving the nutritional content, preserving the flavor integrity, and aligning meals with specific dietary needs. A key focus has been the integration of phytochemically enriched ingredients into diets, which has shown in silico potential to target biological networks of chronic diseases like cancer [
1], Alzheimer’s disease (AD) [
2], and COVID-19 [
3].
Phytochemicals, bioactive compounds found in plants, have gathered significant attention due to their antioxidant, anti-inflammatory, and anti-carcinogenic properties. Preclinical studies suggest that these compounds may play a role in disease prevention and treatment. For instance, brassinolide, a phytochemical present in tea, has shown potential to inhibit tumor growth and induce apoptosis in cancer cells [
4]. In the context of AD, quercetin, found in extra virgin olive oil, has been linked to improved brain health by exhibiting antioxidant and anti-inflammatory effects [
5]. Moreover, genistein, a phytochemical in blackcurrant, has been investigated for its immune-supporting properties, including its potential to modulate inflammation and interfere with viral replication, making it relevant in the study of COVID-19 [
6].
Initial attempts at ingredient substitution utilized statistical methods, such as Term Frequency–Inverse Document Frequency (TF-IDF), to identify potential substitutes based on occurrence patterns within large recipe datasets. TF-IDF is a numerical statistic that reflects how important a word (or ingredient) is to a document (or recipe) in a corpus. It does this by considering both the frequency of the word in the document and the rarity of the word across all documents. In the context of ingredient substitution, TF-IDF helps highlight ingredients that are significant within certain recipes but not ubiquitous across all recipes, thus identifying potential substitutes based on uniqueness and relevance [
7,
8,
9]. Later, co-occurrence-based methods refined this approach by constructing ingredient networks that map relationships across recipes. In these networks, ingredients are represented as nodes, and the edges between them indicate how frequently they appear together. By analyzing these networks, researchers could suggest substitutes based on ingredients’ mutual presence in culinary contexts, identifying clusters of ingredients that are often used interchangeably [
10,
11,
12,
13,
14]. The introduction of language model-based methods marked a significant evolution, utilizing natural language processing techniques such as word2vec [
15], BERT [
16], and R-BERT [
17] to capture semantic relationships between ingredients. Word2vec generates vector representations of words (or ingredients) based on their contexts in the text, allowing the model to identify ingredients with similar contexts or meanings. BERT (Bidirectional Encoder Representations from Transformers) goes further by understanding the bidirectional context of words, providing a deeper semantic understanding. R-BERT specializes in relation extraction, identifying and classifying relationships between entities—in this case, between ingredients. This approach proved effective in improving ingredient substitution tasks through learned embeddings that capture complex semantic relationships [
18], although language models require substantial computational resources and may not always capture the full culinary context, such as flavor profiles or cooking techniques.
More recently, graph neural networks (GNNs) have been utilized to combine the relational information encoded in ingredient graphs with the specific context of given recipes, leading to a deeper understanding of ingredient interactions [
18]. GNNs are designed to operate on graph structures, modeling the dependencies and relationships between nodes (ingredients) and edges (their relationships). Large-scale graphs, such as FlavorGraph, have been introduced to explore ingredient substitutions and food pairings [
19]. FlavorGraph connects ingredients based on shared flavor compounds and culinary usage, providing a rich dataset for analyzing how ingredients relate on a molecular level. This graph-based approach allows for the identification of substitutes that are not only contextually appropriate but also compatible in terms of flavor and chemistry. However, success in this area relies heavily on the quality and curation of the underlying graph data; inaccuracies or omissions can significantly affect the model’s performance. Building on this approach, GISMo was introduced—a GNN-based model that incorporates both recipe-specific contexts and ingredient relationships from FlavorGraph. By constructing a benchmark dataset, Recipe1MSubs, which includes ingredient substitution pairs extracted from user comments, GISMo significantly outperforms previous methods in ranking plausible ingredient substitutions. Specifically, it achieved a performance improvement of at least 14% in the top substitute ingredient prediction, as measured by the Hit@1 metric, over existing models [
20]. In the context of ingredient substitution, Hit@1 evaluates whether the model’s first (most confident) suggested substitute matches the actual substitute used in the recipe. This metric is important because it reflects the model’s effectiveness in providing accurate substitution suggestions on the first attempt, which is essential for real-world application.
The latest stage in this evolving field is represented by LLMs, which promise to overcome the limitations of previous approaches by leveraging their capacity for understanding and generating human-like text [
21,
22]. The introduction of LLMs, such as GPT-3 developed by OpenAI [
21], presents an approach to address the limitations of previous methods for ingredient substitution. Furthermore, while language model-based methods and GNNs represent significant advancements, they still face challenges in capturing the full culinary context and ensuring gastronomically sensible substitutions [
20]. LLMs, trained on extensive and diverse culinary datasets, can potentially offer more contextually aware and accurate ingredient substitutions by leveraging their understanding of both the syntax and semantics of culinary texts [
23]. This capacity for high-level language comprehension and manipulation allows for considering factors such as ingredient compatibility. Importantly, LLMs can be fine-tuned for specific tasks such as ingredient substitution [
22].
Recognizing the limitations of statistical, co-occurrence, language, and GNN-based methods, our research proposes a unique approach by leveraging the capabilities of LLMs for ingredient substitution. LLMs, such as GPT-3.5 [
22], DaVinci [
21], and Meta’s TinyLlama [
24], have demonstrated state-of-the-art performance across a range of natural language processing tasks, from text generation to semantic understanding [
16,
25]. By fine-tuning these models on a dataset of recipes and ingredient substitutes, we aim to develop an algorithm that not only understands the interplay of flavors and nutritional aspects in cooking but also tailors suggestions to the preferences and requirements of each user. In this paper, we benchmarked our ingredient substitution algorithm against the current state of the art GISMo to demonstrate its superiority in generating contextually appropriate ingredient substitutions. We used the Hit@1 accuracy metric to benchmark our models’ performance against state-of-the-art methods. After identifying phytochemically enriched substitutes, we generated a new set of recipes aimed at targeting biological networks associated with cancer, AD, and COVID-19 (
Figure 1).
The research hypothesis of the article is that LLMs can achieve higher accuracy in ingredient substitution tasks compared to the current state-of-the-art GISMo model when evaluated on a standardized dataset. The main contributions of this paper are (1) enhanced accuracy in ingredient substitution, (2) a novel dataset filtration process, and (3) the generation of phytochemically enriched recipes. The rest of this paper is organized as follows. In 
Section 2, we detail the materials and methods employed, including the datasets used, the fine-tuning of LLMs, and the evaluation metrics for ingredient substitution accuracy. 
Section 3 presents the results of our experiments, comparing the performance of fine-tuned LLMs with the GISMo model and showing the generation of phytochemically enriched recipes. 
Section 4 provides a discussion of our findings, highlighting the improvements achieved, the implications for computational gastronomy, and the limitations of our approach. Finally, 
Section 5 concludes the paper by summarizing our contributions and suggesting directions for future research in integrating AI with nutritional science to promote healthier eating practices.
  2. Materials and Methods
  2.1. Recipe and Ingredient Substitution Datasets
Our research started with the study of the Recipe1MSubs dataset, provided by Meta, containing 70,520 pairs of ingredient substitutes with the respective recipes [
20], which is a subset of the Recipe1M dataset [
26]. The Recipe1MSubs dataset was separated into 49,044 data points designated for training, 10,729 for validation, and 10,747 for testing. Each recipe within this dataset is organized in a structured format, beginning with the recipe title, followed by the list of ingredients, with associated quantities, and finally, the cooking instructions. The original GISMo model was trained on this dataset using a methodology focused on ingredient context and co-occurrence as the benchmark for our study.
  2.2. GISMo Benchmark
To establish a baseline for comparison with our new LLM-based models, we re-implemented and re-ran the GISMo as described in the original study. We set the learning rate to 5 × 10
−5, weight decay to 0.0001, and used an embedding dimension of 300 to represent ingredients in a continuous vector space. The model consists of two graph convolutional layers, each with 300 hidden units, and applies a dropout rate of 0.25 to reduce overfitting. Training was conducted over 400 epochs using regular negative sampling, where for each positive substitution pair, negative examples were generated by randomly selecting non-substitutable ingredients, and embeddings were initialized randomly. Average pooling was used for contextual embedding to aggregate information from neighboring nodes, enhancing the model’s context sensitivity without altering the original dataset’s composition. We re-ran GISMo not only to replicate the results of the original study but also to establish a standard benchmark against which we could evaluate the performance of our newer LLM-based models, ensuring that any improvements in ingredient substitution accuracy were attributable to the capabilities of the LLMs rather than differences in the experimental setup. Furthermore, we introduced enhancements to the GISMo model by incorporating each ingredient’s food category as an additional node feature in the graph, providing higher-level semantic knowledge to potentially improve substitution suggestions (as described in 
Section 2.3). Additionally, we applied a dataset filtration process (as described in 
Section 2.4) to the original Recipe1MSubs dataset to remove incorrect or unsuitable substitutions, training GISMo on this filtered dataset to assess whether cleaner training data could enhance the model’s performance.
  2.3. Incorporation in GISMo of a Food Category Feature
Using GPT-4-0613, the latest of OpenAI’s language models, we categorized ingredients into predefined culinary groups. This process involved a Python script utilizing the pandas library for dataset manipulation and the openai library for API interactions. A function, categorize_ingredient, was used to query GPT-4-0613 with each ingredient, requesting its classification into one of 23 categories ranging from common food groups like Fruits and Vegetables to more specialized ones such as Confectioneries and Aquatic foods (
Appendix A). By setting the temperature parameter to 0, the script prioritized reproducibility to ensure consistency in GPT-4-0613’s responses. This approach processed a CSV file of ingredients, appending a category column with the GPT-4-0613-determined categories to the dataset. The augmented dataset, saved as a new CSV file, served as a tool for ingredient substitution models, enabling more contextually relevant substitutions.
  2.4. Dataset Filtration Based on Substitution Validity
To enhance the ingredient substitution model’s accuracy, we used GPT-3.5-Turbo with an asynchronous Python script to evaluate the validity of the proposed ingredient substitutions. This process involved sending detailed prompts to GPT-3.5-Turbo, asking if one ingredient could feasibly substitute another within a specific recipe, and classifying responses into Correct, Potential, or Incorrect to determine their suitability. By processing substitutions in multiple batches using the aiohttp library for asynchronous HTTP requests, we efficiently assessed the 70,520 substitutions, thereby accelerating the evaluation process. Substitutions categorized as Correct were considered suitable and retained, Potential indicated possible suitability requiring further consideration, and Incorrect were considered inappropriate, leading to their removal from the dataset. The final results, saved into a JSON file, formed a filtered dataset for retraining the model, ensuring it was based on accurate substitution data. Key settings included a prediction temperature of 0.5, a limit of 10 output tokens, and five runs to evaluate the prediction stability. In addition, there was a batch size of 100 substitutions with respective recipes to avoid reaching the maximum number of requests per second.
  2.5. Fine-Tuning Language Models for Substitution Predictions
We used GPT-3.5-Turbo-1106, DaVinci-002, and TinyLlama-1.1B to predict viable ingredient substitutions, fine-tuning each with consistent specifications to ensure comparability. Key settings included a prediction temperature of 0.5, a limit of 10 output tokens, and five runs to evaluate the prediction stability, all conducted over a single epoch.
For the fine-tuning process for TinyLlama-1.1B models in our experimental configuration, we refined our model’s fine-tuning process with selected hyperparameters encapsulated within the TrainingArguments setup. This configuration specified an output directory, a per-device train batch size of 8 (due to memory constraints) and applied gradient accumulation over 4 steps to efficiently balance computational demand and memory constraints. The model optimization was conducted using paged_adamw_32bit with a learning rate set at 5 × 10−4, and a cosine learning rate scheduler was employed for optimal learning rate adjustments throughout the training phase. A save strategy based on epochs was utilized, coupled with logging and evaluation intervals set at 25 and 50 steps, respectively, aligning with an evaluation strategy that triggers at specified steps to closely monitor the model’s performance. The training was streamlined to complete within 1 epoch to ensure quick adaptation while preventing overfitting, without setting a maximum step limit and avoiding mixed precision training to maintain computational accuracy. The SFTTrainer was used in the training process, directly interfacing with the training and validation datasets, and was configured with peft_config for tailored pre-fine-tuning adjustments. This allowed us to set our specified hyperparameters and training configurations. Text preprocessing was managed using a specified dataset_text_field and tokenizer, with packing disabled and a maximum sequence length of 512 to standardize input data handling. This approach aimed at enhancing the model’s learning efficiency, prioritizing a balance between optimizing the computational resources and achieving high-quality model training.
Building upon the filtration methodology outlined above, we randomly chose one of the filtered datasets and fine-tuned four final models considering only the Correct substitutions to further refine the accuracy of predictions. TinyLlama-1.1B, DaVinci-002, GPT-3.5-Turbo-1106, and GISMo models were fine-tuned incorporating these high-quality substitutions.
Training samples were provided in prompt completion format for DaVinci-002 and TinyLlama-1.1B and chat completion format for GPT-3.5-Turbo-1106. The number of epochs, training steps, and batch sizes chosen are detailed in 
Appendix B.
  2.6. Evaluation of Ingredient Substitution Accuracy
To validate the accuracy of the ingredient substitution predictions generated by LLMs, we developed an algorithm to standardize and process ingredient names before comparing them to a ground truth dataset derived from the Recipe1M dataset. We began by extracting predictions from the model output, where each line contained an original ingredient, its corresponding ground truth substitute, and the predicted substitution. To ensure consistency across ingredient names, several preprocessing steps were applied, including converting all text to lowercase, removing numeric values, and applying predefined rules to replace or eliminate special characters. This normalization was intended to maintain uniformity in ingredient representation. After preprocessing, a clustering mechanism was used to group similar ingredients, accounting for variations in lexical forms such as singular and plural versions or different types of the same ingredient (e.g., basmati rice and long grain rice). Each ingredient was assigned a unique cluster identifier to ensure that similar ingredients were treated as equivalent during comparison.
Once the LLM-predicted ingredient names were uniformized and categorized, the core of the evaluation involved comparing the predicted substitutions against the ground truth using the Hit@1 metric. This metric assessed the model’s precision by determining whether the first predicted substitution matched the ground truth or fell within the same ingredient cluster. For example, if the ground truth substitution was barley and the model predicted basmati rice, both ingredients would be considered correct if they belong to the same grain cluster. Hit@1 focuses on measuring the accuracy of the model’s top recommendation, as this is the most critical in real-world applications where users often act on the first suggestion. By prioritizing precision in the initial substitution, Hit@1 provides a measure of the LLM’s ability to generate viable and contextually appropriate ingredient substitutions.
  2.7. Phytochemically Enriched Recipe Generation
Finally, we integrated phytochemically enriched ingredients based on their ability to target molecular networks responsible for disease development in cancer [
1], AD [
2], and COVID-19 [
3]. By applying the best-performing model from our comparative analysis, we substituted all ingredients across our dataset with alternatives that elevated the content of the targeted phytochemicals. The recipes were then evaluated and ranked based on their cumulative phytochemical profile. Only salads were considered given the lower number of cooking processes involved in their preparation and, consequently, the higher chances of phytochemical preservation [
27] (
Figure 2).
  4. Discussion
Our study validated the research hypothesis that LLMs can achieve higher accuracy in ingredient substitution tasks compared to the current state-of-the-art GISMo model when evaluated on a standardized dataset. The fine-tuned GPT-3.5-Turbo-1106 model achieved a Hit@1 accuracy of 54.46% on the filtered Recipe1MSubs dataset, significantly outperforming GISMo’s 40.24%. This substantial improvement demonstrates that LLMs have a higher capacity to understand and generate contextually appropriate ingredient substitutions. Building upon this validation, we discuss in more detail the aspects that contributed to the improved performance of the LLMs over GISMo. The following subsections discuss the incorporation of food category features, the impact of dataset filtration based on substitution validity, the fine-tuning process of the LLMs, the generation of phytochemically enriched recipes, and the ethical and economic considerations of our approach.
  4.1. Incorporation in GISMo of a Food Category Feature
An initial strategy we explored was the enhancement of the GISMo model through the incorporation of an additional node feature—food categories for each ingredient, classified into one of the 23 categories utilized in FooDB, based on classifications retrieved via the GPT-4. However, contrary to our expectations, this modification did not yield any improvements in the model’s performance. This outcome may be attributed to several factors. Firstly, including this additional categorical information might have led to overfitting the model to the training data, compromising its ability to generalize to unseen data in the test set (available in our repository). Additionally, another potential reason could be that part of the value of ingredient categorization might have been indirectly achieved by the model’s consideration of ingredient co-occurrence in recipes alongside the presence of flavor molecules. These inherent features within the training data might already provide a basis for the model to make substitution predictions without the need for explicit categorical labels.
  4.2. Dataset Filtration Based on Substitution Validity
With the goal of optimizing ingredient substitution, our study introduced an improvement by integrating the capabilities of GPT-3.5 with the GISMo model. While GISMo independently showcased a threefold enhancement in performance compared to prior methods [
28], our approach to refine the GISMo model through the preliminary filtration of the original dataset via GPT-3.5’s API further increased this improvement. This filtration process involved the exclusion of Potential and Incorrect substitutions from the dataset, thereby ensuring a higher quality of data for model training and application.
The filtration step encompassed five different datasets, and although one was randomly selected to rerun GISMo, the improved results are generalizable across all, due to their almost perfect ingredient substitute similarity across the training, validation, and testing datasets. To demonstrate the consistency of our filtration process, here are examples of substitutions that were consistently classified across the five runs: (A) correct substitutions: orange juice to pineapple juice, carrot to red pepper, black bean to chickpea, basil to dried oregano, onion to shallot; (B) potential substitutions: lemon to orange, apple to peach, apple to apricot, water to wine, blueberry to strawberry; (C) incorrect substitutions: seedless watermelon to lime, fresh cilantro to ground coriander, horseradish to honey, carrot to seasoning salt, clove to garlic.
  4.3. LLM Fine-Tuning for Ingredient Substitution
Using Recipe1MSubs dataset, our experiments explored the benefits of fine-tuning DaVinci, TinyLlama, and GPT-3.5. The first two models did not demonstrate any performance enhancements over the initial method. In contrast, the fine-tuned model leveraging the GPT-3.5 showed a 4% improvement in performance over the GISMo model. Building upon this Recipe1MSubs filtered dataset, we ventured to fine-tune the same three models. Again, the GPT-3.5 model was the only one that showed an increase in performance (20%) when compared with current state of the art.
The findings of this study underscore the importance of data quality and model compatibility in the development of ingredient substitution algorithms. The superior performance achieved through the combination of GPT-3.5’s advanced language processing capabilities and the GISMo model’s framework highlights the potential of leveraging state-of-the-art AI technologies to refine and enhance existing computational models.
  4.4. Phytochemically Enriched Recipe Generation
We specifically selected examples of recipes with ingredients phytochemically enriched targeting COVID-19; COVID-19 and AD; and COVID-19, AD, and cancer molecular networks. Those were Watercress Salad, Kale and Quinoa Salad, and Thai-Style Beef Salad, respectively (
Appendix E). We exclusively considered salads in this analysis due to their minimal food processing steps. This choice was made because fewer processing steps generally help preserve the phytochemicals with the health benefits discussed. Salads undergo minimal thermal processing, which helps maintain the integrity of essential nutrients and active compounds compared to more extensively cooked dishes [
27].
  4.5. Ethical and Economical Considerations
Our research advances computational gastronomy with significant economic and ethical implications, as highlighted in studies on LLMs in food science [
29,
30,
31]. Economically, LLMs enable cost reduction and innovation through ingredient substitution and recipe optimization, promoting personalized nutrition services and creating new revenue streams for the food industry and healthcare sectors. Additionally, AI-driven personalized recommendation systems, including multimedia food logging and geolocation-based food maps, enhance customer satisfaction and loyalty [
30]. Ethically, the deployment of LLMs raises concerns about data biases, misinformation, and privacy, necessitating careful data curation and transparency to ensure fairness and to prevent misleading consumers. Integrating QR code technologies into food labeling further promotes ethical practices by providing transparent detailed product information, thereby enhancing food safety and consumer trust [
29]. Balancing these economic benefits with ethical considerations is essential to responsibly harness AI’s potential in food science.
  4.6. Limitations
One inherent limitation is the diversity of the training datasets used to fine-tune the LLMs. Although these datasets are extensive, they may not fully capture the vast diversity of global cuisines and dietary preferences, potentially impacting the model’s ability to generalize across different culinary traditions and suggest culturally and regionally appropriate substitutions. Additionally, the methodology primarily focuses on textual data, which might not capture the full spectrum of culinary contexts, including taste profiles, textures, and the interplay of flavors. LLMs, while proficient in parsing and generating text, have limited capacity to understand and replicate the sensory experiences of cooking and eating.
Additionally, the fine-tuning process, especially when using a limited set of high-quality substitutions, poses a risk of overfitting, where models may become overly specialized to the training data and less capable of generalizing to unseen recipes or ingredients.
Furthermore, the reliance on the Hit@1 metric, while providing a clear measure of the model’s ability to suggest the correct first substitution, does not capture the overall utility and flexibility of the model in providing a range of suitable alternatives.
Finally, the computational resources required for fine-tuning and deploying LLMs may also limit the accessibility of these advanced tools to researchers and practitioners with limited resources.