Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction

Kanduri, Sai Sangameswara Aadithya; Prasad, Apoorv; McRoy, Susan

doi:10.3390/biomedinformatics5030050

Open AccessArticle

Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction

by

Sai Sangameswara Aadithya Kanduri

,

Apoorv Prasad

and

Susan McRoy

^*

Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI 53211, USA

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2025, 5(3), 50; https://doi.org/10.3390/biomedinformatics5030050

Submission received: 4 June 2025 / Revised: 9 August 2025 / Accepted: 28 August 2025 / Published: 4 September 2025

(This article belongs to the Section Methods in Biomedical Informatics)

Download

Browse Figures

Versions Notes

Abstract

Background: Virtual coaching can help people adopt new healthful behaviors by encouraging them to set specific goals and helping them review their progress. One challenge in creating such systems is analyzing clients’ statements about their activities. Limiting people to selecting among predefined answers detracts from the naturalness of conversations and user engagement. Large Language Models (LLMs) offer the promise of covering a wide range of expressions. However, using an LLM for simple entity extraction would not necessarily perform better than functions coded in a programming language, while creating higher long-term costs. Methods: This study uses a real data set of annotated human coaching dialogs to develop LLM-based models for two training scenarios: one that generates pattern-matching functions and the other which does direct extraction. We use models of different sizes and complexity, including Meta-Llama, Gemma, and ChatGPT, and calculate their speed and accuracy. Results: LLM-generated pattern-matching functions took an average of 10 milliseconds (ms) per item as compared to 900 ms. (ChatGPT 3.5 Turbo) to 5 s (Llama 2 70B). The accuracy for pattern matching was 99% on real data, while LLM accuracy ranged from 90% (Llama 2 70B) to 100% (ChatGPT 3.5 Turbo), on both real and synthetically generated examples created for fine-tuning. Conclusions: These findings suggest promising directions for future research that combines both methods (reserving the LLM for cases that cannot be matched directly) or that use LLMs to generate synthetic training data with more expressive variety which can be used to improve the coverage of either generated codes or fine-tuned models.

Keywords:

large language models; code generation; few-shot learning; prompt engineering

1. Introduction

Virtual health coaches should be able to follow expert-defined counseling protocols such as Brief Action Planning [1] or Cognitive Behavioral Therapy [2]. However, implementing systems that follow a protocol requires significant manual effort, including providing explicit descriptions of specific dialogue sequences as turn-by-turn scripts that cover what the coach says, how the client can respond, and what task-related values those responses indicate, as in authoring for narrative games [3,4]. Another limitation of such approaches is that, because user inputs are restricted to a fixed set of responses, the range and form of these expressions must be anticipated by the authors of scripts. Such explicit authoring can result in rigid and predictable interactions, potentially diminishing user engagement. For example, to elicit the specific entities that form parts of a plan, a scripted system might ask the patient a long list of simple questions (“How often will you walk?” or “How many steps will you try to do?” etc.). This type of interaction quickly becomes repetitive and may discourage long-term use.

One way to avoid this rigidity would be to allow free-text inputs within the context of a scripted sequence that follows a protocol, allowing coaches to keep the dialog focused while also allowing users to respond in their own words, using complete sentences. Allowing free text requires robust methods to extract the expressions that correspond to the goal-related entities within the text, such as the target number of steps. There are two primary methodologies for automatically extracting task-related information: methods based on rules and methods based on machine learning, including Large Language Models (LLMs). LLMs are relatively new and very powerful. Thus, it may be tempting to use them for the entire coaching task, including extracting goal-related entities. However, there are tradeoffs to consider. These include the difficulty of implementing a solution, the efficiency and accuracy of that solution, and the ongoing costs involved.

Rule-based pattern-matching methods explicitly specify the sequence of words that can express each type of entity. A common format for specifying patterns is “regular expressions” (RegEx), which are sequences of literal characters, metacharacters, and various operators that define the rules. Support for processing text using RegEx is included in many programming languages and is near instantaneous at runtime. However, manually designing comprehensive patterns that accommodate the variability of natural language is complex and time-intensive, even for experts [5]. Machine learning (ML) methods can automatically build models that identify sequences of words that comprise known entity types, a task known as “entity recognition,” but the internal functions for recognizing entities are not directly observable or editable by humans. Instead, these models are trained using datasets of examples that have been manually coded. A wide range of ML methods have been used for entity recognition, including support vector machines (SVMs) [6], Hidden Markov Models (HMMs) [7], Maximum Entropy Markov Models (MEMMs) [8], Conditional Random Fields (CRFs) [9], and, most recently, LLMs [10].

Generative LLMs [11,12,13] are models that have been trained to output a sequence of symbols (which can be text, a sequence of classification labels, a set of feature–value pairs, or software code, including regular expressions) when given an input sequence and embedded instructions. These instructions often include a small sample of manually coded data to increase their accuracy. LLMs are “pretrained” on large collections of general unstructured text, with internal parameters that have been jointly optimized for a wide range of tasks. LLMs are generative when they can output sequences never seen in the training data, merging information learned from different tasks. Generative LLMs provide a promising alternative for entity extraction in cases where only a small amount of data has been annotated and a large amount of expressive flexibility is desired. A known disadvantage of using LLMs for ongoing information extraction is the extensive power and memory needed, as they have billions of internal parameters. Using them on a personal device requires either interaction with cloud services or reliance on models that are less powerful than the state of the art. Any use of an LLM also requires some experimentation. One must determine which instructions (prompts), training strategies (fine-tuning), models, and internal settings provide the best output quality, as small differences can have a significant impact on results, which can vary slightly with each conversation. There is also no way to predict how long it will take an LLM to produce an answer, as it depends on a combination of factors, including the number of generated tokens, the model size and architecture, and the hardware used to run the model. However, even if using an LLM proves to be unsuitable for some tasks (such as real-time virtual coaching), they may still be useful for others, such as helping a programmer to create a set of RegEx patterns that covers a given dataset.

To determine the relevance of models for creating real-time services, such as automated health coaches or computer-assisted tools for human coaches, one must consider both accuracy and execution time. Offline tasks can focus on accuracy, as execution time will not matter, whereas real-time tasks need to be both fast and accurate. Offline extraction would be sufficient to support follow-up in a future interaction or for documenting progress. Real-time extraction would be necessary for virtual coaching or for making real-time suggestions to a human coach that are specific to a goal under discussion. In the context of online services, it has been well established that 500 ms would be acceptable [14,15].

This study compares a) the use of state-of-the-art LLMs on a “one-time” basis to derive explicit pattern matching functions, written to use regular expressions, and b) the use of a sample of medium-sized (7B parameter) and larger (70B parameter) LLMs to directly extract information, which would require running inferences for every statement in real time. In the first case, we use LLMs only to avoid the effort of manually coding an extensive set of patterns as regular expressions, while keeping the interpretability and speed of executing the code. (We allow that it might be necessary to perform some minor refinements for the final Python code (version 3.13.5), as described below.) For direct extraction, we develop and optimize a representative set of LLMs to interpret user inputs. Given the variability in LLM performance across different models and configurations, we also compare multiple prompting strategies and fine-tuning techniques to determine the most effective method for integrating LLMs into real-time dialogue systems. For fine-tuning, we used a combination of real and synthesized data, creating over 10,000 examples to address the data needs of the approach. This study aims to contribute to the advancement of hybrid conversational models, bridging the gap between explicit scripting and flexible conversational interactions performed with LLMs.

2. Materials and Methods

2.1. Large Language Models Used

To systematically evaluate the direct extraction capabilities of large language models, we conducted experiments using a diverse set of both open-source and commercial models. This selection strategy enables comprehensive assessment across different model scales, architectural approaches, and accessibility paradigms (see Table 1). So-called “open” models (Meta-Llama 2 series and Google Gemma 7B) provide transparency and reproducibility for detailed analysis, while the commercial model (ChatGPT-3.5-Turbo) represents current industry standards for practical deployment scenarios. In practice, the open models provide open access to internal trained parameters (the weights), but not the data used to create them, so one can download them to use them offline, but one cannot recreate them independently, and they may have licensing restrictions.

The experimental design specifically incorporates models ranging from 7 billion to 70 billion parameters to investigate how model scale influences extraction performance. Additionally, by including both instruction-tuned variants (Chat and Instruct models) and an RLHF (Reinforcement Learning from Human Feedback)-optimized commercial system, we can evaluate how different training methodologies affect direct extraction capabilities across various use cases.

The instructions and examples given to an LLM at runtime are known as a “context.” Context window sizes differ among the chosen models, ranging from 4096 tokens (Meta-Llama 2) and 8192 tokens (Gemma) for the open-model options, to 16,385 tokens for the commercial ChatGPT-3.5-Turbo model [16,17,18]. This context length enables processing of substantial input documents while preserving the ability to generate detailed extraction outputs, as the total token budget encompasses both input text and generated responses. Furthermore, the instruction-following capabilities inherent in these variants make them ideal for zero-shot and few-shot extraction tasks, where models must interpret extraction requirements from natural language prompts without extensive fine-tuning.

2.2. Architecture

This study compares the accuracy and execution time of extracting information from conversational text using (1) RegEx-based pattern-matching code created with the assistance of Large Language Models (LLMs) versus (2) the direct use of end-to-end instructed LLMs (see Figure 1). Both approaches accept the text same input, e.g., “I want to walk 10,000 steps to stay healthy.” Both approaches attempt to extract key attributes from user statements about goals, such as their specificity, measurability, attainability, and frequency aligned with the SMART goal framework [19], and produce output as formatted text. For the pattern-matching approach, regular expressions were produced by a specially trained LLM to generate RegEx rules, embedded in a simple Python function that extracts relevant entities and converts them into a structured list of attribute–value pairs, e.g.,”{attribute1:value 1 attribute2: value2 …},” known as a Python “dictionary.” For the direct interpretation approach, Large Language Models are prompted to extract the same attributes from free-form text input and provide a similar set of attribute–value pairs as the output. The key differences will be their average accuracy, coverage, and execution time.

2.3. Input Format

For this study, the input data consists of text messages sent between a patient and a human health coach. This data has already been annotated to identify four key attributes related to goal-setting (discussed in more detail below) [20]. These four attributes are as follows:

Measurability (e.g., “walk 10,000 steps”);
Specificity (e.g., “steps” or “running”);
Attainability (estimated difficulty level);
Frequency (e.g., “daily” or “weekly”).

Each goal attribute has a variety of possible expressions, including precise numeric goals (“walk 10,000 steps”), range-based goals (“walk between 8000 and 10,000 steps”), and implicit frequency goals, where users do not explicitly mention how often they will perform the activity but the goal can be inferred from the context.

2.4. Output Format

Once the four attributes are extracted, the structured output provides a list of labeled values that could be processed by any automated system easily. The extracted goal attributes are formatted as shown at the bottom of Figure 2.

2.5. Dataset

The dataset used for training and evaluation is the “Health Coaching Dialogue Corpus” (also called “Dataset 1”) described in Gupta et al. (2020) [20]. This corpus includes free text from interactions between a health coach from a university-run medical clinic and 27 patients between the ages of 21 and 65 years, engaged in a study to increase subjects’ physical activity. In the study, a coach trained in SMART goal-setting interacted with patients for four weeks using a text-messaging service on a smart phone to complement the use of a fitness tracker. This dataset has been manually annotated with SMART goal attributes (specificity, measurability, frequency, and attainability) by the original researchers. These attributes include both numeric and discrete information; measurability refers to numerical goals, specificity refers to the type of activity, frequency is the occurrence rate (such as daily, weekly, or some custom repetition), and attainability is a scaled expression of the difficulty level. They also annotated the dialogue to indicate its purpose, which they refer to as the “dialogue phase.” Defined phases include identification, refining, anticipating barriers, solving barriers, and negotiation. The dialogue phase most related to SMART goals is goal identification, which is the first part of goal-setting. For this research, a subset of 40 dialogues specifically focused on goal-setting–goal-identification interactions was used. In the data, only two types of activity goals are present, walking and stairs, which were tracked using a “Fitbit Alta.” This device is a wireless-enabled wristband that automatically counts steps taken and can upload the data to a company website that the coach would use to access the data during the study. In the SMS data, expressions about stair-taking are expressed as “walking the stairs,” which we mapped to walking due to the small size of the dataset.

2.6. Automatic Generation of Pattern-Matching Functions

2.6.1. Initial Design and Setup

Open AI’s ChatGPT-4 [21,22] was leveraged to propose and refine RegEx patterns for each attribute iteratively, focusing first on detecting clear numeric values, frequency indicators (like specific days of the week), and basic activity types (such as walking or running). Asking an LLM to create code is an example of prompt engineering [23], where the prompt suggests the role that the LLM should assume (e.g., an expert Python programmer) and describes the output task. Figure 3 shows an example prompt for requesting code that employs RegEx to extract information from unstructured text.

2.6.2. Extracting and Processing Data

Since numeric expressions in user statements can appear in different formats (“two thousand,” “2K,” “2000”), in the prompt, RegEx patterns were requested to recognize both abbreviations and fully written numbers. For uniformity of the final outputs, the Python module word2number was used to convert all number words into numeric digits. Additionally, as many users express step goals as ranges (e.g., “1000 to 2000 steps”), a new function was created to extract and interpret these values as a single number by default selecting the lower bound as a conservative estimate.

Beyond numeric data, frequency detection was refined to distinguish between specific periods (“Monday to Friday”) and general terms (“weekly,” “daily”). Determining how achievable a goal is requires clinical expertise, so for evaluation purposes a simple scoring system was developed that could be refined in the future. The attainability score was calculated based on measurability and frequency, with higher-effort goals (e.g., “10,000 steps on weekends”) assigned lower scores than more manageable ones (e.g., “3000 steps on weekends”).

2.6.3. Testing and Iterative Refinement

Number handling, activity recognition, and frequency detection were consolidated into a single function. Extensive testing was conducted using real and synthetic examples and then incorporated into specific requests for the LLM to create code with appropriate RegEx, including explicit instructions for how to handle missing words that might be assumed from the context and to create patterns that allow for paraphrases—for example, “When you find no specificity just use steps as default,” “The frequency can also be different, not only days of the week like ‘daily,’ ‘Monday–Friday,’ ‘Monday–Saturday,’ ‘weekly,’ or ‘weekend,’” and “The things inside the single backticks contain some of the frequencies, so add them to pattern, or if you write it in a better way just do it, but be a bit intelligent—for example, the user can say ‘I can walk in my office on working days,’ which means he can walk daily, i.e., Monday–Friday.” With these refined instructions, a few remaining gaps in the coverage were observed, which were addressed by making manual refinements. Also, the final set of patterns was manually reordered to prioritize exact matches over broad patterns to provide more reliable data extraction. These are common programming tasks that do not require a high level of skill and take only a few minutes. To see how much manual effort was needed, Figure 4 shows the RegEx for the frequency attribute as created directly by the LLM (on the left) compared to the edited version (on the right).

2.7. Direct Extraction of Goal-Related Attributes by LLMs

The alternative to using an LLM to help write patterns for extracting information from unstructured text is to instruct the LLM to perform the extraction task, which is another example of prompt engineering [23]. Since the type of text to be analyzed and the attributes to be extracted would not be part of the original pretrained model, this approach requires some experimentation to optimize the model—for example, comparing results obtainable by just providing examples as part of a prompt versus results from fine-tuning a model with a larger training set, which changes the values of some internal parameters.

2.7.1. Development of Prompts for LLMs Without Fine-Tuning

To assess the results obtained without fine-tuning, various standard prompting strategies were tested, including:

Zero-shot learning (ZSL)—Direct input parsing without prior examples;
One-shot learning (1SL)—Input with just a single example to guide the model;
Few-shot learning (FSL)—Input with multiple examples to improve accuracy.

Simple structured prompts were developed, instructing the model to return responses in a predefined attribute–value format. When initial tests revealed that the outputs contained unnecessary surrounding text, the prompts were refined to instruct the model to wrap responses within <result> tags. We then compared alternative LLMs and parameter settings, such as “temperature,” which controls the amount of flexibility permitted by the model.

2.7.2. Evaluation and Optimization of Alternative Pretrained LLMs

To assess direct extraction, experiments were conducted using various open-source and commercial LLMs, including the Meta-Llama 2 7B Chat model, the Gemma 7B Instruct model, and the Meta-Llama 2 70B Chat model and OpenAI’s ChatGPT-3.5-Turbo. These are all smaller models than ChatGPT-4 and more reasonable for real-time use. The first step was to determine the optimal number of examples to provide and the optimal degree of randomness (i.e., temperature). To establish baseline performance, we ran tests using the Meta-Llama 2 7B Chat model. A systematic evaluation was conducted using different prompting strategies, including zero-shot (ZS), one-shot (1S), two-shot (2S), and five-shot (5S) learning paradigms. Each configuration was tested at five different temperature settings: 0.1, 0.3, 0.5, 0.75, and 1.0. The results are summarized below:

Zero-Shot Learning: Initial trials involved direct instruction-based prompting, where a simple user input, such as “I want to walk 1000 steps daily,” was provided for the model. The response format was explicitly defined using a structured template enclosed within <result> tags. At lower temperature settings, the model generated structured responses with accurate frequency values but lacked specificity in certain parameters. Higher temperature settings introduced increased variability, sometimes leading to incorrect but contextually relevant outputs.
One-Shot Learning: To enhance the model’s comprehension of the expected output format, an explicit example was included in the prompt. This refined approach led to improved consistency in the output structure. Notably, repeated trials with identical inputs at lower temperatures yielded stable outputs, while higher temperatures introduced minor variations, particularly in the “attainability” field.
Two-Shot Learning: Given the limitations observed in one-shot learning, an additional example with varied input characteristics was incorporated. This adjustment aimed to enhance the model’s understanding of different input styles. However, despite improvements at lower temperatures, inconsistencies persisted at higher temperatures.
Five-Shot Learning: To further improve performance, the prompt was extended to include five diverse input examples. These covered a range of input styles, including numerical values, step count ranges, and missing frequency attributes. Overall, for this small model, five-shot performed well for a wide range of inputs.

The results indicate that the five-shot configuration significantly improved output consistency across temperature settings, with more ability to generalize across different input variations. (The prompts used in each test are listed in Appendix A.)

After establishing baseline performance, the same experiment was conducted using two much larger models, the Gemma 7B Instruct model and the Meta-Llama 2 70B Chat model. The Gemma model performed similarly to the Meta-Llama 2 7B Chat model. The Meta-Llama 2 70B Chat model achieved the best performance in the two-shot learning configuration, even better than five-shot. The worst results were seen at higher temperatures using the five-shot approach, with the LLM appearing to disregard the target task entirely. (Complete results will be provided in a later section).

2.7.3. Creation of Synthetic Training Data and Evaluation of Fine-Tuned Models

After establishing optimal performance using only pretrained models, we assessed a fine-tuning approach [24]. Fine-tuning involves providing a training set and allowing the model to change its internal weights. However, fine-tuning a model with even 7 billion parameters requires a substantial dataset, typically with at least 5000 labeled examples. Since the available dataset was insufficient, a synthetic data generation process was used to generate additional samples [25]. We used multiple strategies to generate sentences with similar intentions to the original data but with more linguistic variability and with a broader range of values for each of the entity types to be extracted. Below is a summary of the approaches used:

Basic Goal-Setting Sentences: First, basic dataset entries comprising goal-oriented statements related to daily steps were created. These entries included measurable goals (number of steps), specificity (mentioning “steps”), attainability (a rating from 1 to 10), and the frequency with which the goals should be achieved.
Intelligent Attainability Calculation: The dataset was improved by introducing an intelligent mechanism that calculates the “attainability” score based on the number of steps taken and their frequency. This adjustment made the dataset more realistic, as it factored in the difficulty of achieving higher step counts or more frequent activity.
Inclusion of Ranges and Frequencies: To introduce more variability, flexible step goals were used instead of fixed numbers, such as “4000 to 5000 steps.” Prompts were given to instruct the model to use a frequency value from the list of frequencies provided, including “daily,” “Monday–Friday,” “Monday–Saturday,” “weekly,” and “weekend,” to have consistency in the frequency value.
Goal Update Sentences: To allow for examples that update an existing goal, examples were created that used phrases like “Let’s increase the daily target to 8000 steps.”
Conversational Context: To make the dataset entries appear more natural, samples were created to integrate the step goals into casual or unrelated conversations. This approach incorporated additional context not directly related to a goal, such as remarks about the weather, personal motivation, or other daily activities.
Variety in Sentence Structures: A variety of creative scenarios were included to enhance the dataset. These scenarios reflect a diverse range of motivations for walking, such as adopting a dog or starting a new job near a park. Alternative reply styles were also used to infuse the dataset with different tones, from casual to motivational. Additionally, scenarios that emphasize goals following medical advice or personal health commitments were crafted to focus on health awareness and motivation.

To ensure that the synthetic data was of comparable quality to the original, we manually reviewed a random sample of 100 generated entries, assessing their correctness and coherence. When discrepancies were identified within the sample, they were corrected and then the model was fine-tuned with the revised augmented dataset. We repeated this process iteratively: generating a revised dataset, evaluating a new sample of one hundred, and refining the model based on any errors found. This iterative process continued until the data quality met the desired standards. The final file was used as the dataset to fine-tune the LLM.

The final dataset consisted of 10,512 structured examples. The dataset was preprocessed, and a prompt was made from each sample. For training purposes, the 1S prompt structure was used, where the structure of each prompt after preprocessing was as follows:

{“text”: “\n <s>[INST] <<SYS>> surround the answer in between <result> and </result> tags <</SYS>>\n INPUT TEXT [/INST]\n <result>{\n”Measurability”:“value”,\n”Specificity”:“value”,\n”Attainabity”:“value”,\n”Frequency”:“value”\n}</result> </s>\n”}

This prompt has a system instruction about surrounding the values with result tags, followed by the input text and its associated output.

The dataset was then partitioned into training (80%), validation (10%), and testing (10%) subsets. The fine-tuning process was conducted using the LoRA (Low-Rank Adaptation) method [26] within the MLX framework [27,28]. The training was executed on an Apple MacBook Pro with an M2 Max chip and completed within six hours.

After fine-tuning, the model exhibited improved zero-shot performance, demonstrating high accuracy without requiring additional prompt engineering. Given that the training instructions aligned closely with the test prompts, no additional few-shot learning was attempted.

2.7.4. Comparison Tests with ChatGPT

As a final test, we compared our results with open-source models to those of a commercial model, ChatGPT-3.5-Turbo. For these experiments, the premium version of the OpenAI API was acquired and integrated into the experimental framework. Detailed instructions for model prompting were developed, and the outcomes were systematically recorded. Notably, the model’s zero-shot capability demonstrated high accuracy in its initial trials, so we did not run additional tests.

3. Results

This section presents the results obtained from evaluating pattern-matching functions and Large Language Models (LLMs) for extracting relevant information from user inputs that involve setting and reviewing goals for physical activity. The results include the calculated overall accuracy and average execution time per item for different models and methodologies, including pattern matching and direct extraction. The results from fine-tuning experiments are also analyzed.

3.1. Results of Information Extraction Using a Pattern-Matching Function

Pattern-matching techniques were tested on a dataset comprising 100 user inputs, evaluating their ability to correctly extract entities. The metrics calculated were accuracy and execution time. The results showed an accuracy of 99% (99 out of 100 test cases). The pattern-matching code created by the LLM with manual revisions incorrectly matched an example with overlapping patterns. When given the user input “I will walk from Monday to Saturday,” it incorrectly generated “Daily” instead of the correct output, “Monday–Saturday.” The average execution time was 0.01 s.

3.2. Results of Direct Extraction Using Large Language Models (LLMs)

The accuracy for direct extraction ranged from 90% to 100%, with execution times ranging from 0.9 s to 5 s. The best accuracy was achieved using the commercial ChatGPT-3.5-Turbo model, but 0.9 s is too slow for real-time interaction. Figure 5 shows the accuracies for four different models in their optimal configuration: the LLaMA 2 7B Chat model with five-shot, the Gemma 7B Instruct model with five-shot, the LLaMA 2 70B Chat model with two-shot, and the LLaMA 2 70B Chat model with five-shot for the three main attributes (specificity, measurability, and frequency). In the remainder of the section, we shall review the results of the other variations that were tested.

The analysis in Figure 6 reveals that lower temperatures (0.1–0.5) generally provided more reliable attainability assessments, with Gemma 7B (five-shot) showing superior consistency across all temperature settings. The temperature of 0.5 appeared to be optimal for most Llama models, with Llama 2 70B (Two-shot) achieving its peak performance of 86.4% at this setting. Llama 2 7B (Five-shot) dropped significantly from 83% to 71.6% when the temperature increased from 0.5 to 0.9, indicating higher sensitivity to randomness in output generation.

3.2.1. Results for LLaMA 2 (7B) and Gemma 7B Without Fine-Tuning

Experiments with these smaller open-source models were conducted under different settings (zero-shot, one-shot, two-shot, and few-shot) at a range of temperatures. Below we summarize these results.

Zero-Shot (ZS): The models completely failed to understand the requirement and were unable to generate a relevant output.
One-Shot (1S): With the addition of a single example, the model generated an appropriate analysis for inputs of the given style. However, when presented with other input formats, such as those containing a range, the model failed to extract the correct data. For instance, while the model accurately processed the input “I want to walk 10,000 steps daily,” when asked to process “I want to walk between 8000 and 12,000 steps daily,” it failed to capture and interpret the range correctly.
Two-Shot (2S): No significant improvement was observed with 2S.
Few-Shot (5S): The accuracy improved but still contained several errors; increasing the temperature did not help and sometimes resulted in format inconsistencies. Even at lower temperatures, the model generated incorrect values for certain entities.

3.2.2. Results of Fine-Tuning a Smaller Model

As larger open-weight models such as LLaMA 2 70B were too slow (5 s on average) and the smaller models were too error-prone (90% accuracy on average), we investigated fine-tuning a smaller model. Fine-tuning was conducted using LoRA (Low-Rank Adaptation) within the MLX framework, optimizing the LLaMA 2 7B Chat model based on 9460 training samples. The training process concluded with a validation loss of 0.282. After training, an additional set of 526 samples was used for model evaluation. The resulting test loss was measured at 0.216, with a corresponding test perplexity of 1.241. The accuracy on the test set was 98%; the average execution time was 2 s.

3.2.3. Results of ChatGPT-3.5

A prompt with detailed instructions was designed to guide ChatGPT’s responses. The OpenAI API was utilized for the prompting process. The model was then evaluated on the test cases, where it achieved 100% accuracy. The average execution time was 0.9 s.

3.3. Comparative Analysis

Figure 7 presents a comparison of the accuracy and average execution time of the evaluated approaches. While ChatGPT-3.5-Turbo achieved perfect accuracy (100%), the pattern-matching approach closely followed, with 99% accuracy, demonstrating only a minimal trade-off in precision. The fine-tuned Llama 2 7B model achieved 98% (up from 78% without fine-tuning). The larger Llama 2 70B (five-shot) model reached 90% accuracy without fine-tuning. The fine-tuned Llama 2 7B took 2 s on average versus 5 s for the Llama 2 70B (five-shot).

4. Discussion

This study compares the performance of several state-of-the-art AI models and an AI-facilitated pattern-matching function that could support automatic documentation or interactive free-text dialogue for protocol-based health coaching. The analysis includes both small and larger open-source models, including Meta-Llama models (7B and 70B), Gemma 7B Instruct, and the commercial ChatGPT-3.5-Turbo. Each approach presents both strengths and weaknesses in terms of accuracy, speed, transparency, and adaptability. Table 2 summarizes our main findings and recommendations for future use.

The regular expression-based pattern-matching function exhibits high accuracy for the small test set of real data (99%). However, it will be unable to address new sentence patterns and sentences with multiple matching entities without refinement, as its coverage is fixed. When a novel expression is encountered, pattern matching returns a null result, so it will be clear when an update is needed. To update the matching function, a set of new inputs could be gathered, added to the existing training set, and fed to a generative AI model, which is instructed to produce new regular expressions to make necessary adjustments to the pattern-matching function to expand its scope. One could manually choose when to perform such updates or one could create another function to automatically schedule updates. To do this would require having a programmer (possibly on a temporary basis) to manually review and potentially revise the LLM-generated function. The key advantages of this method are its near-instantaneous response time and its transparency, along with a high accuracy on data within the scope, similar to its training data. Thus, there should be no issues with regulatory compliance.

With LLMs, coverage expands automatically, leveraging the pretrained model and provided examples at the cost of lower accuracy. Also, when coverage is exceeded, it will not be immediately obvious, as the LLM output may include errors that are complete fabrications. For direct extraction, Meta Llama 2 70B demonstrated the best accuracy among open-source LLMs without fine-tuning. However, it required more computational resources and had the slowest response time (5 s), making it unsuitable for real-time applications. However, it could be used for documenting interactions offline based on a transcript. However, it was also significantly less accurate than the pattern-matching approach (90% vs. 99%) and required expert manual effort to determine an optimal temperature during training, so it is hard to justify.

Fine-tuning Meta Llama 2 7B resulted in a significant accuracy improvement, up to 98%. This result was attained with a potentially “harder” dataset that included synthetic data crafted to include more variety (requiring greater coverage). The tradeoff was the need to first create an augmented dataset. Also, the execution time was unsatisfactory for real-time use (2 s, where a latency of less than 500 ms is considered good).

ChatGPT-3.5-Turbo showed the highest accuracy across different test cases. At 100%, even on the hardest data, without fine-tuning, it would be best for offline documentation tasks. However, its response time (around 0.9 s) was much slower than that of pattern-matching functions (0.1 s). While this is still faster than other AI models, it is still too slow for conversational interfaces, although it is getting very close to fast enough. These results suggest that specialized “distilled” models will likely be needed for dialogues that require real-time data extraction—but given the loss of transparency and predictability, it is unclear whether there is a strong need for these new models for simple entity extraction if other methods will suffice. One reason to use LLMs for real-time interactions might be to perform multiple tasks at once, including those that might be hard for people to recognize, such as assessing a patient’s confidence, emotions, or lack of engagement. However, these benefits should be weighed against the need for data privacy, transparency, and regulatory compliance and the potential risk of bias [29].

We acknowledge some limitations. This research focused on one type of reported physical activity, walking, and the task of setting an initial goal. We focused on this activity because of the availability of real data involving a professional health coach [20]. Future work might address other activities and their measures, if suitable data were available; however, increasing step counts is an important task for health coaching, especially for counseling people with knee osteoarthritis [30,31]. Other health behaviors where software-supported SMART goal-setting might be useful include goals to reduce stress, improve sleep, or improve nutrition. For example, the patient might create a plan to “meditate 10 min each morning, sleep 7 h per night, or consume 5 servings of fruits and vegetables daily” [32]. While there are applications for people to log these actions by hand, these are straightforward tasks for information extraction similar to what we have shown here. It would also be beneficial to address how to handle other types of information, such as addressing barriers or patients’ confidence in their ability to achieve their goals.

5. Conclusions

The performance of various AI models and pattern-matching functions naturally differs significantly across different learning contexts and application domains. While pattern-matching functions are extremely fast, they may not be able to handle new or complex sentence structures. On the other hand, generative models like Meta-Llama 7B and Gemma 7B Instruct show flexibility, making them suitable for tasks requiring adaptability and instructional clarity, but they lack the speed necessary for use in an interactive conversational system where one also wants to create a log of assessed parameters. ChatGPT, while slower than pattern-matching functions, offers excellent accuracy and comes close to real-time behavior. Future advances may further address this limitation. Overall, the choice of model or function still heavily depends on the specific requirements of accuracy, response time, and computational resources, highlighting the need for a strategic approach in selecting the appropriate AI tool for specific tasks.

Author Contributions

Conceptualization, S.S.A.K., A.P. and S.M.; methodology, S.S.A.K. and S.M.; software, S.S.A.K.; validation, S.S.A.K.; data curation, S.S.A.K.; writing—original draft preparation, S.S.A.K.; writing—review and editing, S.S.A.K., A.P. and S.M.; visualization, S.S.A.K. and A.P.; supervision, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The program code and the synthetic data used for fine-tuning are available at https://github.com/Aadithya180600/Enhancing-scripted-dialogue-systems (accessed on 27 August 2025).

Acknowledgments

The authors acknowledge the support of their colleagues at UW-Milwaukee, including Harshawardhan Vijayan.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large language model
ML	Machine learning
LoRA	Low-Rank Adaptation
AI	Artificial intelligence

Appendix A

Table A1. Prompts used for zero-shot, one-shot, two-shot and five-shot.

Prompt Type	Prompt
Zero-shot	`<<sys>>For the given message, generate a response in JSON format that includes the following information: {Measurability: “value “, Specificity: “value”, Attainability: “value”, Frequency: “value”} surround the answer in between <result> and </result> tags. <<sys>>`
One-shot	`<<SYS>> surround the answer in between <result> and </result> tags. <</SYS>> [INST]”the goal for this week is to walk 2,000 steps per day every day.” [/INST]` `<result> {` `“Measurability”: “2000”,` `“Specificity”: “steps”,` `“Attainability”: “8”,` `“Frequency”: “daily”` `} </result>`
Two-shot	`<<SYS>>> surround the answer in between <result> and </result> tags. <</SYS>> [INST] “the goal for this week is to walk 2,000 steps per day every day.” [/INST]` `<result> {` `“Measurability”: “2000”,` `“Specificity”: “steps”,` `“Attainability”: “8”,` `“Frequency”: “daily”` `} </result>` `[INST] “I wanna TRY between 1,000 and 2,000 from Monday to Friday” [/INST]` `<result> {` `“Measurability”: “1000”,` `“Specificity”: “steps”,` `“Attainability”: “8”,` `“Frequency”: “Monday-Friday”` `} </result>`
Five-shot	`<<SYS>>> surround the answer in between <result> and </result> tags. <</SYS>> [INST] “the goal for this week is to walk 2,000 steps per day every day.” [/INST]` `<result> {` `“Measurability”: “2000”,` `“Specificity”: “steps”,` `“Attainability”: “8”` `“Frequency”: “daily”` `} </result>` `[INST] “I wanna TRY between 1,000 and 2,000 from Monday to Friday” [/INST]` `<result> {` `“Measurability”: “1000”,` `“Specificity”: “steps”,` `“Attainability”: “8”,` `“Frequency”: “Monday-Friday”` `} </result>` `[INST] “Hi! Ive been struggling a bit lately, so lets aim for a more achievable goal of 3,000 steps per day.” [/INST]` `<result> {` `“Measurability”: “3000”,` `“Specificity”: “steps”,` `“Attainability”: “5”,` `“Frequency”: “daily”` `} </result>` `[INST] “Good morning! Im feeling really determined this week. Lets push for 9,000 steps from Saturday to Sunday.” [/INST]` `<result> {` `“Measurability”: “9000”,` `“Specificity”: “steps”,` `“Attainability”: “10”,` `“Frequency”: “weekend”` `} </result>` `[INST] “15,000 steps…..” [/INST]` `<result> {` `“Measurability”: “15000”,` `“Specificity”: “steps”,` `“Attainability”: “7”,` `“Frequency”: “daily”` `} </result>`

References

Gutnick, D.; Reims, K.; Davis, C.; Gainforth, H.; Jay, M.; Cole, S. Brief action planning to facilitate behavior change and support patient self-management. J. Sci. Commun. 2014, 21, 17–29. [Google Scholar]
Lungu, A.; Boone, M.; Chen, S.; Chen, C.; Walser, R. Effectiveness of a cognitive behavioral coaching program delivered via video in real world settings. Telemed. e-Health 2021, 27, 47–54. [Google Scholar] [CrossRef] [PubMed]
Beinema, T.; Davison, D.; Reidsma, D.; Banos, O.; Bruijnes, M.; Donval, B.; Valero, Á.F.; Heylen, D.; Hofs, D.; Huizing, G.; et al. Agents United: An open platform for multi-agent conversational systems. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, Virtual Event, Kyoto, Japan, 14–17 September 2021; pp. 17–24. [Google Scholar]
Yarn Spinner. Available online: https://yarnspinner.dev/ (accessed on 3 June 2025).
Van Hoan, N.; Hung, P. Arext: Automatic Regular Expression Testing Tool Based on Generating Strings with Full Coverage. In Proceedings of the 13th International Conference on Knowledge and Systems Engineering (KSE), Bangkok, Thailand, 10–12 November 2021. [Google Scholar]
Kazama, J.; Makino, T.; Ohta, Y.; Tsujii, J. Tuning Support Vector Machines for biomedical named entity recognition. In Proceedings of the ACL-02 Workshop on Natural Language Processing in Biomedical Applications, Philadelphia, PA, USA, 11 July 2002; Volume 3, pp. 1–8. [Google Scholar]
Zhao, S. Named entity recognition in biomedical texts using an HMM model. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, Switzerland, 28–29 August 2004; Association for Computational Linguistics, USA. 2004; pp. 84–87. [Google Scholar]
McCallum, A.; Freitag, D.; Pereira, F. Maximum Entropy Markov Models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford, CA, USA, 29 June–2 July 2000; pp. 591–598. [Google Scholar]
McCallum, A.; Li, W. Early results for named entity recognition with Conditional Random Fields, feature induction and web-enhanced lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, Edmonton, AB, Canada, 31 May–1 June 2003; pp. 188–191. [Google Scholar]
Hu, Y.; Chen, Q.; Du, J.; Peng, X.; Keloth, V.; Zuo, X.; Zhou, Y.; Li, Z.; Jiang, X.; Lu, Z.; et al. Improving large language models for clinical named entity recognition via prompt engineering. J. Am. Med. Inform. Assoc. 2024, 31, 1812–1820. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Scialom, T. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar] [CrossRef]
Dabrowski, J.; Munson, E.V. 40 years of searching for the best computer system response time. Interact. Comput. 2011, 23, 555–564. [Google Scholar] [CrossRef]
Yu, M.; Zhou, R.; Cai, Z.; Tan, C.W.; Wang, H. Unravelling the relationship between response time and user experience in mobile applications. Internet Res. 2020, 30, 1353–1382. [Google Scholar] [CrossRef]
Meta. Llama 2: Open Source, Free for Research and Commercial Use. Available online: https://www.llama.com/llama2/ (accessed on 21 July 2025).
Hugging Face. Access to Gemma on Hugging Face. Available online: https://huggingface.co/google/gemma-7b (accessed on 21 July 2025).
Open AI. GPT 3.5 Turbo: Legacy GPT Model for Cheaper Chat and Non-Chat Tasks. Available online: https://platform.openai.com/docs/models/gpt-3.5-turbo (accessed on 21 July 2025).
Doran, G.T. There’s a SMART way to write management’s goals and objectives. Manag. Rev. 1981, 70, 35–36. [Google Scholar]
Gupta, I.; Eugenio, B.D.; Ziebart, B.; Baiju, A.; Liu, B.; Gerber, B.; Sharp, L.; Nabulsi, N.; Smart, M. Human-Human Health Coaching via Text Messages: Corpus, Annotation, and Analysis. In Proceedings of the 21st Annual Meeting of the Special Interest Group on Discourse and Dialogue, Online, 21–24 July 2020; Association for Computational Linguistics. pp. 246–256. [Google Scholar]
OpenAI. Available online: https://openai.com/ (accessed on 3 June 2025).
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ‘20), Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; pp. 1877–1901. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Guo, X.; Chen, Y. Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. arXiv 2024, arXiv:2403.04190. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Hannun, A.; Digani, J.; Katharopoulos, A.; Collobert, R. MLX. Available online: https://github.com/ml-explore (accessed on 3 June 2025).
MLX. Available online: https://ml-explore.github.io/mlx/build/html/index.html (accessed on 3 June 2025).
Wen, B.; Norel, R.; Liu, J.; Stappenbeck, T.; Zulkernine, F.; Chen, H. Leveraging Large Language Models for Patient Engagement: The Power of Conversational AI in Digital Health. In Proceedings of the 2024 IEEE International Conference on Digital Health (ICDH), Shenzhen, China, 7–13 July 2024; IEEE: New York, NY, USA, 2024; pp. 104–113. [Google Scholar]
Li, L.; Sayre, E.; Xie, H.; Falck, R.S.; Best, J.R.; Liu-Ambrose, T.; Grewal, N.; Hoens, A.M.; Noonan, G.; Feehan, L.M. Efficacy of a Community-Based Technology-Enabled Physical Activity Counseling Program for People with Knee Osteoarthritis: Proof-of-Concept Study. J. Med. Internet Res. 2018, 20, e159. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Feehan, L.; Xie, H.; Lu, N.; Shaw, C.; Gromala, D.; Zhu, S.; Aviña-Zubieta, J.; Hoens, A.; Koehn, C.; et al. Effects of a 12-Week Multifaceted Wearable-Based Program for People with Knee Osteoarthritis: Randomized Controlled Trial. JMIR Mhealth Uhealth 2020, 8, e19116. [Google Scholar] [CrossRef] [PubMed]
White, N.; Bautista, V.; Lenz, T.; Cosimano, A. Using the SMART-EST Goals in Lifestyle Medicine Prescription. Am. J. Lifestyle Med. 2020, 14, 271–273. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]

Figure 1. Overview of architecture for pattern matching (top) and direct extraction (bottom).

Figure 2. Format of expected output.

Figure 3. An example prompt for generating a Python function for information extraction.

Figure 4. Regular expressions created by ChatGPT before (left) and after manual revision (right).

Figure 5. Accuracies for direct extraction using LLaMA 2 7B Chat (five-shot), Gemma 7B Instruct (five-shot), LLaMA 2 70B Chat (two-shot), and LLaMA 2 70B Chat (five-shot).

Figure 6. Average attainability of models at different temperatures (0.1, 0.5, 0.9) using LLaMA 2 7B Chat (five-shot), Gemma 7B Instruct (five-shot), LLaMA 2 70B Chat (two-shot), and LLaMA 2 70B Chat (five-shot).

Figure 7. Comparison of accuracies and average execution time between LLMs and the pattern-matching function.

Table 1. Summary of large language models used in direct extraction assessment.

Model	Developer	Size	Type	Architecture	Context Window	Reproducibility
Meta-Llama 2 7B Chat	Meta	7 billion	Open weights	Transformer decoder	4096 tokens	High
Meta-Llama 2 70B Chat	Meta	70 billion	Open weights	Transformer decoder	4096 tokens	High
Gemma 7B Instruct	Google	7 billion	Open weights	Transformer decoder	8192 tokens	High
ChatGPT-3.5-Turbo	OpenAI	~175 billion *	Commercial	Transformer	16,385 tokens	Limited (API variations)

* Estimated parameter count; exact specifications not publicly disclosed by OpenAI.

Table 2. Summary of advantages and disadvantages of the tested approaches.

	Python Code with LLM-Drafted RegEx	Downloadable LLM Without Fine-Tuning	State-of-the-Art LLM Without Fine-Tuning	Downloadable LLM with Fine-Tuning
Accuracy	Very high; 99% on covered data, but no answer on out-of-coverage data (unless logic for missing or default values is explicitly specified).	Moderate; best models for accuracy, such as Llama2 70B or Gemma 7B, do well depending on type of entity (e.g., specificity of up to 100%; measurability of up to 98%; frequency of up to 90%)	Excellent; Chat GPT-3.5-Turbo achieved 100% accuracy, with broad coverage	Very high; best fine-tuned models, such as Llama 2 7B, achieved 98%, with broad coverage
Execution speed	Always real time; 10 ms on average	Not real time; best of these models took 5 s on average	Nearly real time; Chat GPT-3.5-Turbo took 900 ms on average	Not real time; best of these models took 2 s on average
Transparency	Mostly clear: rules are readable (with some expertise); clear when out of coverage; when rules overlap, answer will depend on the order of rules	Opaque: weights have no interpretation; no indication of out of coverage (models may hallucinate)	Opaque: weights have no interpretation; no indication of out of coverage (models may hallucinate)	Opaque: weights have no interpretation; no indication of out of coverage (models may hallucinate)
Maintainability and reliability	High: can be updated through manual editing or by providing new labeled examples; no risk if out of coverage	Low: models are not editable, results do not improve with more data, and some risk if out of coverage	Mixed: models are not editable but perform well without effort; updates controlled by API owner	Moderate: models are not editable but can be downloaded and fine-tuned offline if new data is available; some risk if out of coverage
Best use case(s)	Either low- or high-budget real-time interaction, such as virtual or assisted counselling	Low-budget offline documentation tasks or for quick prototyping	High-budget offline documentation tasks or creation of synthetic training data	Low-budget offline documentation tasks or creation of synthetic training data

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kanduri, S.S.A.; Prasad, A.; McRoy, S. Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction. BioMedInformatics 2025, 5, 50. https://doi.org/10.3390/biomedinformatics5030050

AMA Style

Kanduri SSA, Prasad A, McRoy S. Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction. BioMedInformatics. 2025; 5(3):50. https://doi.org/10.3390/biomedinformatics5030050

Chicago/Turabian Style

Kanduri, Sai Sangameswara Aadithya, Apoorv Prasad, and Susan McRoy. 2025. "Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction" BioMedInformatics 5, no. 3: 50. https://doi.org/10.3390/biomedinformatics5030050

APA Style

Kanduri, S. S. A., Prasad, A., & McRoy, S. (2025). Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction. BioMedInformatics, 5(3), 50. https://doi.org/10.3390/biomedinformatics5030050

Article Menu

Using Large Language Models to Extract Structured Data from Health Coaching Dialogues: A Comparative Study of Code Generation Versus Direct Information Extraction

Abstract

1. Introduction

2. Materials and Methods

2.1. Large Language Models Used

2.2. Architecture

2.3. Input Format

2.4. Output Format

2.5. Dataset

2.6. Automatic Generation of Pattern-Matching Functions

2.6.1. Initial Design and Setup

2.6.2. Extracting and Processing Data

2.6.3. Testing and Iterative Refinement

2.7. Direct Extraction of Goal-Related Attributes by LLMs

2.7.1. Development of Prompts for LLMs Without Fine-Tuning

2.7.2. Evaluation and Optimization of Alternative Pretrained LLMs

2.7.3. Creation of Synthetic Training Data and Evaluation of Fine-Tuned Models

2.7.4. Comparison Tests with ChatGPT

3. Results

3.1. Results of Information Extraction Using a Pattern-Matching Function

3.2. Results of Direct Extraction Using Large Language Models (LLMs)

3.2.1. Results for LLaMA 2 (7B) and Gemma 7B Without Fine-Tuning

3.2.2. Results of Fine-Tuning a Smaller Model

3.2.3. Results of ChatGPT-3.5

3.3. Comparative Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI