Next Article in Journal
Self-Training Can Reduce Detection False Alarm Rate of High-Resolution Imaging Sonar
Next Article in Special Issue
Large Language Models in Computer Science Classrooms: Ethical Challenges and Strategic Solutions
Previous Article in Journal
GeoSAE: A 3D Stratigraphic Modeling Method Driven by Geological Constraint
Previous Article in Special Issue
Research and Application of a Multi-Agent-Based Intelligent Mine Gas State Decision-Making System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Context Is King: Large Language Models’ Interpretability in Divergent Knowledge Scenarios

by
Andrés Piñeiro-Martín
1,2,*,
Francisco-Javier Santos-Criado
1,†,
Carmen García-Mateo
2,*,
Laura Docío-Fernández
2 and
María del Carmen López-Pérez
2
1
Balidea Consulting & Programming S.L., Witland Building, Camiños da Vida Street, 15701 Santiago de Compostela, Spain
2
GTM Research Group, AtlanTTic Research Center, University of Vigo, Maxwell Street, 36310 Vigo, Spain
*
Authors to whom correspondence should be addressed.
The research was carried out while the author was employed at Balidea.
Appl. Sci. 2025, 15(3), 1192; https://doi.org/10.3390/app15031192
Submission received: 14 November 2024 / Revised: 21 December 2024 / Accepted: 21 January 2025 / Published: 24 January 2025

Abstract

:
Large language models (LLMs) have revolutionized the field of artificial intelligence in both academia and industry, transforming how we communicate, search for information, and create content. However, these models face knowledge cutoffs and costly updates, driving a new ecosystem for LLM-based applications that leverage interaction techniques to extend capabilities and facilitate knowledge updates. As these models grow more complex, understanding their internal workings becomes increasingly challenging, posing significant issues for transparency, interpretability, and explainability. This paper proposes a novel approach to interpretability by shifting the focus to understanding the model’s functionality within specific contexts through interaction techniques. Rather than dissecting the LLM itself, we explore how contextual information and interaction techniques can elucidate the model’s thought processes. To this end, we introduce the Context-Driven Divergent Knowledge Evaluation (CDK-E) methodology, along with the Divergent Knowledge Dataset (DKD), for evaluating the interpretability of LLMs in context-specific scenarios that diverge from the model’s inherent knowledge. The empirical results demonstrate that advanced LLMs achieve high alignment with divergent contexts, validating our hypothesis that contextual information significantly enhances interpretability. Moreover, the strong correlation between LLM-based metrics and semantic metrics confirms the reliability of our evaluation framework.

1. Introduction

Large language models (LLMs) have ushered in a new era of artificial intelligence (AI), fundamentally transforming our interaction with technology and unlocking unprecedented capabilities. The introduction of models such as BERT [1] and the subsequent release of ChatGPT, based on the GPT-3 model [2], acted as a catalyst, sparking widespread excitement and revolutionizing the LLM landscape. Following these releases, a new wave of LLMs flooded the field, with models such as Llama [3,4], the Gemini family [5], Mistral [6], and Phi [7]. Built on massive datasets and sophisticated architectures, these models have an unparalleled ability to understand, reason, and generate human-like text.
Leading tech companies like Google, Meta, and Microsoft have developed their own models, integrating and deploying them into their commercial products to enhance their functionalities. The emergence of the first open-source models has further democratized technology access, enabling the broader community to integrate and utilize these models in a wide range of applications. From facilitating natural language understanding to enabling creative content generation, LLMs have become indispensable tools across academia and industry. They are being used for tasks such as automated customer support [8], personalized recommendations [9], real-time translation [10], summarization of complex documents [11], and even aiding in scientific research by analyzing large datasets, generating hypotheses and better representations [12]. The versatility and robustness of these models make them critical assets in driving innovation and efficiency across diverse sectors.
However, as these models have grown more powerful, they have also become increasingly complex, resembling opaque black boxes whose inner workings are difficult to interpret. This lack of transparency poses significant challenges for model interpretability and accountability [13,14]. The intricate architectures and vast parameter spaces make it challenging to understand how these models arrive at their decisions. This opacity can lead to several negative consequences, such as the generation of fictitious information (typically called hallucinations) [15], where the model produces plausible-sounding but incorrect or nonsensical outputs. In addition, without a clear understanding of the decision-making process, it becomes difficult to identify and mitigate biases embedded in the models [16], potentially leading to unfair or discriminatory outcomes [17,18,19]. The lack of transparency not only undermines user trust but also raises ethical and security concerns, particularly when LLMs are used in critical applications such as healthcare [20], legal advice [21], education [22], and autonomous systems [23]. Addressing these issues is crucial to ensuring the responsible and effective use of LLMs and highlights the need for robust explainability frameworks and methods that illustrate the decision-making processes of these models.
In terms of artificial intelligence, explainability refers to the ability to understand, explain, and interpret the decisions and behavior of AI systems in human terms [24,25,26]. Within LLMs, explainability is particularly crucial due to the models’ inherent complexity and the high stakes of their applications. The ability to provide clear, understandable explanations for the decisions made by LLMs can significantly enhance user trust and facilitate the adoption of these technologies in sensitive areas.
Given the opaque nature of LLMs, achieving explainability requires innovative approaches that go beyond traditional model introspection. One promising direction is the use of contextual information to illuminate the decision-making processes of these models, enhancing interpretability. Such techniques have proven to be effective in dealing with knowledge cutoffs and the need to access updated content without retraining the model, as well as mitigating issues such as biases and hallucinations [27]. Furthermore, the widespread adoption of these techniques is driven by the ease of creating LLM-based applications and gaining greater control over data. By situating LLMs within specific controlled contexts, we can better interpret and understand the decision-making processes.
In this work, we propose a novel approach to interpretability by leveraging interaction techniques that provide additional context to LLMs. Rather than dissecting the LLMs themselves, we focus on how contextual information can be utilized to interpret and explain the models’ reasoning, chains of thoughts, and outcomes. The aim of this study is to evaluate this capability, shedding light on the reliability and interpretability of LLM outputs in relation to the popular interaction techniques used today. This complements, rather than replaces, the existing explainability and interpretability techniques in the literature.
To test our approach, we introduce the Context-Driven Divergent Knowledge Evaluation (CDK-E) methodology, designed to evaluate the interpretability of large language models within context-divergent scenarios. The novelty of our contribution lies in how we incorporate fabricated contexts to systematically test a model’s ability to adapt its reasoning process. This is achieved through the Divergent Knowledge Dataset (DKD), which includes a series of questions based on fabricated historical events and facts that act as context, diverging from the model’s inherent knowledge. The CDK-E evaluates whether the model can produce accurate and contextually aligned responses, providing insights into how its reasoning adapts to the given context.
Our methodology establishes a baseline using state-of-the-art LLMs through prompt engineering and a robust performance assessment module. By structuring the experiment around divergent contexts, we directly address the research question of how well LLMs can align their outputs with external information, offering a more thorough assessment of their interpretability in complex scenarios. The empirical results show that advanced LLMs achieve strong alignment with the provided divergent contexts, confirming our hypothesis that contextual information significantly enhances interpretability. Furthermore, the strong correlation between LLM-based metrics and semantic evaluations affirms the reliability of our assessment module.
The main contributions of this paper are as follows:
  • The introduction of the Context-Driven Divergent Knowledge Evaluation (CDK-E) methodology along with the Divergent Knowledge Dataset (DKD), a novel methodology and dataset for evaluating the interpretability of LLMs in context-specific scenarios that diverge from the model’s inherent knowledge.
  • The presentation of empirical results demonstrating the effectiveness of contextual information in achieving interpretability and explainability.
  • The provision of analysis and discussion for the integration of LLMs, focusing on improving interpretability and enhancing the understanding of their decision-making processes, along with a framework for future research in explainable AI (XAI).
The rest of the paper is organized as follows: Section 2 discusses the related work and context-based interaction techniques, Section 3 presents the CDK-E methodology and the DKD, Section 4 describes the experimental setup, Section 5 details the results and discussion, and Section 6 provides the conclusions of our work.

2. Related Work

In order to address the challenges posed by the increasing complexity and opacity of large language models, significant research has been dedicated to enhancing their interpretability, explainability, and trustworthiness. These efforts have led to the development of a range of methodologies aimed at making LLMs more transparent and reliable. In this section, we explore the most prominent approaches in explainable AI for LLMs, with a particular focus on the context-based interaction techniques relevant to our work.

2.1. Context-Based Interaction Techniques

The field of prompt engineering has emerged as a new way to explore the potential of LLMs by developing and refining natural language prompts. These techniques enable us to exploit the full potential of LLMs while addressing their inherent limitations [28]. Through effective prompt engineering, we can enhance the performance of LLMs in a wide range of tasks, from question answering to complex reasoning.
Prompt engineering involves more than just designing complex prompts; it encompasses a variety of techniques, skills, and methods to extract information, process it, and interact with LLMs. By crafting robust and efficient prompts, we can significantly improve the safety, reliability, and overall performance of these models [29]. Moreover, the approaches to providing information and interacting with LLMs—such as zero-shot, multi-turn dialogues, chain-of-thought prompting, and self-consistency—are crucial for achieving the desired results. These techniques are particularly important for integrating LLMs with specific knowledge, external tools, or to sequence their tasks, enabling the models to provide relevant and updated information that extends beyond their training.
The essence of many of these techniques is based on the use of context together with the original request. Context-based interaction techniques leverage the additional information provided within the original request to guide the LLMs’ responses. This context can include specific instructions, examples, or any pertinent information that helps to shape the model’s output in a desired manner. By embedding context, we can make the models’ responses more coherent, accurate, and aligned with human understanding [30].
One of the primary advantages of context-based techniques is their ability to mitigate the limitations of LLMs’ knowledge cutoffs. Given that these models can only consider a limited amount of text at a time, crafting prompts that efficiently utilize this window is essential. Effective prompt design ensures that the most critical information is included within the context window, thereby improving the model’s ability to generate meaningful and accurate responses.
In the following, we list several advanced interaction techniques that have been developed to further enhance the capabilities of LLMs:
  • Few-shot Prompting: In this technique, a few examples are provided to the model as part of the prompt. This approach enables in-context learning to help the model understand the desired output format and the type of responses expected [31].
  • Chain-of-Thought Prompting: Chain-of-thought (CoT) prompting encourages the model to break down complex problems into smaller manageable steps [32]. By prompting the model to “think aloud” and generate intermediate reasoning steps, users can improve the model’s ability to handle tasks that require logical reasoning and multi-step problem solving. This technique enhances the transparency and interpretability of the model’s decision-making process.
  • Retrieval-Augmented Generation: One of the most popular techniques nowadays due to its ability to connect LLMs with external information sources, Retrieval-Augmented Generation (RAG) [33] combines the LLMs’ understanding and generation capabilities with external information retrieval. These retrieval mechanisms fetch relevant and up-to-date information from external sources, enabling LLMs to access an authoritative knowledge base beyond their training data before generating a response. RAG enhances the already powerful capabilities of LLMs by integrating specific domain knowledge or an organization’s internal knowledge base, all without the need for model retraining. It is a cost-effective approach that addresses the limitation of knowledge cutoffs, ensuring that outcomes are relevant, accurate, and useful in various contexts.
  • Prompt Chaining: This technique breaks down complex tasks requiring iterative processing or multi-step reasoning into smaller manageable tasks. By connecting multiple prompts in a sequence, each building on the previous one, prompt chaining guides the model through a series of steps. This approach helps to improve the reliability, performance, and understanding of LLMs’ interactions, resulting in more accurate and comprehensive responses.
  • ReAct: ReAct (synergizing reasoning and acting) explores the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner [34]. By prompting the model to not only reason through a problem but also suggest specific actions, the system can induce, track, and update action plans while handling exceptions. Additionally, ReAct enables the model to integrate external sources such as knowledge bases or environments to inform its actions.
These are just some of the most commonly used context-based interaction techniques, but there are many others, such as self-consistency [35], Active-Prompt [36], Directional Stimulus Prompting [37], and Reflexion [38], or more complex techniques that integrate multimodal inputs, such as Graph Prompting [39] and Multimodal CoT [40]. All these techniques rely on the effective use of context and how it is extracted, prepared, and integrated into prompts.
Despite the potential improvements that context-based techniques offer, it is crucial to be aware of their limitations. Irrelevant context or noisy ground-truth labels can degrade the model’s performance [41,42]. Ensuring the quality and relevance of the context is essential. This means not only selecting the appropriate information but also structuring it in a way that aligns well with the task at hand. By meticulously curating context, we can mitigate these risks and maximize the effectiveness of LLMs, harnessing their full potential to generate high-quality context-aware responses.
In summary, context-based interaction techniques through prompt engineering are key for maximizing the utility of LLMs and integrating them into diverse and specific contexts. These techniques offer structured methods to interact with LLMs, addressing their limitations and unlocking new capabilities. Their effectiveness in addressing LLM limitations and extending their capabilities has led to widespread adoption, underscoring the importance of basing model explainability on context-based techniques.

2.2. Explainable AI and Interpretability in Large Language Models

The transformative impact of LLMs opens up new possibilities for XAI research, posing a major challenge in terms of their explainability, interpretability, and ethical and reliable use. The traditional methods and strategies for interpreting machine learning models, such as feature importance analysis and decision tree visualization, are not directly applicable to LLMs due to their massive size and complex architecture. Therefore, researchers have explored various approaches tailored to the unique characteristics of LLMs to provide insights into their decision-making processes [43].
One prominent approach to enhancing the interpretability of LLMs involves the use of attention mechanisms [44]. Attention mechanisms, which are integral to the architecture of many LLMs, have been leveraged to provide insights into which parts of the input data the model is focusing on when making predictions. For instance, studies have shown that visualizing attention weights can help to trace the reasoning paths taken by the model, thereby offering a glimpse into its decision-making process [45]. However, the complexity of attention patterns often requires sophisticated methods to decode meaningful explanations.
Another significant area of research has been the development of post hoc explanation techniques [26]. These methods aim to provide insights into the decision-making processes of LLMs without altering their underlying structure. One common approach is the use of Local Interpretable Model-agnostic Explanations (LIMEs) [46], which approximate the model locally with a simpler interpretable model to explain individual predictions. Another technique, Shapley Additive Explanations (SHAPs) [47,48], leverages cooperative game theory to attribute the contribution of each feature to the final prediction, offering a detailed understanding of feature importance. Additionally, counterfactual explanations have gained appeal [49], where slight modifications are made to the input data to observe how these changes affect the output, thereby highlighting the critical factors influencing the model’s decisions. These post hoc methods are essential for uncovering biases, identifying potential errors, and enhancing user trust in LLMs, especially in high-stakes applications.
Despite these advancements, challenges remain to ensure that the explanations provided by XAI techniques are faithful, understandable, and meaningful. The inherent complexity of LLMs means that, even with advanced methods like attention visualization and post hoc techniques, it is difficult to guarantee that the explanations accurately reflect the true reasoning process of the model. These techniques can sometimes oversimplify the complex operations within LLMs, posing the risk of misrepresentation. Additionally, research challenges persist, such as providing explanations without ground-truths, developing metrics to measure the quality of explanations, mitigating biases, addressing hallucinations, and adapting traditional interpretability techniques to the scale and complexity of LLMs.
There is also ongoing exploration into understanding model decisions through contextual explanations and explainable prompting [50,51]. A key differentiator between LLMs and traditional artificial intelligence models is their ability to accept input data in the form of natural language during model inference [52]. This characteristic offers more intuitive and user-friendly insights into these powerful models’ decision-making processes, enabling us to analyze, understand, and draw conclusions using natural language. Indeed, such techniques are being widely adopted and have proven to be effective in addressing the problems of knowledge cutoffs and accessing up-to-date information [53,54].
Building on these foundations, our work proposes a novel approach to interpretability through context-based interaction techniques. By focusing on the contextual information inherent in the inputs and outputs of LLMs, we aim to offer explanations and interpretations that are more intuitive and user-friendly. This approach not only enhances transparency but also aligns closely with how humans understand and generate language, making the explanations more accessible and actionable for users. To the best of our knowledge, this is the first time that the performance of LLMs has been evaluated using these interaction techniques within divergent content. This evaluation highlights context as a powerful means to interpret and understand the outputs of the models.

3. CDK-E: Context-Driven Divergent Knowledge Evaluation

In this section, we introduce the Context-Driven Divergent Knowledge Evaluation (CDK-E) methodology, the principal contribution of our work. CDK-E is designed to evaluate the interpretability of large language models in scenarios that rely on external context rather than the model’s internal knowledge. Specifically, the methodology tests LLMs using contexts that diverge from the inherent knowledge, meaning that the models can only generate correct responses based on the provided information. By assessing the accuracy of these context-dependent responses, CDK-E offers a more interpretable and reliable evaluation of the models’ decision-making processes. This approach complements other XAI techniques by establishing a framework for leveraging context-based methods to interpret model outputs under controlled knowledge conditions.
To support the implementation of this methodology, we also introduce the Divergent Knowledge Dataset (DKD), a dataset specifically designed to provide these divergent contexts necessary for testing the approach. It contains fabricated contexts, questions, and ground-truths that challenge the models’ inherent knowledge, forcing them to rely on the provided context to produce accurate responses.
The CDK-E methodology enables evaluating the interpretability of an LLM by obtaining answers to the DKD questions given a specific context, assessing their correctness, and, therefore, evaluating the model’s interpretability under controlled knowledge conditions. Figure 1 illustrates a diagram with the CDK-E methodology’s key components and process flow.
In this process, the fabricated contexts and DKD questions are integrated with the prompt in the inference pipeline, forming the query sent to the LLM under evaluation. In this study, we assess multiple models to compare their performance in these divergent-context scenarios. Then, the generated responses are analyzed within the performance assessment module, which utilizes independent evaluator LLMs to compute key performance and interpretability metrics. This module compares the generated answers with ground-truth answers to calculate the metrics, thus providing insights into the model’s ability to interpret and adapt to specific contexts while addressing key challenges in XAI for LLMs, such as model reliability, transparency, and alignment with provided contexts.
Specifically, CDK-E methodology addresses the following:
  • Biases and Hallucinations: By using divergent contexts, CDK-E enhances the detection of hallucinations and biases in LLM responses as these issues become more apparent when the model is exposed to information that conflicts with its internal knowledge.
  • Contextual Misalignment: By evaluating LLM outputs against the provided context, CDK-E helps to identify any misalignment between generated and ground-truths.
  • Scalability: The model-independent pipeline, combined with prompt engineering, enables scalable and flexible evaluation across various models and datasets.
In the following sections, the key components are outlined, including dataset fabrication and the performance assessment module.

3.1. DKD: Divergent Knowledge Dataset

As part of the CDK-E methodology, we introduce the Divergent Knowledge Dataset (DKD) to evaluate the interpretability of LLMs in scenarios that deviate from their internal knowledge. The DKD consists of contexts intentionally modified to differ from information the models inherently possess. For each context, we created corresponding questions and ground-truth answers that require the models to rely solely on the provided divergent information to produce accurate responses. This dataset is a foundational component of our approach, ensuring rigorous and contextually relevant evaluation.
Figure 2 illustrates the dataset fabrication process. First, real contexts are altered to create divergent scenarios, and questions are generated based on these modified contexts. This is accomplished using a data generator LLM, which automates the process. This dataset generation is conducted only once, and the resulting data will be used consistently across all evaluated models, ensuring consistency.
Next, ground-truth answers are created for each question. These answers are carefully reviewed by human annotators to ensure their accuracy and relevance to the modified contexts. Finally, a comprehensive human review is conducted to validate the quality of the fabricated contexts, the generated questions, and the ground-truth answers, ensuring that each element aligns with the intended divergent knowledge conditions. Each of these steps is described in detail in the following sections.

3.1.1. Divergent Context Fabrication

We create divergent contexts by systematically altering existing information to challenge the inherent knowledge of the models. These contexts involve modifications to elements such as dates, names, locations, logical sequences, and events across various topics, including history, literature, popular culture, and basic mathematics. The modifications are crafted to be plausible yet absent from the training data of the LLMs, ensuring that models rely solely on the altered information to generate accurate responses.
To construct these contexts, articles from reliable and comprehensive databases can serve as source material. Using a single LLM, we modify key details in the contexts to create coherent yet divergent scenarios. This process involves altering data points—such as dates, names, and locations—to ensure that the contexts are sufficiently challenging and novel for the models. The prompts guiding the LLM in this task are specifically designed to yield contexts that are both coherent and contextually rich.
Figure 3 provides examples of divergent contexts included in the DKD, illustrating modifications to dates, places, and the narratives surrounding key events. These are limited excerpts from the complete set of generated divergent contexts.

3.1.2. Question Design

From the divergent contexts, a series of questions are formulated that require models to rely exclusively on the provided information to produce accurate and coherent responses. These questions vary in complexity, from moderate to high, to ensure they can be answered solely based on the altered context. A single LLM is used to generate a substantial number of questions for each context, with prompts carefully crafted to create questions that are both challenging and directly relevant to the specific alterations introduced. Figure 3 includes two examples of questions generated for the divergent context excerpts.

3.1.3. Validation and Annotation

The divergent contexts and generated questions were manually reviewed by two human annotators who established specific criteria to guide the annotation process. This validation involved each annotator independently checking each generated question for logical coherence, alignment with the modified context, and manually providing answers to establish the ground-truth dataset. The use of human annotators ensures high-quality contextually relevant ground-truths that automated systems might not provide as human judgment captures nuanced contexts and assesses the logical coherence of the questions and answers. When discrepancies arose, annotators discussed and refined their criteria to reach consensus, ensuring consistency and reliability in the annotated data.

3.2. Performance Assessment Module

The CDK-E incorporates a robust performance assessment module based on RAGAS [55] and metrics analyzed by evaluator LLMs. The RAGAS metrics measure the alignment of LLM responses with ground-truth data and contextual information, capturing semantic alignment. Furthermore, in order simulate a human-based assessment and to complement this, we introduce three additional metrics derived from evaluator LLM interpretations. These additional metrics are essential for assessing aspects such as response accuracy and completeness beyond mere semantic alignment, offering a more comprehensive view of model performance in relation to the altered context.
Thus, our assessment employs two primary RAGAS metrics—Answer Semantic Similarity and Answer Correctness—to evaluate correctness and relevance based on embeddings and factual accuracy. These are complemented by our three proposed LLM-based evaluator metrics, which are detailed in the following sections.

3.2.1. Answer Semantic Similarity

The Answer Semantic Similarity (ASS) metric evaluates how closely the generated response matches the ground-truth answer in terms of meaning and context. This metric uses semantic similarity measures to compare the two texts, ensuring that the generated response contains the same information as the ground-truth, even if the wording is different. The score ranges from 0 to 1, with higher values indicating greater semantic alignment.
To calculate the ASS metric, embeddings (vectorized representations) of both the generated answer and the ground-truth answer are obtained using a language model encoder. The embeddings are then compared using cosine similarity, which is calculated as follows:
ASS = cos ( θ ) = A · B A B = i = 1 n A i B i i = 1 n A i 2 i = 1 n B i 2
where
  • A represents the embeddings (vectorized form) of the generated answer.
  • B represents the embeddings (vectorized form) of the ground-truth answer.
  • A i and B i are the components of vectors A and B, respectively.
  • n is the number of dimensions in the vector space.

3.2.2. Answer Correctness

The Answer Correctness (AC) metric measures the accuracy of the generated answer when compared to the ground-truth, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground-truth.
The calculation of the AC is based on the use of two measures: the already introduced Answer Semantic Similarity (ASS) (1) and Factual Correctness ( F C ), also defined by RAGAS [55]. The Factual Correctness quantifies the factual overlap between the generated answer and the ground-truth answer and is calculated as the F1 score as follows:
F C = T P T P + 1 2 ( F P + F N )
where
  • T P (true positive) represents the number of facts or statements that are present in both the ground-truth and the generated answer.
  • F P (false positive) represents the number of facts or statements that are present in the generated answer but not in the ground-truth.
  • F N (false negative) represents the number of facts or statements that are present in the ground-truth but not in the generated answer.
With the Factual Correctness and Answer Semantic Similarity obtained, the Answer Correctness is calculated as
AC = ( 1 w ) ASS + w FC
where ( 1 w ) is the weight assigned to the ASS, and w is the weight assigned to the FC, with 0 w 1 .

3.2.3. LLM-Score-Based Metrics for Accuracy and Completeness

We propose a set of metrics based on a large language model’s scoring system to evaluate the accuracy and completeness of generated answers relative to the ground-truth:
  • Accuracy (Acc): Assesses if the LLM-generated answer reflects the factual information presented, calculated by prompting “Does the LLM-generated answer correctly reflect the facts presented in the ground-truth?”.
  • Completeness (Cm): Evaluates if the answer includes all key points by prompting “Does the LLM-generated answer cover all key points mentioned in the ground-truth?”.
The key points of the evaluation criteria for both metrics are shown in Table 1. The complete prompts used for these metrics can be found in Appendix A.6.
These metrics provide a detailed assessment of both the accuracy and completeness of LLM-generated responses, enabling a nuanced evaluation of the model’s alignment with the ground-truth. Each evaluation considers the context in the prompt, the factual content of the ground-truth, and the LLM’s generated response.

3.2.4. LLM Answer Validation

The LLM Answer Validation (LAV) metric uses a large language model to determine, in a binary fashion, whether the generated response is correct when compared to the ground-truth. As the third metric proposed in this study, it ensures consistency with human judgments by leveraging the advanced validation capabilities of language models.
LAV = 1 if   LLM   considers   the   response   valid 0 otherwise
This metric provides straightforward binary validation, enhancing the robustness of the evaluation framework by ensuring that the responses are not only semantically and factually correct but also validated by a sophisticated language model. LAV is the main metric in our framework because it most closely aligns with human evaluative processes, making it capable of discerning subtle differences that determine the correctness of a response. Unlike similarity-based metrics, which can produce high scores for minimal changes, LAV provides a more nuanced assessment of correctness. The prompt used for the evaluation can be found in Appendix A.7.

4. Experimental Setup

In this section, we describe the experimental setup used to evaluate the interpretability of LLMs using the CDK-E. The primary objective is to assess the alignment of LLM responses with the provided divergent contexts and the correctness of these responses based on the ground-truth. We begin by detailing the configuration and creation of the Divergent Knowledge Dataset, followed by a presentation of the baseline models and the evaluation pipeline used in the experiments.

4.1. Configuration and Creation of DKD

The Divergent Knowledge Dataset was meticulously constructed to evaluate the interpretability of LLMs under conditions that challenge their inherent knowledge. This dataset includes 50 contexts covering a broad range of topics, such as history and historical figures, literature and authors, popular culture and entertainment, and science and mathematics. Each context was modified by altering dates, names, locations, logical sequences, and events to create scenarios that are sufficiently challenging for the models.
The contexts were sourced from Wikipedia articles and range from 4000 to 8000 characters in length (approximately 1200 to 2000 tokens). Modifications were performed using GPT-4o [56], transforming the original information into coherent divergent scenarios. The specific prompt used to guide the LLM in this task is detailed in Appendix A.1, and the details of the fabricated context can be found in Appendix B.
For question generation, GPT-4o [56] was also used to create 20 questions per context, resulting in a total of 1000 questions across the dataset. These questions are designed to be answerable solely based on the provided contexts, asking specifically for the altered information. The prompt used to generate these questions is detailed in Appendix A.2.
Finally, a manual annotation and review of the contexts, questions, and ground-truths was conducted by two annotators, who established specific criteria before beginning the review process. This involved verifying the logical coherence and relevance of the questions, as well as manually annotating the ground-truth answers for all 1000 questions. This rigorous validation process ensures the reliability and accuracy of the dataset, providing a robust foundation for evaluating the interpretability of LLMs.

4.2. Evaluated Models and Configuration

The baseline models and pipeline were established using prompt engineering techniques to directly create the queries and integrate the models through APIs provided by OpenAI (for GPT models) and Amazon Web Services (AWS) Bedrock (for the other models). This approach enables precise control over the prompt construction process, ensuring that the appropriate context from the Divergent Knowledge Dataset is injected into each question and that additional instructions can be included as needed.
By leveraging the APIs from OpenAI and AWS Bedrock, we ensure seamless integration of different LLMs without requiring modifications to the rest of the pipeline. This flexibility facilitates the dynamic construction of contextually enriched prompts, guiding the models to generate accurate and relevant responses. In this work, several state-of-the-art models were evaluated using this setup, ensuring a diverse representation of model architectures and capabilities. The models evaluated include the following:
  • GPT-3.5 Turbo [2]: An enhanced version of GPT-3, known for its improved performance and efficiency in generating human-like text. It has 175 billion parameters and is trained on a diverse range of internet text. Inference for GPT-3.5 Turbo was conducted using the OpenAI API.
  • GPT-4o [56]: The latest iteration in the GPT series, featuring even more parameters and fine-tuned capabilities. GPT-4o offers superior accuracy and a deeper understanding of context compared to its predecessors. Inference for GPT-4o was conducted using the OpenAI API.
  • Llama 3 (70B) [57]: Developed by Meta, this model features 70 billion parameters and is designed for large-scale language tasks. Llama 3 demonstrates state-of-the-art performance on a wide range of industry benchmarks and includes new capabilities such as improved reasoning and code generation. Inference for Llama 3 was conducted using AWS Bedrock.
  • Mixtral 8x7B [58]: Developed by Mistral AI, this is a Sparse Mixture-of-Experts (SMoE) model that utilizes only 12.9 billion active parameters out of a total of 46.7 billion. It is designed for efficiency and performance, with strong capabilities in multiple languages and coding tasks. Inference for Mixtral was conducted using AWS Bedrock. It is the smallest model we are testing, aiming to determine if our interpretability thesis holds with less powerful models.
For all models, the inference configuration included a temperature of 0.2, a top-k value of 50 (used only for Llama 3 and Mixtral), and a top-p of 0.9. A low temperature setting, such as 0.2, reduces randomness in the output, encouraging the model to produce more consistent and focused responses by prioritizing high-probability words. The top-k sampling, set to 50, limits the choice in next words to the 50 most probable options, allowing some variability while preventing unlikely words from being considered. Lastly, a top-p value of 0.9, or nucleus sampling, dynamically narrows the selection to a subset of words whose cumulative probability reaches 90%, balancing precision with slight diversity. This consistent configuration ensures a fair and controlled comparison across different models, optimizing for accuracy and coherence in responses.
By utilizing these diverse models, we aim to provide a comprehensive evaluation of the CDK-E methodology across different architectures and capabilities. This approach ensures that our findings are robust and generalizable across various state-of-the-art LLMs.

4.3. Contextual Prompt Injection, Evaluator LLMs, and Metric Configuration

For each question, the context is dynamically injected into the prompt, incorporating the full context to instruct the model in generating a concise response. Prompts were crafted to emphasize the relevance of the given context, ensuring models rely solely on the injected information rather than prior knowledge. Specific prompt formats for OpenAI models, LLama 3, and Mixtral 8x7B are detailed in Appendix A.3, Appendix A.4 and Appendix A.5.
To evaluate LLM performance, we calculate Answer Semantic Similarity (ASS), Answer Correctness (AC), LLM-based accuracy (Acc), completeness (Cm), and LLM Answer Validation (LAV). For ASS, embeddings are obtained using OpenAI’s embedding model.
Metrics are calculated using GPT-4 Turbo [59] and Claude 3 Haiku [60] as evaluator LLMs. For AC, facts are retrieved using both models, with w = 0.25 , the default in RAGAS, prioritizing semantic alignment over ground-truth factual overlap.

5. Results and Discussion

In this section, we present and discuss the results obtained from applying the CDK-E methodology to our version of the DKD employing a baseline with different LLMs. Our evaluation focused on the overall correctness of these responses based on a predefined ground-truth as well as on how well these models align their responses with the provided divergent contexts.

5.1. Empirical Results

Table 2 shows the empirical results of our work using the DKD from 50 contexts with a total of 1000 questions, using GPT-4 Turbo ( γ ) and Claude 3 Haiku ( κ ) as evaluators. The table presents the LAV precision, which is the total number of correct answers among the total number of questions, and the averages obtained for Answer Semantic Similarity (ASS), Answer Correctness (AC), and the LLM-based accuracy (Acc) and completeness (Cm) metrics for 50 divergent contexts.
LAV precision will be the reference measure in our analysis as it is the most human-like metric that determines whether the answer is correct and thus whether it has aligned with divergent knowledge regardless of whether it is accurate, complete, or lies within the same vector space as the ground-truth.
As shown, the high values of measures such as LAV precision, ASS, or accuracy for both evaluator models indicate that it is indeed possible to interpret, understand, and explain the outputs of these models in terms of the divergent contexts provided, which confirms the main thesis of our methodology, CDK-E.
Table 2 also shows that our interpretability thesis is best fulfilled when the model performs better. GPT-4o shows the highest adherence to the provided divergent contexts with LAVs of 96.6% and 96.9% for GPT-4 Turbo and Claude 3 Haiku as the evaluators, respectively. This indicates that it was highly effective in generating responses that aligned with the altered contexts, reflecting its advanced understanding and reasoning capabilities. This performance proves that leveraging contextual information significantly enhances the interpretability, understanding, and explainability of model outcomes, even when such contexts diverge significantly from inherent knowledge.
GPT-3.5 and Llama 3 also exhibit strong performance in LAV precision, with GPT-3.5 surpassing all the metrics and being the second-best model. Llama 3 obtains a particularly low result in completeness and Answer Correctness (AC). Upon analyzing the answers provided during inference, we observe that this model tends to respond much more tersely than the rest of the models analyzed. Although it adheres well to the divergent information, it often fails to include all the details present in the ground-truth. Such short but correct answers affect both metrics significantly, distorting the real performance of the model.
Mixtral 8x7B, the smallest model analyzed, demonstrates slightly lower, yet still impressive, adherence to divergent contexts, with LAV scores of 84.8% and 91.9%, and with high values for ASS, outperforming Llama 3 in completeness and Answer Correctness. These results confirm that even models with fewer parameters can effectively leverage context to generate accurate and coherent responses.
Regarding the evaluative models, GPT-4 Turbo and Claude 3 Haiku yield generally similar trends, reinforcing the observation that higher-performing models tend to align better with the divergent context provided in CDK-E. However, Claude 3 Haiku tends to assign slightly higher scores to models like Llama 3 and Mixtral compared to GPT-4 Turbo, particularly on LLM-based metrics such as accuracy and completeness. This suggests a slight difference in sensitivity between the evaluators, although the overall patterns remain consistent. Full tables of results for both evaluators are presented in Appendix C, and the estimated costs of using LLMs in the study, along with the exact versions used for each model, are provided in Appendix D.

5.2. Detailed Response Analysis

A detailed analysis of the models’ responses and errors reveals noteworthy findings. GPT-4o demonstrates the strongest understanding and interpretation of divergent contexts, with no hallucinations or major misinterpretations observed. Instances of misalignment with divergent knowledge are rare; however, these typically occur when answers require deeper interpretative reasoning or involve strongly ingrained facts in the model’s internal knowledge. In our analysis, three main types of errors were identified:
  • Hallucinations errors: These occur when models generate information that is entirely unrelated to the provided context or divergent knowledge. Those models with inferior performance, especially Mixtral 8x7B, exhibited a relatively high frequency of hallucination errors. For example, in response to a question about a fictional historical event, Mixtral 8x7B introduced unrelated historical facts not included in the prompt, likely drawing from its internal knowledge base.
  • Context misalignment: This type of error occurs when models fail to use the divergent context correctly even though relevant information is available. This is the most common error and occurs mainly when the answer to the question is not explicitly found in the divergent context but the model needs to assimilate some of that knowledge to achieve an indirect answer. For example, when a question involves calculating a celebrity’s age based on an altered date of birth, the model may default to its internal knowledge instead of processing the divergent information. Such cases highlight opportunities for improvement, particularly through refined prompting techniques. Approaches like chain-of-thought [32], task decomposition, and sequential reasoning could be employed to guide the model’s focus toward divergent information more effectively. These methods are increasingly utilized in the most advanced models, such as OpenAI’s GPT-o1 series, and show promise in enhancing alignment with context-specific knowledge.
  • Verbose and extraneous information: Predominantly found in the GPT models, this error involves providing overly detailed or verbose responses that include information not directly relevant to the question. For instance, when asked about a simple fictional fact, GPT-3.5 added unnecessary background information, diluting the focus of the answer. On the other hand, Llama 3 and Mixtral 8x7B were more concise in their responses, sometimes to the point of omitting important information.
Regarding model performance by content type, the analysis shows that complex topics such as scientific subjects and historical contexts tend to yield more errors across all the models. In particular, questions on topics like deep learning, relativity, and mathematics showed a greater incidence of both hallucination and misalignment errors, likely due to the higher cognitive demand these topics place on the models to override inherent knowledge with fabricated information. In contrast, the models performed more reliably with fictional contexts (e.g., literature and fantasy), showing stronger alignment with divergent knowledge and fewer hallucinations, likely due to the more flexible and creative nature of these contexts.

5.3. Correlations Between Metrics

In this section, we analyze the correlations between various evaluation metrics to validate the effectiveness of our proposed methodology. First, we focus on how LLM-based evaluation metrics (LAV, Acc, and Cm) correlate with objective metrics, such as ASS, which is based on cosine similarity. By understanding these correlations, we aim to identify any alignment or divergence between subjective model-based assessments and objective vector-space metrics, thereby evaluating the overall usefulness of the proposed measures.
Figure 4 shows the correlation between a weighted combination of the LLM-based metrics (0.5 for LAV, 0.25 for Acc, and 0.25 for Cm) and ASS, employing GPT-4 Turbo (Figure 4a) and Claude 3 Haiku (Figure 4b) as evaluators. The weighting was chosen to reflect the relative importance of each metric in capturing core aspects of model alignment with the context: LAV, as the primary indicator of correct answers in our analysis, is assigned the highest weight (0.5), while Acc and Cm each contribute 0.25 to balance precision with thoroughness.
The figure presents the results for each of the 50 contexts in the Divergent Knowledge Dataset (DKD) for the four models analyzed, highlighting the correlation area between ASS and the weighted LLM-based metrics. These results illustrate the alignment between ASS and the combined LLM-based metrics (LAV, accuracy, and completeness), with GPT-4o demonstrating the highest overall alignment. The strong correlation observed between ASS and the weighted LLM-based metrics supports the validity of our proposed evaluation framework, suggesting that models performing well in generating semantically similar answers (high ASS) also score highly in the combined subjective metrics.
In particular, the figure shows that GPT-4o and GPT-3.5 Turbo exhibit the highest correlation between ASS and the weighted metrics with both evaluators. This indicates that models that excel in adhering to divergent contexts are also effective at producing semantically aligned responses. GPT-4o’s high scores across both measures underscore its advanced capability in handling context-specific information, while GPT-3.5 also achieves strong alignment, albeit with slightly less consistency than GPT-4o.
Analyzing the performance for Llama 3 and Mixtral 8x7B, we observe that both models exhibit lower ASS, reflecting a tendency for more concise or less contextually aligned responses. However, GPT-4 Turbo as the evaluator tends to assign lower scores in the LLM-based metrics, while Claude 3 Haiku generally provides higher, although more variable, scores. This contrast suggests that Haiku may be overestimating subjective metrics (or GPT-4 Turbo undervaluing them), highlighting slight differences between the evaluation models.
Furthermore, if we analyze the correlation between LAV precision and LLM-based accuracy, also employing GPT-4 Turbo as the evaluator, we obtain the diagram shown in Figure 5. This diagram shows the correlation between both measures for the 50 contexts for the model that committed the most errors, Mixtral 8x7B. The figure clearly shows how the analysis of response accuracy correlates highly with response validity, confirming the sense of the proposed metrics.
Some of these metrics rely on the evaluative LLM’s ability to respond or score accurately, making them somewhat subjective. Therefore, we decided to manually review the results, especially those of LAV. Additionally, to confirm the consistency of the metrics, we tested the model outputs using the real context while keeping the altered ground-truth and confirmed that the metrics yielded very poor values in every context.
In summary, the high correlation between LAV, ASS, AC, and the LLM-based metrics indicates that models that perform well in maintaining context accuracy also tend to produce responses that are semantically similar, factually correct, and complete. This relationship suggests that improving a model’s ability to adhere to provided contexts can enhance its overall performance across multiple evaluation metrics.

5.4. Discussion

The empirical results obtained confirm that our CDK-E methodology effectively achieves interpretability and explainability for LLMs in context-divergent interactions. This supports the extension of these conclusions to many of the widely used context-based interaction techniques prevalent today. The study contributes valuable insights into whether it is feasible to trust, interpret, and understand model responses when using common context-based interaction techniques.
Unlike the existing XAI methods, which often focus on examining model internals or analyzing feature importance, CDK-E focuses on the interpretability of outputs in scenarios where the context provided diverges from the model’s inherent knowledge. This approach does not aim to replace the current interpretability techniques but rather to complement them by evaluating the models’ ability to adapt to and align with new context-specific information. Given the unique scope of our methodology, it was not feasible to directly compare it with traditional interpretability approaches as CDK-E’s primary goal is not to probe model internal mechanisms but to assess contextual adaptability and coherence.
Models with superior performance, such as GPT-4o, align better with divergent knowledge, committing fewer errors overall. This high alignment underscores the potential of advanced LLMs to provide reliable and accurate explanations for their outputs. The correlations observed between LLM-based metrics and cosine similarity metrics, along with tests using real contexts and meticulous manual validations, confirm the reliability of the metrics used for evaluating the methodology.
While CDK-E demonstrates significant strengths, it is worth noting certain challenges and considerations inherent to this approach. The methodology relies heavily on the construction and quality of divergent contexts, which requires a detailed and resource-intensive annotation process. Additionally, given the focus on context divergence, this study emphasizes controlled static contexts and may require further adaptation to evaluate real-time dynamic scenarios. Nonetheless, these challenges do not detract from the primary contribution of CDK-E as a practical and complementary tool for interpretability.
The results underscore the significance of context in enhancing the interpretability of LLMs. The ability of these models to align their responses with divergent contexts illustrates that contextual information can be effectively used to mitigate issues like hallucinations and biases. However, CDK-E alone does not fully address all the challenges related to explainability, particularly in cases where contextual misalignment leads to subtle biases that may go unnoticed. Additional layers of analysis, such as user studies or error analyses focused on specific biases, could provide a more holistic understanding of how models interact.
In conclusion, the CDK-E methodology offers a robust framework for evaluating the interpretability of LLMs in context-divergent scenarios. The high performance of models across multiple metrics reaffirms the effectiveness of leveraging contextual information to enhance model transparency and trustworthiness. Nonetheless, CDK-E should be viewed as one component of a broader XAI toolkit rather than a standalone solution.

5.4.1. Implications for Future Research

The development of more sophisticated context-based techniques can further improve the interpretability of LLMs, making them more trustworthy for critical applications. Investigating the performance of smaller models, such as Mixtral 8x7B, can provide insights into optimizing resource usage while maintaining high explainability standards. Furthermore, exploring additional evaluation frameworks, domain dependency, and biases can help to refine and ensure comprehensive and consistent assessments of LLM interpretability.

5.4.2. Challenges and Limitations

Although CDK-E demonstrates strong potential for enhancing the interpretability of LLM outputs, certain limitations inherent to the approach must be acknowledged. One key limitation is the dependence on the DKD, which currently consists of a relatively small sample—50 contexts and 1000 questions. Expanding the DKD with additional contexts and question types across a wider range of domains could enhance the robustness of CDK-E, providing a more comprehensive testbed for evaluating LLMs in context-divergent scenarios. However, scaling the dataset involves a resource-intensive process as each context must be carefully curated and validated to ensure that it accurately diverges from general knowledge while maintaining internal coherence.
Additionally, while CDK-E effectively addresses contextual misalignment in a controlled setting, it does not capture all the dimensions of explainability. For instance, incorrect responses that superficially align with the context can still carry subtle biases or inaccuracies that CDK-E alone may not detect. Further layers of analysis, such as targeted bias detection or user-centered studies, could provide deeper insights into model behavior in cases where simple context alignment might mask underlying issues.

6. Conclusions

This study presents the Context-Driven Divergent Knowledge Evaluation (CDK-E) methodology, demonstrating its effectiveness in enhancing the interpretability of large language models through divergent contexts. By leveraging the Divergent Knowledge Dataset (DKD), we evaluated how well LLMs can align their responses with provided contexts that diverge from their inherent knowledge.
The empirical results show that advanced models like GPT-4o and GPT-3.5 Turbo achieve the highest alignment with divergent contexts, with LAV precision scores of 96.6% and 94.2%, respectively, using GPT-4 Turbo as the evaluator. These findings validate our hypothesis that contextual information can significantly improve the interpretability of LLM outputs. The strong correlation between LLM-based metrics (LAV, Acc, and Cm) and Answer Semantic Similarity (ASS) further supports the reliability of our evaluation framework.
Our results also indicate that models with superior performance in traditional metrics (e.g., GPT-4o) are better suited to maintaining contextual accuracy, thereby providing more reliable and transparent responses. This suggests that improving a model’s alignment with provided contexts can enhance its performance across multiple evaluation dimensions. Importantly, CDK-E is intended to complement existing XAI techniques by focusing on the model’s adaptability to context rather than replacing other methods.
In summary, the CDK-E methodology offers a robust complementary framework for evaluating and enhancing the interpretability of LLMs, reinforcing their suitability for high-stakes applications where trust and transparency are critical. Regarding future research, we envision three key directions for expanding CDK-E and advancing explainable AI research:
  • Extended Datasets: Expanding the DKD to include more diverse and complex contexts, thereby increasing the robustness of the evaluation framework.
  • Advanced Metrics: Developing and integrating new metrics that capture additional dimensions of explainable AI, such as causal inference and user interpretability.
  • Interactive Interpretability: Exploring interactive approaches where users can engage with LLMs to understand and refine their responses, enhancing the transparency and usability of AI systems.

Author Contributions

Conceptualization, investigation, and formal analysis: A.P.-M.; methodology, software, validation, data curation, and writing—original draft, A.P.-M. and F.-J.S.-C.; writing—review and editing, A.P.-M., F.-J.S.-C., C.G.-M. and L.D.-F.; project administration and funding acquisition, C.G.-M., L.D.-F. and M.d.C.L.-P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Galician Innovation Agency (GAIN) and the Consellería de Cultura, Educación, Formación profesional e Universidades of the Xunta de Galicia through the program Doutoramento Industrial [61]. It also received funding from the Consellería de Cultura, Educación, Formación profesional e Universidades of the Xunta de Galicia for the “Centro singular de investigación de Galicia” accreditation 2019–2022, the “Axudas para a consolidación e estructuración de unidades de investigación competitivas do Sistema Universitario de Galicia -ED431B 2024/36”, and the European Union from the “European Regional Development Fund—ERDF”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The DKD can be accessed from https://github.com/andrespimartin/Divergent-Knowledge-Dataset (accessed on 23 January 2025).

Conflicts of Interest

Authors Andrés Piñeiro-Martín, Francisco-Javier Santos-Criado and María del Carmen López-Pérez were employed by the company Balidea Consulting & Programming S.L. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
AccAccuracy
AWSAmazon Web Service
ACAnswer Correctness
ASSAnswer Semantic Similarity
APIApplication Programming Interface
AIArtificial Intelligence
BERTBidirectional Encoder Representations from Transformers
CoTChain of Thought
CmCompleteness
CDK-EContext-Driven Divergent Knowledge Evaluation
DKDDivergent Knowledge Dataset
XAIExplainable Artificial Intelligence
FCFactual Correctness
GPTGenerative Pre-trained Transformer
LLMLarge Language Model
LAVLLM Answer Validation
LIMEsLocal Interpretable Model-agnostic Explanations
RAGRetrieval-Augmented Generation
RAGASRetrieval-Augmented Generation Assessment
ReActReasoning and Acting
SHAPsShapley Additive Explanations

Appendix A. Prompt Utilized in CDK-E

This appendix shows the exact prompts we used for the implementation of the CDK-E.

Appendix A.1. Prompt Utilized for the Development of the Fabricated Context of the Dataset

This prompt was used for the creation of the contexts in the DKD. It was completed with the real context, and the LLM was in charge of modifying the data for the creation of the divergent knowledge context:
prompt=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"""Your task is to rewrite the given
context by changing all specific details like names, dates, places
and colors while keeping the main theme and facts intact.
 
Please ensure that:
1. The overall structure and important events remain unchanged.
2. Change specific details such as names, dates, places, and other
particulars to new, consistent, and plausible alternatives but invented.
3. Keep the theme of the text consistent with the original genre.
4. The rewritten content should read naturally and logically.
 
context:
{context}
"""}
  ]

Appendix A.2. Prompt Utilized for the Creation of Contextual Questions

The following prompt was used to create questions for each fabricated context. It utilized few-shot prompting. The fabricated context was added, and the language model generated the following related questions:
prompt=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"""Your task is to formulate exactly
20 questions from the given context.
 
End each question with a ‘?’ character and then in a newline
write the answer to that question using only the context provided.
Each question must start with "question:". Examples:
  question: What is Real Madrid’s nickname in Spanish?
  context_used: The club has traditionally worn a white home kit
  since its inception.
  question: When was Real Madrid founded??
  context_used: Founded in 1902 as Madrid Football Club, the club
  has traditionally worn a white home kit since its inception.
 
The question must satisfy the rules given below:
1. The question should make sense to humans even when read without
the given context.
2. The question should be fully answered from the given context.
3. The question should be framed from a part of context that
contains important information. It can also be from tables, code...
4. The answer to the question should not contain any links.
5. The question should be of moderate/high difficulty.
6. The question must be reasonable and must be understood and
responded to by humans.
7. Do not use phrases like ’provided context’, etc in the question.
8. Avoid framing questions using the word "and" that can be
decomposed into more than one question.
9. The question should not contain more than 10 words; make use of
abbreviations wherever possible.
 
context:
{context}
"""}
  ]

Appendix A.3. Prompt Utilized for CDK-E Inference with OpenAI API

The following prompt was used for inference during the assessment regarding CDK-E with the OpenAI models:
      prompt=[
      {"role": "user", "content": f"""You are a helpful assistant.
          Answer the following question based strictly and exclusively on
          the provided context. Do not use any prior knowledge or
          information outside of what is given in the context.
          IMPORTANT RESTRICTION: Don’t write that you based the question
          on the provided context but use it.
          context:
          {context}
 
          Question: {question}
          """}
      ]

Appendix A.4. Prompt Utilized for CDK-E Inference with AWS Bedrock API for Llama 3

The following prompt was used for inference during the assessment regarding CDK-E with the Llama 3 model:
      prompt=[
      {"role": "user", "content": f"""""<|begin_of_text|>
          <|start_header_id|>system<|end_header_id|>
 
          You are the helpful assistant.
          <|eot_id|><|start_header_id|>user<|end_header_id|>
 
          Answer the question briefly and concisely with only the final
          answer based ONLY on the following context, don’t use your prior
          knowledge.
 
          Do not include any multiple-choice options or
          explanations, just the answer.
 
          context:
          {context}
 
          Question:
          {question}
 
          <|eot_id|><|start_header_id|>assistant<|end_header_id|>
          """}
      ]

Appendix A.5. Prompt Utilized for CDK-E Inference with AWS Bedrock API for Mixtral 8x7B

The following prompt was used for inference during the assessment regarding CDK-E with the Mixtral 8x7B model:
      prompt=[
      {"role": "user", "content":
 
          f"""<s>[INST]
 
          Answer the following question based strictly and exclusively on
          the provided context. Do not use any prior knowledge or
          information outside of what is given in the context.
 
          context:
          {context}
 
          Question:
          {question}
          [/INST]
          """}
      ]

Appendix A.6. Prompt Utilized for LLM-Based Accuracy and Completeness

The following prompt was used for LLM-based accuracy and completeness as part of our evaluation framework:
      prompt=[
      {"role": "user", "content":
 
          f"""You are provided with a
          question, the ground-truth answer, and the LLM-generated answer.
          Evaluate the quality of the LLM-generated answer based on its
          alignment with the given ground-truth, without considering the
          factual correctness of the ground-truth.
 
          Pay attention to the names, places, dates... because they have to
          be fully based in the ground-truth.
 
          Evaluation Criteria:
 
          Accuracy: Does the LLM-generated answer correctly reflect the
          facts presented in the ground-truth?
 
          Completeness: Does the LLM-generated answer cover all key points
          mentioned in the ground-truth?
 
          Scoring Scale for Accuracy and Completeness:
          0.0: Completely incorrect or irrelevant answer.
          0.25: Partially correct but largely incomplete or irrelevant.
          0.5: Somewhat correct, covering some key points but missing others;
          may include irrelevant information.
          0.75: Mostly correct, covering most key points with minor
          omissions or irrelevant details.
          1.0: Completely correct, covering all key points accurately and
          concisely.
 
          Question:
          {question}
 
          Ground Truth Answer:
          {ground_truth}
 
          LLM-Generated Answer:
          {answer}
 
          Provide your evaluation as:
          Accuracy: <score>
          Completeness: <score>
 
          """}
      ]

Appendix A.7. Prompt Utilized for LLM Answer Validation

The following prompt was used for LLM Answer Validation:
      prompt=[
      {"role": "user", "content": f"""You are provided with a question,
          the ground-truth answer, and the LLM-generated answer.
          Evaluate the quality of the LLM-generated answer based on its
          alignment with the given ground-truth, without considering the
          factual correctness of the ground-truth.
          Pay attention to the names, places, dates... because they have to
          be fully based in the ground-truth.
          Use the following evaluation criteria and scoring scale.
 
          Evaluation Criteria:
          LAV: Is the LLM-generated answer valid based on its alignment with
          the ground-truth and the question?
 
          Scoring Scale for LAV:
          1: if valid
          0: otherwise
 
          Question:
          {question}
          Ground Truth Answer:
          {ground_truth}
          LLM-Generated Answer:
          {answer}
 
          Provide your evaluation as:
          LAV: <score>
          """}
      ]

Appendix B. DKD Details

The following tables show the details of the manufactured/altered contexts grouped by theme. For each context, a summary and the number of total tokens are shown.
Table A1. DKD details for contexts.
Table A1. DKD details for contexts.
IDNameTopicCharactersTokens
1Agatha ChristieLiterature73751605
2Al CaponeHistory80021817
3Artificial IntelligenceScience/Tech67261220
4Christopher ColumbusHistory79751747
5Cryptography RSAScience/Tech78951585
6Deep LearningScience/Tech73481392
7DeepL TranslatorScience/Tech57801194
8Doctor WhoMovies/Television81131698
9Elon MuskScience/Tech79831696
10Fahrenheit 451Literature69631469
11FIFA World CupSports80141635
12French RevolutionHistory75201544
13FriendsMovies/Television78991740
14Gabriel García MárquezLiterature78671779
15Google PageRank AlgorithmScience/Tech76651339
16GravityMaths/Physics77151538
17Greek MythologyHistory68671375
18Harry PotterLiterature78881715
19HogwartsFiction78391583
20Isaac AsimovLiterature79831790
21James BondMovies/Television72331607
22Jane AustenLiterature79131720
23Jules VerneLiterature73581738
24J.R.R. TolkienLiterature79641774
25Llama (animal)Other80001887
26Machine LearningScience/Tech62571139
27Madrid Football ClubSports79911844
28Mad MaxMovies/Television79971714
29Mars (planet)Maths/Physics68551562
30Michael JacksonMovies/Television80061821
31Miguel DelibesLiterature79101861
32Nicolas FlamelHistory73361783
33Philip K. DickLiterature79171750
34Problem P versus NPScience/Tech79191751
35Pyramids of GuimarHistory80251735
36PythagorasMaths/Physics79461840
37Rebecca YarrosLiterature42311117
38Russian RevolutionHistory71571443
39Sherlock HolmesLiterature78681616
40Spider ManFiction80221618
41Star TrekMovies/Television73801500
42Stephen KingLiterature80331761
43TelenovelaLiterature73261673
44Theory of RelativityMaths/Physics80181587
45The Lord of the RingsLiterature79801909
46Thomas MannLiterature76281668
47Thomas MoreHistory78321721
48Tony HawkSports79901840
49TraditionalismHistory75951610
50Winnie the PoohLiterature75081668

Appendix C. Complete Results

The following tables show the complete empirical results in percentages (%) for the 5 metrics analyzed (LAV precision and the averages for Answer Semantic Similarity (ASS), Answer Correctness (AC), and the LLM-based accuracy (Acc) and completeness (Cm)). Table A2 shows the results when using GPT-4 Turbo as evaluator model, and Table A3 the results with Claude 3 Haiku as evaluator.
Table A2. Complete empirical results in percentages (%) for the four models analyzed with GPT-4 Turbo as evaluator model. The last row shows the mean ( μ ) per metric and model evaluated.
Table A2. Complete empirical results in percentages (%) for the four models analyzed with GPT-4 Turbo as evaluator model. The last row shows the mean ( μ ) per metric and model evaluated.
LAVAccCmASSAC
ID3.54oLl3Mx3.54oLl3Mx3.54oLl3Mx3.54oLl3Mx3.54oLl3Mx
11001001008097.598.898.87588.89562.567.594.598.28690.170.287.948.467
295.210010090.59496.496.486.991.796.464.386.99698.584.594.278.384.641.478.2
39585656587.58067.556.287.577.557.556.29393.187.988.470.456.248.652.2
4959595959592.592.59090.593.866.886.295.397.686.889.171.980.642.959.7
510010090.58198.896.495.28192.995.264.38194.597.588.493.574.783.961.374.2
610010010010096.496.49490.598.892.964.377.496.997.685.19179.582.842.163.1
7809580808587.5757578.89053.863.795.998.186.792.172.38249.560.9
810010095.266.795.295.291.766.785.791.754.851.294.297.787.289.565.687.250.443.2
990.510010085.791.792.998.88188.19464.378.695.89787.590.876.78153.167.4
109090957586.2858571.29081.261.372.595.196.787.792.271.973.749.155.2
118190.595.28183.388.191.777.492.997.654.889.395.498.48594.465.885.746.271.8
1210095909598.893.888.89597.593.863.788.899.498.688.195.390.588.460.882
13100100959092.5100909092.598.86588.897.19785.591.69289.84774.7
14951001007092.596.29571.29598.862.572.594.997.787.790.369.186.952.363.5
1510010090709582.582.573.895906571.297.797.19092.378.979.961.760.2
16958590809085807586.277.55572.595.395.189.592.76970.760.567.5
179595859087.591.281.287.582.582.56576.291.891.592.589.366.46870.960.8
1895.210010090.591.798.897.686.991.798.860.78196.298.788.792.687.791.150.563.2
1995.295.295.290.591.79486.982.191.792.963.176.295.897.689.393.37683.963.475.3
2090100100859593.888.886.2929056.277.595.497.384.793.96877.254.171.3
21100100958097.598.891.281.296.291.253.87596.49685.690.880.374.646.254.2
221001009010093.898.888.891.29591.261.387.596.496.686.993.273.381.645.771.1
23859585809093.88581.28592.557.58096.497.685.292.270.372.150.958.4
241001001008010098.897.5809598.871.286.296.199.186.394.582.990.662.677.1
2595100907591.296.283.876.292.596.266.273.895.596.486.192.370.885.449.754.6
26959510090959597.588.887.598.871.28593.695.59091.572.582.25575
279090959087.588.89083.891.282.558.877.597.79684.994.285.98351.276.1
2895.210085.78191.789.385.782.191.786.953.68195.495.490.491.976.673.751.862.7
2995100859091.296.283.888.887.593.870859596.388.293.374.275.767.381.9
3090100100759098.893.872.587.593.86566.295.597.385.19178.184.361.961
3195100909097.596.288.893.896.296.262.581.295.197.884.688.778.387.645.464
329590909091.288.883.887.59092.57083.896.29886.690.979.887.549.262.8
331009590859593.887.582.593.888.863.777.594.395.385.991.475.664.832.965.6
348595907582.591.278.877.583.88053.872.596.796.487.893.376.977.537.670.4
359510010010092.593.897.59593.89071.286.297.196.690.493.883.785.760.974.3
3695100907593.89587.57588.896.263.773.894.197.386.89282.384.859.857.2
3794.794.794.710093.494.793.496.189.597.459.276.394.398.785.889.781.786.444.563.9
3885.781818184.585.782.185.785.786.959.585.7989887.195.1827965.876.4
399595958596.292.591.282.596.292.571.277.596.297.188.691.877.479.466.166.6
4085100908081.298.886.28076.297.561.367.594.497.387.490.372.58045.254.7
41801001008080909086.278.89058.888.894.998.487.294.372.581.338.568.1
4210010085909597.583.886.291.292.563.78095.696.389.592.275.271.452.962.3
431001001007592.597.591.273.892.588.86558.895.595.890.790.86673.353.548.9
4490.576.27071.484.573.868.864.382.16953.860.795.496.587.893.271.973.949.560.8
45100100100909591.292.5809083.862.578.89795.587.893.27976.251.762.8
46901001009591.292.59086.29091.263.783.895.497.487.893.47781.350.174.8
4795100959096.298.893.887.590100659094.19786.3937291.346.668.9
4895.210095.210096.492.991.796.495.291.75679.897.298.883.789.375.483.53963.4
491001001009098.893.892.59597.595758597.898.290.594.886.184.174.571.6
5095100959592.596.292.587.596.298.861.386.295.997.185.993.58087.551.780
μ 94.296.692.884.892.193.188.882.490.391.462.377.795.797.187.392.176.180.852.266
Table A3. Complete empirical results in percentages (%) for the four models analyzed with Claude 3 Haiku as evaluator model. The last row shows the mean ( μ ) per metric and model evaluated.
Table A3. Complete empirical results in percentages (%) for the four models analyzed with Claude 3 Haiku as evaluator model. The last row shows the mean ( μ ) per metric and model evaluated.
LAVAccCmASSAC
ID3.54oLl3Mx3.54oLl3Mx3.54oLl3Mx3.54oLl3Mx3.54oLl3Mx
11001001008510010097.582.598.810093.881.294.598.28690.193.898.188.680.3
295.210095.290.592.998.896.490.592.998.89490.59698.584.594.289.591.28384.6
39595759086.283.871.277.588.887.568.878.89393.187.988.483.679.372.570.2
41009510010097.59596.296.296.29597.59595.397.686.889.187.993.57782.7
510010010010097.698.897.69497.698.89497.694.597.588.493.58390.983.375.8
610010010010097.698.895.296.497.697.692.995.296.997.685.19188.99281.388.1
785100959082.596.286.28082.597.581.283.895.998.186.792.182.786.574.674.4
810010095.271.496.497.69472.696.496.488.171.494.297.787.289.592.492.383.868
910010010090.595.297.698.890.595.296.495.292.995.89787.590.890.789.593.886.4
10909095859091.291.282.588.891.288.88595.196.787.792.2858479.469
118185.790.58186.989.390.585.788.189.385.789.395.798.48594.483.189.286.880.9
121009590951009588.89598.89587.59599.498.688.195.396.794.58793.8
1395908590959082.588.895908088.897.19785.591.687.985.577.684.7
14951001008598.810098.89010010097.591.294.997.787.790.382.486.993.277.5
159095909587.593.883.89088.893.88086.297.797.19092.390.887.68075.1
169095959088.891.288.888.887.591.287.587.595.395.189.592.783.582.686.177.3
171001009510092.593.89092.586.293.886.288.891.891.592.589.383.879.685.678.6
1895.210010095.29498.897.692.99498.89492.996.298.788.792.684.895.586.982.8
1995.210095.290.590.595.290.586.991.79485.789.395.897.689.393.382.590.976.681.5
2090951009092.593.895909093.891.291.295.497.384.793.989.893.279.990.7
21100100909097.510091.291.297.598.887.591.296.49685.690.890.691.483.877.4
229595959590959092.59091.288.893.896.496.686.993.282.48474.988.1
2390100859091.296.28588.892.598.88588.896.497.685.292.281.286.16979.7
241001001009010010097.59010098.896.292.596.199.186.394.584.290.575.882.5
25100100959597.598.891.286.297.598.888.888.895.596.486.192.37987.575.373.6
2695100951009597.593.897.593.897.593.898.893.695.59091.586.385.587.979.1
27958090809588.891.286.297.588.89088.897.79684.994.290.38988.186.6
2895.295.290.595.291.79489.386.995.292.984.589.395.495.490.491.982.785.178.980.5
2995959010091.293.888.892.588.893.886.291.29596.388.293.390.687.182.984.4
309095959087.59588.888.88593.887.588.895.597.385.19186.494.79382.7
311001009010096.298.89096.297.597.588.896.295.197.884.688.787.991.772.487.4
329595909593.893.886.291.292.510086.291.296.29886.690.991.387.587.285.6
33100100908597.596.287.586.297.596.286.287.594.395.385.991.487.879.487.782.9
34909590858587.58078.887.587.577.577.596.796.487.893.383.480.475.785.2
3510010010010097.596.210097.597.597.597.596.297.196.690.493.886.585.780.784.5
36959590859593.887.583.893.893.88583.894.197.386.8928186.479.873
3794.710094.710094.798.794.798.794.710094.798.794.398.785.889.779.489.774.185.6
3890.595.285.795.288.191.786.990.590.591.785.790.5989887.195.185.788.67688
399595959098.893.89587.598.89596.29096.297.188.691.890.68884.285.9
408090758081.29078.88081.29076.281.294.497.387.490.381.684.985.774.2
4195100959592.510091.29591.298.888.897.594.998.487.294.381.289.576.875.4
42100100859597.598.885959597.583.892.595.696.389.592.281.389.178.682.4
43951009585959591.276.296.296.288.87595.595.890.790.880.477.570.374.8
4495.295.271.410088.188.173.890.588.186.972.691.795.496.587.493.286.182.167.975.2
4510095959098.897.593.887.592.59583.8909795.587.893.285.985.583.876.7
46951001009592.596.29588.891.296.288.886.295.497.487.893.493.991.787.581.6
479095908591.2959088.891.2959091.294.19786.3938890.481.681.6
4810010095.210010097.69497.698.897.690.596.497.298.883.789.384.994.87682.8
4910010010010010097.597.597.598.897.59598.897.898.290.594.892.592.490.986.5
50100100959597.598.891.292.598.898.888.891.295.997.185.993.590.191.285.983.8
μ 5.296.992.991.993.695.390.589.393.395.288.189.795.797.187.392.186.388.281.481

Appendix D. Cost Estimation for the Use of LLMs

This appendix provides a detailed cost estimation for the use of various large language models throughout the stages of the CDK-E methodology. The models used include GPT-3.5 Turbo, GPT-4-Turbo, GPT-4o via OpenAI, and Llama 3, Mixtral8x7b, and Claude 3 Haiku via AWS Bedrock.
The cost estimation involves calculating the expenses based on the usage metrics provided by the respective platforms (OpenAI and AWS Bedrock) for each model. The primary factors considered for cost estimation are as follows:
  • Token Usage: The number of tokens processed by each model.
  • Token Cost: The cost associated with the processed tokens.
The following table shows the token cost breakdown per 1000 tokens, the usge cost per model, and the total cost of the study.
Table A4. Cost breakdown for each model, indicating the model version. For AWS, the selected region is eu-west-2 (Europe, London).
Table A4. Cost breakdown for each model, indicating the model version. For AWS, the selected region is eu-west-2 (Europe, London).
Model NameModel VersionPrice per 1000
Input Tokens ($)
Price per 1000
Output Tokens ($)
Usage Cost ($)
GPT-3.5 Turbogpt-3.5-turbo-01250.00050.001512
GPT-4-Turbogpt-4-turbo-2024-04-090.010.0316.9
GPT-4ogpt-4o-2024-05-130.0050.01512.95
Llama 3 (70B)meta.llama3-70b-instruct-v1:00.003450.0045516.36
Mixtral 8x7bmistral.mixtral-8x7b-instruct-v0:10.000590.000911.65
Claude 3 Haikuanthropic.claude-3-haiku-20240307-v1:00.000250.001257.32
Total Cost67.18

References

  1. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 4171–4186. [Google Scholar]
  2. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  3. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  4. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
  5. Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
  6. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
  7. Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Behl, H.; et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv 2024, arXiv:2404.14219. [Google Scholar]
  8. Subagja, A.D.; Ausat, A.M.A.; Sari, A.R.; Wanof, M.I.; Suherlan, S. Improving customer service quality in MSMEs through the use of ChatGPT. J. Minfo Polgan 2023, 12, 380–386. [Google Scholar] [CrossRef]
  9. Bao, K.; Zhang, J.; Zhang, Y.; Wang, W.; Feng, F.; He, X. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023; pp. 1007–1014. [Google Scholar]
  10. Zhang, B.; Haddow, B.; Birch, A. Prompting large language model for machine translation: A case study. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 41092–41110. [Google Scholar]
  11. Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; Hashimoto, T.B. Benchmarking large language models for news summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 39–57. [Google Scholar] [CrossRef]
  12. Ostendorff, M.; Rethmeier, N.; Augenstein, I.; Gipp, B.; Rehm, G. Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 11670–11688. [Google Scholar]
  13. Piñeiro-Martín, A.; García-Mateo, C.; Docío-Fernández, L.; López-Pérez, M.d.C. Ethical Challenges in the Development of Virtual Assistants Powered by Large Language Models. Electronics 2023, 12, 3170. [Google Scholar] [CrossRef]
  14. Singh, C.; Inala, J.P.; Galley, M.; Caruana, R.; Gao, J. Rethinking Interpretability in the Era of Large Language Models. arXiv 2024, arXiv:2402.01761. [Google Scholar]
  15. Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and social risks of harm from language models. arXiv 2021, arXiv:2112.04359. [Google Scholar]
  16. Ferrara, E. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv 2023, arXiv:2304.03738. [Google Scholar]
  17. Abid, A.; Farooqi, M.; Zou, J. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, Virtual, 19–21 May 2021; pp. 298–306. [Google Scholar]
  18. Abid, A.; Farooqi, M.; Zou, J. Large language models associate Muslims with violence. Nat. Mach. Intell. 2021, 3, 461–463. [Google Scholar] [CrossRef]
  19. Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and applications of large language models. arXiv 2023, arXiv:2307.10169. [Google Scholar]
  20. Harrer, S. Attention is not all you need: The complicated case of ethically using large language models in healthcare and medicine. EBioMedicine 2023, 90, 104512. [Google Scholar] [CrossRef] [PubMed]
  21. Perlman, A.M. The implications of ChatGPT for legal services and society. SSRN 2022, 4294197. [Google Scholar] [CrossRef]
  22. Sok, S.; Heng, K. ChatGPT for education and research: A review of benefits and risks. SSRN 2023, 4378735. [Google Scholar]
  23. Fu, D.; Li, X.; Wen, L.; Dou, M.; Cai, P.; Shi, B.; Qiao, Y. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 910–919. [Google Scholar]
  24. Du, M.; Liu, N.; Hu, X. Techniques for interpretable machine learning. Commun. ACM 2019, 63, 68–77. [Google Scholar] [CrossRef]
  25. Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  26. Došilović, F.K.; Brčić, M.; Hlupić, N. Explainable artificial intelligence: A survey. In Proceedings of the 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO), Opatija, Croatia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 0210–0215. [Google Scholar]
  27. Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 3784–3803. [Google Scholar]
  28. Hadi, M.U.; Qureshi, R.; Shah, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J.; Mirjalili, S.; Shah, M. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints 2023. [Google Scholar]
  29. Mungoli, N. Exploring the Synergy of Prompt Engineering and Reinforcement Learning for Enhanced Control and Responsiveness in Chat GPT. J. Electr. Electron. Eng. 2023, 2, 201–205. [Google Scholar]
  30. Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
  31. Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 11048–11064. [Google Scholar]
  32. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  33. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
  34. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 190–213. [Google Scholar]
  35. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
  36. Diao, S.; Wang, P.; Lin, Y.; Zhang, T. Active prompting with chain-of-thought for large language models. arXiv 2023, arXiv:2302.12246. [Google Scholar]
  37. Li, Z.; Peng, B.; He, P.; Galley, M.; Gao, J.; Yan, X. Guiding large language models via directional stimulus prompting. Adv. Neural Inf. Process. Syst. 2024, 36, 62630–62656. [Google Scholar]
  38. Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2024, 36, 8634–8652. [Google Scholar]
  39. Liu, Z.; Yu, X.; Fang, Y.; Zhang, X. Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 417–428. [Google Scholar]
  40. Zhang, Z.; Zhang, A.; Li, M.; Zhao, H.; Karypis, G.; Smola, A. Multimodal chain-of-thought reasoning in language models. arXiv 2023, arXiv:2302.00923. [Google Scholar]
  41. Shi, F.; Chen, X.; Misra, K.; Scales, N.; Dohan, D.; Chi, E.H.; Schärli, N.; Zhou, D. Large language models can be easily distracted by irrelevant context. In Proceedings of the International Conference on Machine Learning. PMLR, Edmonton, AB, Canada, 30 June–3 July 2023; pp. 31210–31227. [Google Scholar]
  42. Yoo, K.M.; Kim, J.; Kim, H.J.; Cho, H.; Jo, H.; Lee, S.W.; Lee, S.g.; Kim, T. Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2422–2437. [Google Scholar]
  43. Zhao, H.; Yang, F.; Lakkaraju, H.; Du, M. Opening the black box of large language models: Two views on holistic interpretability. arXiv 2024, arXiv:2402.10688. [Google Scholar]
  44. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
  45. Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
  46. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  47. Chen, H.; Covert, I.C.; Lundberg, S.M.; Lee, S.I. Algorithms to estimate Shapley value feature attributions. Nat. Mach. Intell. 2023, 5, 590–601. [Google Scholar] [CrossRef]
  48. Al-Najjar, H.A.; Pradhan, B.; Beydoun, G.; Sarkar, R.; Park, H.J.; Alamri, A. A novel method using explainable artificial intelligence (XAI)-based Shapley Additive Explanations for spatial landslide prediction using Time-Series SAR dataset. Gondwana Res. 2023, 123, 107–124. [Google Scholar] [CrossRef]
  49. Ross, A.; Marasović, A.; Peters, M.E. Explaining NLP Models via Minimal Contrastive Editing (MiCE). In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 3840–3852. [Google Scholar]
  50. Wu, X.; Zhao, H.; Zhu, Y.; Shi, Y.; Yang, F.; Liu, T.; Zhai, X.; Yao, W.; Li, J.; Du, M.; et al. Usable XAI: 10 strategies towards exploiting explainability in the LLM era. arXiv 2024, arXiv:2403.08946. [Google Scholar]
  51. Wei Jie, Y.; Satapathy, R.; Goh, R.; Cambria, E. How Interpretable are Reasoning Explanations from Prompting Large Language Models? In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2024; pp. 2148–2164. [Google Scholar] [CrossRef]
  52. Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
  53. Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 17754–17762. [Google Scholar]
  54. Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-context retrieval-augmented language models. Trans. Assoc. Comput. Linguist. 2023, 11, 1316–1331. [Google Scholar] [CrossRef]
  55. Es, S.; James, J.; Espinosa Anke, L.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julian’s, Malta, 21–22 March 2024; Aletras, N., De Clercq, O., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2024; pp. 150–158. [Google Scholar]
  56. OpenAI. GPT-4o: A Model That Can Reason Across Audio, Vision, and Text in Real Time. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 31 October 2024).
  57. Meta. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Meta, 18 April 2024. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 23 January 2025).
  58. Mistral AI. Mixtral of Experts: A High Quality Sparse Mixture-of-Experts. 2023. Available online: https://mistral.ai/news/mixtral-8x22b/ (accessed on 23 January 2025).
  59. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  60. Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. 2023. Available online: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf (accessed on 31 October 2024).
  61. Industrial PhD information by Galician Innovation Agency. Available online: https://www.xunta.gal/dog/Publicados/2021/20211005/AnuncioG0596-270921-0001_gl.html (accessed on 23 January 2025).
Figure 1. Diagram of the key components and process flow of the CDK-E methodology. Note that the evaluated LLMs and the evaluator LLMs are distinct models.
Figure 1. Diagram of the key components and process flow of the CDK-E methodology. Note that the evaluated LLMs and the evaluator LLMs are distinct models.
Applsci 15 01192 g001
Figure 2. Diagram of the steps followed in the Divergent Knowledge Dataset (DKD) fabrication methodology.
Figure 2. Diagram of the steps followed in the Divergent Knowledge Dataset (DKD) fabrication methodology.
Applsci 15 01192 g002
Figure 3. Excerpts from divergent contexts and questions from DKD. In the first example, the date and place of birth of Isaac Newton have been modified, as well as the narrative of his early years. For the second example, the date of the end and outcome of World War II have been modified.
Figure 3. Excerpts from divergent contexts and questions from DKD. In the first example, the date and place of birth of Isaac Newton have been modified, as well as the narrative of his early years. For the second example, the date of the end and outcome of World War II have been modified.
Applsci 15 01192 g003
Figure 4. Correlation between weighted LLM-based metrics (LAV, accuracy, and completeness) and Answer Semantic Similarity (ASS) across different models. Each point represents an individual evaluation of a context on the Divergent Knowledge Dataset, with colors indicating different models (Mixtral-8x7B, GPT-3.5, GPT-4o, and Llama 3). A weighted combination (0.5 for LAV, 0.25 for accuracy, and 0.25 for completeness) was used to highlight alignment between subjective model-based metrics and the objective similarity measure (ASS).
Figure 4. Correlation between weighted LLM-based metrics (LAV, accuracy, and completeness) and Answer Semantic Similarity (ASS) across different models. Each point represents an individual evaluation of a context on the Divergent Knowledge Dataset, with colors indicating different models (Mixtral-8x7B, GPT-3.5, GPT-4o, and Llama 3). A weighted combination (0.5 for LAV, 0.25 for accuracy, and 0.25 for completeness) was used to highlight alignment between subjective model-based metrics and the objective similarity measure (ASS).
Applsci 15 01192 g004
Figure 5. Comparison of LAV precision and Acc with Mixtral 8x7B for the 50 contexts of DKD.
Figure 5. Comparison of LAV precision and Acc with Mixtral 8x7B for the 50 contexts of DKD.
Applsci 15 01192 g005
Table 1. LLM-score-based evaluation criteria for accuracy and completeness.
Table 1. LLM-score-based evaluation criteria for accuracy and completeness.
ScoreDescription
0.0Completely incorrect or irrelevant answer.
0.25Partially correct answer, but largely incomplete or includes irrelevant
information.
0.5Somewhat correct answer that covers some key points but
may miss others or include irrelevant details.
0.75Mostly correct answer covering most key points with minor
omissions or irrelevant details.
1.0Completely correct answer covering all key points accurately
and concisely.
Table 2. Empirical results in percentages (%) for the CDK-E methodology, using GPT-4 Turbo ( γ ) and Claude 3 Haiku ( κ ) as evaluator LLMs. It shows the LAV precision and the averages for Answer Correctness (AC), accuracy (Acc), and completeness (Cm) metrics for each evaluated model, with a single value per model for Answer Semantic Similarity (ASS) as it is unaffected by the evaluator.
Table 2. Empirical results in percentages (%) for the CDK-E methodology, using GPT-4 Turbo ( γ ) and Claude 3 Haiku ( κ ) as evaluator LLMs. It shows the LAV precision and the averages for Answer Correctness (AC), accuracy (Acc), and completeness (Cm) metrics for each evaluated model, with a single value per model for Answer Semantic Similarity (ASS) as it is unaffected by the evaluator.
ModelLAVAccCmASSAC
γ κ γ κ γ κ - γ κ
GPT-3.5 Turbo94.295.292.193.690.393.395.776.186.3
GPT-4o96.696.993.195.391.495.297.180.888.2
Llama 3—70B92.892.988.890.562.388.187.352.281.4
Mixtral 8x7B84.891.982.489.377.789.792.166.081.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Piñeiro-Martín, A.; Santos-Criado, F.-J.; García-Mateo, C.; Docío-Fernández, L.; López-Pérez, M.d.C. Context Is King: Large Language Models’ Interpretability in Divergent Knowledge Scenarios. Appl. Sci. 2025, 15, 1192. https://doi.org/10.3390/app15031192

AMA Style

Piñeiro-Martín A, Santos-Criado F-J, García-Mateo C, Docío-Fernández L, López-Pérez MdC. Context Is King: Large Language Models’ Interpretability in Divergent Knowledge Scenarios. Applied Sciences. 2025; 15(3):1192. https://doi.org/10.3390/app15031192

Chicago/Turabian Style

Piñeiro-Martín, Andrés, Francisco-Javier Santos-Criado, Carmen García-Mateo, Laura Docío-Fernández, and María del Carmen López-Pérez. 2025. "Context Is King: Large Language Models’ Interpretability in Divergent Knowledge Scenarios" Applied Sciences 15, no. 3: 1192. https://doi.org/10.3390/app15031192

APA Style

Piñeiro-Martín, A., Santos-Criado, F.-J., García-Mateo, C., Docío-Fernández, L., & López-Pérez, M. d. C. (2025). Context Is King: Large Language Models’ Interpretability in Divergent Knowledge Scenarios. Applied Sciences, 15(3), 1192. https://doi.org/10.3390/app15031192

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop