A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
Abstract
1. Introduction
1.1. Technological Context and Challenges
- Dense retrieval methods have superseded traditional term-based approaches, significantly improving IR task performance.
- The effectiveness of these models varies considerably across languages and domains, with performance patterns not yet systematically documented or understood.
- Critical questions remain about the trade-offs between model size, computational efficiency, and multilingual performance.
- The capability of models to maintain consistent performance across both language boundaries and specialized domains requires thorough investigation.
1.2. Research Focus on the English–Italian Language Pair
- Linguistic diversity: Italian represents a morphologically rich Romance language with complex verbal systems and agreement patterns, providing an excellent test case for model robustness compared to English’s relatively more straightforward morphological structure [12].
- Research gap: While English dominates NLP research, Italian, despite being spoken by approximately 68 million people (https://en.wikipedia.org/wiki/Italian_language, accessed on 15 March 2025) worldwide and being a major European language, remains under-represented in large-scale NLP evaluations.
- Industrial relevance: Italy’s significant technological sector and growing AI industry make Italian language support crucial for practical applications. The country’s diverse industrial domains (e.g., manufacturing, healthcare, finance, and tourism) present unique challenges for domain-specific IR and QA systems.
- Cross-family evaluation: The comparison between Germanic (English) and Romance (Italian) language families offers insights into the cross-linguistic transfer capabilities of modern language models.
1.3. Research Questions
- Embedding effectiveness: How do state-of-the-art embedding techniques perform across English and Italian IR tasks, and what factors influence their cross-lingual effectiveness?
- LLM impact: What are the quantitative and qualitative effects of integrating LLMs into RAG pipelines for multilingual QA tasks, particularly regarding answer accuracy and factuality?
- Cross-domain and cross-language generalization: To what extent do current models maintain performance across domains and languages in zero-shot scenarios, and what patterns emerge in their generalization capabilities?
- Evaluation methodology: How can we effectively assess multilingual IR and QA systems, and what complementary insights do traditional and LLM-based metrics provide?
1.4. Contributions
- Comprehensive performance analysis: Our systematic evaluation encompasses 12 embedding models and 4 LLMs across 7 diverse datasets, utilizing 11 distinct evaluation metrics. Our analysis distinguishes itself from previous studies like BEIR [13] and MTEB [14] in three fundamental dimensions. First, we employ a multifaceted evaluation approach that combines traditional performance metrics with reference-free LLM-based assessments, providing a more holistic view of model capabilities. Second, we evaluate groundedness and answer relevance dimensions, often overlooked in standard benchmarks, addressing critical concerns in modern retrieval-augmented systems. Third, while most existing evaluations focus primarily on high-resource languages, our study explicitly examines the English–Italian language pair. Thus, we offer valuable insights into model performance for Italian, an important European language that remains underexplored. Our methodologically diverse approach provides a practical understanding of model behavior across languages and domains, complementing existing benchmark studies.
- Cross-lingual insights: An in-depth investigation of English–Italian language pair dynamics, offering valuable insights into the challenges and opportunities in bridging high-resource and lower-resource European languages.
- Evaluation framework: Development and application of a comprehensive evaluation methodology that combines traditional IR metrics with LLM-based assessments, enabling a more nuanced understanding of model performance across languages and domains.
- RAG pipeline insights: We offer detailed insights into the effectiveness of integrating LLMs into RAG pipelines for QA tasks, highlighting both the potential and limitations of this approach.
- Practical implications: Our findings provide valuable guidance for practitioners in selecting appropriate models and techniques for specific IR and QA applications, considering factors such as language, domain, and computational resources.
1.5. Paper Organization
2. Related Work
- Evolution of IR and QA systems: Recent surveys and benchmark frameworks that have shaped our understanding of modern IR and QA systems.
- Embedding models for Information Retrieval: Specialized embedding models designed for IR tasks.
- LLM Integration in question answering: The transformation of QA systems through large language models.
- RAG architecture: The development of retrieval-augmented generation (RAG) systems.
- Evaluation methodologies: Assessment metrics and methodologies for modern IR and QA systems.
2.1. Evolution of IR and QA Systems
2.2. Embedding Models for Information Retrieval
- Comprehensive cross-lingual evaluation, particularly for morphologically rich languages like Italian.
- Systematic assessment of domain adaptation capabilities across languages.
- Comparative analysis of language-specific vs. multilingual models.
2.3. LLM Integration in Question Answering
2.4. RAG Architecture
- Advanced document-splitting mechanisms that preserve semantic coherence.
- Intelligent chunking strategies that optimize information density.
- Sophisticated retrieval mechanisms leveraging state-of-the-art embedding models.
- Integration with powerful language generation models.
- Systematic assessment of cross-lingual performance with focused attention on English–Italian language pairs.
- Comprehensive evaluation of domain adaptability across various sectors.
- Integration of cutting-edge embedding models and LLMs within RAG pipelines.
2.5. Evaluation Methodologies
- Cross-lingual effectiveness across language boundaries.
- Adaptation capabilities across diverse domains.
- Quality and relevance of generated responses.
- Retrieval precision and efficiency metrics.
2.6. Research Gaps and Our Contributions
- Insufficient cross-lingual evaluation frameworks that can assess performance across diverse language families, particularly for morphologically rich languages like Italian. This gap is especially critical as global deployment increases, yet our understanding of system behavior across different linguistic contexts remains limited.
- Limited understanding of domain adaptation challenges when moving from general to specialized contexts across languages. While effective in general domains, current systems often struggle with specialized fields like healthcare, legal, and technical domains, where terminology and reasoning patterns demand sophisticated adaptation. This complexity increases in multilingual settings, where RAG systems face significant challenges in maintaining consistency and accuracy across language barriers.
- Inadequate evaluation methodologies that can capture both technical performance and practical utility in multilingual settings. Traditional metrics may not adequately reflect real-world reliability across different languages and use cases. Ethical considerations compound this limitation regarding bias, fairness, and representation that require systematic investigation.
- A comprehensive evaluation framework spanning both English and Italian, offering insights into model performance across linguistic boundaries.
- An evaluation methodology combining traditional metrics with LLM-based assessment techniques.
3. Methodology and Evaluation Framework
3.1. Overview of Approach
- Retrieval-augmented generation: We implement a structured RAG pipeline with distinct phases for ingestion, retrieval, generation, and evaluation.
- Dataset selection: We employ diverse datasets spanning general knowledge and specialized domains in both English and Italian to assess cross-lingual and cross-domain capabilities.
- Model evaluation: We systematically evaluate 12 embedding models for IR tasks and 4 LLMs within RAG pipelines for QA tasks.
- Multifaceted assessment: We utilize a comprehensive set of evaluation metrics to capture different aspects of performance across languages and domains.
3.2. RAG Pipeline
3.3. Datasets for Information Retrieval and Question Answering
3.3.1. SQuAD-en
- (i)
- id: Unique entry identifier.
- (ii)
- title: Wikipedia article title.
- (iii)
- context: Source passage containing the answer.
- (iv)
- answers: Gold-standard answers with context position indices.
3.3.2. SQuAD-it
3.3.3. DICE
- (i)
- id: Unique document identifier.
- (ii)
- url: Article URL.
- (iii)
- title: Article title.
- (iv)
- subtitle: Article subtitle.
- (v)
- publication date: Article publication date.
- (vi)
- event date: Date of the reported crime event.
- (vii)
- newspaper: Source newspaper name.
3.3.4. SciFact
- (i)
- id: Unique text identifier.
- (ii)
- title: Scientific article title.
- (iii)
- text: Article abstract.
3.3.5. ArguAna
- (i)
- id: Unique argument identifier.
- (ii)
- title: Argument title.
- (iii)
- text: Argument content.
3.3.6. NFCorpus
- (i)
- id: Unique document identifier.
- (ii)
- title: Document title.
- (iii)
- text: Document content.
3.3.7. CovidQA
- (i)
- category: Semantic category.
- (ii)
- subcategory: Specific subcategory.
- (iii)
- query: Keyword-based query.
- (iv)
- question: Natural language question form.
- (i)
- id: Answer identifier.
- (ii)
- title: Source document title.
- (iii)
- answer: Answer text.
3.3.8. NarrativeQA
- (i)
- document: Source book or movie script.
- (ii)
- question: Query to be answered.
- (iii)
- answers: List of valid answers.
3.3.9. NarrativeQA-Cross-Lingual
3.4. Models
3.4.1. Models Used for Information Retrieval
3.4.2. Large Language Models for Question Answering
3.5. Evaluation Metrics
3.5.1. IR Evaluation Metric
- Normalized Discounted Cumulative Gain (NDCG@k) [36]:
- Definition: A ranking quality metric comparing rankings to an ideal order, where the relevant items are at the top.
- Formula: , where is the Discounted Cumulative Gain at k, and is the Ideal at k, with k as a chosen cutoff point. measures the total item relevance in a list with a discount that helps address the diminishing value of items further down the list.
- Range: Values from 0 to 1, where 1 indicates a perfect match with the ideal order.
- Rationale: NDCG is chosen as our sole IR metric because it effectively captures the quality of ranking, considering both the relevance and position of retrieved items. It is particularly useful for evaluating systems where the order of the results matters, making it well-suited for assessing the performance of our embedding models in retrieval tasks.
- Implementation: Available in PyTorch, TensorFlow, and the BEIR framework.
- Mean Average Precision (MAP@k) [36]:
- Definition: A retrieval quality metric that measures both the relevance of items and the system’s ability to rank the most relevant items higher.
- Formula: , where is the Average Precision at k (a chosen cutoff point) for query q, calculated as an average of Precision values at all relevant positions within k, and is the total number of queries.
- Range: Values from 0 to 1, where 1 indicates perfect retrieval and ranking.
- Rationale: MAP is valuable because it provides a single score that summarizes precision at various recall levels, emphasizing the importance of retrieving relevant items early in the result list. It effectively captures both precision and recall aspects of retrieval quality.
- Limitations: There is no relevance value, so a document needs to be relevant or irrelevant.
- Implementation: Available in PyTorch, TensorFlow, and the BEIR framework.
- Recall@k [36]:
- Definition: A retrieval completeness metric for measuring the proportion of relevant documents successfully retrieved within the top k results.
- Formula: , where is the number of true positives (relevant documents) in the top k results, and is the number of false negatives (relevant documents not retrieved in the top k).
- Range: Values from 0 to 1, where 1 indicates all relevant documents have been retrieved.
- Rationale: Recall@k is valuable for assessing a system’s ability to find all relevant information, highlighting the model’s true positive recognition capability. It provides insight into how thoroughly a system captures the full set of relevant documents, making it particularly important in legal, medical, or research contexts, where missing relevant information could have significant consequences.
- Limitations: Recall@k does not consider the ranking order of retrieved documents within the top k results. The metric is undefined when there are no relevant documents in the test set (the denominator becomes zero). Although it effectively measures completeness, it provides only a partial view of system performance and is often insufficient when used alone, as it does not account for precision or ranking quality.
- Implementation: Available in Scikit-learn, PyTorch, TensorFlow, and the BEIR framework.
- Precision@k [36]:
- Definition: A retrieval accuracy metric for measuring the proportion of retrieved documents in the top k results that are relevant.
- Formula: , where is the number of true positives (relevant documents) in the top k results, and is the number of false positives (non-relevant documents) in the top k results.
- Range: Values from 0 to 1, where 1 indicates all retrieved documents are relevant.
- Rationale: Precision@k is valuable for assessing a system’s ability to return accurate results, highlighting the model’s false positive rate. It indicates the probability that a retrieved document is truly relevant, making it particularly important in contexts where delivering relevant information is more critical than finding all relevant documents.
- Limitations: Precision@k does not consider the ranking order within the top k results. When there are no relevant documents in the test collection, the metric becomes problematic: if no documents are returned, the denominator becomes zero, and the metric is undefined; if only non-relevant documents are returned, precision is zero. The metric focuses solely on false positive generation rate, providing only a partial view of system performance.
- Implementation: Available in Scikit-learn, PyTorch, TensorFlow, and the BEIR framework.
3.5.2. QA Evaluation Metrics
- Reference-based metrics:
- BERTScore [39]: This measures semantic similarity using contextual embeddings. BERTScore is a language generation evaluation metric based on pre-trained BERT contextual embeddings [8]. It computes the similarity of two sentences as a sum of cosine similarities between their tokens’ embeddings. This metric can handle such cases where two sentences are semantically similar but differ in form. This evaluation method has been used in many papers, like [64,65]. This metric is often used in question answering, summarization, and translation. It can be implemented using different libraries, including TensorFlow and HuggingFace.
- BEM (BERT-based Evaluation Metric) [66]: This uses a fine-tuned BERT trained to assess answer equivalence. This model receives a question, a candidate answer, and a reference answer as input and returns a score quantifying the similarity between the candidate and the reference answers. This evaluation method is used in some recent papers like [67,68]. This metric can be implemented using TensorFlow. The model trained to perform the answer equivalence task is available on the TensorFlow hub.
- ROUGE [38]: This evaluates n-gram overlap between generated and reference answers. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) evaluates the overlap of n-grams between generated and reference answers. In detail, it is a set of different metrics (ROUGE-1, ROUGE-2, and ROUGE-L) used to evaluate text summarization and machine comprehension systems:
- ROUGE-N: This is defined as an n-gram recall between a predicted text and a ground truth text: , where is the maximum number of n-grams of size n co-occurring in a candidate text and the ground truth text. The denominator is the total sum of the number of n-grams occurring in the ground truth text.
- ROUGE-L: This calculates an F-measure using the Longest Common Subsequence (LCS); the idea is that the longer the LCS of two texts is, the more similar the two summaries are. Given two texts, the ground truth X of length m and the prediction Y of length n, the formal definition is where and .
ROUGE metrics are very popular in Natural Language Processing-specific tasks involving text generation like summarization and question answering [69]. The advantage of ROUGE is that it allows us to estimate the quality of a generative model’s output in common NLP tasks without dependencies on language. The main disadvantages are that it does not consider words to be semantic and is sensitive to word choice and sentence structure. Rouge metrics are implemented in PyTorch, TensorFlow, and Huggingface. - F1 Score: The harmonic mean of precision and recall of word overlap. The F1 score is defined as the harmonic mean of precision and recall of word overlap between generated and reference answers. . This score summarizes the information on both aspects of a classification problem, focusing on precision and recall. F1 score is a very popular metric to evaluate the performance of Artificial Intelligence and Machine Learning systems on classification tasks [70]. In question answering, two popular benchmark datasets that use F1 as one of the metrics for evaluation are SQuAD [43] and TriviaQA [71]. The advantages of the F1 score are the following:
- It can handle unbalanced classes well.
- It captures and summarizes both the aspects of Precision and Recall in a single metric.
The main disadvantage is that, if left alone, the F1 score can be harder to interpret. The F1 score could be used in both Information Extraction and question answering settings. The F1 score is implemented in all the popular libraries of machine/deep learning and Data Analysis, such as Scikit-learn, PyTorch, and TensorFlow.
- Reference-free metrics:
- Context relevance: Evaluates retrieved context relevance to the question. It assesses whether the passage returned is relevant for answering the given query. Therefore, this measure is useful for evaluating IR after obtaining the answer.
- Groundedness or faithfulness: Assesses the degree to which the generated answer is supported by retrieved documents obtained in a RAG pipeline. Therefore, it measures if the generated answer is faithful to the retrieved passage or if it contains hallucinated or extrapolated statements beyond the passage.
- Answer relevance: Measures the relevance of the generated answer to the query and retrieved passage.
- Metric Classification:
- Syntactic metrics evaluate formal response aspects, including BLEU [72], ROUGE [38], precision, recall, F1, and Exact Match [43]. These focus on text properties rather than semantic meaning. These metrics are generally considered less indicative of the semantic value of the generated responses. This is due to their focus on the text’s formal properties rather than its content or inherent meaning.
- Semantic metrics evaluate response meaning, including BERTScore [39] and BEM score [66]. The BEM score is preferred to BERTScore for its correlation with human evaluations, as reported in the original study we refer to, and because we empirically found that BERTScore tends to take values in a very short subset of values in the range. The LLM-based metrics also belong to this group.
- Manual Evaluation:
- Very Poor: The generated answer is totally incorrect or irrelevant to the question. This case indicates a failure of the system to comprehend the query or retrieve pertinent information.
- Poor: The generated answer is predominantly incorrect but with glimpses of relevance, suggesting some level of understanding or appropriate retrieval.
- Neither: The generated answer mixes relevant and irrelevant information almost equally, showcasing the system’s partial success in addressing the query.
- Good: The generated answer is largely correct but includes minor inaccuracies or irrelevant details, demonstrating a strong understanding and response to the question.
- Very Good: Reserved for completely correct and fully relevant answers, reflecting an ideal outcome where the system accurately understood and responded to the query.
- Inter-metric Correlation:
3.6. Experimental Design
3.6.1. Hardware and Software Specifications
3.6.2. Experimental Procedure
- Data preparation: We processed and indexed all documents using each embedding model.
- Processing: We implemented specific processing workflows for both IR and QA tasks.
- Evaluation: We applied our comprehensive set of evaluation metrics.
- IR Pipeline.
- Data preparation:
- (a)
- Indexed all documents in the corpus using each embedding model.
- (b)
- For documents exceeding the maximum token limit, we considered single-chunk truncation, following BEIR settings.
- Processing:
- Query Processing: Encoded each query using the corresponding embedding model.
- Retrieval:
- (a)
- Used Milvus for efficient similarity search.
- (b)
- Retrieved top-k documents for each query (), with extensive experiments reported for .
- Evaluation:
- (a)
- Computed nDCG@10, MAP@10, Recall@10, and Precision@10 for each model on each dataset, focusing on nDCG@10 as the primary metric.
- (b)
- Used existing relevance judgments where available; for datasets without explicit judgments (e.g., DICE), we considered documents relevant if matching the ground truth.
- QA Pipeline.
- Data preparation:
- (a)
- Indexed documents using Cohere embed-multilingual-v3.0 (best-performing IR model based on nDCG@10).
- (b)
- Split documents into passages of 512 tokens without sliding windows, balancing semantic integrity with information relevance.
- Processing:
- Query processing: Encoded each query using the corresponding embedding model.
- Retrieval stage: Used Cohere embed-multilingual-v3.0 to retrieve the top-10 passages.
- Answer generation:
- (a)
- Constructed bilingual prompts, combining questions and retrieved passages.
- (b)
- Applied consistent prompt templates across all models and datasets as shown in Table 5.
- (c)
- Generated answers using each LLM.
During generation, we employed the prompt structure shown in Table 5 for both English and Italian tasks. This prompt structure provides explicit instructions and context to the language model while encouraging concise and truthful answers without fabrication.
- Evaluation:
- (a)
- Computed reference-based metrics (BERTScore, BEM, ROUGE, BLEU, EM, F1) using generated answers and ground truth.
- (b)
- Used GPT-3.5-turbo to compute reference-free metrics (answer relevance, groundedness, context relevance) through prompted evaluation.
3.6.3. Reproducibility Measures
- Randomization control: Fixed random seeds for all processes requiring randomization.
- Sampling strategy:
- −
- Used standard dataset splits where available.
- −
- Selected statistically valid representative subsets when working with large datasets:
- ∗
- A total of 150 tuples from the SQuAD-en validation set (1.5% of dev set).
- ∗
- A total of 150 tuples from the SQuAD-it test set, enabling direct cross-lingual comparison (1.9% of test set).
- ∗
- A total of 100 balanced tuples from NarrativeQA (50 books, 50 movie scripts).
- Model configuration:
- −
- Used default pre-trained weights without fine-tuning for all models.
- −
- Maintained consistent parameters across experiments (e.g., 512-token chunk size).
- Implementation environment:
- −
- Google Colab platform.
- −
- Python with Langchain framework.
- −
- Milvus vector store.
- −
- Standardized evaluation protocols and thresholds.
3.7. Ethical Considerations and Limitations
3.7.1. Ethical Considerations
- Data ethics: Ensured strict compliance with dataset licenses and usage agreements while maintaining transparency of data sources and processing methods.
- Model usage: Adhered to API providers’ usage policies and rate limits, particularly for commercial models like GPT-4o.
- Transparency: Thoroughly documented model limitations and potential output biases to ensure transparent reporting of system capabilities and constraints.
- Resource efficiency: Designed experiments to minimize computational resource usage while maintaining statistical validity.
3.7.2. Limitations and Potential Biases
- Dataset coverage limitations: Our dataset selection, though diverse, represents only a fraction of potential real-world scenarios across languages and domains. While our datasets span general knowledge (SQuAD), scientific content (SciFact), and specialized domains (NFCorpus), they cannot capture the full spectrum of linguistic variations and domain-specific applications.
- Model accessibility constraints: Access limitations to proprietary models and computational constraints prevented exhaustive experimentation with all available models. This particularly affects comparisons involving commercial models with limited architectural details transparency.
- Evaluation metric limitations: Current evaluation metrics, while diverse, may not capture all nuanced aspects of model performance, particularly for complex tasks requiring sophisticated reasoning. The challenge of quantifying dimensions like answer relevance and factual accuracy remains an active research area.
- Cross-lingual analysis constraints: Our cross-lingual analysis, while valuable, is limited to the English–Italian language pair. This specific focus means our findings may not generalize to language pairs with greater linguistic distance or to languages with significantly different morphological characteristics.
- Resource and sampling constraints: Practical resource constraints necessitated using dataset subsets rather than complete datasets. We ensured statistical validity through careful sampling strategies (fixed random seeds, balanced representation). Our consistent findings across multiple metrics suggest the patterns observed likely represent broader trends.
- Temporal considerations: Our results represent a snapshot of model capabilities at a specific point in time. Given the rapid evolution of NLP technology, future developments may shift the relative performance characteristics we observed, particularly as new models and architectures emerge.
4. Results
4.1. Information Retrieval Performance
4.1.1. Cross Domain Results
- Performance varies significantly by domain, with no single model achieving universal superiority across all tasks.
- Multilingual-e5-large achieves the highest performance on general domain tasks, with an nDCG@10 of 0.91 on SQuAD-en.
- The BGE models demonstrate particular strength in specialized content, achieving top performance on ArguAna (0.64) and SciFact (0.75).
- The GTE and BGE architectures show robust adaptability to scientific and medical domains, maintaining strong performance across SciFact and NFCorpus datasets.
4.1.2. Cross-Language Results
- Multilingual models consistently outperform Italian-specific models (e.g., BERTino) across both datasets.
- Multilingual-e5-large achieves top performance on SQuAD-it (nDCG@10: 0.86).
- Embed-multilingual-v3.0 demonstrates exceptional versatility, excelling in both SQuAD-it (0.86) and DICE (0.72).
- The performance gap between multilingual and monolingual models suggests superior domain adaptation capabilities in larger multilingual architectures.
4.1.3. Retrieval Size Impact
4.2. Question Answering Performance
4.2.1. Model Performance Across Tasks and Languages
- Model specialization: (i) Llama 3.1 8b excels in syntactic accuracy on general domain tasks. (ii) GPT-4o demonstrates superior cross-lingual capabilities. (iii) Mistral-Nemo achieves consistent performance across diverse tasks.
- Performance patterns: (i) BERTScores indicate strong semantic understanding across all models. (ii) Groundedness scores decrease in complex domains. (iii) Semantic metrics consistently outperform syntactic measures.
- Domain effects: (i) Factual domains (CovidQA) show higher groundedness scores. (ii) Narrative domains pose greater challenges for factual accuracy. (iii) Cross-lingual performance remains robust in structured tasks.
4.2.2. Metrics Effectiveness vs. Human Evaluation
5. Discussion
5.1. The Domain Specialization Challenge
5.2. Cross-Lingual Performance: A Tale of Two Languages
5.3. The Architecture vs. Scale Debate
5.4. Patterns in Model Evaluation Metrics
5.5. Practical Implications and Ethical Considerations
6. Conclusions
- The success of well-designed smaller models suggests that focused architectural innovation may be more valuable than simply scaling up existing approaches.
- The consistent pattern of domain-specific performance degradation indicates a need for more sophisticated approaches to specialized knowledge transfer across languages.
- The challenge of maintaining answer groundedness while preserving natural language generation capabilities emerges as a critical area for future work.
- Dataset diversity: Future work should expand to include a wider range of languages and domains to further validate the cross-lingual and domain adaptation capabilities of these models.
- Domain adaptation: The documented performance drop from general (0.90 nDCG@10) to specialized domains (0.36 nDCG@10) calls for more sophisticated domain adaptation mechanisms.
- Cross-lingual knowledge transfer: The varying performance gaps between English and Italian, particularly in specialized domains, suggest the need for improved cross-lingual transfer methods. There is a need to explore methods for leveraging high-resource language models to improve performance on low-resource languages, potentially through zero-shot or few-shot learning approaches.
- Improved groundedness: Develop techniques to enhance the faithfulness of LLM-generated answers to the provided context, possibly through modified training objectives or architectural changes.
- Architectural innovation: The comparable performance of different-sized models (e.g., multilingual-E5-base vs. large shows minimal differences in nDCG@10) indicates that architectural efficiency may be more crucial than model scale. Therefore, developing more efficient architectures that maintain performance across languages without requiring massive computational resources is necessary.
- Long-context LLMs for QA: There is a need to explore the potential of emerging long-context LLMs (e.g., Claude 3 and GPT-4 with extended context) in handling complex, multi-hop QA tasks without the need for separate retrieval steps. To address long documents, we will compare this approach with a smart selection of chunks through structural document representation (Document Object Model).
- Dynamic retrieval: Investigate adaptive retrieval methods that can dynamically adjust the number of retrieved passages based on query complexity or ambiguity.
- Multimodal IR and QA: Extend the current work to include multimodal information retrieval and question answering, incorporating text, images, and potentially other modalities.
- Evaluation methodologies: Advance our understanding of how to assess both technical performance and practical utility in real-world applications. Develop better factual accuracy metrics given the observed groundedness challenges.
- Development of specialized multilingual metrics: Evaluation metrics specifically designed for multilingual and cross-lingual scenarios are needed. Future research should focus on creating metrics that better capture the unique challenges of cross-lingual knowledge transfer and factual groundedness across languages, potentially incorporating language-specific linguistic features and cultural contexts.
- Model updates: Given the rapid pace of development in NLP, regular re-evaluations with newly released models will be necessary to keep findings current.
- Interpretability and explainability: Develop methods for better understanding and interpreting the decision-making processes of dense retrievers and LLMs in IR and QA tasks.
- Ethical AI in IR and QA: Further investigation into bias mitigation and fairness across languages and cultures in IR and QA systems. Develop frameworks for ethical deployment of AI-powered systems, including methods for bias detection and mitigation and strategies for clearly communicating model limitations to end-users.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Newry, UK, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 11324–11436. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; pp. 3982–3992. [Google Scholar] [CrossRef]
- Ni, J.; Qu, C.; Lu, J.; Dai, Z.; Ábrego, G.H.; Ma, J.; Zhao, V.Y.; Luan, Y.; Hall, K.B.; Chang, M.W.; et al. Large Dual Encoders Are Generalizable Retrievers. arXiv 2021, arXiv:2112.07899. [Google Scholar]
- Mistral AI. Mistral Nemo: Introducing Our New Model. 2024. Available online: https://mistral.ai/news/mistral-nemo/ (accessed on 29 March 2025).
- Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Newry, UK, 2017; Volume 30. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20), Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar]
- Cotterell, R.; Mielke, S.J.; Eisner, J.; Roark, B. Are All Languages Equally Hard to Language-Model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Walker, M., Ji, H., Stent, A., Eds.; Volume 2 (Short Papers), pp. 536–541. [Google Scholar] [CrossRef]
- Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv 2021, arXiv:2104.08663. [Google Scholar]
- Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; Vlachos, A., Augenstein, I., Eds.; pp. 2014–2037. [Google Scholar] [CrossRef]
- Hambarde, K.A.; Proença, H. Information Retrieval: Recent Advances and Beyond. IEEE Access 2023, 11, 76581–76604. [Google Scholar] [CrossRef]
- Anand, A.; Lyu, L.; Idahl, M.; Wang, Y.; Wallat, J.; Zhang, Z. Explainable Information Retrieval: A Survey. arXiv 2022, arXiv:2211.02405. [Google Scholar]
- Tang, Y.; Yang, Y. MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv 2024, arXiv:2401.15391. [Google Scholar]
- Zhang, Z.; Fang, M.; Chen, L. RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering. arXiv 2024, arXiv:2402.16457. [Google Scholar]
- Gao, M.; Hu, X.; Ruan, J.; Pu, X.; Wan, X. LLM-based NLG Evaluation: Current Status and Challenges. arXiv 2024, arXiv:2402.01383. [Google Scholar] [CrossRef]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; pp. 6769–6781. [Google Scholar] [CrossRef]
- Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), Xi’an, China, 25–30 July 2020; pp. 39–48. [Google Scholar] [CrossRef]
- Xiong, L.; Xiong, C.; Li, Y.; Tang, K.F.; Liu, J.; Bennett, P.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv 2020, arXiv:2007.00808. [Google Scholar]
- Nogueira, R.; Yang, W.; Lin, J.; Cho, K. Document Expansion by Query Prediction. arXiv 2019, arXiv:1904.08375. [Google Scholar]
- Gao, L.; Callan, J. Condenser: A Pre-Training Architecture for Dense Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.T., Eds.; pp. 981–993. [Google Scholar] [CrossRef]
- Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv 2024, arXiv:2212.03533. [Google Scholar]
- Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv 2024, arXiv:2309.07597. [Google Scholar]
- Xiao, S.; Liu, Z.; Shao, Y.; Cao, Z. RetroMAE: Pre-Training Retrieval-Oriented Language Models Via Masked Auto-Encoder. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; pp. 538–548. [Google Scholar] [CrossRef]
- Liu, Z.; Xiao, S.; Shao, Y.; Cao, Z. RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Volume 1: Long Papers, pp. 2635–2648. [Google Scholar] [CrossRef]
- Muffo, M.; Bertino, E. BERTino: An Italian DistilBERT model. arXiv 2023, arXiv:2303.18121. [Google Scholar]
- Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Online, 13–18 July 2020; Volume 119, pp. 3929–3938. [Google Scholar]
- Khattab, O.; Potts, C.; Zaharia, M. Relevance-guided Supervision for OpenQA with ColBERT. arXiv 2021, arXiv:2007.00814. [Google Scholar] [CrossRef]
- Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. arXiv 2021, arXiv:2104.07567. [Google Scholar]
- Huo, S.; Arabzadeh, N.; Clarke, C. Retrieving Supporting Evidence for Generative Question Answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, Beijing, China, 26–28 November 2023; pp. 11–20. [Google Scholar] [CrossRef]
- Zhang, T.; Patil, S.G.; Jain, N.; Shen, S.; Zaharia, M.; Stoica, I.; Gonzalez, J.E. RAFT: Adapting Language Model to Domain Specific RAG. arXiv 2024, arXiv:2403.10131. [Google Scholar]
- Carterette, B.; Voorhees, E.M. Overview of information retrieval evaluation. In Current Challenges in Patent Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2011; pp. 69–85. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; Isabelle, P., Charniak, E., Lin, D., Eds.; pp. 311–318. [Google Scholar] [CrossRef]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Es, S.; James, J.; Espinosa Anke, L.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, 17–22 March 2024; Aletras, N., De Clercq, O., Eds.; pp. 150–158. [Google Scholar]
- Katranidis, V.; Barany, G. FaaF: Facts as a Function for the evaluation of RAG systems. arXiv 2024, arXiv:2403.03888. [Google Scholar]
- Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Volume 1: Long Papers, pp. 338–354. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Su, J., Duh, K., Carreras, X., Eds.; pp. 2383–2392. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Jia, R.; Liang, P. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv 2018, arXiv:1806.03822. [Google Scholar]
- Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R., Kan, M.Y., Eds.; Volume 1: Long Papers, pp. 1870–1879. [Google Scholar] [CrossRef]
- Zhang, Y.; Nie, P.; Geng, X.; Ramamurthy, A.; Song, L.; Jiang, D. DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. arXiv 2020, arXiv:2002.12591. [Google Scholar]
- Croce, D.; Zelenanska, A.; Basili, R. Neural Learning for Question Answering in Italian. In Proceedings of the International Conference of the Italian Association for Artificial Intelligence (AI*IA 2018), Trento, Italy, 20–23 November 2018; Ghidini, C., Magnini, B., Passerini, A., Traverso, P., Eds.; pp. 389–402. [Google Scholar]
- Bonisoli, G.; Di Buono, M.P.; Po, L.; Rollo, F. DICE: A Dataset of Italian Crime Event News. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), Taipei, Taiwan, 23–27 July 2023; pp. 2985–2995. [Google Scholar] [CrossRef]
- Wadden, D.; Lin, S.; Lo, K.; Wang, L.L.; van Zuylen, M.; Cohan, A.; Hajishirzi, H. Fact or Fiction: Verifying Scientific Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; pp. 7534–7550. [Google Scholar] [CrossRef]
- Wachsmuth, H.; Syed, S.; Stein, B. Retrieval of the Best Counterargument Without Prior Topic Knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Gurevych, I., Miyao, Y., Eds.; Volume 1: Long Papers, pp. 241–251. [Google Scholar] [CrossRef]
- Boteva, V.; Ghalandari, D.G.; Sokolov, A.; Riezler, S. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval, Proceedings of the 38th European Conference on IR Research (ECIR 2016), Padua, Italy, 20–23 March 2016; Lecture Notes in Computer Science; Ferro, N., Crestani, F., Moens, M.F., Mothe, J., Silvestri, F., Nunzio, G.M.D., Hauff, C., Silvello, G., Eds.; Springer: Cham, Switzerland, 2016; Volume 9626, pp. 716–722. [Google Scholar]
- Tang, R.; Nogueira, R.; Zhang, E.; Gupta, N.; Cam, P.; Cho, K.; Lin, J. Rapidly Bootstrapping a Question Answering Dataset for COVID-19. arXiv 2020, arXiv:2004.11339. [Google Scholar]
- Wang, L.L.; Lo, K.; Chandrasekhar, Y.; Reas, R.; Yang, J.; Burdick, D.; Eide, D.; Funk, K.; Katsis, Y.; Kinney, R.; et al. CORD-19: The COVID-19 Open Research Dataset. arXiv 2020, arXiv:2004.10706. [Google Scholar]
- Kočiský, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann, K.M.; Melis, G.; Grefenstette, E. The NarrativeQA Reading Comprehension Challenge. Trans. Assoc. Comput. Linguist. 2018, 6, 317–328. [Google Scholar] [CrossRef]
- Li, Z.; Zhang, X.; Zhang, Y.; Long, D.; Xie, P.; Zhang, M. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv 2023, arXiv:2308.03281. [Google Scholar]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- Gemma Team. Gemma 2: Improving Open Language Models at a Practical Size. arXiv 2024, arXiv:2408.00118. [Google Scholar]
- Wrzalik, M.; Krechel, D. CoRT: Complementary Rankings from Transformers. arXiv 2021, arXiv:2010.10252. [Google Scholar]
- Tang, H.; Sun, X.; Jin, B.; Wang, J.; Zhang, F.; Wu, W. Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. arXiv 2021, arXiv:2105.03599. [Google Scholar]
- Lawrie, D.; Yang, E.; Oard, D.W.; Mayfield, J. Neural Approaches to Multilingual Information Retrieval. arXiv 2023, arXiv:2209.01335. [Google Scholar]
- Esteva, A.; Kale, A.; Paulus, R.; Hashimoto, K.; Yin, W.; Radev, D.; Socher, R. CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization. arXiv 2020, arXiv:2006.09595. [Google Scholar]
- Zhan, J.; Mao, J.; Liu, Y.; Zhang, M.; Ma, S. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. arXiv 2020, arXiv:2006.15498. [Google Scholar]
- Yang, Y.; Jin, N.; Lin, K.; Guo, M.; Cer, D. Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation. arXiv 2020, arXiv:2009.13815. [Google Scholar]
- Durmus, E.; He, H.; Diab, M. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar] [CrossRef]
- Goyal, T.; Li, J.J.; Durrett, G. News Summarization and Evaluation in the Era of GPT-3. arXiv 2023, arXiv:2209.12356. [Google Scholar]
- Bulian, J.; Buck, C.; Gajewski, W.; Börschinger, B.; Schuster, T. Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; pp. 291–305. [Google Scholar] [CrossRef]
- Kamalloo, E.; Dziri, N.; Clarke, C.L.A.; Rafiei, D. Evaluating Open-Domain Question Answering in the Era of Large Language Models. arXiv 2023, arXiv:2305.06984. [Google Scholar]
- Lerner, P.; Ferret, O.; Guinaudeau, C. Cross-modal Retrieval for Knowledge-based Visual Question Answering. arXiv 2024, arXiv:2401.05736. [Google Scholar]
- Blagec, K.; Dorffner, G.; Moradi, M.; Ott, S.; Samwald, M. A global analysis of metrics used for measuring performance in natural language processing. arXiv 2022, arXiv:2204.11574. [Google Scholar]
- Blagec, K.; Dorffner, G.; Moradi, M.; Samwald, M. A critical analysis of metrics used for measuring progress in artificial intelligence. arXiv 2021, arXiv:2008.02577. [Google Scholar]
- Joshi, M.; Choi, E.; Weld, D.S.; Zettlemoyer, L. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv 2017, arXiv:1705.03551. [Google Scholar]
- Lin, C.Y.; Hovy, E. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, AB, Canada, 27 May–1 June 2003; pp. 150–157. [Google Scholar]
- Weaver, K.F.; Morales, V.; Dunn, S.L.; Godde, K.; Weaver, P.F. Pearson’s and Spearman’s correlation. In An Introduction to Statistical Analysis in Research; John Wiley and Sons, Ltd.: Hoboken, NJ, USA, 2017; pp. 435–471. [Google Scholar] [CrossRef]
Dataset | Task | Domain | Language | Used Samples | Retrieval Unit Granularity |
---|---|---|---|---|---|
SQuAD-en | IR and QA | Open | English | 150 of 10.6k tuples (Dev-set = sample of SQuAD-it) | Entire paragraphs |
SQuAD-it | IR and QA | Open | Italian | 150 of 7.6k tuples (Test set, random seed 433) | Entire paragraphs |
DICE | IR-News Retrieval | Crime News | Italian | All 10.3k tuples | Single chunk with truncation |
SciFact | IR-Fact checking | Scientific literature | English | 5k test set tuples (300 queries) | Single chunk with truncation |
ArguAna | IR-Argument Retrieval | Misc. Arguments | English | 1.4k queries (Test set, corpus of 8.6k docs) | Single chunk with truncation |
NFCorpus | IR | Bio-Medical | English | 323 queries (Test set) (3.6k docs) | Single chunk with truncation |
CovidQA | QA | Medical | English | All 124 tuples (27 questions, and 85 unique articles) | Chunks of 512 tokens of CORD19 documents |
NarrativeQA | QA-RC | Narrative books and Movie scripts | English | 100 queries (50 books, 50 movies) | Chunks of 512 tokens |
NarrativeQA-translated | QA-RC | Narrative books and Movie scripts | Cross-lingual (En docs, It QA) | Same as NarrativeQA | Chunks of 512 tokens |
Model | Parameters | Max Input Length | Language |
---|---|---|---|
GTE-base a [55] | 109M | 512 | English |
GTE-large b [55] | 335M | 512 | English |
BGE-base-en-v1.5 c [26] | 109M | 512 | English |
BGE-large-en-v1.5 d [26] | 335M | 512 | English |
multilingual-E5-base e [25] | 278M | 512 | Multilingual |
multilingual-E5-large f [25] | 560M | 512 | Multilingual |
text-embedding-ada-002 (OpenAI) g | Not disclosed | 8192 | Multilingual |
embed-multilingual-v2.0 (Cohere) h,i | Not disclosed | 256 | Multilingual |
embed-multilingual-v3.0 (Cohere) | Not disclosed | 512 | Multilingual |
sentence-bert-base j | 109M | 512 | Italian |
BERTino k [29] | 65M | 512 | Italian |
BERTino v2 l | 65M | 512 | Italian |
Model | Company | API/Open-Source | Param. | Context Win. | Language |
---|---|---|---|---|---|
GPT-4o a | OpenAI | API-based | >175B | 128,000 | Multilingual |
Llama 3.1 8b b [56] | Meta | Open-source | 8.03B | 8192 c | Multilingual |
Mistral-Nemo d,e | MistralAI | Open-source | 12.2B | 128,000 | Multilingual |
Gemma2b f [6,57] | Open-source | 2.51B | 128,000 | English |
Metric | Type | Category | Description | Formula/Calculation | Range | Advantages | Limitations |
---|---|---|---|---|---|---|---|
Information Retrieval Metrics | |||||||
NDCG@k | IR | Ranking Quality | A ranking quality metric comparing rankings to an ideal order, where relevant items are at the top | 0 to 1 | Effectively captures quality of ranking, considering both relevance and position of retrieved items | Requires graded relevance judgments; sensitive to the choice of discount function | |
MAP@k | IR | Retrieval Quality | A retrieval quality metric that measures both the relevance of items and the system’s ability to rank the most relevant items higher | 0 to 1 | Provides a single score that summarizes precision at various recall levels | No relevance amount (binary relevance) | |
Recall@k | IR | Retrieval Completeness | A retrieval completeness metric measuring the proportion of relevant documents successfully retrieved within the top k results | 0 to 1 | Assesses system’s ability to find all relevant information | Does not consider ranking order; undefined when no relevant documents; provides only partial view of system performance | |
Precision@k | IR | Retrieval Accuracy | A retrieval accuracy metric measuring the proportion of retrieved documents in the top k results that are relevant | 0 to 1 | Assesses system’s ability to return accurate results | Does not consider ranking order; undefined if no documents returned; focuses only on false positive rate | |
Question Answering Metrics | |||||||
Syntactic (Reference-based) | |||||||
ROUGE-L | QA | Syntactic (Reference-based) | Refers to the scoring of the longest common subtext between texts | , where and | 0 to 1 | Evaluates quality without dependencies from language | Does not consider word semantics; sensitive to sentence structure |
F1 Score | QA | Syntactic (Reference-based) | Harmonic mean of precision and recall of word overlap | 0 to 1 | Handles unbalanced classes; summarizes precision and recall | If left alone, can be harder to interpret | |
Semantic (Reference-based) | |||||||
BERTScore | QA | Semantic (Reference-based) | Measures semantic similarity using contextual embeddings | Sum of cosine similarities between token embeddings | 0 to 1 | Handles semantically similar but formally different sentences | Computationally expensive; performance depends on underlying language model |
BEM | QA | Semantic (Reference-based) | Uses a fine-tuned BERT trained to assess answer equivalence | Fine-tuned BERT model score | 0 to 1 | Better correlation with human evaluations than BERTScore | Domain-specific; limited by the training data used for fine-tuning |
Semantic (Reference-free, LLM-based) | |||||||
Context Relevance | QA | Semantic (Reference-free) | Evaluates retrieved-context relevance to the question | LLM-based evaluation | 0 to 1 | Useful for evaluating IR after obtaining answer | Depends on LLM quality; potentially high computational cost; variable output |
Groundedness | QA | Semantic (Reference-free) | Assesses the degree to which the generated answer is supported by retrieved documents | LLM-based evaluation | 0 to 1 | Measures faithfulness to retrieved passage | Subjective to LLM interpretation; may miss nuanced logical inconsistencies |
Answer Relevance | QA | Semantic (Reference-free) | Measures relevance of generated answer to query and retrieved passage | LLM-based evaluation | 0 to 1 | Evaluates overall answer quality | Inherits LLM biases; less reliable for specialized domains |
Other Evaluation Methods | |||||||
Manual Evaluation | QA | Human Assessment | 5-point Likert scale assessment by human annotators | Human judgment on 1–5 scale | 1 to 5 | Gold standard for evaluation | High costs in terms of money and time; potential subjectivity between annotators |
Spearman Rank Correlation | Meta | Statistical | Assesses correlation between automated metrics and human evaluation | Statistical calculation on ranked data | −1 to 1 | Works with ordinal and continuous variables | Sensitive to ties; does not capture non-monotonic relationships |
You are a Question Answering system that is rewarded if the response is short, concise and straight to the point, use the following pieces of context to answer the question at the end. If the context doesn’t provide the required information simply respond <no answer>. Context: {retrieved_passages} Question: {human_question} Answer: |
Sei un sistema in grado di rispondere a domande e che viene premiato se la risposta è breve, concisa e dritta al punto, utilizza i seguenti pezzi di contesto per rispondere alla domanda alla fine. Se il contesto non fornisce le informazioni richieste, rispondi semplicemente <nessuna risposta>. Context: {retrieved_passages} Question: {human_question} Answer: |
Model | SQuAD-en | SQuAD-it | DICE | SciFact | ArguAna | NFCorpus |
---|---|---|---|---|---|---|
GTE-base | 0.87 | _ | _ | 0.74 | 0.56 | 0.37 |
GTE-large | 0.87 | _ | _ | 0.74 | 0.57 | 0.38 |
bge-base-en-v1.5 | 0.86 | _ | _ | 0.74 | 0.64 | 0.37 |
bge-large-en-v1.5 | 0.89 | _ | _ | 0.75 | 0.64 | 0.38 |
multilingual-e5-base | 0.90 | 0.85 | 0.56 | 0.69 | 0.51 | 0.32 |
multilingual-e5-large | 0.91 | 0.86 | 0.64 | 0.70 | 0.54 | 0.34 |
text-embedding-ada-002 (OpenAI) | 0.86 | 0.79 | 0.54 | 0.71 | 0.55 | 0.37 |
embed-multilingual-v2.0 (Cohere) | 0.84 | 0.79 | 0.64 | 0.66 | 0.55 | 0.32 |
embed-multilingual-v3.0 (Cohere) | 0.90 | 0.86 | 0.72 | 0.70 | 0.55 | 0.36 |
sentence-bert-base | _ | 0.52 | 0.22 | _ | _ | _ |
BERTino | _ | 0.57 | 0.33 | _ | _ | _ |
BERTino v2 | _ | 0.64 | 0.40 | _ | _ | _ |
Model | SQuAD-en | SQuAD-it | ||||||
---|---|---|---|---|---|---|---|---|
nDCG@10 | MAP | R@10 | P@10 | nDCG@10 | MAP | R@10 | P@10 | |
GTE-base | 0.87 | 0.83 | 0.98 | 0.098 | — | — | — | — |
GTE-large | 0.87 | 0.83 | 0.97 | 0.097 | — | — | — | — |
bge-base-en-v1.5 | 0.86 | 0.81 | 0.95 | 0.095 | — | — | — | — |
bge-large-en-v1.5 | 0.89 | 0.85 | 0.98 | 0.098 | — | — | — | — |
multilingual-e5-base | 0.90 | 0.87 | 0.98 | 0.098 | 0.85 | 0.80 | 0.97 | 0.097 |
multilingual-e5-large | 0.91 | 0.88 | 1.00 | 0.100 | 0.86 | 0.83 | 0.95 | 0.095 |
text-embedding-ada-002 (OpenAI) | 0.86 | 0.83 | 0.94 | 0.094 | 0.79 | 0.75 | 0.92 | 0.092 |
embed-multilingual-v2.0 (Cohere) | 0.84 | 0.80 | 0.96 | 0.096 | 0.79 | 0.74 | 0.94 | 0.094 |
embed-multilingual-v3.0 (Cohere) | 0.90 | 0.86 | 1.00 | 0.10 | 0.86 | 0.81 | 0.98 | 0.098 |
sentence-bert-base | — | — | — | — | 0.52 | 0.45 | 0.72 | 0.072 |
BERTino | — | — | — | — | 0.57 | 0.50 | 0.77 | 0.077 |
BERTino v2 | — | — | — | — | 0.64 | 0.58 | 0.85 | 0.085 |
k | Recall@k |
---|---|
1 | 0.335 |
5 | 0.535 |
10 | 0.611 |
20 | 0.680 |
50 | 0.767 |
100 | 0.827 |
Model | SQuAD-en | SQuAD-it | CovidQA | NaQA-B | NaQA-M | NaQA-B-tran | NaQA-M-tran |
---|---|---|---|---|---|---|---|
GPT-4o | 0.26 0.25 | 0.21 0.18 | 0.21 0.13 | 0.14 0.12 | 0.16 0.12 | 0.13 0.13 | 0.13 0.13 |
Llama 3.1 8b | 0.72 0.69 | 0.57 0.54 | 0.22 0.15 | 0.12 0.11 | 0.13 0.11 | 0.09 0.09 | 0.09 0.09 |
Mistral-Nemo | 0.43 0.41 | 0.27 0.25 | 0.27 0.17 | 0.23 0.21 | 0.30 0.25 | 0.10 0.06 | 0.05 0.04 |
Gemma2b | 0.40 0.39 | _ | 0.24 0.16 | 0.15 0.11 | 0.17 0.13 | _ | _ |
Model | SQuAD-en | SQuAD-it | CovidQA | NaQA-B | NaQA-M | NaQA-B-tran | NaQA-M-tran |
---|---|---|---|---|---|---|---|
GPT-4o | 0.85 0.93 | 0.81 0.92 | 0.85 0.61 | 0.85 0.50 | 0.85 0.46 | 0.83 0.47 | 0.83 0.45 |
Llama 3.1 8b | 0.92 0.90 | 0.90 0.79 | 0.85 0.61 | 0.85 0.45 | 0.85 0.47 | 0.81 0.44 | 0.82 0.43 |
Mistral-Nemo | 0.88 0.94 | 0.83 0.82 | 0.86 0.62 | 0.87 0.60 | 0.88 0.51 | 0.83 0.25 | 0.82 0.18 |
Gemma2b | 0.88 0.77 | _ | 0.85 0.43 | 0.85 0.38 | 0.86 0.32 | _ | _ |
Model | SQuAD-en | SQuAD-it | CovidQA | NaQA-B | NaQA-M | NaQA-B-tran | NaQA-M-tran |
---|---|---|---|---|---|---|---|
GPT-4o | 1.0 0.90 0.79 | 0.99 0.80 0.81 | 0.89 0.82 0.61 | 0.89 0.58 0.58 | 0.91 0.59 0.39 | 0.96 0.55 0.45 | 0.94 0.49 0.31 |
Llama 3.1 8b | 1.0 0.89 0.67 | 0.99 0.80 0.71 | 0.86 0.82 0.62 | 0.95 0.58 0.53 | 0.95 0.59 0.33 | 0.93 0.56 0.40 | 0.91 0.50 0.33 |
Mistral-Nemo | 1.0 0.89 0.78 | 0.98 0.81 0.78 | 0.91 0.82 0.64 | 1.0 0.59 0.52 | 0.96 0.59 0.37 | 0.99 0.55 0.47 | 0.94 0.49 0.30 |
Gemma2b | 0.98 0.90 0.67 | _ | 0.77 0.82 0.51 | 0.91 0.57 0.56 | 0.87 0.59 0.31 | _ | _ |
NarrativeQA Books | NarrativeQA Movies | |||||
---|---|---|---|---|---|---|
Metrics | Human Judgement | BEM | AR TruLens | Human Judgement | BEM | AR TruLens |
Human Judgement | 1.000 | 0.735 | 0.436 | 1.000 | 0.704 | 0.565 |
BEM | 0.735 | 1.000 | 0.185 | 0.704 | 1.000 | 0.522 |
AR TruLens gpt-3.5-turbo | 0.436 | 0.185 | 1.000 | 0.565 | 0.522 | 1.000 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Oro, E.; Granata, F.M.; Ruffolo, M. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian. Big Data Cogn. Comput. 2025, 9, 141. https://doi.org/10.3390/bdcc9050141
Oro E, Granata FM, Ruffolo M. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian. Big Data and Cognitive Computing. 2025; 9(5):141. https://doi.org/10.3390/bdcc9050141
Chicago/Turabian StyleOro, Ermelinda, Francesco Maria Granata, and Massimo Ruffolo. 2025. "A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian" Big Data and Cognitive Computing 9, no. 5: 141. https://doi.org/10.3390/bdcc9050141
APA StyleOro, E., Granata, F. M., & Ruffolo, M. (2025). A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian. Big Data and Cognitive Computing, 9(5), 141. https://doi.org/10.3390/bdcc9050141