TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation

Jaddoa, Ahmed Sami; Karimpour, Jaber; Salehpour, Pedram

doi:10.3390/info17040372

Open AccessArticle

TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation

by

Ahmed Sami Jaddoa

^1,*

,

Jaber Karimpour

^1,* and

Pedram Salehpour

²

¹

Department of Computer Science, Faculty of Mathematics, Statistics, and Computer Science, University of Tabriz, Tabriz 5166616471, Iran

²

Department of Computer Engineering, Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz 5166616471, Iran

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(4), 372; https://doi.org/10.3390/info17040372

Submission received: 10 February 2026 / Revised: 1 April 2026 / Accepted: 3 April 2026 / Published: 15 April 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Here, we propose a composite system that uses text summarization (TS) and question answering (QA) to supplement the IR process of long documents. Most previous studies have used separate approaches, i.e., either TS or QA. The aim of this paper is to develop an interaction between TS and QA in three stages to enhance IR performance. First, SBERT is used for summarization. Second, an RAG method is employed to retrieve information and generate answers. In the architecture of RAG, retrieval of the document is fulfilled via all-MiniLM-L6-v2, while answer generation is performed via the T5 and BART-large-cnn models. Third, the retrieved answers are assessed and compared with a baseline system in which the documents are treated without summarization. The proposed system aims to improve the quality of retrieved information and accuracy of answers generated by TSQA in a unified pipeline. Experimental evaluation conducted on the NIPS dataset demonstrates that the proposed approach significantly enhances summary informativeness and answer accuracy compared with traditional single-task approaches. The simulation results show improvements of 20.83% in text similarity and 2.38% in BERT scores for answer generation compared with the standard RAG baseline without summarization.

Keywords:

text summarization with question answering (TSQA); Sentence BERT (SBERT); retrieval-augmented generation (RAG); text summarization (TS); information retrieval (IR); question answering (QA); T5 model; BART-large-cnn model; all-MiniLM-L6-v2 model; abbreviation

1. Introduction

Information retrieval (IR) is significant in daily life because of its integration into a variety of useful functions, such as internet browsing, personal assistants, question-answering systems, digital libraries, and chatbots. Its primary objective is recognizing and retrieving information associated with the user’s request. Since multiple records may be relevant, results are often ranked based on their similarity to the user’s query. At the establishment of the information retrieval field, conventional text retrieval systems relied predominantly on matching terms between documents and queries. Nonetheless, these term-based retrieval systems have some disadvantages, such as synonyms, polysemy, and lexical gaps, which can limit their effectiveness [1,2].

Large language model (LLM) integration has fundamentally redefined IR systems, for example, by introducing LLM-generated data as a new source of IR data, shifting from passive retrieval to proactive generation as a core paradigm, and adopting LLMs as evaluators of results for IR systems [3].

Due to the massive and continuously growing size of textual corpora existing on the internet, a significant amount of information could be lost or go unnoticed. Simultaneously, for human experts, the task of summarizing these resources is highly time-consuming and tedious, which necessitates task automation. Natural language processing (NLP) represents a multi-disciplinary field of research, merging approaches and aspects from the fields of computer science, linguistics, and artificial intelligence; it deals with developing processes that efficiently and semantically analyze massive amounts of textual data. Text summarization (TS) represents one of the fundamental NLP subtasks that have been defined as the process of automatically creating a fluent and concise summary, which captures the main topics and ideas of one or several documents [4].

Summarization significantly improves IR in question-answering (QA) systems by enabling the system to process large volumes of information faster, enhancing the relevance and conciseness of search results, and enhancing the factual accuracy of the final answers. Converting large volumes of text into understandable and concise summaries while keeping the essential information and underlying meaning is the primary objective of text summarization. This process can support efficient information retrieval (IR) and extraction of knowledge, as well as content comprehension [5].

The automated process of identifying and selecting the most important points in a document or article to produce a condensed version is known as text summarization. The increase in data overload has led to a growing interest in automatic text summarization. The process of manual summarization of large documents is a challenging task due to its time-consuming and labor-intensive nature. Text summarization, a sub-field of natural language processing (NLP), focuses on reducing long texts into brief yet informative summaries. In essence, it involves condensing documents or web content while retaining their most important information [6].

There are two main methods in text summarization: abstractive and extractive. In the latter, phrases or sentences are selected and compiled straight from the source text to form a summary. Although this method focuses on determining the most crucial segments of a text, the final output may not be coherent or fluid. Conversely, abstract summarization generates new sentences that express the main ideas of a text and frequently produces summaries that are more fluid and human-like. A challenging task in NLP is text summarization. By producing smaller versions without sacrificing meaning, it seeks to improve reading efficiency and facilitate searching for information from several publications [7,8].

One of the most important aspects of comprehending natural languages is QA, where techniques typically employ a two-stage pipeline consisting of a retriever that performs the selection of passages appropriate to a certain question and a reader that generates answers from chosen passages. These two components are often separately trained with the use of the ground-truth context passages that are relevant to QA pairs. It is challenging to find explicitly annotated context question–answer triplets for many real-world settings [9].

The goal of QA systems is to search for large document collections and provide answers to natural language questions. IR and information extraction (IE) components are typically used together to perform this job. There are two types of QA systems, i.e., traditional and modern, based on the complexity of the input. The input in traditional QA systems is made of a single passage or document along with a question, and the objective of the system is to extract the answer from the given document. In contrast, modern QA systems take a question and a collection of documents as the input. Consequently, a typical modern QA system consists of two main stages: the retrieval stage, which identifies relevant documents, and the reading or comprehension stage, which extracts answers from the retrieved documents [10].

Summarization and QA are both areas of data mining; more specifically, they are part of the field of text mining which is a specialized branch within data mining. Text mining represents an interdisciplinary field that involves extracting significant and intriguing patterns for knowledge exploration from text data sources. Text mining relies on IR, data mining (DM), statistics, machine learning (ML), and computational linguistics [11].

The core scientific contribution of this research lies in presenting an integrated framework that combines summarization with question-answering systems, offering a systematic solution to the challenges associated with information density in documents. The scientific significance of this integration lies in its ability to transform the engagement with academic content from time-consuming linear reading to intelligent, context-aware information retrieval, where summarization serves as a tool to focus attention on central ideas, while the question-answering system provides a precise mechanism for interrogating texts and extracting specific methodological details. By integrating these two approaches, the research contributes to reducing cognitive and time costs and enhances the accuracy of knowledge extraction by reducing information noise and directing search engines toward the contexts most relevant to the questions. This is primarily achieved via the following.

Novel dual-model framework (TSQA): A dual-model framework that synergistically integrates text summarization and question answering to improve information retrieval and comprehension. Unlike previous methods that treat these two tasks separately, the proposed framework establishes a bidirectional interaction where summarization improves the contextual relevance of the answer, while the answer model verifies and enriches the summaries. This mechanism reduces redundancy and produces more accurate, consistent, and contextually relevant outputs. Furthermore, limited studies have explored this synergistic integration with the aim of simultaneously improving answer quality and enhancing the summary.
The SBERT transformer model is employed for text summarization tasks since semantic representations with high sentence quality can be generated. The practical rationale for choosing SBERT lies in its superior ability to generate dense embeddings for entire sentences, overcoming the fundamental limitations of the traditional BERT model which was primarily designed to process words rather than sentences. Thanks to its architecture based on Siamese networks, SBERT successfully maps each sentence to precise coordinates in a radial space that reflects its true contextual meaning, enabling semantic similarity to be calculated with high computational efficiency. This shift from word- to concept-level processing enables the summarization system to identify the most essential sentences and eliminate redundancy with exceptional precision.
The scientific advantage of combining summarization with a question-answering system lies in reducing the time required to access information compared with using a question-answering system on its own. This is because the standalone question-answering system takes a long time to scan documents in their entirety to find the answer, whereas summarization acts as a time filter that screens texts and narrows the search scope to only the essential parts. This integration speeds up the retrieval process.
Abbreviation expansion: Maintaining expanded forms of acronyms is an important step in information retrieval for a key reason: a user might search using a full term while the document contains the acronym, and vice versa. Therefore, we have created a glossary containing all acronyms in the database to ensure that the expanded forms appear in the text during retrieval, whether the summary is included or not.
Window size: Determining the window size for chunking is a critical step. The scientific benefit of adjusting the window size lies in balancing contextual accuracy and response speed; intelligent chunking of text ensures that the question-answering system is provided with a focused and coherent context, preventing the model from becoming overwhelmed by long texts.

2. Related Work

This segment is a critically assessed review of the current literature on text summarization and question answering and thus provides a background against which the proposed system should be placed.

2.1. Text Summarization

In their study, Zuhair Hussein Ali et al. [12] suggest a new approach for multi-document summarization depending on harmonic algorithm research in which diversity, readability, and coverage are maximized. The ROUGE was manipulated for the purpose of evaluating the efficacy of the model proposed. This was performed via the employment of the Text Analysis Conference (TAC-2011) benchmark dataset.

In another study performed by Hew Zi Jian et al. [13], the most suitable text summarization in addition to the top ML model for news article summarization was determined. Comparing text summarization with the employment of selected classifiers required the use of the CNN/Daily Mail database in this study.

Text summarization was explored by Lochan Basyal and Mihir Sanghvi [14]. This exploration was conducted via the employment of different large language models (LLMs), such as OpenAI’s ChatGPT text-davinci-003, Falcon-7B-Instruct, and MPT-7B-Instruct. During the process of experimentation, a variety of hyper-parameters were employed to evaluate the summaries generated, in addition to widely used evaluation metrics including ROUGE, Bidirectional Encoder Representations from Transformers (BERT), and Bilingual Evaluation Understudy (BLEU) scores. The experimental results indicated that text-davinci-003 was better than that of other models. As far as the datasets employed in their study, two kinds were used: CNN/Daily Mail and XSum. The primary objective of the study was to provide detailed insights into the performance of LLMs across different datasets.

In the work of Bharathi Mohan G et al. [15], the performance of several LLMs was the main objective, particularly those fine-tuned on the CNN/Daily Mail dataset of news articles. These LLMs included Lama-2-7b, Bart, T5, and Pegasus. Methods of prompt-based fine-tuning with QLoRA are used consistently to evaluate how well such models produce summaries that convey the main essence of news articles.

Kietikul Jearanaitanakij et al. [16] aimed to create a specific Thai news summarization system via which abstractive summaries could be extracted and generated, in addition to highlighting the basic ideas. Their model consisted of two components: first, the text-rank method was employed to extract a contiguous region containing significant sentences; second, an optimized mBART model was employed as an LLM to generate an abstract summary from the extracted content. In other words, the proposed approach first identifies an important news region and then passes it to mBART for summarization. This method produced news summaries that captured essential information while maintaining a syntactic style such as that of natural Thai language.

In their study, Mohamed Bayan Kmainasi et al. [17] concentrated on evaluating news and social media content in a multilingual context. They focused on improving a specific LLM called Llama Lens. This can be regarded as the first attempt to address multilingualism as well as domain specificity, with an emphasis on social media and news.

2.2. Question Answering

Kurnia Muludi et al. [18] addressed the processing of documents in NLP problems, employing the RAG method for the enhancement of QA systems.

In their study, Ibnu Pujiono et al. [19] employed a vector database along with retrieval-augmented generation to assess the performance of the developed chatbot. The effectiveness of current LLMs was compared in this study to address regulatory issues related to public service agencies. Taking cosine similarity scores into consideration, the LLM evaluated and responded to questions using a vector database.

Binita Saha et al. [20] presented novel architecture for retrieval-augmented generation (RAG) systems to reinforce question-answering tasks via the employment of a target corpus. Human-like text generation and analysis have been transformed by these models, i.e., LLMs. However, without combination with live data tools, these models rely on pre-trained data and do not provide real-time updates. By incorporating databases and internet resources, RAG improves LLMs by producing contextually relevant responses.

Antonio Moreno Cediel et al. [21] primarily aimed at identifying and developing models improved for Spanish answer extraction (AE) and question generation (QG) tasks via the use of the answer-aware technique. To optimize three multilingual models (mT5-base, mT0-base, and BLOOMZ-560M), three different datasets were employed: the Spanish version of the SQuAD dataset; SQAC, a Spanish dataset; and their union (SQuAD + SQAC), which produced somewhat better results in their study.

Wenjun Meng et al. [22] focused on human health risks via the use of a question-answering (QA) system which was developed via the employment of retrieval-augmented generation (RAG). Depending on the QA pair creation mechanism’s introduction, 300 high-quality question–answer pairs across six sub-fields were produced. With the use of naive and advanced RAG methods, a QA system was developed.

Xinyue Huang et al. [23] addressed issues with multi-hop reasoning and contextual understanding across long documents by introducing a new RAG framework designed for difficult QA tasks. The framework, based on LLaMA 3, combines advanced context fusion and multi-hop reasoning techniques with dense retrieval modules to provide responses that are more accurate and cohesive. The adaptability and robustness of the model are enhanced by joint optimization that combines generation cross-entropy and retrieval likelihood.

Jhon Rayo et al. [24] introduced a hybrid information retrieval (IR) system aiming to retrieve relevant information from extensive regulatory corpora. It is worth noting in this study that this system combines semantic and lexical research approaches. To obtain lexical coverage and semantic precision, the system integrated the conventional BM25 method with a tuned sentence transformer model. The LLMs were used within a RAG framework to synthesize retrieved passages and produce comprehensive and accurate responses.

SeongKu Kang et al. [25] presented the concept of a Coverage-based Query Set Generation (CCQGen) framework for the purpose of presenting a set of queries that fully covers the concepts in a document. Among the crucial differentiators of CCQGen is its ability to adaptively modify the generation process in response to previously created queries. The study identified terms which were not accurately discussed in the earlier queries and used them as preconditions to generate new queries.

Furthermore, Thi Thu Uyen Hoang et al. [26] presented a RAG framework to improve information extraction from PDF files in question-answering (QA) systems. Existing QA systems are largely prepared for textual material and face certain challenges in recognizing the diversity and richness of the data found in PDFs, such as images, text, graphs, vector diagrams, and tables. They suggested a comprehensive RAG-based QA system capable of handling complex multimodal questions that combine multiple data types.

Finally, Chengke Wu et al. [27] introduced RAG for Construction Management (RAG4CM), a new paradigm consisting of three components: (1) a pipeline that created a knowledge pool by parsing project documents into hierarchical structures; (2) novel RAG search algorithms; and (3) a method for learning user preferences. The first two components integrated document-level hierarchical features with raw content to improve RAG results and granularity alignment. Preference learning enhances user–system interactions and leads to consistently better responses. A prototype system has been developed, and a series of experiments were carried out.

3. Materials and Methods

3.1. System Overview

Our study proposes a structured paradigm that incorporates information retrieval (IR) with retrieval-augmented generation (RAG) and summarization methods of question answering (QA). The given methodology will contribute to the effectiveness level, as well as the level of extraction and response synthesis contextual accuracy. The framework is structured into three major steps: first, extractive summarization; second, document selection with respect to a user query facilitated by IR; and third, synthesis of the end response during utilization of the RAG approach.

3.1.1. Summarization Phase

The first stage includes applying an extractive summarization model to a set of documents to produce information summaries that are rich and concise. The essential purpose of this stage is to minimize the materials that are irrelevant or redundant, while the core meaning of each document is preserved. The result is a set of condensed texts that include the most essential ideas. As such, a focused and effective knowledge base will be formed. Depending on the summarization of the documents prior to retrieval, the system reduces computational load and enhances the relevance of the subsequent retrieval stage.

3.1.2. Question-Guided Information Retrieval Phase

Having completed the summarization, the next step of the system is to progress the question-driven retrieval stage. There are two ways of creating a question (a prompt), namely via the user or the system itself. The question (prompt) is created by either a user or the system. A search process will not be performed by the IR module after the query is received; instead, the summaries generated will be searched to retrieve the most contextually relevant information. The system retrieves and ranks the most relevant passages or summaries via the use of a vector-based semantic similarity method. These retrieved fragments serve as contextual evidence that lays the foundation for answer generation.

3.1.3. Answer Generation Using RAG

In the final stage, a RAG model is employed to produce the final answer. The RAG mechanism combines a retriever and a generator in a single pipeline:

Dynamically, the retriever obtains the most-ranked summaries of the IR stage as external knowledge sources.
The generator is a transformer-based language model that utilizes the question and the retrieved summaries as contextual input.
The model then gives a coherent, context-sensitive and fact-based answer by synthesizing information in the retrieved content.

After integration, the answers generated are not just human-like and fluent but can also be checked against verifiable content. The suggested methodology could fill the gap between traditional retrieval schemes and more sophisticated generative question-answering models through the combination of accuracy of information retrieval, focus of summarization, and contextual fluency of retrieval-augmented generation. The general workflow of the proposed TSQA approach is shown in Figure 1. It further shows the stepped sequential interaction of the summarization process, the question-directed information retrieval, and the answer-generation module based on RAG.

3.2. Dataset

NIPS: One of the leading conferences for ML in the world is Neural Information Processing Systems (NIPS). It includes every step starting from computer vision and DL to reinforcement learning and cognitive science. The dataset includes the authors, titles, abstracts, and extracted text of all NIPS papers published from the inaugural 1987 conference to the 2016 conference. The paper texts were extracted from the original PDF files and released in CSV format for this study.

3.3. Preprocessing

In this step, the document or group of interconnected documents is prepared for the summarization system. The input is converted into a list of specific words or phrases extracted from the documents. NLP-based procedures are part of such a pre-processing phase, which involves cleaning the user’s input text by first segmenting it into sentences, then tokenizing each token, removing unnecessary words (stop words), removing noise, and lemmatizing the text.

Tokenization:

A token refers to an array of words that results from dividing the sentence. This step, i.e., tokenization, involves breaking the text into smaller units that are referred to as tokens depending on delimiters such as spaces, parentheses, sentence boundaries, annotations, and other separators. Accordingly, a set of parts are displayed in Table 1.

Stop Words Removal:

Stop words are common words that frequently occur in language but carry minimal semantic meaning. The aim of this process is to improve the effectiveness of certain actions, such as phrase-based searches, by removing low-information words. This practice streamlines the text and emphasizes more substantive and meaningful terms.

Lemmatization:

Lemmatization is comparable to stemming, yet the meaning of the word is very well retained. A lemma is the term for a word changed back to its root form while maintaining the word’s meaning as well as its semantics.

Abbreviation Expansion:

Abbreviation expansion is an important aspect of NLP tasks, particularly in text summarization and question answering (QA). Information loss or ambiguity may occur if the abbreviations are not accurately expanded into their full forms. Unresolved abbreviations within summarization can result in less informative and unclear summaries, particularly in domain-specific texts such as scientific articles. Maintaining expanded forms of acronyms is an important step in information retrieval for a key reason: a user might search using the full term while the document contains the acronym, and vice versa. Therefore, we have created a glossary containing all acronyms in the database to ensure that the expanded forms appear in the text during retrieval, whether the summary is included or not. Algorithm 1 shows abbreviation expansion based on a frequency threshold.

Algorithm 1: Context-Aware Abbreviation Expansion and Global Aggregation Framework

Objective: To automatically construct a high-quality acronym-definition dictionary from a large corpus by leveraging local context windows, heuristic validation, and global frequency filtering.
Input: D: A dataset containing N documents (papers)
τ: Frequency threshold for noise filtering (default τ = 5)
Output: R: The final ranked registry of valid abbreviation–definition pairs.

Begin
Step 1: Global Initialization:
Initialize a global frequency map G←∅ to store counts of pairs (A, D) →N}
Step 2: Corpus Processing Loop:
For each document d_i in D:
1. Preprocessing:
T← Normalize whitespace in d_i
2. Pattern Recognition:
Identify set of matches S using Regex: (? <=\s) \(([ [A-Z 0-9\-\.] {1,10} s?) \)
3. Local Extraction Strategy:
For each match m∈ S containing acronym A at index idx:
Context Windowing:
Extract preceding text T_prev = T [max (0, idx-150): idx]
Tokenize T_prev into a word list W_prev
Dynamic Candidate Search:
Define search window size

k

ranging from

∣ A ∣

to

m i n (∣ W_{p r e v} ∣, ∣ A ∣ + 5)

Initialize

B e s t D e f \leftarrow N u l l

For each k (iterating backwards/forwards):
Construct candidate string C from the ast k tokens of W_prev
Validation Check (Heuristic):
Start Char: Does the first word of C start with the first letter of A?
Complexity: Is length(C) ≤ (∣A∣ × 3 + 5)?
If validation passes:
BestDef ← C
(Optional: Update to favor longer valid matches if found)
Local Update:
If BestDef ≠ Null:
Increment G[(A,BestDef)] ←[(A,BestDef)] + 1
Step 3: Noise Reduction and Filtering:
Initialize final list R←∅.
For each unique pair p = (A_cr,D_ef) in G:
If G[p] > τ:
dd tuple (A_cr,D_ef,G[p]) to R.
Step 4: Finalization:
Sort R by frequency in descending order.
Return R.
End

3.4. Text Summarization

Extractive summarization is the most widely used method in the analysis of texts and natural language processing (NLP). The establishment of an informative and concise summary requires that the important sentences or phrases in the source text must be extracted selectively. It is an involved process where textual analysis must be meticulously performed to find the most crucial facts, crucial points, or arguments presented in a chosen source.

SBERT Transformer:

SBERT (Sentence-BERT) can be successfully applied in text summarization, especially extractive summarization. This approach is aimed at determining and extracting the most significant sentences of a document to create a concise summary. The following steps are used in summarization with the help of the SBERT algorithm:

Sentence embedding generation: The input document is segmented into individual sentences. Each sentence is then input into a pre-trained SBERT model. SBERT generates high-dimensional vector embedding for each sentence, capturing its semantic meaning. Efficient identification of the similarity of sentences is conducted via these embedding generations.
Sentence similarity calculation: After obtaining sentence embedding, it is possible to compute the similarity of all pairs of sentences. This is usually performed through the cosine similarity measure, i.e., the angle between two vectors. The larger the cosine similarity, the greater the semantic similarity between sentences.
Sentence ranking: Another procedure for sentence ranking is conducted according to their importance. This can be performed via the employment of certain procedures such as text-rank; a graph is constructed within which sentences are represented as nodes and edges denote their similarity. These procedures can be reinforced by employing SBERT embeddings, which provide semantically rich similarity scores.
Summary generation: The extractive summary is formed via the use of the top-ranked sentences or the most representative sentences in the clusters depending on the results of the ranking or clustering. The length of a desired summary or a particular threshold can be used to decide on the number of sentences to include. Algorithm 2 shows the steps of SBERT for extractive summarization.

Algorithm 2: SBERT_Text_Summarization

Input: Document (D, K)
Output: Summary (S)

Begin
Step 1: Tokenization
sentences = SplitIntoSentences(D)
Step 2: Embedding Generation
embeddings = []
for s in sentences:
embeddings.append(SBERT(s))
Step 3: Similarity Calculation
similarity_matrix = zeros(len(sentences), len(sentences))
for i in range(len(sentences)):
for j in range(len(sentences)):
similarity_matrix[i][j] = cosine_similarity(embeddings[i], embeddings[j])
Step 4: Sentence Ranking (TextRank)
scores = TextRank(similarity_matrix)
Step 5: Summary Generation
top_sentences = SelectTopK (sentences, scores, k)
summary = SortByOriginalOrder(top_sentences)
return summary
End

3.5. Vectorization

The process of converting sentences, paragraphs, or the whole document into fixed-size, dense vector representations (embeddings) which capture their semantic meaning is called vectorization and is achieved using Sentence-BERT. Unlike traditional methods, such as TF-IDF, or standard BERT models, SBERT is optimized for efficient and meaningful semantic similarity comparisons. SBERT uses a Siamese network composed of a pair of identical BERT models with the same architecture and weights. This network is trained on sentence pairs that are labeled for semantic similarity [28].

This type of artificial neural network (ANN) is referred to as a Siamese neural network, which computes similar output vectors by cooperating on two distinct input vectors using the same weights with the aim of learning similarity metrics. Every sentence is fed into an identical yet distinct BERT model. A pooling technique (such as max-pooling, averaging, or utilizing a CLS token) is used to merge the outputs regarding the two BERT models, including contextualized embedding of every word in the input text. To capture the meaning of the complete sentence, this combines word-level information to a fixed-size single vector that is known as sentence embedding [28].

3.6. Question Answering (QA)

In NLP, QA involves creating systems that can automatically answer questions in natural language. These systems can locate relevant information in vast datasets, comprehend the context regarding a question, and offer a concise answer.

3.6.1. Information Retrieval

The retriever is essential to the efficiency of a QA system, since it is the crucial component that looks for a document repository for sections most likely to contain information needed to answer a question posed. IR is the process of locating and retrieving information from system resources to satisfy an information need. The retriever aims to return the most relevant knowledge, serving as the provider of the information in RAG knowledge sources. This is performed via the calculation of distance between the query and documents from external knowledge sources.

3.6.2. Retrieval-Augmented Generation (RAG)

Retrieval and generation models are combined in RAG. It generates text in response to prompts via the use of a large language model, while integrating information from a separate system of retrieval to reinforce contextual relevance and output quality. Retrieval models are used to obtain factual content from a knowledge base. Generative models, on the other hand, provide extra context to produce more accurate results [29].

RAG is a cutting-edge AI paradigm combining generative models and IR to improve response accuracy and reliability. LLMs generate fluent text but are solely dependent on pre-trained information, while conventional IR systems can effectively find relevant documents but cannot produce new content. By initially retrieving relevant documents from external databases and subsequently using them to guide text generation, RAG addresses this gap in a two-step process [30].

Window size:

In question-answering systems that rely on information retrieval, retrieved documents are often too long to be fully entered into a QA form due to length limitations (token limit). Therefore, the documents are divided into smaller segments called windows. Window size is the number of words or tokens within each text segment, and determining the window size for chunking is a critical step. The scientific benefit of adjusting the window size lies in balancing contextual accuracy and response speed. Algorithm 3 shows the steps of the window size (text chunking).

Algorithm 3: Window Size (Adaptive Semantic Text Chunking)

Objective: To segment long academic texts into semantically coherent units by preserving paragraph boundaries where possible and strictly respecting token limits.
Input: T: Raw Document Text.
L_min: Minimum chunk size (50 words).
L_max: Maximum chunk size (500 words).
Output: C_list: Ordered list of text chunks.

Begin
Initialize C_list = []
Step 1: Semantic Separation
Split T into paragraphs P based on double newlines (\n\n)
For each paragraph p in P Do:
p = Strip whitespace from p
If p is empty, then:
Continuing
End If
Step 2: Calculate Word Count
WordCount = Count words in p
Case (A): Paragraph fits perfectly within limits
If (WordCount >= L_min) And (WordCount <= L_max) Then:
Append p to C_list
Case (B): Paragraph is too long (Fragmentation required)
Else If WordCount > L_max Then:
Sentences = Split p into sentences (using NLTK)
Buffer = ““
For each sent in Sentences Do:
CombinedLength = Count words in (Buffer + sent)
If CombinedLength <= L_max Then: Check if adding sentence exceeds limit
Buffer = Buffer + ““+ sent
Else: buffer
If Buffer is not empty, then:
Append Buffer to C_list
End If
Buffer = sent # Start new chunk with current sentence
End If
Case (C): Paragraph is too short (Optional: Ignore or Merge)
Else:
# Current logic ignores very short paragraphs (noise)
Continuing
End If
End For
Return C_list
End

Retrieval stage:

all-MinLM-L6-v2 Model

The all-MinLM-L6-v2 model is the most appropriate for use in the retrieval part of the RAG system because of its trade-off of speed, efficiency, and quality. It functions as an embedding model that maps sentences and short paragraphs into a 384-dimensional dense vector space, enabling semantic search and information retrieval tasks. In an RAG pipeline, the all-MiniLM-L6-v2 model is typically used in two main stages as follows.

Document indexing (ingestion): The model is used to convert a corpus of documents into vector embedding.

Query encoding and retrieval: The same model is utilized to generate an embedding for each query a user submits. The query embedding is utilized to perform a similar search against the vector database to retrieve the most relevant document chunks. The LLM is then given the retrieved text segments as context to produce a well-informed response.

Generation stage:

Bart-large-cnn Model:

BART-large-cnn is a transformer encoder–decoder (seq2seq) model, accompanied by an auto-regressive (GPT-like) decoder and a bidirectional (BERT-like) encoder. During the process of pre-training, text is corrupted via employing a random noise function. The model is trained to reconstruct the original text. Having been optimized for text generation, BART performs particularly well while demonstrating strong performance on comprehension task QAs [31].

The BART-large-cnn model has been adopted as a core generation component within the RAG architecture due to its advanced capabilities in producing high-quality, context-enhanced text. BART-large-cnn is a sequence-to-sequence transformer that is specifically built to enable encoder and decoder components to be integrated in such a manner that they provide an in-depth view of the textual input, its reformulation, and semantic completion. These features make it highly effective for tasks that require contextual reasoning and linking, such as answering questions based on external sources. Within the RAG framework, a semantic retrieval process is initially performed to fetch relevant text segments for query, depending on high-resolution vector representation algorithms. The retrieved segments are subsequently combined with the query formula as a unified input text into the BART-large-cnn model to allow it to generate a rich contextual representation that takes into consideration the query content and the information obtained in the retrieved documents. This process contributes to the reliability of the generation process and grounds the answer in actual evidence instead of the implicit knowledge of the model, which is an essential measure in reducing the impact of random generation.

T5 Model:

Google AI created the T5 (text-to-text transfer transformer) model, a large language model that organizes all NLP tasks to a unified text-to-text format. This simplifies the process for various NLP tasks, such as summarization, translation, and QA, as the model’s output and input are text strings. One of the transformer-based models utilizing the same transformer architecture is T5. T5 models aim to handle all NLP tasks through approaching each text-processing problem as a text-to-text task, that is, through taking the provided text as input and creating new text as output [32].

4. Evaluation

An effective evaluation metric system is highly important for objectively measuring an IR model’s performance. Some of these measures include cosine similarity, BERT score, Precision@k, Recall@k, F1@k, MAP, MRR, and nDCG@k. Most of these metrics are crucial for assessing information retrieval systems and measuring the quality of search results. Precision@k, Recall@k, and F1@k assess the top k results, with precision focusing on relevance, recall on coverage, and F1 balancing both. MAP presents the average precision across queries to reward systems that rank relevant items highly, while MRR specifically focuses on the rank of the first relevant result. NDCG@k is a rank-aware metric that gives more weight to highly appropriate items at the top of the list, handling graded relevance where some results are more important than others.

BERT score: Pre-trained contextual embedding is used via the BERT score to capture semantic similarities that may be overlooked by n-gram overlap assessments. The semantic similarity between the reference text and candidate text is measured using the BERT score, an automatic evaluation metric for text generation.

Precision@k: This performance metric evaluates relevance regarding a list of suggested items. It calculates the percentage of the top k set’s suggested items relevant to the user. Equation (1) provides the Precision@k formula:

Precision @ k = \frac{(N u m b e r o f r e l e v a n t i t e m s i n t o p k)}{k}

(1)

Recall@k: The number of relevant items chosen out of all potentially relevant ones is the basis for such a metric, which is utilized to assess recommendation systems. For several queries, it estimates the proportion of relevant items discovered in the top k. Equation (2) provides the Recall@k formula:

Recall @ k = \frac{(N u m b e r o f r e l e v a n t i t e m s i n t o p k)}{(T o t a l n u m b e r o f r e l e v a n t i t e m s i n t h e d a t a s e t)}

(2)

F1@K (F1 score at K): The harmonic mean regarding recall and precision is the F1 score. At a particular cutoff point K, Recall@k and Precision@k are balanced by a harmonic mean which yields a single score. Equation (3) provides the F1@K formula:

F 1 @ k = \frac{2 x (P r e c i s i o n @ k x R e c a l l @ k)}{(P r e c i s i o n @ k + R e c a l l @ k)}

(3)

Mean Average Precision (MAP): A quality metric called Mean Average Precision (MAP) at K is used to assess how well a recommender or ranking system can return relevant items in the top K results and prioritize more relevant items. Equations (4) and (5) provide the MAP formula:

MAP = \frac{1}{N} \sum_{i = 1}^{N} A P @ k_{i}

(4)

where:

K is the chosen cutoff point.
N represents the total number of queries (in the case of information retrieval) in the evaluated dataset.
AP is the average precision for a given ranking list:

AP = \frac{1}{N} \sum_{k = 1}^{K} P r e c i s i o n (k) x r e l (k)

(5)

Mean Reciprocal Rank (MRR): The statistical metric that focuses on the first-correct answer rank and is also used for evaluating the recommendation performance or information retrieval systems is known as the MRR. It is calculated as the average of the reciprocal ranks of query results over a given set of queries. The inverse of the rank of the first correct item is the reciprocal rank of a query response. To be more specific, if the correct item is at rank k, the reciprocal rank is 1/k. MRR is obtained via averaging these reciprocal ranks across all queries. Equation (6) illustrates the MRR formula:

MRR = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{{r a n k}_{i}}

(6)

Normalized Discounted Cumulative Gain (NDCG@K): The efficacy regarding ranking algorithms in recommendation systems is captured by the ranking quality metric known as NDCG. Considering every item’s relevance, it assesses how well the predicted ranking of the items matches the optimal ranking. Graded relevance is handled via a rank-aware metric. It normalizes the score to a range of [0] and assigns greater weight to the highly relevant items that occur earlier in the list. Equation (7) provides the NDCG@K formula:

NDCG @ K = \frac{(D C G @ K)}{(I D C G @ K)}

(7)

5. Results and Discussion

5.1. Model Details

To demonstrate how the RAG model works by inputting a question into the system, an experiment was designed and conducted. It began with the retrieval stage, where the model searched a database of documents or pre-stored texts for the identification of the documents most relevant to the question. With the use of vector embedding in the SBERT model, retrieval was based on determining the semantic similarity score between question and paragraph representations. The best source for the answer was then determined to be the text segment or paragraph with the highest score of similarity. In the next stage, generation, the model used the retrieved paragraphs with the highest similarity scores to generate a coherent and accurate linguistic answer. This was achieved by combining the retrieved information with the model’s generative capabilities to produce an answer supported by textual evidence.

A case study of the SBERT model’s performance in the summarization of text via the use of RAG for the retrieval task is illustrated in Figure 2, showing how a retrieval-enhanced text generation model works, specifically on a NIPS dataset. This model has two phases, ensuring that the user’s query is answered accurately and with scientific support.

Phase 1: Passage Retrieval

The user’s query is “Using Independent Component Analysis (ICA) for artifact removal in EEG recordings.” Instead of solely relying on memory, a system retrieves relevant information on the topic from the NIPS dataset. This information is then ranked based on similarity scores and the top five passages are chosen, with a score range of 77.15 to 83.12. This ensures that the system is provided with accurate scientific information to support ICA’s ability to spatially separate and filter signals without the need for clean reference channels.

Phase 2: Text Generation using a Large Language Model (LLM)

The user’s query and the five chosen passages (K passages) are combined to form a single template, referred to as a prompt. This information is then sent to a large language model, an “intelligent editor,” to review the information and create a coherent paragraph to answer the user’s query. The text generated will illustrate the efficacy of ICA technology in analyzing multi-channel EEG data and identifying anomalies with a high level of precision.

5.2. Baselines

To analyze the effect of the TSQA model, the factor of integrating text summarization into the QA process was used to facilitate information retrieval. The assessment used SBERT to perform summaries of text, the retrieval-augmented generation (RAG) framework with the all-MiniLMall-v2 model in the retrieval step, and T5 and Bart-large-cnn in the generation step.

Without summarization:

5.2.1. Generation Part

We compare text chunk similarity and the evaluation of answer generation for given questions without summarization using RAG for the retrieval part using the all-MinLM-L6-v2 model and for the generation part using the T5 and Bart-large-cnn models. Table 2 shows the score similarity of questions with relevant papers and BERT scores for evaluation of the generation part.

5.2.2. Retrieval Part

We compare the results of information retrieval for question answering without summarization with RAG for the retrieval part using the all-MinLM-L6-v2 model and for the generation part using the T5 and Bart-large-cnn models. Table 3 and Table 4 show the metrics of the retrieval part for a given question without summarization for the T5 model and the Bart-large-cnn model, respectively.

With summarization:

5.2.3. Generation Part

We compare the results of text chunk similarity and the evaluation of answer generation for given questions with summarization using RAG for the retrieval part using the all-MinLM-L6-v2 model and for the generation part using the T5 and Bart-large-cnn models. Table 5 shows the score similarity of questions with relevant papers and BERT scores for evaluation of the generation part.

5.2.4. Retrieval Part

We compare the results of information retrieval for question answering with summarization through RAG for the retrieval part using the all-MinLM-L6-v2 Model and for the generation part using the T5 and Bart-large-cnn models. Table 6 and Table 7 show the metrics of the retrieval part for a given question’s summarization using the T5 model and the Bart-large-cnn model, respectively.

A performance comparison shows that integrating the SBERT-based summarization phase into the RAG system significantly reinforces retrieval effectiveness and generation quality compared with the baseline system without summarization. The metrics of similarity and BERTS scores are reported in Table 2, along with the standard retrieval metrics (Prec@5, Recall@5, MAP, MRR, and nDCG@5) presented in Table 3 and Table 4, clearly indicating this improvement.

Table 5 reflects that when summarization is applied prior to the retrieval process, higher similarity scores are obtained between the questions and the retrieved texts. Moreover, higher BERT scores are noticed for most questions, indicating an improved ability to generate more coherent and suitable answers. Furthermore, marginal performance is conducted using the BART model, which is better than the T5 model regarding answer quality, particularly for questions Q1, Q4, and Q5, enhancing its effectiveness as a generation model when provided with well-organized and summarized content and improving retrieval performance.

Comparing Table 3, Table 4, Table 6 and Table 7 indicates that the following benefits were obtained via employing SBERT to summarize documents prior to the retrieval stage: a significant improvement in Precision@5 and Recall@5 regarding MAP and nDCG@5 has been realized as compared with the non-summarized version. An MRR of 1 in most of the questions, Q1–Q4, implies that the system always returned the most relevant document in the first position. It is noteworthy that the RAG system with SBERT and BART achieved the optimal overall performance (Table 4), recording the highest nDCG@5 score of 0.78 in Q1 and the highest Precision@5 score of 0.6–0.8 in Q1 to Q5. This demonstrates that summarizing content improves the accuracy of the retrieved document ranking and increases the model’s confidence in selecting the most relevant results.

Figure 3 depicts a comparative analysis of the latency incurred between the document retrieval phase and final response generation, serving as a critical benchmark for evaluating the proposed system’s operational efficiency. Empirical measurements reveal significant variance in processing time (measured in seconds) contingent upon the context management strategy employed. In the non-summarized approach, where full-text documents are passed directly to the generator, a marked increase in latency is observed; this is primarily attributed to the heightened computational overhead required to process dense, high-token-count inputs. Conversely, the integration of summarization as an intermediate stage demonstrably optimizes the total execution time. While the summarization process introduces a marginal temporal cost, it effectively catalyzes the final generation phase by mitigating informational redundancy and distilling the context. Consequently, this modular approach achieves an optimal trade-off between computational throughput and the precision of the generated output.

The processing times across the tested models in five independent experiments (Q1–Q5) show a notable variation in processing efficiency; the BART-Large-cnn model recorded the highest time consumption, reflecting the high computational demands of its deep generative architecture. In contrast, the results demonstrate that integrating SBERT as a summarization layer led to a substantial reduction in total processing time, particularly when paired with the T5 model. This time advantage of the hybrid model (SBERT + T5) is attributed to SBERT’s role in reducing the input texts and converting them into condensed representations, which allowed the T5 model to complete its tasks in record time compared with models that processed the texts directly and at length. These results highlight the effectiveness of the pre-summarization strategy in improving data flow and reducing computational load in natural language processing systems.

According to this, we compared the search performance of the TSQA framework environment to the experimental result; the results are presented in Table 8.

In Appendix A, we include detailed in (Table A1, Table A2, Table A3, Table A4 and Table A5) of the TSQA framework implemented in this work to explain the response-generation mechanism within the retrieval-augmented generation (RAG) framework, and in Appendix B, (Table A6) contain all abbreviations in the dataset.

6. Conclusions

In this paper, we present a novel method to enhance the process of retrieving information by integrating text summarization with question answering. We have experimentally confirmed that the incorporation of an SBERT-based summarization stage prior to retrieval is an influential method for improving RAG systems’ performance in retrieval and generation. Furthermore, the results also show that the optimal configuration for achieving the highest accuracy levels and comprehensiveness is SBERT summarization combined with generation using the Facebook/BART-large-cnn model. Higher-quality retrieved documents, improved information structuring for the generation of answers, and more fixed and accurate responses are yielded via this combination. As a result, we recommend SBERT-based summarization as the most important core component of RAG systems designed for high-precision applications such as advanced search systems, scientific document analysis, and long-form AI applications.

7. Limitations and Future Work

Despite the promising performance of the hybrid model combining SBERT-based summarization with a RAG pipeline, certain constraints remain. The system’s effectiveness is inherently tied to the large language model (LLM)’s performance during text extraction and response generation based on question similarity. To ensure a reliable and fair assessment, a rigorous ground-truth baseline was constructed. While the evaluation focused on a representative set of five questions and their corresponding top five retrieved documents, this specific scope was methodologically chosen to allow for an in-depth, high-precision qualitative analysis of alignment and factual consistency, avoiding the excessive data expansion that often hinders detailed manual validation. However, this focused approach may limit the broader generalizability of the results across more diverse datasets. Furthermore, the extractive nature of SBERT can occasionally overlook nuanced details necessary for the RAG retrieval stage.

In future work, we will aim to scale the evaluation framework and explore abstractive summarization techniques to provide more cohesive context to the QA pipeline, alongside optimizing chunking strategies to further refine the balance between summary brevity and information density. Furthermore, a key direction will be generalizing the proposed framework across diverse specialized domains, such as the medical field. This expansion would involve fine-tuning the model on domain-specific corpora (e.g., the PubMed dataset) to evaluate its capability in handling complex terminology and providing factually critical answers where high precision is paramount. Lastly, we aim to evaluate other LLMs.

Author Contributions

Conceptualization, A.S.J., J.K. and P.S.; methodology, A.S.J. and J.K.; software, A.S.J.; formal analysis, A.S.J., J.K. and P.S.; investigation, A.S.J., J.K. and P.S.; resources, A.S.J.; data curation, A.S.J. and J.K.; writing—original draft, A.S.J. and J.K.; writing—review and editing, A.S.J. and J.K.; visualization, A.S.J. and J.K.; supervision, J.K. and P.S.; project administration, A.S.J. and J.K.; funding acquisition, A.S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The NIPS dataset supporting the findings of this study is publicly available on the Kaggle platform at (https://www.kaggle.com/datasets/benhamner/nips-papers?select=papers.csv) (accessed on 1 September 2025). All additional data generated or analyzed during this study are included within this article.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper. All authors, including the research supervisors, have reviewed the final manuscript and approved its submission.

Abbreviations

The following abbreviations are used in this manuscript:

TSQA	Text summarization with question answering
TS	Text summarization
QA	Question answering
IR	Information retrieval
ML	Machine learning
MAP	Mean Average Precision
RAG	Retrieval-augmented generation
BERT	Bidirectional Encoder Representations from Transformers
SBERT	Sentence BERT
T5	Text-to-text transfer transformer
NLP	Natural language processing
NIPS	Neural Information Processing Systems
DM	Data mining
MRR	Mean Reciprocal Rank
NDCG	Normalized Discounted Cumulative Gain

Appendix A

Case Study:

This section presents an illustrative case study of the TSQA framework implemented in this work to explain the response-generation mechanism within the retrieval-augmented generation (RAG) framework. It relies on posing questions and retrieving semantically similar texts using the cosine similarity scale. The following tables compare two main scenarios: generation using directly retrieved texts without summarization and generation after applying a summarization stage to the retrieved texts before inputting them into the generation model. Table A1, Table A2, Table A3, Table A4 and Table A5 include the queries posed, the texts or responses generated in each scenario, and the quantitative values used to assess generation quality using the BERT score. This comparison aims to highlight the impact of summarization on improving semantic consistency and the accuracy of the generated responses and to provide a clear analysis of the differences between the two scenarios used in this case study.

Table A1. An example question and its corresponding reference answer taken from the NIPS dataset.

Questions	Reference Answer
Q1: Using Independent Component Analysis (ICA) for artifact removal in EEG recordings.	Independent Component Analysis is an approach to the identification and possible removal of artifacts from EEG records. It effectively decomposes multiple channels. It is effectively applied to remove artifacts from electroencephalographic (EEG) and magnetoencephalographic recordings and can also be used for analyzing multi-channel neuronal recordings.
Q2: Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks.	Learning long-term dependencies is a known challenge for simple Recurrent Neural Networks. When the function f can be approximated using a Multilayer Perceptron, the resulting system is referred to as a NARX network. In the case where a NARX network is unfolded in time, the delays of output will appear as jump-ahead connections in an unfolded network. Intuitively, those jump-ahead connections provide a shorter path for the propagation of the gradient information, thereby leading to the reduction in sensitivity over the long term.
Q3: Explain methods for improving speed and accuracy of Support Vector Machines.	Several methods exist to improve Support Vector Machines, including techniques to speed up the training process using analytical QP. The number of Support Vectors (SVs) has a dramatic impact on the efficiency of Support Vector Machines during learning and prediction stages. Recent results indicate that the number of SVs linearly increases with the number n of examples of training.
Q4: How are Gaussian Processes used for regression tasks?	Gaussian Processes are a Bayesian approach used for regression. The mgp approach allows the system to self-organize by locally selecting the Gaussian process regression model with the appropriate optimal bandwidth.
Q5: What are Support Vector Machines (SVM)?	Support Vector Machines (SVM) are state-of-the-art models for many classification problems. Support Vector Machine type learning algorithms are used to produce functions f. They suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Support Vector Machines (SVMs) implement the idea they map input vectors into a high dimensional feature space.

Table A2. An example illustrates the similar scores between chunked text and the corresponding questions, showing how related questions align with specific papers and their relevance without a summarization step.

Questions	Paper ID	Paper Title	Score Similarity
Q1: Using Independent Component Analysis (ICA) for artifact removal in EEG recordings.	1683	Recognizing Evoked Potentials in a Virtual Environment	0.72
	1639	Algorithms for Independent Components Analysis and Higher Order Statistics	0.70
	2777	Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity	0.67
	1343	Extended ICA Removes Artifacts from Electroencephalographic Recordings	0.63
	2224	A Probabilistic Approach to Single Channel Blind Signal Separation	0.59
Q2: Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks.	1151	Learning long-term dependencies is not as difficult with NARX networks	0.72
	1102	Hierarchical Recurrent Neural Networks for Long-Term Dependencies	0.71
	964	An Input Output HMM Architecture	0.62
	987	Recurrent Networks: Second Order Properties and Pruning	0.61
	851	Learning Temporal Dependencies in Connectionist Speech Recognition	0.60
Q3: Explain methods for improving speed and accuracy of Support Vector Machines.	1577	Using Analytic QP and Sparseness to Speed Training of Support Vector Machines	0.68
	1253	Improving the Accuracy and Speed of Support Vector Machines	0.67
	1663	Model Selection for Support Vector Machines	0.64
	3594	Support Vector Machines with a Reject Option	0.60
	2580	Kernel Projection Machine: a New Tool for Pattern Recognition	0.59
Q4: How are Gaussian Processes used for regression tasks?	1497	Finite-Dimensional Approximation of Gaussian Processes	0.70
	2230	Transductive and Inductive Methods for Approximate Gaussian Process Regression	0.69
	3529	Modeling human function learning with Gaussian processes	0.68
	2230	Transductive and Inductive Methods for Approximate Gaussian Process Regression	0.67
	1048	Gaussian Processes for Regression	0.66
Q5: What are Support Vector Machines (SVM)?	1663	Model Selection for Support Vector Machines	0.68
	1577	Using Analytic QP and Sparseness to Speed Training of Support Vector Machines	0.68
	2580	Kernel Projection Machine: A New Tool for Pattern Recognition	0.65
	1949	A Parallel Mixture of SVMs for Very Large Scale Problems	0.62
	1711	Probabilistic Methods for Support Vector Machines	0.60

Table A3. An example illustrates the similarity scores between chunked text and the corresponding questions, showing how related questions align with specific papers and their relevant summarization step using SBERT model transformer.

Questions	Paper ID	Paper Title	Score Similarity
Q1: Using Independent Component Analysis (ICA) for artifact removal in EEG recordings.	1343	Extended ICA Removes Artifacts from Electroencephalographic Recordings	0.87
	1343	Extended ICA Removes Artifacts from Electroencepha-lographic Recordings	0.78
	1574	Analyzing and Visualizing Single-Trial Event-Related Potentials	0.72
	2777	Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity	0.65
	2379	Sparse Representation and Its Applications in Blind Source Separation	0.63
Q2: Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks.	1151	Learning long-term dependencies is not as difficult with NARX networks	0.74
	1953	Reinforcement Learning with Long Short-Term Memory	0.70
	1102	Hierarchical Recurrent Neural Networks for Long-Term Dependencies	0.67
	1151	Learning long-term dependencies is not as difficult with NARX networks	0.66
	1102	Hierarchical Recurrent Neural Networks for Long-Term Dependencies	0.65
Q3: Explain methods for improving speed and accuracy of Support Vector Machines.	1253	Improving the Accuracy and Speed of Support Vector Machines	0.74
	1663	Model Selection for Support Vector Machines	0.63
	1949	A Parallel Mixture of SVMs for Very Large-Scale Problems	0.61
	1814	Incremental and Decremental Support Vector Machine Learning	0.60
	1870	From Margin to Sparsity	0.59
Q4: How are Gaussian Processes used for regression tasks?	5089	It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals	0.67
	2561	Dependent Gaussian Processes	0.62
	3403	Local Gaussian Process Regression for Real Time Online Model Learning	0.62
	1048	Gaussian Processes for Regression	0.60
	3529	Modeling human function learning with Gaussian processes	0.60
Q5: What are Support Vector Machines (SVM)?	1253	Improving the Accuracy and Speed of Support Vector Machines	0.64
	1949	A Parallel Mixture of SVMs for Very Large-Scale Problems	0.63
	1663	Model Selection for Support Vector Machines	0.62
	3534	Relative Margin Machines	0.61
	1687	A Geometric Interpretation of v-SVM Classifiers	0.55

Table A4. An example demonstrates how evaluated answers are generated using the BERT score metric for the given questions, without applying any summarization step.

Questions	Reference Answer	Generation Answer	BERT Score
Q1	Using Independent Component Analysis (ICA) for artifact removal in EEG recordings.	Independent Components Analysis (lCA) has generated much recent theoretical and practical interest because of its successes on a number of different signal processing problems. In this paper we present a novel technique for suppressing interference signals and separating signals from individual evoked sources. The technique is based on a new probabilistic graphical model termed stimulus evoked independent factor analysis (SEIFA)	0.84
Q2	Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks.	On tasks with long-term dependencies, a class of architecture known as NARX networks may perform badly. According to earlier research, gradient descent learning performs better in recurrent networks.	0.83
Q3	Explain methods for improving speed and accuracy of Support Vector Machines.	Training algorithms for SVMs are slow, especially for large problems, but they exhibit classification speeds which are substantially slower than those of neural networks. Improve accuracy by incorporating knowledge about invariances of the problem at hand and increase classification speed by reducing the complexity of the decision function representation.	0.84
Q4	How are Gaussian Processes used for regression tasks?	Gaussian process regression (GPR) has demonstrated excellent performance in a number of applications. One unpleasant aspect of GPR is its scaling behavior with the size of the training data set N. The relationship between Gaussian processes and Bayesian linear regression suggests that we can define a single model that exploits both similarity and rules in forming predictions.	0.86
Q5	What are Support Vector Machines (SVM)?	Support Vector Machines (SVMs) implement the following idea: they map input vectors into a high dimensional feature space, where a maximal margin hyperplane is constructed. Training algorithms for SVMs are slow, especially for large problems.	0.87

Table A5. An example demonstrates how evaluated answers are generated using the BERT score metric for the given questions, with a summarization step using the SBERT model transformer.

Question	Reference Answer	Generation Answer	BERT Score
Q1	Using Independent Component Analysis (ICA) for artifact removal in EEG recordings.	Independent component analysis can effectively detect, separate and remove activity in EEG records from a wide variety of artifactual sources. Results compared favorably to those obtained using regression-based methods. Technique is based on a new probabilistic graphical model termed stimulus-evoked independent factor analysis.	0.86
Q2	Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks.	Recurrent Hierarchical NNs for Long-Term Dependency Understanding. When the function f can be approximated using a Multilayer Perceptron, the resulting system is called a NARX network. When a NARX network is unfolded in time, the output delays will appear as jump-ahead connections in the unfolded network.	0.85
Q3	Explain methods for improving speed and accuracy of Support Vector Machines.	Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems. They suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. A new mixture of SVMs that can be easily implemented in parallel. Each Support Vector Machine is trained on a small subset of the whole dataset.	0.86
Q4	How are Gaussian Processes used for regression tasks?	Multi-task Gaussian process (Gaussian process) models are widely used to couple related tasks or functions for joint regression. The main limitation of Gaussian process regression is that the computational complexity scales cubically with the training examples n. A method to speed up standard Gaussian process regression with local Gaussian process models (lgp).	0.86
Q5	What are Support Vector Machines (SVM)?	Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems. They suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. The method for improving generalization performance (the "virtual support vector" method) does so by incorporating known invariances of the problem. The "reduced set" method is a way to improve the speed of Support Vector Machines.	0.89

Appendix B

Table A6 below lists all abbreviations in the dataset, as this step is essential for preserving sentence meaning and context during processing and summarization. The abbreviations in the dataset were processed by counting the number of times each appeared within the research paper. In this process, we retained abbreviations that appeared four or more times in the paper, as this frequency indicates the abbreviation’s importance and role within the sentence structure.

Table A6. Abbreviation definitions.

Abbreviation	Full Form	Frequency
ANN	Artificial Neural Network	4
AP	Average Precision	4
BSCI	Blind Sparse Channel Identification	5
CCA	Canonical Correlation Analysis	8
CD	Contrastive Divergence	4
CMAC	Control: The Cerebellar Model Articulation Controller	4
CS	Compressive Sensing	4
CSP	Common Spatial Patterns	5
DDP	Differential Dynamic Programming	5
DP	Dirichlet Process	7
DTW	Dynamic Time Warping	4
EC	Expectation Consistent	4
EER	Equal Error Rate	4
EKF	Extended Kalman Filter	4
EM	Expectation-Maximization Algorithm	4
FACS	Facial Action Coding System	5
FFT	Fast Fourier Transform	4
FIR	Finite Impulse Response	4
FITC	Fully Independent Training Conditional	4
GA	Genetic Algorithm	4
GDA	Generalized Discriminant Analysis	4
GEM	Geometric Entropy Minimization	5
GIS	Generalized Iterative Scaling	5
GSM	Gaussian Scale Mixture	5
HME	Hierarchical Mixture of Experts	5
IAF	Inverse Autoregressive Flow	5
ICA	Independent Component Analysis	5
IR	Information Retrieval	4
KCCA	Kernel Canonical Correlation Analysis	5
KMM	Kernel Mean Matching	5
LDA	Latent Dirichlet Allocation	5
LP	Linear Program	7
LR	Logistic Regression	4
MAD	Mean Absolute Difference	5
MAP	Maximum a posteriori Probability	4
ME	Mixture of Experts	4
MF	Matrix Factorization	4
MF	Mean Field	5
MLE	Maximum Likelihood Estimation	5
MMD	Maximum Mean Discrepancy	4
MSE	Mean Squared Error	5
MVU	Maximum Variance Unfolding	5
NB	Naive Bayes	4
NDCG	Normalized Discounted Cumulative Gain	5
OBS	Optimal Brain Surgeon	4
PAC	Probably Approximately Correct	5
PCA	Principal Component Analysis	4
RKHS	Reproducing Kernel Hilbert Spaces	4
ROC	Receiver Operating Characteristics	4
ROI	Region of Interest	5
RSC	Restricted Strong Convexity	5
SR	Synchrony Rate	5
SSL	Structured Sparsity Learning	5
STOC	Symposium on Theory of Computing	4
SV	Support Vector	4
SVD	Singular Value Decomposition	5
SVR	Support Vector Regression	5
VB	Variational Bayes	5

References

Hambarde, K.A.; Proenca, H. Information Retrieval: Recent Advances and Beyond. IEEE Access 2023, 11, 76581–76604. [Google Scholar] [CrossRef]
Ali, L. Improving Information Retrieval Systems’ Efficiency. Int. J. Eng. Res. Technol. (IJERT) 2022, 11, 287–292. [Google Scholar]
Xu, C. Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era. In KDD ’24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2024; Volume 1. [Google Scholar] [CrossRef]
Giarelis, N.; Mastrokostas, C.; Karacapilidis, N. Abstractive vs. Extractive Summarization: An Experimental Review. Appl. Sci. 2023, 13, 7620. [Google Scholar] [CrossRef]
Manasaveerashyva, Y.N.; Prathibha, B.S. Text Summerization using Natural Language Processing. Grenze Int. J. Eng. Technol. 2022, 8, 372–378. [Google Scholar]
Bharati, M.H. Text Summarization Using NLP. Int. J. Res. Appl. Sci. Eng. Technol. 2024, 12, 803–807. [Google Scholar] [CrossRef]
Edress, Z.; Ortakci, Y. Optimizing Text Summarization with Sentence Clustering and Natural Language Processing. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1123–1132. [Google Scholar] [CrossRef]
Shafiq, N.; Hamid, I.; Asif, M.; Nawaz, Q.; Aljuaid, H.; Ali, H. Abstractive text summarization of low-resourced languages using deep learning. PeerJ Comput. Sci. 2023, 9, e1176. [Google Scholar] [CrossRef]
Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; Nanayakkara, S. Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering. Trans. Assoc. Comput. Linguist. 2023, 11, 1–17. [Google Scholar] [CrossRef]
Le, N.K.; Nguyen, D.H.; Nguyen, T.T.T.; Nguyen, M.P.; Le, T.; Le Nguyen, M. A Novel Pipeline to Enhance Question-Answering Model by Identifying Relevant Information. In New Frontiers in Artificial Intelligence; Lecture Notes in Computer Science; Conference Paper; Springer Nature: Cham, Switzerland, 2023; Volume 13856, pp. 296–311. [Google Scholar] [CrossRef]
Shahade, A.K.; Deshmukh, P.V. A Unified Approach to Text Summarization: Classical, Machine Learning, and Deep Learning Methods. Ing. Syst. Inf. 2025, 30, 169–179. [Google Scholar] [CrossRef]
Ali, Z.H.; Hussein, A.K.; Abass, H.K.; Fadel, E. Extractive multi document summarization using harmony search algorithm. Telkomnika (Telecommun. Comput. Electron. Control) 2021, 19, 89–95. [Google Scholar] [CrossRef]
Jian, H.Z.; Johnson, O.V.; Wah, K.K. Text Summarization for News Articles by Machine Learning Techniques. Appl. Math. Comput. Intell. 2022, 11, 174–196. [Google Scholar]
Basyal, L.; Sanghvi, M. Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models. arXiv 2023, arXiv:2310.10449. [Google Scholar] [CrossRef]
Archanaa, N.; Shivanesh, B.; Kumar, J.D.T.S.; Mohan, G.B.; Doss, S. Comparative Analysis of News Articles Summarization using LLMs. In 2024 Asia Pacific Conference on Innovation in Technology (APCIT); IEEE: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
Jearanaitanakij, K.; Boonpong, S.; Teainnagrm, K.; Thonglor, T. Fast Hybrid Approach for Thai News Summarization. Eng. Technol. Horiz. 2024, 41, 410307. [Google Scholar] [CrossRef]
Kmainasi, M.B.; Shahroor, A.E.; Hasanain, M.; Laskar, S.R.; Hassan, N.; Alam, F. LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content. arXiv 2024, arXiv:2410.15308. [Google Scholar] [CrossRef]
Muludi, K.; Fitria, K.M.; Triloka, J.; Sutedi. Retrieval-Augmented Generation Approach: Document Question Answering using Large Language Model. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 776–785. [Google Scholar] [CrossRef]
Pujiono, I.; Agtyaputra, I.M.; Ruldeviyani, Y. Implementing Retrieval-Augmented Generation and Vector Databases for Chatbots in Public Services Agencies Context. J. Ilmu Pengetah. Dan Teknol. Komput. 2024, 10, 216–223. [Google Scholar] [CrossRef]
Saha, B.; Saha, U.; Malik, M.Z. Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance. IEEE Access 2024, 12, 185401–185410. [Google Scholar] [CrossRef]
Moreno-Cediel, A.; del-Hoyo-Gabaldon, J.-A.; Garcia-Lopez, E.; Garcia-Cabot, A.; de-Fitero-Dominguez, D. Evaluating the performance of multilingual models in answer extraction and question generation. Sci. Rep. 2024, 14, 15477. [Google Scholar] [CrossRef]
Meng, W.; Li, Y.; Chen, L.; Dong, Z. Using the Retrieval-Augmented Generation to Improve the Question-Answering System in Human Health Risk Assessment: The Development and Application. Electronics 2025, 14, 386. [Google Scholar] [CrossRef]
Huang, X.; Lin, Z.; Sun, F.; Zhang, W.; Tong, K.; Liu, Y. A Multi-Hop Retrieval-Augmented Generation Framework for Intelligent Document Question Answering in Financial and Compliance Contexts. 2025. Available online: https://www.researchsquare.com/article/rs-6927746/v1 (accessed on 11 February 2026).
Rayo, J.; La Rosa, R.D.; Garrido, M. A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 31–35. [Google Scholar] [CrossRef]
Kang, S.; Lee, D. Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM ’25), Hannover, Germany, 10–14 March 2025; ACM: New York, NY, USA, 2025; 10p. [Google Scholar] [CrossRef]
Thu, T.; Hoang, U.; Anh, V. PDF Retrieval Augmented Question Answering. arXiv 2025, arXiv:2506.18027v1. [Google Scholar] [CrossRef]
Wu, C.; Jiang, J.; Jiang, R.; Li, X. Retrieval augmented generation-driven information retrieval and question answering in construction management. Adv. Eng. Inform. 2025, 65, 103158. [Google Scholar] [CrossRef]
Ortakci, Y. Revolutionary text clustering: Investigating transfer learning capacity of SBERT models through pooling techniques. Eng. Sci. Technol. Int. J. 2024, 55, 101730. [Google Scholar] [CrossRef]
Yu, W. Retrieval-augmented Generation across Heterogeneous Knowledge. In NAACL 2022—2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 52–58. [Google Scholar] [CrossRef]
Li, Z.; Wang, Z.; Wang, W.; Hung, K.; Xie, H.; Wang, F.L. Retrieval-augmented generation for educational application: A systematic survey. Comput. Educ. Artif. Intell. 2025, 8, 100417. [Google Scholar] [CrossRef]
Pathak, P.; Rana, P.S. Comparative Analysis of Pretrained Models for Text Classification, Generation and Summarization: A Detailed Analysis. In Pattern Recognition; Lecture Notes in Computer Science LNCS; Conference paper; Springer: Cham, Switzerland, 2024; Volume 15301, pp. 151–166. [Google Scholar]
Phakmongkol, P.; Vateekul, P. Enhance text-to-text transfer transformer with generated questions for thai question answering. Appl. Sci. J. 2021, 11, 10267. [Google Scholar] [CrossRef]

Figure 1. The general workflow of TSQA.

Figure 2. The two phases behind the RAG model of IR.

Figure 3. Comparison of the processing time of TSQA.

Table 1. An example of tokenization process.

Text	Learning of continuous valued functions using ensembles of neural network committees can present better accuracy, reliable estimation of generalization error, and active learning.
Tokens	“Learning”, “of”, “continuous”, “valued”, “functions”, “using”, “ensembles”, “of”, “neural”, “network”, “committees”, “can”, “present”, “better”, “accuracy”, “reliable”, “estimation”, “of”, “generalization”, “error”, “and”, “active”, and “learning”.

Table 2. Score similarity of questions and BERT scores for evaluation of generation without summarization of T5 and Bart-large-cnn models.

Questions	T5 Model		BART Model
Questions	Score Similarity	BERT Score	Score Similarity	BERT Score
Q1	0.72	0.84	0.72	0.84
Q2	0.74	0.86	0.74	0.86
Q3	0.67	0.84	0.67	0.85
Q4	0.70	0.83	0.70	0.86
Q5	0.67	0.83	0.67	0.84

Table 3. Metrics of the retrieval part with T5 model without summarization.

Question	Precision@5	Recall@5	F1@5	MAP	MRR	nDCG@5
Q1	0.8	0.44	0.57	0.3	0.5	0.66
Q2	0.4	0.5	0.4	0.5	0.5	0.63
Q3	0.6	0.42	0.5	0.42	1	0.72
Q4	0.6	0.37	0.46	0.34	1	0.70
Q5	0.6	0.4	0.49	0.4	1	0.69

Table 4. Metrics of the retrieval part with Bart-large-cnn model without summarization.

Question	Precision@5	Recall@5	F1@5	MAP	MRR	nDCG@5
Q1	0.8	0.44	0.57	0.35	1	0.78
Q2	0.5	0.5	0.5	0.5	0.5	0.65
Q3	0.6	0.43	0.5	0.43	1	0.72
Q4	0.6	0.37	0.46	0.34	1	0.70
Q5	0.6	0.43	0.5	0.4	1	0.69

Table 5. Score similarity of questions and BERT scores for evaluation of generation with summarization of T5 and Bart-large-cnn models.

Questions	T5 Model		Bart-Large-cnn Model
Questions	Score Similarity	BERT Score	Score Similarity	BERT Score
Q1	0.73	0.84	0.87	0.86
Q2	0.72	0.82	0.72	0.84
Q3	0.59	0.84	0.73	0.86
Q4	0.72	0.84	0.72	0.86
Q5	0.67	0.84	0.67	0.83

Table 6. Metrics of the retrieval part with T5 model with summarization.

Question	Precision@5	Recall@5	F1@5	MAP	MRR	nDCG@5
Q1	1	0.55	0.71	0.55	1	1
Q2	0.6	0.75	0.66	0.75	1	0.83
Q3	0.8	0.44	0.57	0.4	1	0.83
Q4	0.6	0.4	0.5	0.4	1	0.72
Q5	0.4	0.29	0.33	0.16	0.5	0.38

Table 7. Metrics of the retrieval part with Bart-large-cnn model with summarization.

Question	Precision@5	Recall@5	F1@5	MAP	MRR	nDCG@5
Q1	0.8	0.44	0.57	0.44	1	0.86
Q2	0.6	1	0.75	0.75	1	0.84
Q3	0.8	0.57	0.66	0.57	1	0.86
Q4	0.8	0.5	0.61	0.5	1	0.87
Q5	0.6	0.43	0.5	0.43	1	0.72

Table 8. System environment employed for TSQA development and evaluation.

Components	Version
CPU	Inter (R) Core (TM) Ultra 9 258H (2.90 GHz)
RAM	16 GB
GPU	Intel (R) Arc (TM) 140T GPU (8 GB)
Speed	7467 MT/s
Python (PyCharm)	2025.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jaddoa, A.S.; Karimpour, J.; Salehpour, P. TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation. Information 2026, 17, 372. https://doi.org/10.3390/info17040372

AMA Style

Jaddoa AS, Karimpour J, Salehpour P. TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation. Information. 2026; 17(4):372. https://doi.org/10.3390/info17040372

Chicago/Turabian Style

Jaddoa, Ahmed Sami, Jaber Karimpour, and Pedram Salehpour. 2026. "TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation" Information 17, no. 4: 372. https://doi.org/10.3390/info17040372

APA Style

Jaddoa, A. S., Karimpour, J., & Salehpour, P. (2026). TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation. Information, 17(4), 372. https://doi.org/10.3390/info17040372

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation

Abstract

1. Introduction

2. Related Work

2.1. Text Summarization

2.2. Question Answering

3. Materials and Methods

3.1. System Overview

3.1.1. Summarization Phase

3.1.2. Question-Guided Information Retrieval Phase

3.1.3. Answer Generation Using RAG

3.2. Dataset

3.3. Preprocessing

3.4. Text Summarization

3.5. Vectorization

3.6. Question Answering (QA)

3.6.1. Information Retrieval

3.6.2. Retrieval-Augmented Generation (RAG)

4. Evaluation

5. Results and Discussion

5.1. Model Details

5.2. Baselines

5.2.1. Generation Part

5.2.2. Retrieval Part

5.2.3. Generation Part

5.2.4. Retrieval Part

6. Conclusions

7. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI