TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation
Abstract
1. Introduction
- Novel dual-model framework (TSQA): A dual-model framework that synergistically integrates text summarization and question answering to improve information retrieval and comprehension. Unlike previous methods that treat these two tasks separately, the proposed framework establishes a bidirectional interaction where summarization improves the contextual relevance of the answer, while the answer model verifies and enriches the summaries. This mechanism reduces redundancy and produces more accurate, consistent, and contextually relevant outputs. Furthermore, limited studies have explored this synergistic integration with the aim of simultaneously improving answer quality and enhancing the summary.
- The SBERT transformer model is employed for text summarization tasks since semantic representations with high sentence quality can be generated. The practical rationale for choosing SBERT lies in its superior ability to generate dense embeddings for entire sentences, overcoming the fundamental limitations of the traditional BERT model which was primarily designed to process words rather than sentences. Thanks to its architecture based on Siamese networks, SBERT successfully maps each sentence to precise coordinates in a radial space that reflects its true contextual meaning, enabling semantic similarity to be calculated with high computational efficiency. This shift from word- to concept-level processing enables the summarization system to identify the most essential sentences and eliminate redundancy with exceptional precision.
- The scientific advantage of combining summarization with a question-answering system lies in reducing the time required to access information compared with using a question-answering system on its own. This is because the standalone question-answering system takes a long time to scan documents in their entirety to find the answer, whereas summarization acts as a time filter that screens texts and narrows the search scope to only the essential parts. This integration speeds up the retrieval process.
- Abbreviation expansion: Maintaining expanded forms of acronyms is an important step in information retrieval for a key reason: a user might search using a full term while the document contains the acronym, and vice versa. Therefore, we have created a glossary containing all acronyms in the database to ensure that the expanded forms appear in the text during retrieval, whether the summary is included or not.
- Window size: Determining the window size for chunking is a critical step. The scientific benefit of adjusting the window size lies in balancing contextual accuracy and response speed; intelligent chunking of text ensures that the question-answering system is provided with a focused and coherent context, preventing the model from becoming overwhelmed by long texts.
2. Related Work
2.1. Text Summarization
2.2. Question Answering
3. Materials and Methods
3.1. System Overview
3.1.1. Summarization Phase
3.1.2. Question-Guided Information Retrieval Phase
3.1.3. Answer Generation Using RAG
- Dynamically, the retriever obtains the most-ranked summaries of the IR stage as external knowledge sources.
- The generator is a transformer-based language model that utilizes the question and the retrieved summaries as contextual input.
- The model then gives a coherent, context-sensitive and fact-based answer by synthesizing information in the retrieved content.
3.2. Dataset
3.3. Preprocessing
- Tokenization:
- Stop Words Removal:
- Lemmatization:
- Abbreviation Expansion:
| Algorithm 1: Context-Aware Abbreviation Expansion and Global Aggregation Framework |
| Objective: To automatically construct a high-quality acronym-definition dictionary from a large corpus by leveraging local context windows, heuristic validation, and global frequency filtering. Input: D: A dataset containing N documents (papers) τ: Frequency threshold for noise filtering (default τ = 5) Output: R: The final ranked registry of valid abbreviation–definition pairs. |
| Begin Step 1: Global Initialization: Initialize a global frequency map G←∅ to store counts of pairs (A, D) →N} Step 2: Corpus Processing Loop: For each document d_i in D: 1. Preprocessing: T← Normalize whitespace in d_i 2. Pattern Recognition: Identify set of matches S using Regex: (? <=\s) 3. Local Extraction Strategy: For each match m∈ S containing acronym A at index idx: Context Windowing: Extract preceding text T_prev = T [max (0, idx-150): idx] Tokenize T_prev into a word list W_prev Dynamic Candidate Search: Define search window size ranging from to Initialize For each k (iterating backwards/forwards): Construct candidate string C from the ast k tokens of W_prev Validation Check (Heuristic): Start Char: Does the first word of C start with the first letter of A? Complexity: Is length(C) ≤ (∣A∣ × 3 + 5)? If validation passes: BestDef ← C (Optional: Update to favor longer valid matches if found) Local Update: If BestDef ≠ Null: Increment G[(A,BestDef)] ←[(A,BestDef)] + 1 Step 3: Noise Reduction and Filtering: Initialize final list R←∅. For each unique pair p = (A_cr,D_ef) in G: If G[p] > τ: dd tuple (A_cr,D_ef,G[p]) to R. Step 4: Finalization: Sort R by frequency in descending order. Return R. End |
3.4. Text Summarization
- SBERT Transformer:
- Sentence embedding generation: The input document is segmented into individual sentences. Each sentence is then input into a pre-trained SBERT model. SBERT generates high-dimensional vector embedding for each sentence, capturing its semantic meaning. Efficient identification of the similarity of sentences is conducted via these embedding generations.
- Sentence similarity calculation: After obtaining sentence embedding, it is possible to compute the similarity of all pairs of sentences. This is usually performed through the cosine similarity measure, i.e., the angle between two vectors. The larger the cosine similarity, the greater the semantic similarity between sentences.
- Sentence ranking: Another procedure for sentence ranking is conducted according to their importance. This can be performed via the employment of certain procedures such as text-rank; a graph is constructed within which sentences are represented as nodes and edges denote their similarity. These procedures can be reinforced by employing SBERT embeddings, which provide semantically rich similarity scores.
- Summary generation: The extractive summary is formed via the use of the top-ranked sentences or the most representative sentences in the clusters depending on the results of the ranking or clustering. The length of a desired summary or a particular threshold can be used to decide on the number of sentences to include. Algorithm 2 shows the steps of SBERT for extractive summarization.
| Algorithm 2: SBERT_Text_Summarization |
| Input: Document (D, K) Output: Summary (S) |
| Begin Step 1: Tokenization sentences = SplitIntoSentences(D) Step 2: Embedding Generation embeddings = [] for s in sentences: embeddings.append(SBERT(s)) Step 3: Similarity Calculation similarity_matrix = zeros(len(sentences), len(sentences)) for i in range(len(sentences)): for j in range(len(sentences)): similarity_matrix[i][j] = cosine_similarity(embeddings[i], embeddings[j]) Step 4: Sentence Ranking (TextRank) scores = TextRank(similarity_matrix) Step 5: Summary Generation top_sentences = SelectTopK (sentences, scores, k) summary = SortByOriginalOrder(top_sentences) return summary End |
3.5. Vectorization
3.6. Question Answering (QA)
3.6.1. Information Retrieval
3.6.2. Retrieval-Augmented Generation (RAG)
- Window size:
| Algorithm 3: Window Size (Adaptive Semantic Text Chunking) |
| Objective: To segment long academic texts into semantically coherent units by preserving paragraph boundaries where possible and strictly respecting token limits. Input: T: Raw Document Text. L_min: Minimum chunk size (50 words). L_max: Maximum chunk size (500 words). Output: C_list: Ordered list of text chunks. |
| Begin Initialize C_list = [] Step 1: Semantic Separation Split T into paragraphs P based on double newlines (\n\n) For each paragraph p in P Do: p = Strip whitespace from p If p is empty, then: Continuing End If Step 2: Calculate Word Count WordCount = Count words in p Case (A): Paragraph fits perfectly within limits If (WordCount >= L_min) And (WordCount <= L_max) Then: Append p to C_list Case (B): Paragraph is too long (Fragmentation required) Else If WordCount > L_max Then: Sentences = Split p into sentences (using NLTK) Buffer = ““ For each sent in Sentences Do: CombinedLength = Count words in (Buffer + sent) If CombinedLength <= L_max Then: Check if adding sentence exceeds limit Buffer = Buffer + ““+ sent Else: buffer If Buffer is not empty, then: Append Buffer to C_list End If Buffer = sent # Start new chunk with current sentence End If Case (C): Paragraph is too short (Optional: Ignore or Merge) Else: # Current logic ignores very short paragraphs (noise) Continuing End If End For Return C_list End |
- all-MinLM-L6-v2 Model
- Bart-large-cnn Model:
- T5 Model:
4. Evaluation
- K is the chosen cutoff point.
- N represents the total number of queries (in the case of information retrieval) in the evaluated dataset.
- AP is the average precision for a given ranking list:
5. Results and Discussion
5.1. Model Details
- Phase 1: Passage Retrieval
- Phase 2: Text Generation using a Large Language Model (LLM)
5.2. Baselines
- Without summarization:
5.2.1. Generation Part
5.2.2. Retrieval Part
- With summarization:
5.2.3. Generation Part
5.2.4. Retrieval Part
6. Conclusions
7. Limitations and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| TSQA | Text summarization with question answering |
| TS | Text summarization |
| QA | Question answering |
| IR | Information retrieval |
| ML | Machine learning |
| MAP | Mean Average Precision |
| RAG | Retrieval-augmented generation |
| BERT | Bidirectional Encoder Representations from Transformers |
| SBERT | Sentence BERT |
| T5 | Text-to-text transfer transformer |
| NLP | Natural language processing |
| NIPS | Neural Information Processing Systems |
| DM | Data mining |
| MRR | Mean Reciprocal Rank |
| NDCG | Normalized Discounted Cumulative Gain |
Appendix A
| Questions | Reference Answer |
|---|---|
| Q1: Using Independent Component Analysis (ICA) for artifact removal in EEG recordings. | Independent Component Analysis is an approach to the identification and possible removal of artifacts from EEG records. It effectively decomposes multiple channels. It is effectively applied to remove artifacts from electroencephalographic (EEG) and magnetoencephalographic recordings and can also be used for analyzing multi-channel neuronal recordings. |
| Q2: Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks. | Learning long-term dependencies is a known challenge for simple Recurrent Neural Networks. When the function f can be approximated using a Multilayer Perceptron, the resulting system is referred to as a NARX network. In the case where a NARX network is unfolded in time, the delays of output will appear as jump-ahead connections in an unfolded network. Intuitively, those jump-ahead connections provide a shorter path for the propagation of the gradient information, thereby leading to the reduction in sensitivity over the long term. |
| Q3: Explain methods for improving speed and accuracy of Support Vector Machines. | Several methods exist to improve Support Vector Machines, including techniques to speed up the training process using analytical QP. The number of Support Vectors (SVs) has a dramatic impact on the efficiency of Support Vector Machines during learning and prediction stages. Recent results indicate that the number of SVs linearly increases with the number n of examples of training. |
| Q4: How are Gaussian Processes used for regression tasks? | Gaussian Processes are a Bayesian approach used for regression. The mgp approach allows the system to self-organize by locally selecting the Gaussian process regression model with the appropriate optimal bandwidth. |
| Q5: What are Support Vector Machines (SVM)? | Support Vector Machines (SVM) are state-of-the-art models for many classification problems. Support Vector Machine type learning algorithms are used to produce functions f. They suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Support Vector Machines (SVMs) implement the idea they map input vectors into a high dimensional feature space. |
| Questions | Paper ID | Paper Title | Score Similarity |
|---|---|---|---|
| Q1: Using Independent Component Analysis (ICA) for artifact removal in EEG recordings. | 1683 | Recognizing Evoked Potentials in a Virtual Environment | 0.72 |
| 1639 | Algorithms for Independent Components Analysis and Higher Order Statistics | 0.70 | |
| 2777 | Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity | 0.67 | |
| 1343 | Extended ICA Removes Artifacts from Electroencephalographic Recordings | 0.63 | |
| 2224 | A Probabilistic Approach to Single Channel Blind Signal Separation | 0.59 | |
| Q2: Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks. | 1151 | Learning long-term dependencies is not as difficult with NARX networks | 0.72 |
| 1102 | Hierarchical Recurrent Neural Networks for Long-Term Dependencies | 0.71 | |
| 964 | An Input Output HMM Architecture | 0.62 | |
| 987 | Recurrent Networks: Second Order Properties and Pruning | 0.61 | |
| 851 | Learning Temporal Dependencies in Connectionist Speech Recognition | 0.60 | |
| Q3: Explain methods for improving speed and accuracy of Support Vector Machines. | 1577 | Using Analytic QP and Sparseness to Speed Training of Support Vector Machines | 0.68 |
| 1253 | Improving the Accuracy and Speed of Support Vector Machines | 0.67 | |
| 1663 | Model Selection for Support Vector Machines | 0.64 | |
| 3594 | Support Vector Machines with a Reject Option | 0.60 | |
| 2580 | Kernel Projection Machine: a New Tool for Pattern Recognition | 0.59 | |
| Q4: How are Gaussian Processes used for regression tasks? | 1497 | Finite-Dimensional Approximation of Gaussian Processes | 0.70 |
| 2230 | Transductive and Inductive Methods for Approximate Gaussian Process Regression | 0.69 | |
| 3529 | Modeling human function learning with Gaussian processes | 0.68 | |
| 2230 | Transductive and Inductive Methods for Approximate Gaussian Process Regression | 0.67 | |
| 1048 | Gaussian Processes for Regression | 0.66 | |
| Q5: What are Support Vector Machines (SVM)? | 1663 | Model Selection for Support Vector Machines | 0.68 |
| 1577 | Using Analytic QP and Sparseness to Speed Training of Support Vector Machines | 0.68 | |
| 2580 | Kernel Projection Machine: A New Tool for Pattern Recognition | 0.65 | |
| 1949 | A Parallel Mixture of SVMs for Very Large Scale Problems | 0.62 | |
| 1711 | Probabilistic Methods for Support Vector Machines | 0.60 |
| Questions | Paper ID | Paper Title | Score Similarity |
|---|---|---|---|
| Q1: Using Independent Component Analysis (ICA) for artifact removal in EEG recordings. | 1343 | Extended ICA Removes Artifacts from Electroencephalographic Recordings | 0.87 |
| 1343 | Extended ICA Removes Artifacts from Electroencepha-lographic Recordings | 0.78 | |
| 1574 | Analyzing and Visualizing Single-Trial Event-Related Potentials | 0.72 | |
| 2777 | Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity | 0.65 | |
| 2379 | Sparse Representation and Its Applications in Blind Source Separation | 0.63 | |
| Q2: Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks. | 1151 | Learning long-term dependencies is not as difficult with NARX networks | 0.74 |
| 1953 | Reinforcement Learning with Long Short-Term Memory | 0.70 | |
| 1102 | Hierarchical Recurrent Neural Networks for Long-Term Dependencies | 0.67 | |
| 1151 | Learning long-term dependencies is not as difficult with NARX networks | 0.66 | |
| 1102 | Hierarchical Recurrent Neural Networks for Long-Term Dependencies | 0.65 | |
| Q3: Explain methods for improving speed and accuracy of Support Vector Machines. | 1253 | Improving the Accuracy and Speed of Support Vector Machines | 0.74 |
| 1663 | Model Selection for Support Vector Machines | 0.63 | |
| 1949 | A Parallel Mixture of SVMs for Very Large-Scale Problems | 0.61 | |
| 1814 | Incremental and Decremental Support Vector Machine Learning | 0.60 | |
| 1870 | From Margin to Sparsity | 0.59 | |
| Q4: How are Gaussian Processes used for regression tasks? | 5089 | It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals | 0.67 |
| 2561 | Dependent Gaussian Processes | 0.62 | |
| 3403 | Local Gaussian Process Regression for Real Time Online Model Learning | 0.62 | |
| 1048 | Gaussian Processes for Regression | 0.60 | |
| 3529 | Modeling human function learning with Gaussian processes | 0.60 | |
| Q5: What are Support Vector Machines (SVM)? | 1253 | Improving the Accuracy and Speed of Support Vector Machines | 0.64 |
| 1949 | A Parallel Mixture of SVMs for Very Large-Scale Problems | 0.63 | |
| 1663 | Model Selection for Support Vector Machines | 0.62 | |
| 3534 | Relative Margin Machines | 0.61 | |
| 1687 | A Geometric Interpretation of v-SVM Classifiers | 0.55 |
| Questions | Reference Answer | Generation Answer | BERT Score |
|---|---|---|---|
| Q1 | Using Independent Component Analysis (ICA) for artifact removal in EEG recordings. | Independent Components Analysis (lCA) has generated much recent theoretical and practical interest because of its successes on a number of different signal processing problems. In this paper we present a novel technique for suppressing interference signals and separating signals from individual evoked sources. The technique is based on a new probabilistic graphical model termed stimulus evoked independent factor analysis (SEIFA) | 0.84 |
| Q2 | Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks. | On tasks with long-term dependencies, a class of architecture known as NARX networks may perform badly. According to earlier research, gradient descent learning performs better in recurrent networks. | 0.83 |
| Q3 | Explain methods for improving speed and accuracy of Support Vector Machines. | Training algorithms for SVMs are slow, especially for large problems, but they exhibit classification speeds which are substantially slower than those of neural networks. Improve accuracy by incorporating knowledge about invariances of the problem at hand and increase classification speed by reducing the complexity of the decision function representation. | 0.84 |
| Q4 | How are Gaussian Processes used for regression tasks? | Gaussian process regression (GPR) has demonstrated excellent performance in a number of applications. One unpleasant aspect of GPR is its scaling behavior with the size of the training data set N. The relationship between Gaussian processes and Bayesian linear regression suggests that we can define a single model that exploits both similarity and rules in forming predictions. | 0.86 |
| Q5 | What are Support Vector Machines (SVM)? | Support Vector Machines (SVMs) implement the following idea: they map input vectors into a high dimensional feature space, where a maximal margin hyperplane is constructed. Training algorithms for SVMs are slow, especially for large problems. | 0.87 |
| Question | Reference Answer | Generation Answer | BERT Score |
|---|---|---|---|
| Q1 | Using Independent Component Analysis (ICA) for artifact removal in EEG recordings. | Independent component analysis can effectively detect, separate and remove activity in EEG records from a wide variety of artifactual sources. Results compared favorably to those obtained using regression-based methods. Technique is based on a new probabilistic graphical model termed stimulus-evoked independent factor analysis. | 0.86 |
| Q2 | Explain challenges and solutions for learning long-term dependencies with Recurrent Neural Networks. | Recurrent Hierarchical NNs for Long-Term Dependency Understanding. When the function f can be approximated using a Multilayer Perceptron, the resulting system is called a NARX network. When a NARX network is unfolded in time, the output delays will appear as jump-ahead connections in the unfolded network. | 0.85 |
| Q3 | Explain methods for improving speed and accuracy of Support Vector Machines. | Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems. They suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. A new mixture of SVMs that can be easily implemented in parallel. Each Support Vector Machine is trained on a small subset of the whole dataset. | 0.86 |
| Q4 | How are Gaussian Processes used for regression tasks? | Multi-task Gaussian process (Gaussian process) models are widely used to couple related tasks or functions for joint regression. The main limitation of Gaussian process regression is that the computational complexity scales cubically with the training examples n. A method to speed up standard Gaussian process regression with local Gaussian process models (lgp). | 0.86 |
| Q5 | What are Support Vector Machines (SVM)? | Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems. They suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. The method for improving generalization performance (the "virtual support vector" method) does so by incorporating known invariances of the problem. The "reduced set" method is a way to improve the speed of Support Vector Machines. | 0.89 |
Appendix B
| Abbreviation | Full Form | Frequency |
|---|---|---|
| ANN | Artificial Neural Network | 4 |
| AP | Average Precision | 4 |
| BSCI | Blind Sparse Channel Identification | 5 |
| CCA | Canonical Correlation Analysis | 8 |
| CD | Contrastive Divergence | 4 |
| CMAC | Control: The Cerebellar Model Articulation Controller | 4 |
| CS | Compressive Sensing | 4 |
| CSP | Common Spatial Patterns | 5 |
| DDP | Differential Dynamic Programming | 5 |
| DP | Dirichlet Process | 7 |
| DTW | Dynamic Time Warping | 4 |
| EC | Expectation Consistent | 4 |
| EER | Equal Error Rate | 4 |
| EKF | Extended Kalman Filter | 4 |
| EM | Expectation-Maximization Algorithm | 4 |
| FACS | Facial Action Coding System | 5 |
| FFT | Fast Fourier Transform | 4 |
| FIR | Finite Impulse Response | 4 |
| FITC | Fully Independent Training Conditional | 4 |
| GA | Genetic Algorithm | 4 |
| GDA | Generalized Discriminant Analysis | 4 |
| GEM | Geometric Entropy Minimization | 5 |
| GIS | Generalized Iterative Scaling | 5 |
| GSM | Gaussian Scale Mixture | 5 |
| HME | Hierarchical Mixture of Experts | 5 |
| IAF | Inverse Autoregressive Flow | 5 |
| ICA | Independent Component Analysis | 5 |
| IR | Information Retrieval | 4 |
| KCCA | Kernel Canonical Correlation Analysis | 5 |
| KMM | Kernel Mean Matching | 5 |
| LDA | Latent Dirichlet Allocation | 5 |
| LP | Linear Program | 7 |
| LR | Logistic Regression | 4 |
| MAD | Mean Absolute Difference | 5 |
| MAP | Maximum a posteriori Probability | 4 |
| ME | Mixture of Experts | 4 |
| MF | Matrix Factorization | 4 |
| MF | Mean Field | 5 |
| MLE | Maximum Likelihood Estimation | 5 |
| MMD | Maximum Mean Discrepancy | 4 |
| MSE | Mean Squared Error | 5 |
| MVU | Maximum Variance Unfolding | 5 |
| NB | Naive Bayes | 4 |
| NDCG | Normalized Discounted Cumulative Gain | 5 |
| OBS | Optimal Brain Surgeon | 4 |
| PAC | Probably Approximately Correct | 5 |
| PCA | Principal Component Analysis | 4 |
| RKHS | Reproducing Kernel Hilbert Spaces | 4 |
| ROC | Receiver Operating Characteristics | 4 |
| ROI | Region of Interest | 5 |
| RSC | Restricted Strong Convexity | 5 |
| SR | Synchrony Rate | 5 |
| SSL | Structured Sparsity Learning | 5 |
| STOC | Symposium on Theory of Computing | 4 |
| SV | Support Vector | 4 |
| SVD | Singular Value Decomposition | 5 |
| SVR | Support Vector Regression | 5 |
| VB | Variational Bayes | 5 |
References
- Hambarde, K.A.; Proenca, H. Information Retrieval: Recent Advances and Beyond. IEEE Access 2023, 11, 76581–76604. [Google Scholar] [CrossRef]
- Ali, L. Improving Information Retrieval Systems’ Efficiency. Int. J. Eng. Res. Technol. (IJERT) 2022, 11, 287–292. [Google Scholar]
- Xu, C. Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era. In KDD ’24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2024; Volume 1. [Google Scholar] [CrossRef]
- Giarelis, N.; Mastrokostas, C.; Karacapilidis, N. Abstractive vs. Extractive Summarization: An Experimental Review. Appl. Sci. 2023, 13, 7620. [Google Scholar] [CrossRef]
- Manasaveerashyva, Y.N.; Prathibha, B.S. Text Summerization using Natural Language Processing. Grenze Int. J. Eng. Technol. 2022, 8, 372–378. [Google Scholar]
- Bharati, M.H. Text Summarization Using NLP. Int. J. Res. Appl. Sci. Eng. Technol. 2024, 12, 803–807. [Google Scholar] [CrossRef]
- Edress, Z.; Ortakci, Y. Optimizing Text Summarization with Sentence Clustering and Natural Language Processing. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1123–1132. [Google Scholar] [CrossRef]
- Shafiq, N.; Hamid, I.; Asif, M.; Nawaz, Q.; Aljuaid, H.; Ali, H. Abstractive text summarization of low-resourced languages using deep learning. PeerJ Comput. Sci. 2023, 9, e1176. [Google Scholar] [CrossRef]
- Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; Nanayakkara, S. Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering. Trans. Assoc. Comput. Linguist. 2023, 11, 1–17. [Google Scholar] [CrossRef]
- Le, N.K.; Nguyen, D.H.; Nguyen, T.T.T.; Nguyen, M.P.; Le, T.; Le Nguyen, M. A Novel Pipeline to Enhance Question-Answering Model by Identifying Relevant Information. In New Frontiers in Artificial Intelligence; Lecture Notes in Computer Science; Conference Paper; Springer Nature: Cham, Switzerland, 2023; Volume 13856, pp. 296–311. [Google Scholar] [CrossRef]
- Shahade, A.K.; Deshmukh, P.V. A Unified Approach to Text Summarization: Classical, Machine Learning, and Deep Learning Methods. Ing. Syst. Inf. 2025, 30, 169–179. [Google Scholar] [CrossRef]
- Ali, Z.H.; Hussein, A.K.; Abass, H.K.; Fadel, E. Extractive multi document summarization using harmony search algorithm. Telkomnika (Telecommun. Comput. Electron. Control) 2021, 19, 89–95. [Google Scholar] [CrossRef]
- Jian, H.Z.; Johnson, O.V.; Wah, K.K. Text Summarization for News Articles by Machine Learning Techniques. Appl. Math. Comput. Intell. 2022, 11, 174–196. [Google Scholar]
- Basyal, L.; Sanghvi, M. Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models. arXiv 2023, arXiv:2310.10449. [Google Scholar] [CrossRef]
- Archanaa, N.; Shivanesh, B.; Kumar, J.D.T.S.; Mohan, G.B.; Doss, S. Comparative Analysis of News Articles Summarization using LLMs. In 2024 Asia Pacific Conference on Innovation in Technology (APCIT); IEEE: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Jearanaitanakij, K.; Boonpong, S.; Teainnagrm, K.; Thonglor, T. Fast Hybrid Approach for Thai News Summarization. Eng. Technol. Horiz. 2024, 41, 410307. [Google Scholar] [CrossRef]
- Kmainasi, M.B.; Shahroor, A.E.; Hasanain, M.; Laskar, S.R.; Hassan, N.; Alam, F. LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content. arXiv 2024, arXiv:2410.15308. [Google Scholar] [CrossRef]
- Muludi, K.; Fitria, K.M.; Triloka, J.; Sutedi. Retrieval-Augmented Generation Approach: Document Question Answering using Large Language Model. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 776–785. [Google Scholar] [CrossRef]
- Pujiono, I.; Agtyaputra, I.M.; Ruldeviyani, Y. Implementing Retrieval-Augmented Generation and Vector Databases for Chatbots in Public Services Agencies Context. J. Ilmu Pengetah. Dan Teknol. Komput. 2024, 10, 216–223. [Google Scholar] [CrossRef]
- Saha, B.; Saha, U.; Malik, M.Z. Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance. IEEE Access 2024, 12, 185401–185410. [Google Scholar] [CrossRef]
- Moreno-Cediel, A.; del-Hoyo-Gabaldon, J.-A.; Garcia-Lopez, E.; Garcia-Cabot, A.; de-Fitero-Dominguez, D. Evaluating the performance of multilingual models in answer extraction and question generation. Sci. Rep. 2024, 14, 15477. [Google Scholar] [CrossRef]
- Meng, W.; Li, Y.; Chen, L.; Dong, Z. Using the Retrieval-Augmented Generation to Improve the Question-Answering System in Human Health Risk Assessment: The Development and Application. Electronics 2025, 14, 386. [Google Scholar] [CrossRef]
- Huang, X.; Lin, Z.; Sun, F.; Zhang, W.; Tong, K.; Liu, Y. A Multi-Hop Retrieval-Augmented Generation Framework for Intelligent Document Question Answering in Financial and Compliance Contexts. 2025. Available online: https://www.researchsquare.com/article/rs-6927746/v1 (accessed on 11 February 2026).
- Rayo, J.; La Rosa, R.D.; Garrido, M. A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 31–35. [Google Scholar] [CrossRef]
- Kang, S.; Lee, D. Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM ’25), Hannover, Germany, 10–14 March 2025; ACM: New York, NY, USA, 2025; 10p. [Google Scholar] [CrossRef]
- Thu, T.; Hoang, U.; Anh, V. PDF Retrieval Augmented Question Answering. arXiv 2025, arXiv:2506.18027v1. [Google Scholar] [CrossRef]
- Wu, C.; Jiang, J.; Jiang, R.; Li, X. Retrieval augmented generation-driven information retrieval and question answering in construction management. Adv. Eng. Inform. 2025, 65, 103158. [Google Scholar] [CrossRef]
- Ortakci, Y. Revolutionary text clustering: Investigating transfer learning capacity of SBERT models through pooling techniques. Eng. Sci. Technol. Int. J. 2024, 55, 101730. [Google Scholar] [CrossRef]
- Yu, W. Retrieval-augmented Generation across Heterogeneous Knowledge. In NAACL 2022—2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 52–58. [Google Scholar] [CrossRef]
- Li, Z.; Wang, Z.; Wang, W.; Hung, K.; Xie, H.; Wang, F.L. Retrieval-augmented generation for educational application: A systematic survey. Comput. Educ. Artif. Intell. 2025, 8, 100417. [Google Scholar] [CrossRef]
- Pathak, P.; Rana, P.S. Comparative Analysis of Pretrained Models for Text Classification, Generation and Summarization: A Detailed Analysis. In Pattern Recognition; Lecture Notes in Computer Science LNCS; Conference paper; Springer: Cham, Switzerland, 2024; Volume 15301, pp. 151–166. [Google Scholar]
- Phakmongkol, P.; Vateekul, P. Enhance text-to-text transfer transformer with generated questions for thai question answering. Appl. Sci. J. 2021, 11, 10267. [Google Scholar] [CrossRef]



| Text | Learning of continuous valued functions using ensembles of neural network committees can present better accuracy, reliable estimation of generalization error, and active learning. |
| Tokens | “Learning”, “of”, “continuous”, “valued”, “functions”, “using”, “ensembles”, “of”, “neural”, “network”, “committees”, “can”, “present”, “better”, “accuracy”, “reliable”, “estimation”, “of”, “generalization”, “error”, “and”, “active”, and “learning”. |
| Questions | T5 Model | BART Model | ||
|---|---|---|---|---|
| Score Similarity | BERT Score | Score Similarity | BERT Score | |
| Q1 | 0.72 | 0.84 | 0.72 | 0.84 |
| Q2 | 0.74 | 0.86 | 0.74 | 0.86 |
| Q3 | 0.67 | 0.84 | 0.67 | 0.85 |
| Q4 | 0.70 | 0.83 | 0.70 | 0.86 |
| Q5 | 0.67 | 0.83 | 0.67 | 0.84 |
| Question | Precision@5 | Recall@5 | F1@5 | MAP | MRR | nDCG@5 |
|---|---|---|---|---|---|---|
| Q1 | 0.8 | 0.44 | 0.57 | 0.3 | 0.5 | 0.66 |
| Q2 | 0.4 | 0.5 | 0.4 | 0.5 | 0.5 | 0.63 |
| Q3 | 0.6 | 0.42 | 0.5 | 0.42 | 1 | 0.72 |
| Q4 | 0.6 | 0.37 | 0.46 | 0.34 | 1 | 0.70 |
| Q5 | 0.6 | 0.4 | 0.49 | 0.4 | 1 | 0.69 |
| Question | Precision@5 | Recall@5 | F1@5 | MAP | MRR | nDCG@5 |
|---|---|---|---|---|---|---|
| Q1 | 0.8 | 0.44 | 0.57 | 0.35 | 1 | 0.78 |
| Q2 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.65 |
| Q3 | 0.6 | 0.43 | 0.5 | 0.43 | 1 | 0.72 |
| Q4 | 0.6 | 0.37 | 0.46 | 0.34 | 1 | 0.70 |
| Q5 | 0.6 | 0.43 | 0.5 | 0.4 | 1 | 0.69 |
| Questions | T5 Model | Bart-Large-cnn Model | ||
|---|---|---|---|---|
| Score Similarity | BERT Score | Score Similarity | BERT Score | |
| Q1 | 0.73 | 0.84 | 0.87 | 0.86 |
| Q2 | 0.72 | 0.82 | 0.72 | 0.84 |
| Q3 | 0.59 | 0.84 | 0.73 | 0.86 |
| Q4 | 0.72 | 0.84 | 0.72 | 0.86 |
| Q5 | 0.67 | 0.84 | 0.67 | 0.83 |
| Question | Precision@5 | Recall@5 | F1@5 | MAP | MRR | nDCG@5 |
|---|---|---|---|---|---|---|
| Q1 | 1 | 0.55 | 0.71 | 0.55 | 1 | 1 |
| Q2 | 0.6 | 0.75 | 0.66 | 0.75 | 1 | 0.83 |
| Q3 | 0.8 | 0.44 | 0.57 | 0.4 | 1 | 0.83 |
| Q4 | 0.6 | 0.4 | 0.5 | 0.4 | 1 | 0.72 |
| Q5 | 0.4 | 0.29 | 0.33 | 0.16 | 0.5 | 0.38 |
| Question | Precision@5 | Recall@5 | F1@5 | MAP | MRR | nDCG@5 |
|---|---|---|---|---|---|---|
| Q1 | 0.8 | 0.44 | 0.57 | 0.44 | 1 | 0.86 |
| Q2 | 0.6 | 1 | 0.75 | 0.75 | 1 | 0.84 |
| Q3 | 0.8 | 0.57 | 0.66 | 0.57 | 1 | 0.86 |
| Q4 | 0.8 | 0.5 | 0.61 | 0.5 | 1 | 0.87 |
| Q5 | 0.6 | 0.43 | 0.5 | 0.43 | 1 | 0.72 |
| Components | Version |
|---|---|
| CPU | Inter (R) Core (TM) Ultra 9 258H (2.90 GHz) |
| RAM | 16 GB |
| GPU | Intel (R) Arc (TM) 140T GPU (8 GB) |
| Speed | 7467 MT/s |
| Python (PyCharm) | 2025.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jaddoa, A.S.; Karimpour, J.; Salehpour, P. TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation. Information 2026, 17, 372. https://doi.org/10.3390/info17040372
Jaddoa AS, Karimpour J, Salehpour P. TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation. Information. 2026; 17(4):372. https://doi.org/10.3390/info17040372
Chicago/Turabian StyleJaddoa, Ahmed Sami, Jaber Karimpour, and Pedram Salehpour. 2026. "TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation" Information 17, no. 4: 372. https://doi.org/10.3390/info17040372
APA StyleJaddoa, A. S., Karimpour, J., & Salehpour, P. (2026). TSQA: Integrating Text Summarization and Question Answering to Improve Information Retrieval from Documents Using Retrieval-Augmented Generation. Information, 17(4), 372. https://doi.org/10.3390/info17040372

