Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content
Abstract
1. Introduction
- A formulation of explainable Semantic Text Relations (STR) grounded in Answerable Question Sets (AQS): We define the semantic relation between two texts through the set-theoretic relationship between their respective sets of answerable questions. Comparing these sets determines whether the texts are semantically equivalent, one is included in the other, or they partially overlap. Because each difference in answerability corresponds to a specific piece of missing or added information, this approach provides a clear, operational, and explainable definition of STR based on LLM-driven answerability.
- A first STR benchmark dataset: We construct and release a synthetic dataset of text pairs labeled with fine-grained semantic relations and corresponding question sets, which can be used to build and evaluate explainable STR models.
- Evaluation of discriminative and generative STR classifiers: We train and compare multiple approaches for directly classifying semantic text relations from text pairs, including zero-shot and few-shot prompting with large generative models, supervised transformer-based classifiers, and traditional machine-learning models.
2. Prior Work
2.1. Textual Semantic Relations
2.2. Answerability
- SQuAD 2.0 (2018) dataset extends SQuAD 1.1 by adding 50K unanswerable questions. These negatives are adversarially created, so that an answerable-looking question may have no answer in the paragraph. Systems must predict “no answer” (often scored as the [CLS] token) when appropriate. SQuAD 2.0 remains a primary testbed for extractive QA with answerability.
- Natural Questions (2019): A large-scale Google dataset of real Google queries and Wikipedia pages. Annotators mark long and short answers or label a question NULL if no answer is found. Kwiatkowski et al. [39] report that about 50.5% of sampled queries had no relevant passage at all, and another ~14.7% had only partial answers, which they label as unanswerable. (On the official splits, roughly one-third of training examples are answerable.) NQ’s format includes yes/no classification for questions answerable only with a Boolean, adding another dimension to answerability.
- NewsQA (2017): QA over CNN news articles. Questions were written using only article summaries, so many questions end up unanswerable in the full article. Trischler et al. [40] collected ~100K QA pairs; later analyses [36] note that a significant fraction (~13–14% in train) are “nil” questions with no answer span. This makes NewsQA a valuable benchmark for nil-aware extraction models.
- BoolQ (2019): A crowdsourced yes/no reading-comprehension dataset. Each question has a paragraph and a binary (yes/no) answer. By design, every question is answerable with a yes-or-no answer. However, BoolQ is often viewed as requiring answer verification or inference (i.e., entailment) rather than span extraction. It demonstrates that answerability can be viewed as a classification problem (yes/no) given context, related to but distinct from “no-answer” detection.
- QuAC (2018) and CoQA (2019): Conversational QA datasets. In QuAC, a student asks free-form questions about a hidden Wikipedia paragraph, and the teacher answers with spans or “n/a” (no answer). Many QuAC questions are unanswerable or require dialog context. CoQA similarly includes some unanswerable turns (annotated as “” or “unknown”). These tasks stress models’ ability to track context and declare “no answer” when the conversation’s current question cannot be answered from the text.
- TyDi QA (2020): A multilingual info-seeking QA dataset across 11 languages. In its primary tasks, systems must select the passage containing the answer (or NULL if none exists) and then produce an answer span or a yes/no response. Thus, TyDi explicitly trains models to output NULL or no answer when needed. Its “gold passage” variant, however, follows SQuAD’s convention by discarding questions that are unanswerable from the given passage.
2.3. Generation of Synthetic Data for QA and Text Relations Modeling
2.3.1. Synthetic QA Datasets
2.3.2. Synthetic Paraphrase Datasets
2.3.3. Synthetic NLI and Entailment Datasets
2.3.4. Quality Control Strategies
- Filtering by Model Confidence: Use generation probabilities as scores. For example, Shakeri et al. [59] use the likelihood of the generated QA (from BART) to filter out low-confidence pairs. Alberti et al. [48] similarly require that synthetic QA be answerable by a pretrained model in a round-trip check.
- Metric-based Filtering: Compute lexical/semantic metrics between the original and generated text. In paraphrase datasets, thresholds on PINC and BERTScore [55] ensure paraphrases are both diverse and meaning-preserving. For QA generation, answer overlaps and language quality measures (such as BLEU and ROUGE) may be used.
- Classifier or LLM Evaluation: He et al. [43] advocate using a strong classifier to label synthetic examples: the synthetic text is first generated (or labeled) and then passed through the “best available” task model to obtain pseudo-labels. Jin and Wang [52] train an RL-based selector with an LLM reward to pick high-quality QA pairs, outperforming naive selection. In other words, models themselves act as gatekeepers on generated data.
- Human Verification: When feasible, humans can vet a sample of synthetic data to calibrate filters. Akil et al. [49] use human evaluation to set BERTScore thresholds for Bangla paraphrases. This ensures the chosen thresholds align with actual semantic correctness.
- Mix with Real Data: Crucially, studies note that synthetic examples should supplement rather than replace human data. He et al. [43] and others caution that iterative augmentation should retain original data to avoid feedback loops. As long as the gold data remains in training, synthetic augmentation can steadily improve models without divergence.
3. Materials and Methods
3.1. Synthetic STR Dataset Construction
3.2. STR Classification Model Training
- (1)
- Fine-tuning transformer-based cross-encoders, where both texts are fed jointly into a single model that performs sentence-pair classification, and
- (2)
- Using frozen sentence encoders, where the embeddings of both texts obtained from a pre-trained SBERT model are concatenated and used as features for classical classifiers.
4. Results
5. Conclusions, Limitations, and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| STR | Semantic Text Relations |
| AQS | Answerable Question Text |
| QA | Question answering |
| NLP | Natural Language Processing |
| LLM | Large Language Models |
| RTE | Recognizing Textual Entailment |
| NLI | Natural Language Inference |
| SNLI | Stanford Natural Language Inference (corpus) |
| MultiNLI | Multi-Genre Natural Language Inference (corpus) |
| SICK | Sentences Involving Compositional Knowledge (dataset) |
| GLUE | General Language Understanding Evaluation (benchmark) |
| MNLI | Multi-Genre Natural Language Inference (task within GLUE) |
| QQP | Quora Question Pairs (task within GLUE) |
| MRPC | Microsoft Research Paraphrase Corpus |
| LSTM | Long Short-Term Memory |
| BiLSTM | Bidirectional Long Short-Term Memory |
| BERT | Bidirectional Encoder Representations from Transformers |
| RoBERTa | Robustly optimized BERT approach |
| ALBERT | A Lite BERT |
Appendix A. Datapoint Example
Appendix A.1. Source Paragraph
- The New York City Fire Department (FDNY) provides fire protection, technical rescue, response to various hazards, and emergency medical services throughout the five boroughs of New York City. It is the largest municipal fire department in the United States and the second largest in the world, after the Tokyo Fire Department. The FDNY employs approximately 11,080 uniformed firefighters and over 3300 uniformed EMTs and paramedics. The department is often referred to as New York’s Bravest.
Appendix A.2. Questions (AQS)
- What are the primary responsibilities and operational areas of the FDNY?
- How does the FDNY’s size compare to other fire departments in the U.S. and globally?
- What is the approximate number of uniformed personnel in the FDNY?
- What is the well-known motto or nickname associated with the department?
Appendix A.3. Full Paraphrase
- The Fire Department of New York (FDNY) is responsible for fire suppression, technical rescue operations, initial response to biological, chemical, and radiological incidents, and emergency medical care throughout all five boroughs of New York City. As the largest municipal fire department in the United States, the FDNY ranks second globally, behind only the Tokyo Fire Department. The department has a workforce of around 11,080 uniformed firefighters and more than 3300 uniformed emergency medical technicians and paramedics. The department is commonly known by its motto, “New York’s Bravest”.
Appendix A.4. Synthetically Generated Text with Answer to Q4 Removed
- Removed question: What is the well-known motto or nickname associated with the department?
- Generated text:
- The Fire Department of New York (FDNY) is responsible for fire suppression, technical rescue operations, initial response to biological, chemical, and radiological incidents, and emergency medical care throughout all five boroughs of New York City. As the largest municipal fire department in the United States, the FDNY ranks second globally, behind only the Tokyo Fire Department. The department has a workforce of around 11,080 uniformed firefighters and more than 3300 uniformed emergency medical technicians and paramedics.
Appendix B. Prompts for STR Classification Using Generative Models
Appendix B.1. Zero-Shot Prompt
- You are a language expert tasked with identifying the semantic relation between two texts. The possible relations are:
- EQUIVALENCE—Both texts express the same information. INCLUSION—one text contains all the information in the other, plus additional content. SEMANTIC OVERLAP—the texts have partial semantic overlap, but neither fully includes the other.
- Text A: “{TEXT A}” Text B: “{TEXT B}”
- What is the semantic relation between Text A and Text B?
- Answer with one of: “EQUIVALENCE”, “INCLUSION”, or “SEMANTIC OVERLAP”.
Appendix B.2. Few-Shot Prompt
- You are a language expert tasked with identifying the semantic relation between two texts. The possible relations are:
- EQUIVALENCE—Both texts express the same information. INCLUSION—one text contains all the information in the other, plus additional content. SEMANTIC OVERLAP—the texts have partial semantic overlap, but neither fully includes the other.
- Example 1: Text A: “The Eiffel Tower is located in Paris and attracts millions of tourists every year.” Text B: “Many tourists visit the Eiffel Tower in Paris annually.” Answer: INCLUSION
- Example 2:
- Text A: “Photosynthesis occurs in plant leaves using sunlight, water, and carbon dioxide.” Text B: “The process of photosynthesis in plants uses water, CO, and sunlight in leaves.” Answer: EQUIVALENCE
- Example 3: Text A: “The collapse of mortgage-backed securities triggered the 2008 financial crisis.” Text B: “The Great Depression was caused by a stock market crash in 1929.” Answer: SEMANTIC OVERLAP
- Now, determine the relation in the following example: Text A:
- “{TEXT A}” Text B: “{TEXT B}”
- Answer
Appendix C. Confusion Matrices for STR Classification Models

References
- Mama, E.; Sheri, L.; Aperstein, Y.; Apartsin, A. From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives. arXiv 2025, arXiv:2509.11803. [Google Scholar] [CrossRef]
- Aperstein, Y.; Cohen, Y.; Apartsin, A. Generative AI-Based Platform for Deliberate Teaching Practice: A Review and a Suggested Framework. Educ. Sci. 2025, 15, 405. [Google Scholar] [CrossRef]
- Zha, Y.; Yang, Y.; Li, R.; Hu, Z. AlignScore: Evaluating factual consistency with a unified alignment function. arXiv 2023, arXiv:2305.16739. [Google Scholar] [CrossRef]
- Zhang, S.; Wan, D.; Cattan, A.; Klein, A.; Dagan, I.; Bansal, M. QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization. arXiv 2024, arXiv:2412.07096. [Google Scholar] [CrossRef]
- Youa, Z.; Guoa, Y. PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization. arXiv 2025, arXiv:2503.08890. [Google Scholar]
- Dagan, I.; Glickman, O.; Magnini, B. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop; Springer: Berlin/Heidelberg, Germany, 2006; pp. 177–190. [Google Scholar]
- Poliak, A. A survey on recognizing textual entailment as an NLP evaluation. arXiv 2020, arXiv:2010.03061. [Google Scholar]
- Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. arXiv 2015, arXiv:1508.05326. [Google Scholar] [CrossRef]
- Dolan, B.; Brockett, C. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop Paraphrasing (IWP2005), Jeju Island, Republic of Korea, 9–16 October 2005. [Google Scholar]
- Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. arXiv 2017, arXiv:1708.00055. [Google Scholar]
- Glickman, O.; Dagan, I.; Koppel, M. A probabilistic classification approach for lexical textual entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, Marina Del Rey, CA, USA, 1–2 June 2005; pp. 1050–1055. [Google Scholar]
- Mehdad, Y.; Negri, M.; Federico, M. Towards cross-lingual textual entailment. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, 1–6 June 2010; pp. 321–324. [Google Scholar]
- MacCartney, B.; Manning, C.D. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, 18–22 August 2008; pp. 521–528. [Google Scholar]
- Clark, P.; Harrison, P. An Inference-Based Approach to Recognizing Entailment. In Notebook Papers and Results, Text Analysis Conference (TAC); The Boeing Company: Seattle, WA, USA, 2009; pp. 63–72. Available online: https://tac.nist.gov/publications/2009/participant.papers/Boeing.proceedings.pdf (accessed on 25 October 2025).
- MacCartney, B. Natural Language Inference. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 2009. [Google Scholar]
- Wang, S.; Jiang, J. A compare-aggregate model for matching text sequences. arXiv 2016, arXiv:1611.01747. [Google Scholar] [CrossRef]
- Rada, R.; Mili, H.; Bicknell, E.; Blettner, M. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 1989, 19, 17–30. [Google Scholar] [CrossRef]
- Baroni, M.; Bernardi, R.; Do, N.-Q.; Shan, C.-C. Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012; pp. 23–32. [Google Scholar]
- Williams, A.; Nangia, N.; Bowman, S.R. A broad-coverage challenge corpus for sentence understanding through inference. arXiv 2017, arXiv:1704.05426. [Google Scholar]
- Marelli, M.; Bentivogli, L.; Baroni, M.; Bernardi, R.; Menini, S.; Zamparelli, R. SemEval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, 23–24 August 2014; pp. 1–8. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
- Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; Bordes, A. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Dublin, Ireland, 23–24 August 2017; pp. 670–680. [Google Scholar]
- Wang, S.; Jiang, J. Learning natural language inference with LSTM. arXiv 2015, arXiv:1512.08849. [Google Scholar]
- Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A decomposable attention model for natural language inference. arXiv 2016, arXiv:1606.01933. [Google Scholar] [CrossRef]
- Chen, Q.; Zhu, X.; Ling, Z.H.; Wei, S.; Jiang, H.; Inkpen, D. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1657–1668. [Google Scholar]
- Mueller, J.; Thyagarajan, A. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2786–2792. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5753–5763. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Rajpurkar, P.; Jia, R.; Liang, P. Know what you don’t know: Unanswerable questions for SQuAD. arXiv 2018, arXiv:1806.03822. [Google Scholar]
- Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.-T.; Choi, Y.; Liang, P.; Zettlemoyer, L. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2174–2184. [Google Scholar]
- Clark, C.; Lee, K.; Chang, M.-W.; Kwiatkowski, T.; Collins, M.; Toutanova, K. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Minneapolis, MN, USA, 2–7 June 2019; pp. 2924–2936. [Google Scholar]
- Clark, J.H.; Choi, E.; Collins, M.; Garrette, D.; Kwiatkowski, T.; Nikolaev, V.; Palomaki, J. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Trans. Assoc. Comput. Linguist. 2020, 8, 454–470. [Google Scholar] [CrossRef]
- Kundu, S.; Ng, H.T. A NIL-aware answer extraction framework for question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4243–4252. [Google Scholar]
- Kamath, A.; Jia, R.; Liang, P. Selective question answering under domain shift. arXiv 2020, arXiv:2006.09462. [Google Scholar] [CrossRef]
- Hu, M.; Peng, Y.; Huang, Z.; Qiu, X.; Wei, F.; Zhou, M. Read+verify: Machine reading comprehension with unanswerable questions. In Proceedings of the AAAI’19: AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6529–6537. [Google Scholar]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Petrov, S. Natural questions: A benchmark for question answering research. Trans. Assoc. Comput. Linguist. 2019, 7, 453–466. [Google Scholar] [CrossRef]
- Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.; Bachman, P.; Suleman, K. NewsQA: A machine comprehension dataset. arXiv 2016, arXiv:1611.09830. [Google Scholar]
- Yatskar, M. A qualitative comparison of CoQA, SQuAD 2.0 and QuAC. arXiv 2018, arXiv:1809.10735. [Google Scholar]
- Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.R.; Smith, N.A. Annotation artifacts in natural language inference data. arXiv 2018, arXiv:1803.02324. [Google Scholar] [CrossRef]
- He, X.; Nassar, I.; Kiros, J.; Haffari, G.; Norouzi, M. Generate, annotate, and learn: NLP with synthetic text. Trans. Assoc. Comput. Linguist. 2022, 10, 826–842. [Google Scholar] [CrossRef]
- Takahashi, K.; Omi, T.; Arima, K.; Ishigaki, T. Training generative question-answering on synthetic data obtained from an instruct-tuned model. arXiv 2023, arXiv:2310.08072. [Google Scholar]
- Hosseini, M.J.; Petrov, A.; Fabrikant, A.; Louis, A. A synthetic data approach for domain generalization of NLI models. arXiv 2024, arXiv:2402.12368. [Google Scholar] [CrossRef]
- Namboori, A.; Mangale, S.; Rosenbaum, A.; Soltan, S. GeMQuAD: Generating multilingual question answering datasets from large language models using few shot learning. arXiv 2024, arXiv:2404.09163. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- Alberti, C.; Andor, D.; Pitler, E.; Devlin, J.; Collins, M. Synthetic QA corpora generation with roundtrip consistency. arXiv 2019, arXiv:1906.05416. [Google Scholar] [CrossRef]
- Akil, A.; Sultana, N.; Bhattacharjee, A.; Shahriyar, R. BanglaParaphrase: A high-quality Bangla paraphrase dataset. arXiv 2022, arXiv:2210.05109. [Google Scholar]
- Puri, R.; Spring, R.; Patwary, M.; Shoeybi, M.; Catanzaro, B. Training question answering models from synthetic data. arXiv 2020, arXiv:2002.09599. [Google Scholar] [CrossRef]
- Hemati, H.H.; Beigy, H. Consistency training by synthetic question generation for conversational question answering. arXiv 2024, arXiv:2404.11109. [Google Scholar] [CrossRef]
- Jin, J.; Wang, H. Select high-quality synthetic QA pairs to augment training data in MRC under the reward guidance of generative language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 14543–14554. [Google Scholar]
- Moon, S.R.; Fan, J. How you ask matters: The effect of paraphrastic questions to BERT performance on a clinical SQuAD dataset. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online, 19 November 2020; pp. 111–116. [Google Scholar]
- Wei, J.; Zou, K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar] [CrossRef]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Chen, D.L.; Dolan, W.B. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 190–200. [Google Scholar]
- McCoy, R.T.; Pavlick, E.; Linzen, T. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv 2019, arXiv:1902.01007. [Google Scholar] [CrossRef]
- Tang, L.; Laban, P.; Durrett, G. Minicheck: Efficient fact-checking of llms on grounding documents. arXiv 2024, arXiv:2404.10774. [Google Scholar]
- Shakeri, S.; dos Santos, C.; Zhu, H.; Ng, P.; Nan, F.; Wang, Z.; Nallapati, R.; Xiang, B. End-to-end synthetic data generation for domain adaptation of question answering systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 5445–5460. [Google Scholar]
| Original(A): The Apollo program was the third United States human spaceflight program carried out by NASA, which accomplished the first human landing on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower’s administration as a follow-up to Project Mercury, which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy’s national goal of “landing a man on the Moon and returning him safely to the Earth” by the end of the 1960s. | ||
| Question | Answer | |
| Q1 | Which space agency was responsible for the Apollo program? | NASA |
| Q2 | What was the goal of the Apollo program? | Landing a man on the Moon and returning him safely to the Earth |
| Q3 | When did the Apollo Moon landings take place? | from 1969 to 1972 |
| Q4 | Who initiated the Apollo program? | Dwight D. Eisenhower |
| Q5 | Which program came before Apollo? | Project Mercury |
| Paraphrasing(B): Emerging during the administration of Dwight D. Eisenhower, as the project intended to succeed Project Mercury—which had placed the first Americans in space—the Apollo program became the third U.S. human spaceflight initiative carried out by NASA. It was later aligned with President John F. Kennedy’s objective of “landing a man on the Moon and returning him safely to the Earth” before the 1960s ended, ultimately achieving the first human lunar landing between 1969 and 1972 | ||
| Rephrased with only Q5 unanswerable(C): The Apollo program was NASA’s third crewed spaceflight initiative and achieved humanity’s first Moon landing, operating from 1969 to 1972. Initially formulated during the administration of President Dwight D. Eisenhower, it was later aligned with President John F. Kennedy’s objective of “landing a man on the Moon and returning him safely to Earth” before the close of the 1960s. | ||
| Rephrased with only Q1 unanswerable(D): The Apollo program was the third U.S. crewed spaceflight effort and achieved the first human landing on the Moon between 1969 and 1972. It was initially formulated during President Dwight D. Eisenhower’s tenure as a successor to Project Mercury, which sent the first Americans into orbit. The program was later aligned with President John F. Kennedy’s objective of “landing a man on the Moon and returning him safely to the Earth” before the decade’s end. | ||
| Labels: Equivalence: A = B; Inclusion: C ≤ A, C ≤ B, D ≤ A, D ≤ B; Overlap: A ⋈ B | ||
| Model | Accuracy | Macro-F1 |
|---|---|---|
| RoBERTa-base | 0.614 | 0.446 |
| DistilBERT | 0.606 | 0.441 |
| SBERT + Logistic Regression | 0.604 | 0.478 |
| SBERT + Random Forest | 0.591 | 0.529 |
| Longformer-base | 0.555 | 0.238 |
| GPT-4.1 Zero-Shot | 0.339 | 0.341 |
| GPT-4.1 Few-Shot | 0.313 | 0.307 |
| GPT-4.1 Zero-Shot, CoT | 0.408 | 0.409 |
| GPT-4.1 Few-Shot CoT | 0.406 | 0.398 |
| GPT-4o Zero-Shot | 0.354 | 0.355 |
| GPT-4o Few-Shot | 0.311 | 0.265 |
| GPT-4o Zero-Shot CoT | 0.399 | 0.406 |
| GPT-4o Few-Shot CoT | 0.279 | 0.220 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Aperstein, Y.; Gottlib, A.; Benita, G.; Apartsin, A. Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content. Information 2025, 16, 1090. https://doi.org/10.3390/info16121090
Aperstein Y, Gottlib A, Benita G, Apartsin A. Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content. Information. 2025; 16(12):1090. https://doi.org/10.3390/info16121090
Chicago/Turabian StyleAperstein, Yehudit, Alon Gottlib, Gal Benita, and Alexander Apartsin. 2025. "Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content" Information 16, no. 12: 1090. https://doi.org/10.3390/info16121090
APA StyleAperstein, Y., Gottlib, A., Benita, G., & Apartsin, A. (2025). Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content. Information, 16(12), 1090. https://doi.org/10.3390/info16121090

