Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework
Abstract
1. Introduction
- A linguistically adaptive QG system based on the mT5 architecture specifically optimized for Arabic’s morphological complexity and low-resource educational contexts, combining deep learning with domain-specific knowledge representation.
- An answer-aware re-ranking mechanism that enhances semantic precision by jointly modeling contextual relationships between source text, generated questions, and target answers, improving educational relevance.
- A dual-stage generation pipeline that produces both subjective questions (via beam search) and high-quality MCQs with semantically plausible distractors (via top-k sampling), addressing the full spectrum of assessment needs.
- New Arabic NLP resources including a curated MCQ dataset with expert-validated distractors and open-source implementation, facilitating research in Arabic educational technology and adaptive learning systems.
2. Background
2.1. Linguistic and Computational Challenges in Arabic QG
- Morphological richness and ambiguity.
- Arabic words exhibit extensive inflectional variation based on gender, number, tense, and grammatical case. For instance, the verb كتب (“he wrote”) can morph into كتبت (“she wrote”), كتبوا (“they wrote”), or سيكتبون (“they will write”). This morphological richness increases the complexity of generating grammatically correct and semantically aligned questions, particularly when dealing with out-of-domain inputs or informal text.Additionally, Arabic commonly omits diacritics (short vowels), leading to lexical ambiguity. For example, the word علم can be interpreted as “flag,” “knowledge,” or “he knew,” depending on context. Such ambiguities are especially problematic for QG systems, as minor misinterpretations in token meaning can render the generated question irrelevant or misleading.
- Syntactic flexibility and typographic noise.
- Arabic allows variable word order structures such as Subject–Verb–Object (SVO), Verb–Subject–Object (VSO), and even less common forms like Verb–Object-Subject (VOS). QG models must therefore learn to generalize over multiple syntactic patterns and maintain clarity regardless of word order. Further complicating the task, common typing errors—such as mistaking the name علي (“Ali”) for the preposition على (“on”)—can confuse syntactic parsers or named entity recognition components, thereby reducing the reliability of generated output.
- Domain limitations and lack of evaluation standards.
- The field of Arabic question generation is hampered by a scarcity of dedicated datasets and resources, a recurring challenge across the Arabic NLP landscape that affects tasks from text summarization [9] to specialized applications like legal judgment support systems [10]. Existing corpora are often restricted to narrow domains, such as religious texts or news articles, and typically lack comprehensive annotations for question–answer pairs and distractors. This limitation hinders model generalizability and precludes the establishment of a unified evaluation benchmark. While metrics like BLEU and METEOR are commonly adopted, they may not adequately assess the grammatical and contextual fidelity essential for educational content. Furthermore, transformer-based models, especially in multilingual settings, are susceptible to cascading errors and semantic drift during the generation of longer or more complex questions—a problem attributed to exposure bias. Although techniques like imitation learning offer promising mitigation strategies [11], they are not explored in the present study.
2.2. Challenges in Distractor Generation
- Semantic relatedness.
- Distractors must belong to the same semantic class or domain as the correct answer. For example, in the question “What is the capital of Saudi Arabia?”, distractors like “Cairo” or “Beirut” are suitable, while “Washington, DC” may be semantically related but contextually inappropriate, as it is not an Arab city.
- Contextual appropriateness.
- Even semantically related distractors must be contextually plausible. In questions about historical events—such as the Battle of Qadisiyyah (معركة القادسية)—appropriate distractors would reference similar historical battles (e.g., معركة حطين، معركة اليرموك). Distractors unrelated to the regional or historical context would confuse learners or trivialize the assessment.
- Grammatical consistency.
- Distractors must match the grammatical features of the correct answer, including gender, number, and case. For example, in the MCQ: – المغرب؟ هو عاصمة (“– is the capital of Morocco?”), the correct answer الرباط (Rabat) is masculine. Thus, valid distractors must also be masculine nouns (e.g., الخرطوم, الكويت). Using feminine options like القاهرة (Cairo) introduces grammatical inconsistency and signals the correct answer.
2.3. T5 Architecture
3. Related Work
3.1. Traditional Approaches
3.2. Deep-Learning Approaches
4. Compiling the Dataset
4.1. Dataset for Plain Question
4.2. Dataset for Multiple-Choice Questions
- Arabic Facebook QA dataset
- [43]. This dataset, known as multilingual question answering (MLQA), serves as a benchmark for evaluating cross-lingual question answering performance. It contains over 5000 extractive question–answer instances (12,000 in English) in the SQuAD format. MLQA covers seven languages, including English, Arabic, German, Spanish, Hindi, Vietnamese, and Simplified Chinese. The dataset exhibits high parallelism, with an average of QA instances parallel across four different languages.
- ARCD dataset.
- The Arabic Reading Comprehension Dataset (ARCD) consists of 1395 questions collected from Wikipedia articles through crowd-sourcing [44].
- Qudrat Arabic Reading Comprehension Test.
- Qudrat is an Arabic reading comprehension test designed for high school students, serving as a crucial requirement for acceptance into Saudi universities (https://www.qeyas2030.com/public/categories/82/show, accessed on 9 June 2024). We collected approximately 200 instances from various websites and applications.
- Mawdoo3.com.
- This is a prominent website hosting a vast collection of Arabic articles (https://mawdoo3.com/). We have collected around 700 articles from this source and constructed our dataset using them.
- Madinah Arabic.
- This is a website offering online courses and tests to aid students worldwide in learning Arabic (https://www.madinaharabic.com/). We have obtained around 100 reading comprehension tests from Madinah Arabic to contribute to our dataset.
4.3. Data Preprocessing
- Combining the datasets. We merge the datasets together into a unified form, specifically CSV format, for each model.
- Removing unwanted columns. We eliminate unnecessary columns such as “start answer,” “answers,” “id,” “QID,” and “SiteID” from the dataset.
- Discarding question–answer pairs with empty answers.
- We remove special characters such as newline and unclosed brackets from all columns within the dataset.
- Removing reference symbols within the contexts, such as those used for footnote marker.
- We unify the Alif character into a standardized form throughout the dataset.
- Remove all the diacritics from the dataset.
5. Our Proposed Approach
5.1. Proposed System Design
- Distractor 1:
- “June 28, 1914”;
- Distractor 2:
- “July 28, 1944”;
- Distractor 3:
- “November 11, 1918”.
5.2. Training the Model
5.3. Generating Optimal Questions
- Answer-aware question generation:
- We will use a specialized approach to generate answer-aware questions that relate to both the context and a given answer. This helps to avoid generic or irrelevant questions. The format for this will beAnswer: <answer> Context: <context>.Algorithm 1 implements the above. It starts by loading the Arabic SQuAD 1.1 dataset and a pre-trained mT5 model for the purpose of training the model. Once trained, the model is used to generate three questions based on a given context C and target answer A, employing one of the decoding algorithms referenced later in this Section. A procedure called QG_reranking is then used to calculate the similarity score between each of the generated questions, the input paragraph, and the expected answer, and these questions, along with their scores, are stored in a list. The list of questions is subsequently sorted by decreasing similarity score, and the question with the highest score is returned as the final output.
Algorithm 1: Generating and re-ranking questions using mT5 large model. |
- Question and answer-aware distractor generation:
- Following the methodology from the answer-aware questions, we also aim to create valid distractors that are related to the context, question, and answer. The format will beQuestion: <question> Answer: <answer> Context: <context>.The following pseudocode describes how we implement the “QG with MCQs” model.Algorithm 2 begins by loading a specifically constructed dataset and a pre-trained mT5 base model, with the aim of training this model on the loaded data. Following the training, the trained model is employed to generate three distractors based on both the context and the target answer, along with a question, which could possibly be the question generated from a previous model. Together, these form four choices: one correct answer and three distractors. The generation of these distractors and the question makes use of one of the decoding algorithms listed below. Ultimately, the algorithm returns the list of generated distractors.
Algorithm 2: Generating question with multiple-choice questions (MCQs). |
- Decoding algorithms:
- This involves selecting decoding algorithms that significantly impact the generated output. Through these methods, we will assess which algorithms perform best after training and using our models. We explored the following methods:
- Greedy decoding algorithm. Uses token probabilities to determine the next target word by selecting the token with the highest probability. This approach is the default decoding algorithm in the tokenizer.decode function found in T5 and mt5 architectures. However, a significant issue with greedy decoding is that it can overlook high-probability tokens that are hidden behind tokens with lower probabilities [47].
- Beam search decoding algorithm. Addresses the limitation found in the greedy algorithm, where high probability tokens may be missed. Unlike the greedy approach, beam search explores multiple potential paths in the output, reducing the risk of overlooking hidden high-probability tokens. This exploration allows it to select the best option from the various paths, thereby overcoming the deficiency in the greedy method.
- Top-k sampling decoding algorithm. A variant of the greedy decoding method. Instead of selecting just the token with the highest probability, it considers the top k tokens with the highest probabilities. The probability mass is then redistributed among these selected k tokens, providing a broader consideration of likely options.
- QG re-ranking:
- To mitigate error propagation from QG to downstream multiple-choice question formation, we introduce a knowledge-driven re-ranking mechanism. Our pipeline first generates multiple question candidates using the fine-tuned Arabic QG model, then employs AraELECTRA [48], a pre-trained Arabic QA model as a knowledge-based validator. This model evaluates each candidate’s semantic alignment with the input context and target answer, assigning a relevance score. The highest-scoring question is selected, ensuring optimal contextual coherence and educational validity. This step not only filters out low-quality outputs but also explicitly leverages external knowledge (via QA) to enhance robustness—a key advantage over end-to-end QG systems.
5.4. Evaluation Metrics
- BLEU
- (BiLingual Evaluation Understudy) is a string-matching algorithm that is widely utilized for assessing the quality of text in machine translation tasks [49]. It is not dependent on the particular language chosen, making it a versatile measure. In addition to its primary use in translation evaluation, BLEU has also been employed in various text-to-text generation tasks such as simplification of texts [50], and QG [6,30,35,37,38,40,45,51]. The application of BLEU extends to distractor generation as well, as seen in [46].To formally show the mathematical representation of the BLEU metric [52], we start with some definitions. Let denote the n-gram precision, defined as the ratio of accurately predicted n-grams to the total number of predicted n-grams. For example, if the target sentence is “He eats an apple,” and the predicted sentence is “He ate an apple,” the , while , and . We define the geometric average precision score (GAP) as follows:The BLEU metric calculation varies based on the n-gram length. Specifically, BLEU-1 employs unigram precision, BLEU-2 computes the geometric mean of unigram and bigram precision, while BLEU-3 incorporates the geometric mean of unigram, bigram, and trigram precision, and this sequence proceeds similarly for higher n-gram s.In the proposed architecture of the first model, the system is designed to map the input sequence to the expected question, resulting in a generated question that serves as the model’s prediction. Similarly, in the second model configuration, the target output is defined as the appropriate distractor for each given prompt, with the model generating a predicted distractor accordingly.
- METEOR
- (Metric for Evaluation of Translation with Explicit ORdering) [53] a metric which serves as an evaluative tool for assessing the quality of text translated by machines, using a reference text for comparison. It primarily focuses on aligning words and phrases between the reference and the machine-translated text. It employs a range of measures such as precision, recall, and F-score to evaluate the translations. Importantly, this metric takes into account the sequence of words and phrases, as well as the use of synonyms, in its assessment. Additionally, it gauges both the adequacy and fluency of the machine-translated text. Adequacy measures the extent to which the translation retains the original meaning, while fluency evaluates its grammaticality and readability. This metric has gained extensive usage in machine translation research and various NLP text-to-text tasks, e.g., [6,30,40,45], as in the case of our models. Its popularity stems from its language independence, making it applicable for evaluating translations in any language. Furthermore, it has shown a strong correlation with human judgment in determining translation quality. For example, it achieved scores of 0.347 and 0.331 for Arabic and Chinese datasets, respectively (https://huggingface.co/spaces/evaluate-metric/meteor, accessed on 24 October 2024). The METEOR metric is given byThe Penalty is a factor ranging from 0 to 1, which penalizes the machine translation output if it contains an excessive number of words that do not appear in the reference translation(s). This penalty is calculated by comparing the count of words in the machine translation output that are absent in the reference translation(s) to the overall word count in the machine translation output.
- ROUGE-L
- (Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence) [54] is a widely adopted metric for evaluating the similarity between generated and reference texts based on the longest common subsequence (LCS). It computes precision (), recall (), and F-score () using the following formulas:ROUGE-L has been widely adopted in NLG tasks such as question generation and summarization, where it evaluates generated text quality by measuring both lexical overlap and sentence-level structural similarity [55]. Several Arabic NLP studies, including [5,30,32,34], have employed ROUGE-L for assessment. However, given the morphological richness and syntactic complexity of Arabic, ROUGE-L alone may fail to adequately capture semantic equivalence. This limitation is particularly pronounced, as the standard metric is sensitive to surface-form variations rather than underlying meaning [56].Emerging metrics like LEMMA-ROUGE have been proposed specifically to address this issue by leveraging lemmatization to reduce morphological noise [56]. However, as it is a recent innovation not yet established in existing Arabic NLP literature, we prioritize metrics that allow for direct comparison with prior art. Consequently, to ensure both robustness and comparative fairness, we combine ROUGE-L with a suite of established complementary metrics—including BLEU for n-gram precision, METEOR for synonymy and inflection awareness, and BERTScore for contextual semantic evaluation. We identify the exploration of morphologically aware metrics like LEMMA-ROUGE as a valuable direction for future research.
- BERTScore
- (Bidirectional Encoder Representations from Transformers Score) [57] is an evaluation metric that measures semantic similarity between generated and reference texts using contextual embeddings from pretrained transformer models like BERT. Unlike traditional metrics relying on exact word matches, it captures meaning through two steps:
- Embedding generation. A pretrained transformer such as BERT converts reference text and candidate text into contextual embedding sequences and .
- Similarity computation. For each token pair, cosine similarity is calculated between embeddings. The metric then computes precision, recall and F-score,
While standard metrics like BLEU and METEOR provide a baseline for evaluating lexical overlap, they fall short of assessing the semantic consistency and cultural relevance of generated text, a critical limitation for morphologically rich languages like Arabic. To overcome this, we employ BERTScore, a metric that leverages contextual embeddings to evaluate semantic similarity. BERTScore offers three key advantages: (a) it captures deeper semantic relationships beyond surface-level n-gram matching; (b) it demonstrates a higher correlation with human judgment; and (c) its multilingual capability, via models like mBERT, makes it suited for Arabic evaluation.To our knowledge, prior research in Arabic QG has not adopted BERTScore, creating a gap in robust semantic evaluation. By integrating it into our framework, we enable a direct, meaning-based comparison between our model and the baseline approaches (fine-tuned AraT5 and translated English QG models). This provides a novel and necessary dimension of assessment, ensuring that evaluations account for fluency, semantic accuracy, and contextual appropriateness, thereby offering a more holistic view of model performance. - Human Evaluation
- Automatic metrics are based on mathematical algorithms that compare machine-generated text with a set of reference texts and compute a score based on how closely the two matches. While these metrics can be useful for quickly evaluating large datasets, they have some limitations. One of the main limitations of automatic metrics is that they do not always capture the nuances of language that are important for human readers. For example, an automated metric might score a machine-generated summary as being highly similar to a reference summary, but a human reader might still find the machine-generated summary difficult to understand or lacking in important details. This is where human evaluation metrics come in. Human evaluation metrics involve having human judges reading machine-generated text and score it based on a set of criteria that are relevant to the task at hand like accuracy, coherence, and readability. While human evaluation metrics can be more time-consuming and resource-intensive than automatic metrics, they offer a more accurate and nuanced way of evaluating text-to-text tasks. By having human judges evaluate the quality of machine-generated text in addition to the automatic metrics, we can gain a better understanding of how well the machine-generated text performs in real-world scenarios.Crowdsourcing offers cost-effectiveness, quicker feedback, and diverse perspectives but faces challenges like quality control issues, inconsistent judgments, and misunderstood instructions. For example, ref. [58] highlighted significant quality challenges when using platforms like Clickworker and MTurk for Arabic tasks.To facilitate a meticulous and feasible manual evaluation, we employ a randomly selected sample of 100 records from the test set. This sample size provides a robust and manageable subset for a detailed qualitative assessment by human judges, balancing statistical relevance with practical constraints. Three independent evaluators, all native Arabic speakers with college-level proficiency, will assess the semantic similarity of the model-generated text to the ground truth. They will use a three-point scale: a score of 0 for “not similar”, 0.50 for “somewhat similar”, and 1 for “fully similar”. This approach ensures a reliable and nuanced measure of output quality that complements our automated metrics.We will report the results of human experts using two metrics: overall average score and inter-rater reliability (measured using Fleiss’ kappa). The overall average score is the mean score across all records. Fleiss’ kappa () evaluates the level of agreement among multiple raters, accounting for agreement by chance. It is calculated as follows:
6. Results and Discussion
6.1. Task 1: Standalone Question Generation
6.2. Task 2: Multiple-Choice Question Generation
6.3. Task 3: Analysis of Decoding Strategies
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
ARCD | Arabic Reading Comprehension Dataset |
BLEU | BiLingual Evaluation Understudy |
JSON | JavaScript Object Notation |
LCS | Longest Common Subsequence |
MCQs | Multiple-Choice Questions |
METEOR | Metric for Evaluation of Translation with Explicit ORdering |
MLQA | Multi-lingual Question Answering |
MT | Machine Translation |
mT5 | Multi-lingual T5 |
NER | Named Entity Recognition |
NLP | Natural Language Processing |
NLU | Natural Language Understanding |
QA | Question-Answering |
QG | Question Generation |
ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
SQuAD | Stanford Question Answering Dataset |
SVO | Subject-Verb-Object |
T5 | Google’s Text-to-Text Transfer Transformer |
VOS | Verb-Object-Subject |
VSO | Verb-Subject-Object |
References
- Neirotti, R.A. The importance of asking questions and doing things for a reason. Braz. J. Cardiovasc. Surg. 2021, 36, 1–2. [Google Scholar] [CrossRef]
- Thalheimer, W. Learning Benefits of Questions; A Work-Learning Research Publication: Somerville, MA, USA, 2003. [Google Scholar]
- Al-Khatib, H. Automatic Questions Generation from Arabic Content (in Arabic). Master’s Thesis, Higher Institute for Applied Sciences and Technology, Damascus, Syria, 2019. [Google Scholar]
- Al-Hasan, A. Measurement and Evaluation Course (Lectures 6th and 7th): Essay and Objective Tests. 2019. Available online: https://hama-univ.edu.sy/newsites/education/wp-content/uploads/2020/05/xxx.pdf (accessed on 7 February 2024).
- Alwaneen, T.H.; Azmi, A.M. Stacked dynamic memory-coattention network for answering why-questions in Arabic. Neural Comput. Appl. 2024, 36, 8867–8883. [Google Scholar] [CrossRef]
- Lopez, L.E.; Cruz, D.K.; Cruz, J.C.B.; Cheng, C. Transformer-based end-to-end question generation. arXiv 2020, arXiv:2005.01107. [Google Scholar]
- Mannaa, Z.M.; Azmi, A.M.; Aboalsamh, H.A. Computer-assisted i‘raab of Arabic sentences for teaching grammar to students. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8909–8926. [Google Scholar] [CrossRef]
- Azmi, A.M.; Al-Jouie, M.F.; Hussain, M. AAEE - Automated evaluation of students’ essays in Arabic language. Inf. Process. Manag. 2019, 56, 1736–1752. [Google Scholar] [CrossRef]
- Azmi, A.; Al-Thanyyan, S. Ikhtasir—A user selected compression ratio Arabic text summarization system. In Proceedings of the 2009 International Conference on Natural Language Processing and Knowledge Engineering, Dalian, China, 24–27 September 2009; pp. 1–7. [Google Scholar]
- Almuzaini, H.A.; Azmi, A.M. TaSbeeb: A judicial decision support system based on deep learning framework. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101695. [Google Scholar] [CrossRef]
- Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
- Gao, Y.; Bing, L.; Li, P.; King, I.; Lyu, M.R. Generating distractors for reading comprehension questions from real examinations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6423–6430. [Google Scholar]
- Tay, Y.; Tuan, L.A.; Hui, S.C. Multi-range reasoning for machine comprehension. arXiv 2018, arXiv:1803.09074. [Google Scholar] [CrossRef]
- Ebel, R.L.; Frisbie, D.A. Essentials of Educational Measurement, 5th ed.; Prentice-Hall: Hoboken, NJ, USA, 1991. [Google Scholar]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 483–498. [Google Scholar]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv 2021, arXiv:2010.11934v3. [Google Scholar]
- Wolfe, J.H. Automatic question generation from text-an aid to independent study. In Proceedings of the ACM SIGCSE-SIGCUE Technical Symposium on Computer Science and Education, Anaheim, CA, USA, 12–13 February 1976; pp. 104–112. [Google Scholar]
- Rus, V.; Graesser, A.C. Workshop Report: The Question Generation Task and Evaluation Challenge; Technical report; Institute for Intelligent Systems: Stuttgart, Germany, 2009. [Google Scholar]
- Zhang, R.; Guo, J.; Chen, L.; Fan, Y.; Cheng, X. A review on question generation from natural language text. ACM Trans. Inf. Syst. (TOIS) 2021, 40, 1–43. [Google Scholar] [CrossRef]
- Belyanova, M.; Chernenkiy, V.; Kaganov, Y.; Gapanyuk, Y. Using hybrid intelligent information system approach for text question generation. In Proceedings of the CEUR Workshop Proceedings—Russian Advances in Fuzzy Systems and Soft Computing: Selected Contributions to the 8th International Conference on Fuzzy Systems, Soft Computing and Intelligent Technologies (FSSCIT 2020), Smolensk, Russia, 29 June–1 July 2020; Volume 2782, pp. 194–201. [Google Scholar]
- Mostow, J.; Chen, W. Generating Instruction Automatically for the Reading Strategy of Self-Questioning. In Proceedings of the 2009 Conference on Artificial Intelligence in Education: Building Learning Systems that Care: From Knowledge Representation to Affective Modelling, Brighton, UK, 6–10 July 2009; pp. 465–472. [Google Scholar]
- Kunichika, H.; Katayama, T.; Hirashima, T.; Takeuchi, A. Automated question generation methods for intelligent English learning systems and its evaluation. In Proceedings of the International Conference on Computers in Education (ICCE’04), Melbourne, Australia, 30 November–3 December 2004; Volume 670. [Google Scholar]
- Huang, Y.; He, L. Automatic generation of short answer questions for reading comprehension assessment. Nat. Lang. Eng. 2016, 22, 457–489. [Google Scholar] [CrossRef]
- Bousmaha, K.Z.; Chergui, N.H.; Mbarek, M.S.A.; Belguith, L.H. AQG: Arabic Question Generator. Rev. D’Intelligence Artif. 2020, 34, 721–729. [Google Scholar] [CrossRef]
- Elbasyouni, M.; Abdelrazek, E.; Saad, A. Building a system based on natural languages processing to automatic question generation from Arabic texts. Int. J. Curr. Res. 2014, 6, 7608–7613. [Google Scholar]
- Kaur, J.; Bathla, A.K. Automatic question generation from Hindi text using hybrid approach. In Proceedings of the Second International Conference on Science Technology and Management, Sliema, Malta, 17 August 2015. [Google Scholar]
- Swali, D.; Palan, J.; Shah, I. Automatic question generation from paragraph. Int. J. Adv. Eng. Res. Dev. 2016, 3, 73–78. [Google Scholar] [CrossRef]
- Kriangchaivech, K.; Wangperawong, A. Question generation by transformers. arXiv 2019, arXiv:1909.05017. [Google Scholar] [CrossRef]
- Ouahrani, L.; Bennouar, D. Attentional Seq2Seq Model for Arabic Opinion Question Generation. In Proceedings of the International Symposium on Modelling and Implementation of Complex Systems; Springer Nature: Cham, Swtizerland, 2024; pp. 112–126. [Google Scholar]
- Alhashedi, S.; Suaib, N.M.; Bakri, A. Arabic automatic question generation using transformer model. In Proceedings of the AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2024; Volume 2991. [Google Scholar]
- Bonifacio, L.; Jeronymo, V.; Abonizio, H.Q.; Campiotti, I.; Fadaee, M.; Lotufo, R.; Nogueira, R. mMARCO: A multilingual version of the MS MARCO passage ranking dataset. arXiv 2021, arXiv:2108.13897. [Google Scholar]
- Lafkiar, S.; En Nahnahi, N. An end-to-end transformer-based model for Arabic question generation. Multimed. Tools Appl. 2024, 84, 22009–22023. [Google Scholar] [CrossRef]
- Lafkiar, S.; Nahnahi, N.E. An Arabic question generation system based on a shared BERT-base encoder-decoder architecture. Math. Model. Comput. 2024, 11, 763–772. [Google Scholar] [CrossRef]
- Rahim, M.; Khoja, S.A. Sawaal: A Framework for Automatic Question Generation in Urdu. In Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), Trento, Italy, 19–20 October 2024; pp. 139–148. [Google Scholar]
- Nagoudi, E.M.B.; Elmadany, A.; Abdul-Mageed, M. AraT5: Text-to-Text Transformers for Arabic Language Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 628–647. [Google Scholar]
- de Fitero-Dominguez, D.; Garcia-Cabot, A.; Garcia-Lopez, E. Automated multiple-choice question generation in Spanish using neural language models. Neural Comput. Appl. 2024, 36, 18223–18235. [Google Scholar] [CrossRef]
- Vachev, K.; Hardalov, M.; Karadzhov, G.; Georgiev, G.; Koychev, I.; Nakov, P. Leaf: Multiple‑choice question generation. In Advances in Information Retrieval, Proceedings of the European Conference on Information Retrieval (ECIR 2022), Stavanger, Norway, 10–14 April 2022; Springer: Cham, Swtizerland, 2022; pp. 321–328. [Google Scholar]
- Abdel-Galil, H.; Mokhtar, M.; Doma, S. Automatic question generation model based on deep learning approach. Int. J. Intell. Comput. Inf. Sci. 2021, 21, 110–123. [Google Scholar] [CrossRef]
- Montgomerie, A. Generating Questions Using Transformers. 2020. Available online: https://amontgomerie.github.io/2020/07/30/question-generator.html (accessed on 24 October 2024).
- Qiu, J.; Xiong, D. Generating Highly Relevant Questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5983–5987. [Google Scholar]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
- Mozannar, H.; Hajal, K.E.; Maamary, E.; Hajj, H. Neural Arabic question answering. arXiv 2019, arXiv:1906.05394. [Google Scholar] [CrossRef]
- Lewis, P.; Oğuz, B.; Rinott, R.; Riedel, S.; Schwenk, H. MLQA: Evaluating cross-lingual extractive question answering. arXiv 2019, arXiv:1910.07475. [Google Scholar]
- ARCD, H.F.D. Arabic Reading Comprehension Dataset. 2021. Available online: https://huggingface.co/datasets/arcd (accessed on 10 July 2023).
- Liu, B. Neural question generation based on Seq2Seq. In Proceedings of the 2020 5th International Conference on Mathematics and Artificial Intelligence, Chengdu, China, 10–13 April 2020; pp. 119–123. [Google Scholar]
- Hernandez, L.; Randall, S.; Nazeri, A. Question Generator. 2020. Available online: http://cs230.stanford.edu/projects_fall_2020/reports/55771015.pdf (accessed on 10 July 2023).
- von Platen, P. How to Generate Text: Using Different Decoding Methods for Language Generation with Transformers. 2020. Available online: https://huggingface.co/blog/how-to-generate (accessed on 5 August 2023).
- Antoun, W.; Baly, F.; Hajj, H. AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding. arXiv 2020, arXiv:cs.CL/2012.15516. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Al-Thanyyan, S.S.; Azmi, A.M. Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101662. [Google Scholar] [CrossRef]
- Kim, Y.; Lee, H.; Shin, J.; Jung, K. Improving neural question generation using answer separation. In Proceedings of the AAAI conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6602–6609. [Google Scholar]
- Doshi, K. Foundations of NLP Explained—Bleu Score and WER Metrics. 2021. Available online: https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b (accessed on 10 July 2023).
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Wang, B.; Wang, X.; Tao, T.; Zhang, Q.; Xu, J. Neural question generation with answer pivot. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9138–9145. [Google Scholar]
- Al-Numai, A.; Azmi, A. LEMMA-ROUGE: An Evaluation Metric for Arabic Abstractive Text Summarization. Indones. J. Comput. Sci. 2023, 12, 470–481. [Google Scholar] [CrossRef]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Almuzaini, H.A.; Azmi, A.M. An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm. Expert Syst. Appl. 2022, 203, 117384. [Google Scholar] [CrossRef]
- Nakhleh, S.; Mustafa, A.M.; Najadat, H. AraT5GQA: Arabic Question Answering model using automatic generated dataset. In Proceedings of the 2024 15th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 13–15 August 2024; pp. 1–5. [Google Scholar]
- Tami, M.; Ashqar, H.I.; Elhenawy, M. Automated question generation for science tests in Arabic language using NLP techniques. In International Conference on Intelligent Systems, Blockchain, and Communication Technologies; Springer: Cham, Swtizerland, 2024; pp. 274–285. [Google Scholar]
- Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13063–13075. [Google Scholar]
- Xiao, D.; Zhang, H.; Li, Y.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. ERNIE-GEN: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence, Virtual, 19–26 August 2021; pp. 3997–4003. [Google Scholar]
- Chomphooyod, P.; Suchato, A.; Tuaycharoen, N.; Punyabukkana, P. English grammar multiple-choice question generation using Text-to-Text Transfer Transformer. Comput. Educ. Artif. Intell. 2023, 5, 100158. [Google Scholar] [CrossRef]
- Rodriguez-Torrealba, R.; Garcia-Lopez, E.; Garcia-Cabot, A. End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Syst. Appl. 2022, 208, 118258. [Google Scholar] [CrossRef]
- Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; Palmer, M., Hwa, R., Riedel, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 785–794. [Google Scholar]
- Al-Ssulami, A.M.; Alsorori, R.S.; Azmi, A.M.; Aboalsamh, H. Improving coronary heart disease prediction through machine learning and an innovative data augmentation technique. Cogn. Comput. 2023, 15, 1687–1702. [Google Scholar] [CrossRef]
Question | |
في أي عام ولد ألبرت أينشتاين؟ | In which year was Albert Einstein born? |
Context | |
بالألمانية (Albert Einstein) 14 مارس 1979 –18 أبريل 1955) عالم فيزياء ألماني المولد، (حيث تخلى عن الجنسية الألمانية لاحقا) سويسري وأمريكي الجنسية، من أبوين يهوديين، وهو يشتهر بأب النسبية كونه واضع النسبية الخاصة والنسبية العامة الشهيرتين اللتين كانتا اللبنة الأولى للفيزياء النظرية الحديثة، ولقد حاز في عام 1921 على جائزة نوبل في الفيزياء عن ورقة بحثية عن التأثير الكهروضوئي، ضمن ثلاثمائة ورقة علمية أخرى له في تكافؤ المادة والطاقة وميكانيكا الكم وغيرها، وأدت استنتاجاته المبرهنة إلى تفسير العديد من الظواهر العلمية التي فشلت الفيزياء الكلاسيكية في إثباتها | In German, Albert Einstein (14 March 1879–18 April 1955) was a German-born physicist (later renouncing his German citizenship), with Swiss and American citizenship. He was born to Jewish parents and is renowned as the father of relativity, having formulated both the special and general theories of relativity, which laid the foundation for modern theoretical physics. In 1921, he was awarded the Nobel Prize in Physics for his research on the photoelectric effect, among his other 300 scientific papers on the equivalence of matter and energy, quantum mechanics, and more. His groundbreaking findings led to the interpretation of numerous scientific phenomena that classical physics had failed to explain. |
Correct answer | 1879 |
Distractor 1 | 1955 |
Distractor 2 | 1921 |
Distractor 3 | 1979 |
Task | Dataset | Source (Domain Notes) | Size | Option | |
---|---|---|---|---|---|
QG w/o MCQs | Arabic SQuAD 1.1 [42] | Translated from English (Wikipedia/General) | 48,344 | Question, Context, and Answer | |
QG with MCQs | MLQA [43] | Wikipedia articles (Wikipedia/General) | 5852 | Question, Context, Answer, and Distractors | |
ARCD [44] | Arabic Wikipedia (Wikipedia/General) | 1395 | |||
Qudrat | Arabic reading comprehension from Qiyas (Exams) | ||||
Mawdoo3.com | Arabic articles from https://mawdoo3.com/ (News/Articles) | ||||
Madinah Arabic | Arabic reading comprehension test (Exams) |
Task1: Generating plain questions | ||
The correct answer | 1879 | |
Generated question (by approach) | ||
Greedy | في اي عام ولدت ابنة اينشتاين؟ | In which year was Einstein’s daughter born? |
Top-k | في اي عام ولدت اينشتاين؟ | In which year was Einstein born (fem.)? |
Beam search | في اي عام ولد اينشتاين؟ | In which year was Einstein born? |
Task2: Generating muliple-choice questions | ||
Question | في اي عام ولد اينشتاين؟ | In which year was Einstein born? |
Correct answer | 1879 | |
Generated distractors (by approach) | ||
Greedy | 1955, 1879, 1880 | |
Top-k | 1955, 1930, 1919 | |
Beam search | 1879, 1880, ابريل (April) 1955 |
Model | Language | Dataset | BLEU-4 | METEOR | ROUGE-L | |
---|---|---|---|---|---|---|
Wang et al. [55] | English | SQuAD | 16.42 | 18.95 | 41.87 | – |
Dong et al. [61] | English | SQuAD | 22.12 | 25.06 | 51.07 | – |
Xiao et al. [62] | English | SQuAD | 25.40 | 26.92 | 52.84 | – |
Abdel-Galil et al. [38] | English | SQuAD | 11.30 | – | – | – |
Rahim and Khoja [34] | Urdu | UQuAD | 23.32 | 36.47 | 53.66 | – |
Alhashedi et al. [30] | Arabic | mMACRO | 19.12 | 23.00 | 51.99 | – |
Lafkiar and En Nahnahi [32] | Arabic | Arabic SQuAD & ARCD | 20.51 | 24.04 | 44.01 | – |
Lafkiar and Nahnahi [33] | Arabic | ARQGData | 20.29 | 30.73 | 38.54 | – |
Ouahrani and Bennouar [29] | Arabic | Adapted-cQA-MD | 10.03 | 11.61 | 15.08 | – |
Nagoudi et al. [35] | Arabic | ARGEN-QG | 16.99 | – | – | – |
AraT5-base (fine-tuned) | Arabic | Arabic SQuAD | 20.61 | 20.37 | 4.80 | 74.50 |
Our model | Arabic | Arabic SQuAD | 27.49 | 25.18 | 4.24 | 76.34 |
Context | كطالب دكتوراه في جامعة ايرلانجن نورمبرج الالمانية ، بدا كارلينز براندنبورغ العمل على ضغط الموسيقى الرقمية في اوائل الثمانينات ، مع التركيز على كيفية ادراك الناس للموسيقى . اكمل اعمال دكتوراه في عام 1989 . تنحدر MP3 مباشرة من OCF و PXFM مما يمثل نتيجة تعاون براندنبورغ يعمل كخبير ما بعد دكتوراه في مختبرات AT&T Bell مع James D. Johnston (JJ) في AT&T Bell Labs مع معهد فراونهوفر للدوائر المتكاملة ، Erlangen ، مع مساهمات صغيرة نسبيا من فرع MP2 من المبرمجين الفرعيين نفسيا الصوتي . في عام 1990 ، اصبح براندنبورغ استاذا مساعدا في Erlangen Nuremberg . بينما كان هناك ، واصل العمل على ضغط الموسيقى مع العلماء في جمعية فراونهوفر في عام 1993 انضم الى موظفي معهد فراونهوفر . |
Answer | 1993 |
Reference question | متى أنضم براندنبورغ إلى معهد فراونهوفر |
Our model GQ | في اي عام انضم كارلينز براندنبورغ الى معهد فراونهوفر |
AraT5 GQ | في اي عام كان كارلينز براندنبورج استاذا مساعدا في جامعة ساوثامبتون؟ |
OBSERVATION: AraT5 model generates an off-topic question unrelated to the core subject matter. In contrast, our model correctly identifies the key entity and generates a question which accurately queries about the relevant institution, demonstrating a closer semantic alignment with the reference question. | |
Context | اكثر الامراض المعروفة التي تؤثر على جهاز المناعة نفسه هو الايدز ، وهو عوز المناعة الذي يتميز بقمع الخلايا التائية CD4 المساعد والخلايا المتغصنة والبلاعم من فيروس نقص المناعة البشرية HIV. |
Answer | الايدز |
Reference question | ما هو اكثر امراض جهاز المناعة شهرة |
Our model GQ | ما هو اكثر الامراض معروفة التي تؤثر على جهاز المناعة |
AraT5 GQ | كم عدد الامراض التي تؤثر على جهاز المناعة؟ |
OBSERVATION: AraT5 model exhibits a failure in task adherence, incorrectly generating a “how many” question that queries quantity, whereas the reference question requires a “what is” question targeting conceptual identity. | |
Context | في عامي 2013 و 2014 ، اخترقت حلقة اختراق روسية اوكرانية عرفت باسم ريسكاتور اجهزة كمبيوتر Target كوربوراتيون في عام 2013 ، وسرقة ما يقرب من 40 مليون بطاقة ائتمان ، ثم اجهزة كمبيوتر هوم ديبوت في عام 2014 ، سرقت ما بين 53 و 56 مليون رقم بطاقة ائتمان . تم تسليم التحذيرات في كلتا الشركتين ، ولكن تم تجاهلها يعتقد ان خروقات الامن المادية باستخدام الات الخروج الذاتي لعبت دورا كبيرا . يقول جيم والتر ، مدير عمليات استخبارات التهديد في شركة مكافي للتقنية الامنية ان البرمجيات الخبيثة المستخدمة هي غير مبتكرة على الاطلاق وغير مثيرة للاهتمام ، مما يعني انه كان من الممكن ايقاف البرامج المضادة للفيروسات بسهولة من قبل برنامج مكافحة الفيروسات الحالي لو كان المسؤولون يردون على التحذيرات . وقد اسفر حجم السرقات عن اهتمام كبير من سلطات الولايات المتحدة وسلطات الولايات المتحدة الفيدرالية وما زال التحقيق جاريا. |
Answer | بين 53 و56 مليون. |
Reference question | كم عدد ارقام بطاقات الائتمان التي سرقت من هوم ديبوت في عام 2014؟ |
Our model GQ | كم عدد بطاقات الائتمان التي سرقت في عام 2014؟ |
AraT5 GQ | في عام 2014 ، ما هي الارقام التي تم العثور عليها في اجهزة كمبيوتر iPod؟ |
OBSERVATION: Our model correctly identifies and extracts the relevant numeric fact, while AraT5 confuses the central entity, leading to an inaccurate and meaningless question. |
Context | بالاضافة الى ذلك ، هذه الدورة التمهيدية تعطي الطلاب المزيد من المعلومات لتكملة علم الاحياء العام او التدريب العلمي . كما ان لديها جزئين مختلفين الجزء الاول هو مقدمة للمبادئ الاساسية للمناعة والجزء الثاني هو سلسلة محاضرات موجهة سريريا . من ناحية اخرى ، فان الدورة التدريبية المتقدمة هي دورة اخرى لاولئك الذين يرغبون في توسيع او تحديث فهمهم لعلم المناعة . ينصح للطلاب الذين يرغبون في حضور دورة متقدمة للحصول على خلفية من مبادئ علم المناعة . تتطلب معظم المدارس من الطلاب اتخاذ اجراءات اختيارية في اخرى لاكمال شهاداتهم . درجة الماجستير تتطلب سنتين من الدراسة بعد الحصول على درجة البكالوريوس . بالنسبة لبرنامج الدكتوراه ، يلزم ان يستغرق عامين اضافيين من الدراسة |
Answer | سنتين من الدراسة |
Reference question | كم يستغرق الحصول على درجة الماجستير عادة؟ |
Generated question | كم من الوقت يحتاج الطلاب للحصول على درجة الماجستير؟ |
Context | يظهر الانسان ما يخفيه من افكار داخلية عن طريق ظهور بعض الحركات الجسدية، او ما يسميها علماء النفس بالايماءات والايحاءات الجسدية، والتي هي حركات لا ارادية تصدر من الشخص يمكن السيطرة على بعضها، والبعض الاخر لا يمكن اخفاؤه او تجنب ظهوره؛ اذ يدركه بسهولة من لديه علم بلغة الجسد لدى الانسان |
Answer | عن طريق ظهور بعض الحركات الجسدية |
Reference question | كيف يظهر الانسان مايخفيه من افكار داخليه؟ |
Generated question | كيف يظهر الانسان ما يخفيه من الافكار الداخلية؟ |
Context | كرة القدم هي رياضة جماعية تُلعب بين فريقين يتكون كل منهما من أحد عشر لاعباً بكرة مُكوَّرة. يلعب كرة القدم 250 مليون لاعب في أكثر من مائتي دولة حول العالم، فلذلك تكون الرياضة الأكثر شعبية وانتشاراً في العالم. |
Answer | رياضة جماعية تُلعب بين فريقين يتكون كل منهما من أحد عشر لاعباً |
Reference question | ما هو عدد اللاعبين في فرقة كرة القدم؟ |
Generated question | كرة القدم هي ماذا؟ |
Model | Language | Dataset | BLEU-4 | METEOR | ROUGE-L | |
---|---|---|---|---|---|---|
English QG (translated to Arabic) | English | SQuAD | 2.74 | 25.48 | 3.33 | 75.28 |
Our model | Arabic | Translated SQuSD | 4.46 | 24.91 | 3.00 | 75.56 |
Model | Dist. | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | Lang. | Dataset |
---|---|---|---|---|---|---|---|
Vachev et al. [37] | 46.37 | English | RACE | ||||
32.19 | |||||||
34.47 | |||||||
Chomphooyod et al. [63] | 6.5 | English | NAIST Lang-8 learner corpora | ||||
Rodriguez-Torrealba et al. [64] | 14.80 | 7.06 | 3.75 | 2.16 | Spanish | ||
de Fitero-Dominguez et al. [36] | 28.56 | 18.97 | 14.27 | 11.34 | Spanish | RACE, Cosmos QA, SciQ | |
Our model | 20.28 | 17.96 | 17.17 | 16.75 | Arabic | Our dataset | |
19.83 | 17.43 | 16.73 | 16.41 | ||||
19.84 | 17.74 | 16.96 | 16.54 |
Context | كوكب الارض هو الوحيد بين كواكب المجموعة الشمسية المعروف بوجود حياة عليه، ترتيبه الثالث في النظام الشمسي ويبعد مسافة 150 مليون كم عن الشمس، يحتاج كوكب الارض الى 365,25 يوم للدوران حول الشمس، نظرا لانه يسير في الفضاء بسرعة 108 الاف كم في الساعة |
Answer | الثالث |
Question | ماهو ترتيب كوكب الارض في النظام الشمسي؟ |
Reference distractors | الثاني |
الرابع | |
الأول | |
Generated distractors | الثامن |
الثاني | |
الرابع | |
Context | المياه: هي عبارة عن مادة مكونة من عنصري الهيدروجين والاكسجين، قادرة على احلال العديد من المواد الاخرى، وهي من اكثر المركبات ضرورة ووفرة على كوكب الارض، حيث توجد في الطبيعة بحالاتها الغازية، والسائلة، والصلبة، كما انها تتميز بان لا لون لها ولا رائحة. وتعد خاصية استخدام المياه كمادة مذيبة اساسية للكائنات الحية، حيث يعتقد ان بداية نشاة الحياة كانت في المحاليل المائية الموجودة في محيطات العالم، فالمحاليل المائية تلعب دورا مهما في العديد من العمليات الحيوية، خاصة في الدم والعصارة الهضمية وذلك لاتمام العمليات البيولوجية. بالرغم من ان الماء لا يظهر لونا عندما تكون كميته قليلة، الا انه يمتلك لونا داخليا خفيفا يميل للزرقة وذلك بسبب امتصاص خفيف للضوء عند الاطوال الموجية الحمراء. |
Answer | الهيدروجين والاكسجين |
Question | مما يتركب الماء؟ |
Reference distractors | النيتروجين والاكسجين |
الهيدروجين والفلوريد | |
الاكسجين | |
Generated distractors | الفوسفات والحديد |
النيتروجين | |
انثيلين |
Context | تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض، ثم بعد ذلك تتشكل الرقائق الثلجية التي تذوب في درجة اعلى من 0 درجة مئوية في الهواء الرطب، لتعود وتلتصق معا وتشكل رقائق ثلجية اكبر حجما، وبالتالي يصبح وزنها ثقيلا بالقدر الذي يسمح بسقوطها على الارض، ويمكن تعريف الثلج بالانجليزية: Snow بانه قطرات الماء في حالتها الصلبة والمتبلورة في الغلاف الجوي والتي تسقط على الارض. |
Answer | تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض |
Question | كيف تتكون الثلوج؟ |
Reference distractors | تتكون الثلوج خارج السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض |
تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة مرتفعة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض | |
تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال تنافر بلورات الثلج الصغيرة ببعضها البعض | |
Generated distractors | يتكون الثلوج داخل السحب وذلك عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض |
تتكون الاملاح في الساحل، | |
تتكون الرقائق الثلجية |
Metric | Beam Search (Mean) | Top-k (Mean) | p-Value |
---|---|---|---|
BLEU-1 | 24.11 | 13.82 | 0.0055 |
BLEU-2 | 15.14 | 5.24 | 0.0014 |
BLEU-3 | 9.30 | 2.96 | 0.0270 |
BLEU-4 | 6.90 | 2.28 | 0.0451 |
METEOR | 21.52 | 9.33 | 0.0013 |
80.67 | 75.60 |
Distractor 1 () | Distractor 2 () | Distractor 3 () | |||||||
---|---|---|---|---|---|---|---|---|---|
Metric | Beam | Top-k | p-Value | Beam | Top-k | p-Value | Beam | Top-k | p-Value |
BLEU-1 | 11.56 | 13.78 | 0.0095 | 12.25 | 14.56 | 0.0047 | 10.34 | 14.42 | |
BLEU-2 | 10.93 | 12.76 | 0.0197 | 11.58 | 13.58 | 0.0076 | 9.80 | 13.31 | |
BLEU-3 | 10.75 | 12.60 | 0.0134 | 11.53 | 13.38 | 0.0108 | 9.67 | 13.11 | |
BLEU-4 | 10.63 | 12.50 | 0.0098 | 11.44 | 13.27 | 0.0105 | 9.58 | 13.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jabr, R.B.; Azmi, A.M. Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework. Mathematics 2025, 13, 2975. https://doi.org/10.3390/math13182975
Jabr RB, Azmi AM. Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework. Mathematics. 2025; 13(18):2975. https://doi.org/10.3390/math13182975
Chicago/Turabian StyleJabr, Reham Bin, and Aqil M. Azmi. 2025. "Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework" Mathematics 13, no. 18: 2975. https://doi.org/10.3390/math13182975
APA StyleJabr, R. B., & Azmi, A. M. (2025). Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework. Mathematics, 13(18), 2975. https://doi.org/10.3390/math13182975