Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework

Jabr, Reham Bin; Azmi, Aqil M.

doi:10.3390/math13182975

Open AccessArticle

Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework

by

Reham Bin Jabr

and

Aqil M. Azmi

^*

Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(18), 2975; https://doi.org/10.3390/math13182975

Submission received: 30 July 2025 / Revised: 28 August 2025 / Accepted: 3 September 2025 / Published: 14 September 2025

(This article belongs to the Special Issue Advanced Artificial Intelligence Models and Its Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In this work, we propose a knowledge-aware approach for Arabic automatic question generation (QG) that leverages the multilingual T5 (mT5) transformer augmented with a pre-trained Arabic question-answering model to address challenges posed by Arabic’s morphological richness and limited QG resources. Our system generates both subjective questions and multiple-choice questions (MCQs) with contextually relevant distractors through a dual-model pipeline that tailors the decoding strategy to each subtask: the question generator employs beam search to maximize semantic fidelity and lexical precision, while the distractor generator uses top-k sampling to enhance diversity and contextual plausibility. The QG model is fine-tuned on Arabic SQuAD, and the distractor model is trained on a curated combination of ARCD and Qudrat. Experimental results show that beam search significantly outperforms top-k sampling for fact-based question generation, achieving a BLEU-4 score of 27.49 and a METEOR score of 25.18, surpassing fine-tuned AraT5 and translated English–Arabic baselines. In contrast, top-k sampling is more effective for distractor generation, yielding higher BLEU scores and producing distractors that are more diverse yet remain pedagogically valid, with a BLEU-1 score of 20.28 establishing a strong baseline in the absence of prior Arabic benchmarks. Human evaluation further confirms the quality of the generated questions. This work advances Arabic QG by providing a scalable, knowledge-aware solution with applications in educational technology, while demonstrating the critical role of task-specific decoding strategies and setting a foundation for future research in automated assessment.

Keywords:

Arabic question generation; multiple-choice question generation; knowledge-aware NLP; low-resource language processing; beam search; top-k sampling

MSC:

68T50; 68T07; 68T09; 68T05

1. Introduction

Effective education relies on well-designed questions, which serve not only as assessment tools but also as catalysts for deep comprehension and critical thinking. Studies show that skillfully crafted questions can improve learning outcomes by up to 150%, enhancing memory retention, focusing attention on core concepts, and identifying misconceptions for targeted remediation [1,2]. However, manual question creation—particularly for formats requiring linguistic precision, cognitive challenge, and pedagogical relevance—remains labor-intensive and inconsistent.

Educational assessments primarily use two formats: subjective questions, which evaluate open-ended reasoning, and multiple-choice questions (MCQs), which enable scalable evaluation. Both formats depend critically on question quality; poorly designed distractors in MCQs, for example, such as semantically unrelated options (e.g., “What causes climate change?” with distractors like “the invention of the credit card” or “Newton’s first law”), can undermine validity by either misleading learners or making correct answers obvious [3,4]. To alleviate the burden of manual design, researchers have explored automatic question generation (QG), a challenging NLP task combining natural language understanding (NLU) and generation (NLG). Unlike question answering (QA), which extracts answers from text [5], QG synthesizes contextually appropriate and grammatically valid questions from unstructured content. Early QG systems employed rule-based or template-driven methods [6], but transformer-based models now enable more flexible and linguistically nuanced generation.

Despite progress in high-resource languages, Arabic QG lags due to unique linguistic and computational challenges. Arabic’s morphological richness involves significant word variation based on gender, number, tense, and grammatical case. Flexible word order (e.g., SVO, VSO) and diacritic omission introduce ambiguity, complicating syntactic and semantic interpretation. Typographical errors and homographs further degrade generation quality, necessitating advanced modeling strategies to handle these complexities [7].

Compounding these issues is Arabic’s status as a low-resource language in NLP. Annotated QG datasets are scarce and often limited in domain coverage, size, or annotation quality. Standardized evaluation protocols are also lacking, hindering robust benchmarking [8]. This gap is particularly critical given the Arab region’s educational demands: over 60% of the population is under 25, and the Middle Eastern e-learning market is projected to reach 175 billion USD by 2030 (Source: https://digitaldefynd.com/IQ/online-education-market-middle-east/, accessed on 5 August 2025). Scalable Arabic QG systems could thus transform adaptive and personalized learning.

To address these challenges, we propose a knowledge-aware Arabic QG framework capable of generating both subjective questions and MCQs with semantically plausible distractors. Our approach integrates the multilingual T5 (mT5) transformer with a pre-trained Arabic QA model to ensure contextual fidelity. A dual-model pipeline is employed: one fine-tuned on Arabic SQuAD for QG and another on a curated corpus (ARCD, Qudrat, and additional sources) for distractor generation. Beam search optimizes subjective QG, while top-k sampling enhances distractor diversity. Evaluations use BLEU and METEOR metrics, supplemented by human assessments of fluency, coherence, and pedagogica utility.

It should be noted that our model is designed for well-formed educational content in formal settings and assumes input text is free from noise such as spelling errors, grammatical mistakes, or non-standard language variants. Additionally, while our work focuses on generating assessment items, we do not address the grading of student answers—a distinct and well-established challenge in educational NLP. Our contribution is specifically centered on the automatic generation of high-quality questions and distractors from reliable and clean source material.

Our novel AI-driven framework for Arabic question generation advances the field through four key innovations:

A linguistically adaptive QG system based on the mT5 architecture specifically optimized for Arabic’s morphological complexity and low-resource educational contexts, combining deep learning with domain-specific knowledge representation.
An answer-aware re-ranking mechanism that enhances semantic precision by jointly modeling contextual relationships between source text, generated questions, and target answers, improving educational relevance.
A dual-stage generation pipeline that produces both subjective questions (via beam search) and high-quality MCQs with semantically plausible distractors (via top-k sampling), addressing the full spectrum of assessment needs.
New Arabic NLP resources including a curated MCQ dataset with expert-validated distractors and open-source implementation, facilitating research in Arabic educational technology and adaptive learning systems.

Our work bridges advanced NLP techniques with practical educational applications, offering a scalable solution for e-learning platforms, intelligent tutoring systems, and domain-specific assessment tools across the Arabic-speaking world.

This paper is organized as follows: Section 2 introduces the necessary background material. Section 3 provides a review of existing literature on question generation. The dataset compiled for this study is described in Section 4. Our proposed methodology is detailed in Section 5. Experimental results are presented and discussed in Section 6. Finally, Section 7 concludes the paper and outlines potential directions for future research.

2. Background

This section provides essential background for our study. We begin by examining two key challenges: (1) the linguistic complexities of Arabic in question generation (QG) (Section 2.1) and (2) the specific difficulties in creating effective distractors for MCQs in the Arabic language (Section 2.2). We then present a concise overview of the T5 model architecture.

Our literature review identified a significant absence of systematically documented linguistic challenges for Arabic QG. While the following analysis enumerates these challenges comprehensively—including those not directly addressed by our proposed approach—its primary purpose is twofold: to position our work within the broader Arabic NLP landscape and to highlight the pressing need for targeted research in Arabic educational technology. These findings may also prove valuable for addressing similar challenges in other low-resource language contexts.

2.1. Linguistic and Computational Challenges in Arabic QG

The development of automatic question generation (QG) systems is a complex task, particularly for Modern Standard Arabic (MSA) due to its unique linguistic, morphological, and syntactic characteristics. Our research is intentionally focused on MSA because it is the exclusive language of written formal education across the Arab world. While instruction may involve oral code-switching into a local dialect for explanation, all educational foundations—including textbooks, curricular materials, worksheets, and examinations—are authored and standardized solely in MSA. Consequently, any QG system designed for an educational context must prioritize mastery of MSA. This subsection presents a broad overview of the main challenges encountered in Arabic QG and MCQ distractor generation, with a specific focus on the written MSA domain.

Morphological richness and ambiguity.: Arabic words exhibit extensive inflectional variation based on gender, number, tense, and grammatical case. For instance, the verb كتب (“he wrote”) can morph into كتبت (“she wrote”), كتبوا (“they wrote”), or سيكتبون (“they will write”). This morphological richness increases the complexity of generating grammatically correct and semantically aligned questions, particularly when dealing with out-of-domain inputs or informal text.
Additionally, Arabic commonly omits diacritics (short vowels), leading to lexical ambiguity. For example, the word علم can be interpreted as “flag,” “knowledge,” or “he knew,” depending on context. Such ambiguities are especially problematic for QG systems, as minor misinterpretations in token meaning can render the generated question irrelevant or misleading.
Syntactic flexibility and typographic noise.: Arabic allows variable word order structures such as Subject–Verb–Object (SVO), Verb–Subject–Object (VSO), and even less common forms like Verb–Object-Subject (VOS). QG models must therefore learn to generalize over multiple syntactic patterns and maintain clarity regardless of word order. Further complicating the task, common typing errors—such as mistaking the name علي (“Ali”) for the preposition على (“on”)—can confuse syntactic parsers or named entity recognition components, thereby reducing the reliability of generated output.
Domain limitations and lack of evaluation standards.: The field of Arabic question generation is hampered by a scarcity of dedicated datasets and resources, a recurring challenge across the Arabic NLP landscape that affects tasks from text summarization [9] to specialized applications like legal judgment support systems [10]. Existing corpora are often restricted to narrow domains, such as religious texts or news articles, and typically lack comprehensive annotations for question–answer pairs and distractors. This limitation hinders model generalizability and precludes the establishment of a unified evaluation benchmark. While metrics like BLEU and METEOR are commonly adopted, they may not adequately assess the grammatical and contextual fidelity essential for educational content. Furthermore, transformer-based models, especially in multilingual settings, are susceptible to cascading errors and semantic drift during the generation of longer or more complex questions—a problem attributed to exposure bias. Although techniques like imitation learning offer promising mitigation strategies [11], they are not explored in the present study.

2.2. Challenges in Distractor Generation

Generating high-quality distractors for MCQs in Arabic introduces another layer of complexity, particularly due to the need for grammatical agreement, cultural awareness, and semantic plausibility. Based on prior work in English MCQ generation [12,13,14], we adapt core distractor design criteria to the Arabic context.

Semantic relatedness.: Distractors must belong to the same semantic class or domain as the correct answer. For example, in the question “What is the capital of Saudi Arabia?”, distractors like “Cairo” or “Beirut” are suitable, while “Washington, DC” may be semantically related but contextually inappropriate, as it is not an Arab city.
Contextual appropriateness.: Even semantically related distractors must be contextually plausible. In questions about historical events—such as the Battle of Qadisiyyah (معركة القادسية)—appropriate distractors would reference similar historical battles (e.g., معركة حطين، معركة اليرموك). Distractors unrelated to the regional or historical context would confuse learners or trivialize the assessment.
Grammatical consistency.: Distractors must match the grammatical features of the correct answer, including gender, number, and case. For example, in the MCQ: – المغرب؟ هو عاصمة (“– is the capital of Morocco?”), the correct answer الرباط (Rabat) is masculine. Thus, valid distractors must also be masculine nouns (e.g., الخرطوم, الكويت). Using feminine options like القاهرة (Cairo) introduces grammatical inconsistency and signals the correct answer.

2.3. T5 Architecture

Transfer learning and fine-tuning are widely employed in current NLP approaches. Instead of training a new model from scratch, these methods utilize pre-trained models, resulting in significant time, resource, and effort savings while maintaining high accuracy. An exemplary model is Google’s Text-to-Text Transfer Transformer (T5) [15], which introduces a unified format for various text-to-text tasks, ensuring that both input and output are always in textual form. Both T5 and mT5, a multilingual variant of T5, exhibit versatility in performing various tasks, as shown in Figure 1. These tasks encompass machine translation, where mT5 can proficiently translate text from one language to another; text classification, enabling the classification of text based on genre, sentiment, topic, author, or language using mT5 and T5; question answering; and text summarization.

T5 utilizes the transformer architecture, a highly effective neural network structure specifically designed for natural language processing tasks. This architecture comprises an encoder and a decoder, each consisting of a stack of self-attention layers.

The encoder receives a token sequence as input and generates a series of hidden states. Subsequently, the decoder takes the encoder’s hidden states as input and generates an output sequence of tokens. The transformer architecture’s self-attention layers enable the model to capture long-range dependencies between tokens in input and output sequences. This capability is crucial for tasks like machine translation and text summarization, where comprehending the entire input sequence is necessary to generate accurate output.

One limitation of T5 is its focus on the English language. However, multilingual T5 (mT5) addresses this limitation by incorporating 101 languages, including Arabic. Like T5, mT5 is built upon the Transformer architecture, but it is trained on a dataset containing text and code from multiple languages [16]. Consequently, mT5 is capable of generating text in multiple languages, making it valuable for tasks like machine translation and cross-lingual text summarization. Both mT5 and T5 can undergo fine-tuning to adapt to new tasks. Fine-tuning entails training the model on a specific dataset containing examples relevant to the desired task. Through this process, the model learns the statistical associations between the input and output examples, enabling it to effectively perform the new task. To carry out fine-tuning on mT5 or T5, the following steps are typically involved: (a) select a dataset containing examples relevant to the new task; (b) prepare the dataset by tokenizing the text and constructing a vocabulary; (c) train the model using the prepared dataset; and (d) evaluate the model’s performance on a separate held-out dataset. Fine-tuning mT5 or T5 can be a time-intensive process, but it is highly effective. Through fine-tuning, we can substantially enhance the model’s performance on the specific task at hand.

3. Related Work

The Question Generation (QG) task involves creating natural questions relevant to provided input and, when specified, a targeted answer. One of the pioneering initiatives in automatic QG was AutoQuest, developed as part of a computer-based educational system [17]. This system generated questions from texts to facilitate independent study, displaying text paragraph by paragraph and generating questions from randomly selected sentences. Progression was allowed if the answers, which needed to contain a percentage of the original sentence’s words, were correct. If incorrect, the paragraph was re-displayed, and a new question was posed. However, focused research on QG began only in 2008 with the National Science Foundation-sponsored, The Question Generation Shared Task and Evaluation Challenge [18].

In a recent study, Zhang et al. [19] proposed a comprehensive taxonomy of QG tasks, categorizing them based on the types of input context text, target answers, and questions generated (refer to Figure 2. This highlights the breadth of the field and the diverse research avenues it encompasses. In this section, we review works related to our target tasks: generating standalone questions and multiple-choice questions. Our focus will primarily be on research conducted in languages other than English, with a particular emphasis on Arabic. Additionally, we include some relevant works in English for comparative analysis.

Researchers employ a variety of methods to address QG, including rule-based approaches, machine learning, deep learning, and hybrid methods that combine these techniques [6,20]. Among these, deep learning-based solutions have gained significant popularity due to their effectiveness [6]. Broadly, QG approaches can be categorized into two types: (a) traditional methods and (b) deep learning methods, each of which has seen significant contributions that we explore and analyze in detail.

3.1. Traditional Approaches

Traditional methods for QG primarily utilize heuristic rules to convert descriptive text into related questions. These rule-based approaches can generally be categorized into three types: template-based [21], syntax-based [22], and semantic-based [23] methods.

Al-Khatib [3] proposed a model for generating Arabic questions by utilizing two resources: a general ontology and a specialized resource specific to a particular course. They incorporated predefined templates to generate two types of questions: subjective questions and multiple-choice questions (MCQs). The author chose not to rely on automatic metrics and instead randomly selected 100 generated questions for evaluation by a human expert. The assessment considered various criteria, including the relevance of the question to the given context and its overall value.

In another study, Bousmaha et al. [24] developed a system for Arabic QG to support the learning process of children. They initially preprocessed the text using a morphological analysis tool called MADIMARA and integrated the STAR tool for segmentation purposes. Semantic role labeling, facilitated by the Arabic PropBank tool, was applied to identify and tag predicates (actions and relationships) and associated semantic entities in the given text. The question model complemented this step by generating patterns based on Arabic grammar and constructing question templates from specific text. They evaluated their system on ten documents consisting a total of 600 sentences from the corpus Quizzito, generating multiple questions from each document and the F-measure ranged between 48% and up to 86%. The authors stated that 77% of the incorrectly generated questions contained grammatical errors.

Elbasyouni et al. [25] developed an Arabic fill-in-the-blank question generation system with a Graphical User Interface (GUI) to assist teachers in exam preparation. It is a four phase system: normalization of the ‘Alef’ character to a unified form, removal of diacritical marks, tokenization of the text into sentences, extraction of named entity recognition (NER) entities (such as person, location, and organization) using the Stanford NLP tool, and removal of extracted NER entities and any dates or times from the text. The generated questions were saved in a database, and in the final phase, the saved questions from the previous phase were processed to produce the final question list. Human evaluation was conducted by multiple teachers (sample size of 20) to assess the system based on criteria such as relevance, question target, syntactic structure, clarity, and variety. Statistical processing was applied to determine the appropriateness of the system’s performance, although specific details regarding the sample size and calculation of percentages were not provided.

Furthermore, Kaur and Bathla [26] employed a hybrid approach that incorporates NER as an additional step to generate MCQs based on a given Hindi text. The hybrid methodology combines rule-based, example-based, and dictionary lookup techniques. The rule-based approach comprises ten predefined rules that leverage entities such as names and locations. The example-based approach consists of a set of patterns that are compared with the input text. If there is similarity, a question is generated; otherwise, the rule-based approach is applied. This step aims to enhance the system’s performance. The dictionary lookup component includes a collection of entities with their corresponding classes, and it collaborates with both the rule-based and example-based approaches. To evaluate their solution, the researchers utilized precision, recall, and F-measure metrics. They conducted the evaluation on 300 Hindi sentences sourced from various websites and books, covering different question types. However, the specific details on how the MCQs were generated were not provided in their study.

Additionally, Swali et al. [27] introduced an English QG system designed for multiple fields. Their methodology consists of two phases: sentence selection and question generation. The system initially takes a given text and tokenizes it based on periods, treating each sentence individually. The Stanford POS Tagger is then employed to process each sentence, and feature extraction is conducted to identify important sentences. For example, the system checks if a sentence is the first sentence in the text. In the question generation phase, selected sentences undergo processing to generate questions using a rule-based approach. Simple sentences are processed using NER, while complex sentences are processed based on discourse connectives (e.g., “like,” “for instance,” “etc.”). Evaluation results were not reported.

It is worth noting that the aforementioned papers do not utilize machine learning techniques and instead rely solely on rules and templates, if any. Rules and templates are human-designed transformations that can be time-consuming, and the quality of questions heavily depends on their accuracy. As a result, these approaches may not be scalable beyond human capacity [28]. However, we may leverage certain steps, such as extracting NER, to generate questions.

3.2. Deep-Learning Approaches

Recent trends have adopted deep learning (DL) approaches, in particular seq2seq and transformer models. Ouahrani and Bennouar [29] addressed the task of generating opinion-based questions, aiming to provide users with opinion-rich examples relevant to their queries. They utilized a biomedical Arabic community question answering dataset, cQA-MD, which consists of 1,531 contexts and approximately 20K QA pairs. The dataset was split, with 85% used for training and 15% for testing. The authors reported their best performance with BLEU-4, ROUGE-L, and METEOR scores of 10.03, 15.08, and 11.61, respectively. Additionally, three human evaluators assessed 100 randomly selected QA pairs, yielding average scores of 3.75 out of 5 for syntactic correctness and 2.79 out of 5 for relevance.

Alhashedi et al. [30] proposed an innovative approach to generate long-form Arabic questions without answer awareness, leveraging BERT transformers instead of traditional methods. Their model operates in two categories: the first generates questions from short texts, while the second segments documents into paragraphs and extracts key sentences using the TextRank algorithm, iterating until the last paragraph is processed. For evaluation, the authors used the multi-lingual mMARCO dataset [31], filtering and cleaning it to obtain 242 K Arabic records. The dataset was split into 80% for training, 10% for validation, and 10% for testing. The model achieved BLEU-4, ROUGE-L, and METEOR scores of 19.12, 51.99, and 23, respectively.

Lafkiar and En Nahnahi [32] proposed an end-to-end transformer-based system for Arabic QG designed to produce natural language questions from passages with answer spans explicitly marked using <start> and <end> tags. This system utilizes the strengths of transformer models to capture long-range dependencies and contextual nuances, focusing on relevant sections of the text. The model was trained and evaluated on a dataset of 50K passages from Arabic SQuAD and ARCD, achieving BLEU, ROUGE-L, and METEOR scores of 20.51, 44.01, and 24.04, respectively. In a subsequent study, Lafkiar and Nahnahi [33] demonstrated the effectiveness of pre-trained checkpoints for Arabic QG. They proposed a transformer-based seq2seq model integrated with pre-trained AraBERT checkpoints, initializing both the encoder and decoder to address the limited resources available for Arabic QG. For training and testing, they compiled a dataset of approximately 71K triplets (passages, questions, and answers), named ARQGData, which included Arabic SQuAD, ARCD, MLQA, and TyDi QA. The model achieved a BLEU score of 20.29 and a METEOR score of 30.73.

Rahim and Khoja [34] proposed a framework for automatic QG in Urdu. The system comprises seven stages: pre-processing, tagging, anaphora resolution, word chunking, automatic dataset construction (ACD) using Urdu linguistic rules, fine-tuning mT5, and a ranking algorithm. The framework introduced custom algorithms for anaphora resolution and word chunking to handle Urdu’s complex sentence structures, generating 4497 question–answer pairs from 250 passages. mT5 was fine-tuned on Urdu SQuAD and ACD, with experiments on various encodings and embeddings. A ranking algorithm filtered irrelevant questions based on semantic representation. Evaluated using BLEU-4, METEOR, and ROUGE-L, the best-performing model achieved scores of 24.78, 37.07, and 54.99, respectively.

In their study, Nagoudi et al. [35] introduced AraT5, an Arabic transformer model built on the T5 architecture. AraT5 is tailored to handle a variety of Arabic language tasks, including QG for both MSA and prevalent dialects. To evaluate its performance, AraT5 was benchmarked against Google’s mT5 model. Specifically for the QG task, the authors created a dataset named ARGEN_QG, comprising 96K triplets sourced from ARCD, MLQA, TyDi QA, and XTREME. Their best results were observed for MSA, where AraT5 achieved a BLEU score of 16.99, outperforming the mT5, which scored 15.29.

Spanish-language automatic MCQ generation was explored by [36] using mT5-based models. The approach involves three tasks: candidate answer extraction, answer-aware question generation, and distractor generation. A structured pipeline integrates these tasks to transform input text into a questionnaire. SQuAD dataset was used for fine-tuning the first two tasks, while a combination of three MCQ datasets, automatically translated into Spanish, was employed for distractor generation. The models were evaluated using BLEU, ROUGE-L, and cosine similarity metrics. Results indicated slight deviations from the baseline model in QG, with a BLEU score of 18 compared to the baseline’s 18.59. However, the model demonstrated improved performance in distractor generation, achieving a BLEU score of 11.34 versus the baseline’s 5.21.

Vachev et al. [37] conducted a study involving two distinct models. The first model, trained on the SQuAD 1.1 dataset, generated multiple question–answer pairs from a given context. The second model, trained on the RACE dataset, generated multiple distractors using an answer, question, and context. To enrich the distractor pool, the authors integrated sense2vec, enabling the inclusion of simple words not directly related to the context. Evaluation metrics included Exact Match, F-score, and BLEU scores. The first, second, and third distractors achieved BLEU-1 scores of 46.37, 32.19, and 34.47, respectively.

Abdel-Galil et al. [38] developed a GUI-based QG system using a seq2seq approach with gated recurrent units (GRUs), a copy mechanism, and an attention decoder. Built on the SQuAD 1.1 dataset, the system allows users to upload PDF documents, which are converted into text and processed through four phases: (1) data cleaning, where equations, figures, and tables are removed; (2) context selection, where important paragraphs and sentences are extracted; (3) feature extraction using part-of-speech (POS) tagging and NER; and (4) question generation, where subjective questions of varying difficulty levels (easy, medium, hard) are produced using the seq2seq approach. The system provides two output templates: “Question bank PDF” and “Exam template PDF.” The model achieved a BLEU-4 score of 11.3.

Lopez et al. [6] conducted a study on the methodologies employed for English QG. They observed that while many existing approaches are robust, they often rely on complex model architectures that pose challenges during training. To overcome this, the authors opted for a transformer-based solution utilizing the pre-trained GPT-2 model and the SQuAD dataset. In their experimentation, the researchers examined the format of the training data before fine-tuning it to the model. They explored two formats: an answer-unaware format employing different delimiters such as All Questions per Line (AQPL) and One Question per Line (OQPL). The results indicated that OQPL with the delimiter [SEP] outperformed AQPL. Subsequently, they further investigated the format of the training data by incorporating answer awareness while utilizing the OQPL format. Surprisingly, they observed no significant change in the model’s performance. The authors used BLEU and METEOR metrics to evaluate their model.

Montgomerie [39] developed a QG system for independent learners, offering subjective questions with or without MCQs. For MCQs, the system used SpaCy’s NER tool to extract distractors sharing the same entity type. Framing QG as the inverse of QA, the study trained the system on 250K records from datasets like SQuAD, RACE, CoQA, and MSMARCO. Using the following form for input

answer_token <answer> context_token <context>

made the system answer aware. The QG task used Google’s T5 seq2seq model, while a BERT-based model evaluated question–answer validity using Next Sentence Prediction. The approach combined modern NLP techniques to enhance question quality. Performance results were not reported.

According to [40], neural seq2seq models for QG often produce generic and undiversified questions that lack relevance to the given passage and target answer. To address this issue, the authors proposed two innovative methods. First, a partial copy mechanism prioritizes words morphologically close to those in the input passage during question generation. Second, a QA-based reranker selects questions from a n-best list of candidates, ensuring they are preferred by both the QA and QG models. A combination of the two methods with specific weights assigned to each achieved the best performance, yielding a BLEU-4 score of 15.29 and a METEOR score of 20.13.

4. Compiling the Dataset

To fine-tune our model, we will utilize multiple datasets. Our approach involves two distinct models, each with its specific requirements. The first model aims to generate straightforward questions using two inputs: the context and the answer. On the other hand, the second model is designed to generate questions specifically for multiple-choice questions (MCQs). We will cover the dataset requirement for each model.

4.1. Dataset for Plain Question

Here, we utilize the Arabic Stanford Question Answering Dataset (SQuAD) v1.1. This dataset is a compilation of question–answer pairs originally derived from various English Wikipedia articles, encompassing over 100,000 questions generated by crowd workers [41]. To facilitate its use in Arabic, Mozannar et al. [42] translated the dataset and reformatted it into JSON (JavaScript Object Notation). We adopt Arabic SQuAD as it represents the most widely used benchmark for Arabic question answering and, by extension, question generation. Moreover, SQuAD in its original English form is the standard benchmark for QG research, and the availability of its Arabic counterpart allows us to align with established evaluation practices. Given the scarcity of publicly available Arabic datasets for automatic QG, Arabic SQuAD provides both reliability and comparability, making it an appropriate choice for our study.

4.2. Dataset for Multiple-Choice Questions

To address the absence of pre-existing QA datasets containing distractors, we compiled our own dataset comprising 8000 Arabic question–answer pairs, including distractors. Our methodology involved utilizing two open-source QA datasets, namely the Arabic Facebook QA Dataset and the ARCD (Arabic Reading Comprehension Dataset). To ensure manageable and verifiable data, we divided these datasets into smaller subsets of approximately 300 pairs each. We then sought the assistance of five freelancers, in addition to our own efforts, to annotate each subset.

Additionally, we expanded our data collection efforts beyond these two sources. We incorporated information from other resources such as the Qudrat Arabic Reading Comprehension Test, https://mawdoo3.com/, and Madinah Arabic. By following a consistent approach, we constructed our own questions, answers, and distractors, ensuring a diverse range of question types and contextual fields. To ensure the quality of the annotations, we conducted a thorough verification process, meticulously reviewing and rectifying any issues ourselves. The manual annotation process can be summarized as follows:

To facilitate the freelancers’ understanding of the task, we provided them with a training dataset. This dataset served as a reference to familiarize them with the context, question, and correct answer structure. Clear guidelines were provided, instructing the freelancers to read the given context, question, and correct answer, and then identify three distractors. The distractors were required to adhere to the same format as the correct answer and/or be plausible answers derived from the context. Throughout the process, we closely monitored the freelancers’ work to ensure their adherence to the guidelines and maintain a high standard of annotation quality. Regular oversight allowed us to address any potential issues and maintain the integrity of the dataset.

Consider the example in Table 1 from our dataset, where all the distractors adhere to the same format as the correct answer. Notably, two of the distractors were actually plausible answers found within the context itself.

The following is a summary of the datasets we collected:

Arabic Facebook QA dataset: [43]. This dataset, known as multilingual question answering (MLQA), serves as a benchmark for evaluating cross-lingual question answering performance. It contains over 5000 extractive question–answer instances (12,000 in English) in the SQuAD format. MLQA covers seven languages, including English, Arabic, German, Spanish, Hindi, Vietnamese, and Simplified Chinese. The dataset exhibits high parallelism, with an average of QA instances parallel across four different languages.
ARCD dataset.: The Arabic Reading Comprehension Dataset (ARCD) consists of 1395 questions collected from Wikipedia articles through crowd-sourcing [44].
Qudrat Arabic Reading Comprehension Test.: Qudrat is an Arabic reading comprehension test designed for high school students, serving as a crucial requirement for acceptance into Saudi universities (https://www.qeyas2030.com/public/categories/82/show, accessed on 9 June 2024). We collected approximately 200 instances from various websites and applications.
Mawdoo3.com.: This is a prominent website hosting a vast collection of Arabic articles (https://mawdoo3.com/). We have collected around 700 articles from this source and constructed our dataset using them.
Madinah Arabic.: This is a website offering online courses and tests to aid students worldwide in learning Arabic (https://www.madinaharabic.com/). We have obtained around 100 reading comprehension tests from Madinah Arabic to contribute to our dataset.

These diverse datasets contribute to our comprehensive collection of Arabic question–answer pairs, enabling us to train and evaluate our models effectively. Table 2 summarizes our compiled dataset.

4.3. Data Preprocessing

Once the datasets have been collected and constructed, we proceed with the data preprocessing steps for each model. These steps include

Combining the datasets. We merge the datasets together into a unified form, specifically CSV format, for each model.
Removing unwanted columns. We eliminate unnecessary columns such as “start answer,” “answers,” “id,” “QID,” and “SiteID” from the dataset.
Discarding question–answer pairs with empty answers.
We remove special characters such as newline and unclosed brackets from all columns within the dataset.
Removing reference symbols within the contexts, such as those used for footnote marker.
We unify the Alif character into a standardized form throughout the dataset.
Remove all the diacritics from the dataset.

For the question-with-MCQs model, an additional preprocessing step is performed to unify the “All of them” and “None of them” options. This ensures consistency across the dataset, which was created by multiple contributors. These preprocessing steps guarantee that the data is clean, standardized, and ready for training and evaluation in each respective model.

5. Our Proposed Approach

We employed the mT5-large Google pre-trained model to train our QG without MCQs model. Additionally, we utilized the mT5-base model to train the QG with MCQs. Our training process involved the use of Colab Pro+ and PyTorch v2.8.0. Furthermore, we incorporated several well-known QA datasets such as SQuAD 1.1, Arabic Facebook QA, and ARCD. To cater to Arabic MCQs, we also created a new dataset entirely from the scratch.

5.1. Proposed System Design

Our proposed system design, as shown in Figure 3, consists of two models. The first model is responsible for generating questions by utilizing the context and the answer as inputs. The second model, on the other hand, generates distractors by incorporating the context, answer, and the previously generated question.

Figure 4 illustrates an instance of question generation with MCQs selected by the user. The original Arabic context translates to

The First World War, commonly referred to as the Great War, started in Europe on July 28, 1914, and ended on November 11, 1918. It was initially believed to be the conflict that would bring an end to all wars. With over 70 million soldiers mobilized, including 60 million Europeans, it became one of the largest wars in history.

The corresponding question and answer are, “When was the First World War commenced,” and “July 28, 1914,” respectively. The system generates three distractors:

Distractor 1:: “June 28, 1914”;
Distractor 2:: “July 28, 1944”;
Distractor 3:: “November 11, 1918”.

All four options will be presented in a shuffled order. Notably, the generated distractors in the MCQs can either be a potential answer extracted from the same paragraph (e.g., distractor 3) or a slightly modified version of the correct answer (e.g., distractors 1 and 2 above).

5.2. Training the Model

After preprocessing the collected datasets, the next step is to train two models. The first model is trained using question–answer pairs and follows the answer-aware OQPL format, as referenced in [6,30,39,45,46]. We utilized the pre-trained mt5-large model from Google to assist us in this task. The second model focuses on generating distractors and is trained using question, answer, and context pairs from our specially constructed dataset, utilizing the mT5-base model. We divided our datasets into three parts for both models: 80% was allocated for training, 10% for validation, and the remaining 10% for testing.

We implemented the training pipeline in PyTorch, employing the Dataset and DataLoader classes to optimize memory utilization and enable efficient batch processing. Each model was trained for three epochs using the AdamW optimizer, configured with a learning rate of 0.0003, betas set to (0.9, 0.999), epsilon of 1e-8, and no weight decay. Given the GPU memory limitations of the Nvidia Tesla T4 available on Google Colab, we adopted gradient accumulation over eight steps, effectively simulating a batch size of 8 while maintaining stable gradient updates. Input sequences were standardized to a maximum length of 512 tokens through truncation and padding, ensuring consistent tensor dimensions across batches. Training stability was further supported by a linear learning rate scheduler with no warmup steps, gradient clipping with a maximum norm of 1.0, and the use of a fixed random seed (42) for reproducibility. Mixed precision training was explicitly disabled (FP16 = False) to avoid potential numerical instability under the given hardware and task constraints.

The question generation model converged in approximately nine hours, whereas the distractor generation model required only 16 minutes. We selected the best model for each task based on the lowest validation loss, iteratively saving checkpoints to preserve the optimal weights. This strategy ensured effective optimization within our hardware limitations.

5.3. Generating Optimal Questions

After developing and fine-tuning our models, our focus is on generating high-quality questions and distractors. This involves employing optimal decoding algorithms, a process divided into several key stages:

Answer-aware question generation:: We will use a specialized approach to generate answer-aware questions that relate to both the context and a given answer. This helps to avoid generic or irrelevant questions. The format for this will be
Answer: <answer> Context: <context>.
Algorithm 1 implements the above. It starts by loading the Arabic SQuAD 1.1 dataset and a pre-trained mT5 model for the purpose of training the model. Once trained, the model is used to generate three questions based on a given context C and target answer A, employing one of the decoding algorithms referenced later in this Section. A procedure called QG_reranking is then used to calculate the similarity score between each of the generated questions, the input paragraph, and the expected answer, and these questions, along with their scores, are stored in a list. The list of questions is subsequently sorted by decreasing similarity score, and the question with the highest score is returned as the final output.

Algorithm 1: Generating and re-ranking questions using mT5 large model.

Question and answer-aware distractor generation:: Following the methodology from the answer-aware questions, we also aim to create valid distractors that are related to the context, question, and answer. The format will be
Question: <question> Answer: <answer> Context: <context>.
The following pseudocode describes how we implement the “QG with MCQs” model.
Algorithm 2 begins by loading a specifically constructed dataset and a pre-trained mT5 base model, with the aim of training this model on the loaded data. Following the training, the trained model is employed to generate three distractors based on both the context and the target answer, along with a question, which could possibly be the question generated from a previous model. Together, these form four choices: one correct answer and three distractors. The generation of these distractors and the question makes use of one of the decoding algorithms listed below. Ultimately, the algorithm returns the list of generated distractors.

Algorithm 2: Generating question with multiple-choice questions (MCQs).

Decoding algorithms:

This involves selecting decoding algorithms that significantly impact the generated output. Through these methods, we will assess which algorithms perform best after training and using our models. We explored the following methods:

Greedy decoding algorithm. Uses token probabilities to determine the next target word by selecting the token with the highest probability. This approach is the default decoding algorithm in the tokenizer.decode function found in T5 and mt5 architectures. However, a significant issue with greedy decoding is that it can overlook high-probability tokens that are hidden behind tokens with lower probabilities [47].
Beam search decoding algorithm. Addresses the limitation found in the greedy algorithm, where high probability tokens may be missed. Unlike the greedy approach, beam search explores multiple potential paths in the output, reducing the risk of overlooking hidden high-probability tokens. This exploration allows it to select the best option from the various paths, thereby overcoming the deficiency in the greedy method.
Top-k sampling decoding algorithm. A variant of the greedy decoding method. Instead of selecting just the token with the highest probability, it considers the top k tokens with the highest probabilities. The probability mass is then redistributed among these selected k tokens, providing a broader consideration of likely options.

QG re-ranking:

To mitigate error propagation from QG to downstream multiple-choice question formation, we introduce a knowledge-driven re-ranking mechanism. Our pipeline first generates multiple question candidates using the fine-tuned Arabic QG model, then employs AraELECTRA [48], a pre-trained Arabic QA model as a knowledge-based validator. This model evaluates each candidate’s semantic alignment with the input context and target answer, assigning a relevance score. The highest-scoring question is selected, ensuring optimal contextual coherence and educational validity. This step not only filters out low-quality outputs but also explicitly leverages external knowledge (via QA) to enhance robustness—a key advantage over end-to-end QG systems.

5.4. Evaluation Metrics

To effectively assess our proposed Arabic QG model, we’ve chosen a hybrid evaluation strategy that combines the precision-focused metrics BLEU and METEOR with human-centered assessments. These metrics are preferred over traditional ones like precision, recall, and F-measure, which fail to capture the semantic richness essential for QG. For instance, while both “What is the capital of France?” and “Which city is the capital of France?” are valid queries, traditional metrics might unduly penalize them for not matching a reference exactly. Conversely, BLEU and METEOR, which evaluate semantic alignment and n-gram overlap, appreciate their contextual and linguistic appropriateness. This methodology ensures a comprehensive understanding of our model’s capabilities in generating contextually relevant and grammatically accurate questions.

Before exploring the intricacies of these evaluation mechanisms, it is essential to define key linguistic constructs that form the foundation of these metrics. In computational linguistics, a “unigram” refers to a single unit, which can be either a word or a character, depending on the context. A “bigram” represents a pair of consecutive units (words or characters), and an n-gram is formally defined as a contiguous sequence of n units within a given text. For metrics like BLEU and METEOR, n-grams specifically refer to words. These constructs are fundamental to understanding how such metrics operate.

The machine-based metrics typically use reference texts for evaluation. For our “QG without MCQs” model, the target question serves as the reference translation, while the predicted question is the candidate translation. While for the “QG with MCQs” model, the reference translation comprises the target distractors, and the candidate translation consists of the predicted distractors. The term ‘translation’ is used because these metrics were originally designed for evaluating machine translation (MT) tasks, where reference and candidate translations are compared to assess quality.

BLEU

(BiLingual Evaluation Understudy) is a string-matching algorithm that is widely utilized for assessing the quality of text in machine translation tasks [49]. It is not dependent on the particular language chosen, making it a versatile measure. In addition to its primary use in translation evaluation, BLEU has also been employed in various text-to-text generation tasks such as simplification of texts [50], and QG [6,30,35,37,38,40,45,51]. The application of BLEU extends to distractor generation as well, as seen in [46].

To formally show the mathematical representation of the BLEU metric [52], we start with some definitions. Let

P_{n}

denote the n-gram precision, defined as the ratio of accurately predicted n-grams to the total number of predicted n-grams. For example, if the target sentence is “He eats an apple,” and the predicted sentence is “He ate an apple,” the

P_{1} = 3 / 4

, while

P_{2} = 1 / 3

, and

P_{3} = 0

. We define the geometric average precision score (GAP) as follows:

GAP (n) = exp (\sum_{i}^{n} w_{i} log P_{i}) = \prod_{i}^{n} P_{i}^{w_{i}},

(1)

where n represents the n-gram’s length, and

w_{i}

signifies the associated weights, which are typically uniform and calculated as

= 1 / n

. The Brevity Penalty (BP) is introduced in the BLEU metric computation to correct for a bias in the n-gram precision calculation. Without this correction, the unigram precision (denoted as

P_{1}

) could incorrectly suggest perfect performance with a value of 1 by generating a prediction consisting of a single token, such as “He” or “an.” Such a scenario artificially inflates the performance score and could unduly encourage the model to produce minimal output lengths to artificially raise the precision metric. The BP effectively adjusts for this by penalizing overly short outputs,

BP = \{\begin{matrix} 1, & if c > r, \\ exp (1 - r / c), & otherwise, \end{matrix}

(2)

where c is the number of words in the predicted sentence, and r is the number of words in the target sentence. This ensures that the BP cannot exceed 1, even if the predicted sentence is longer than the target. Finally, the

BLEU- n

score is defined as

BLEU- n = BP \cdot GAP (n) .

(3)

The BLEU metric calculation varies based on the n-gram length. Specifically, BLEU-1 employs unigram precision, BLEU-2 computes the geometric mean of unigram and bigram precision, while BLEU-3 incorporates the geometric mean of unigram, bigram, and trigram precision, and this sequence proceeds similarly for higher n-gram s.

In the proposed architecture of the first model, the system is designed to map the input sequence to the expected question, resulting in a generated question that serves as the model’s prediction. Similarly, in the second model configuration, the target output is defined as the appropriate distractor for each given prompt, with the model generating a predicted distractor accordingly.

METEOR

(Metric for Evaluation of Translation with Explicit ORdering) [53] a metric which serves as an evaluative tool for assessing the quality of text translated by machines, using a reference text for comparison. It primarily focuses on aligning words and phrases between the reference and the machine-translated text. It employs a range of measures such as precision, recall, and F-score to evaluate the translations. Importantly, this metric takes into account the sequence of words and phrases, as well as the use of synonyms, in its assessment. Additionally, it gauges both the adequacy and fluency of the machine-translated text. Adequacy measures the extent to which the translation retains the original meaning, while fluency evaluates its grammaticality and readability. This metric has gained extensive usage in machine translation research and various NLP text-to-text tasks, e.g., [6,30,40,45], as in the case of our models. Its popularity stems from its language independence, making it applicable for evaluating translations in any language. Furthermore, it has shown a strong correlation with human judgment in determining translation quality. For example, it achieved scores of 0.347 and 0.331 for Arabic and Chinese datasets, respectively (https://huggingface.co/spaces/evaluate-metric/meteor, accessed on 24 October 2024). The METEOR metric is given by

METEOR = F_{mean} \cdot (1 - Penalty),

(4)

where

F_{mean} = 10 P R / (R + 9 P)

is the harmonic mean of precision (P) and recall (R). This measure is computed based on the count of matching unigrams, bigrams, trigrams, and their respective stemmed versions in the n-grams, found both in the machine translation output and the reference translation(s). The precision is calculated as

m / w_{t}

, where m is the count of matching unigrams in the candidate and reference translations, and

w_{t}

is the total unigrams in the candidate translation. The recall is computed as

m / w_{r}

, with m as before, and

w_{r}

is the number of unigrams in the reference translation.

The Penalty is a factor ranging from 0 to 1, which penalizes the machine translation output if it contains an excessive number of words that do not appear in the reference translation(s). This penalty is calculated by comparing the count of words in the machine translation output that are absent in the reference translation(s) to the overall word count in the machine translation output.

ROUGE-L

(Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence) [54] is a widely adopted metric for evaluating the similarity between generated and reference texts based on the longest common subsequence (LCS). It computes precision (

P_{LCS}

), recall (

R_{LCS}

), and F-score (

F_{LCS}

) using the following formulas:

\begin{matrix} R_{LCS} & = LCS (X, Y) / | X |, \end{matrix}

(5)

\begin{matrix} P_{LCS} & = LCS (X, Y) / | Y |, \end{matrix}

(6)

\begin{matrix} F_{LCS} & = \frac{(1 + β^{2}) P_{LCS} R_{LCS}}{R_{LCS} + β^{2} P_{LCS}}, \end{matrix}

(7)

where

LCS (X, Y)

denotes the length of the longest common subsequence between the reference text X and the generated text Y. The parameter

β

controls the relative importance of recall and precision; typically,

β = 1

assigns equal weight, whereas

β > 1

emphasizes recall. In this work we assume the default value of

β

.

ROUGE-L has been widely adopted in NLG tasks such as question generation and summarization, where it evaluates generated text quality by measuring both lexical overlap and sentence-level structural similarity [55]. Several Arabic NLP studies, including [5,30,32,34], have employed ROUGE-L for assessment. However, given the morphological richness and syntactic complexity of Arabic, ROUGE-L alone may fail to adequately capture semantic equivalence. This limitation is particularly pronounced, as the standard metric is sensitive to surface-form variations rather than underlying meaning [56].

Emerging metrics like LEMMA-ROUGE have been proposed specifically to address this issue by leveraging lemmatization to reduce morphological noise [56]. However, as it is a recent innovation not yet established in existing Arabic NLP literature, we prioritize metrics that allow for direct comparison with prior art. Consequently, to ensure both robustness and comparative fairness, we combine ROUGE-L with a suite of established complementary metrics—including BLEU for n-gram precision, METEOR for synonymy and inflection awareness, and BERTScore for contextual semantic evaluation. We identify the exploration of morphologically aware metrics like LEMMA-ROUGE as a valuable direction for future research.

BERTScore

(Bidirectional Encoder Representations from Transformers Score) [57] is an evaluation metric that measures semantic similarity between generated and reference texts using contextual embeddings from pretrained transformer models like BERT. Unlike traditional metrics relying on exact word matches, it captures meaning through two steps:

Embedding generation. A pretrained transformer such as BERT converts reference text $x = 〈 x_{1}, x_{2}, \dots 〉$ and candidate text $\hat{x} = 〈 {\hat{x}}_{1}, {\hat{x}}_{2}, \dots 〉$ into contextual embedding sequences $〈 x_{1}, x_{2}, \dots 〉$ and $〈 {\hat{x}}_{1}, {\hat{x}}_{2}, \dots 〉$ .
Similarity computation. For each token pair, cosine similarity is calculated between embeddings. The metric then computes precision, recall and F-score,

$\begin{matrix} P_{BERT} & = \frac{1}{| \hat{x} |} \sum_{{\hat{x}}_{j} \in \hat{x}} max_{x_{i} \in x} x_{i}^{T} {\hat{x}}_{j}, \end{matrix}$

(8)

$\begin{matrix} R_{BERT} & = \frac{1}{| x |} \sum_{x_{i} \in x} max_{{\hat{x}}_{j} \in \hat{x}} x_{i}^{T} {\hat{x}}_{j}, \end{matrix}$

(9)

$\begin{matrix} F_{BERT} & = \frac{2 P_{BERT} R_{BERT}}{P_{BERT} + R_{BERT}} . \end{matrix}$

(10)

While standard metrics like BLEU and METEOR provide a baseline for evaluating lexical overlap, they fall short of assessing the semantic consistency and cultural relevance of generated text, a critical limitation for morphologically rich languages like Arabic. To overcome this, we employ BERTScore, a metric that leverages contextual embeddings to evaluate semantic similarity. BERTScore offers three key advantages: (a) it captures deeper semantic relationships beyond surface-level n-gram matching; (b) it demonstrates a higher correlation with human judgment; and (c) its multilingual capability, via models like mBERT, makes it suited for Arabic evaluation.

To our knowledge, prior research in Arabic QG has not adopted BERTScore, creating a gap in robust semantic evaluation. By integrating it into our framework, we enable a direct, meaning-based comparison between our model and the baseline approaches (fine-tuned AraT5 and translated English QG models). This provides a novel and necessary dimension of assessment, ensuring that evaluations account for fluency, semantic accuracy, and contextual appropriateness, thereby offering a more holistic view of model performance.

Human Evaluation

Automatic metrics are based on mathematical algorithms that compare machine-generated text with a set of reference texts and compute a score based on how closely the two matches. While these metrics can be useful for quickly evaluating large datasets, they have some limitations. One of the main limitations of automatic metrics is that they do not always capture the nuances of language that are important for human readers. For example, an automated metric might score a machine-generated summary as being highly similar to a reference summary, but a human reader might still find the machine-generated summary difficult to understand or lacking in important details. This is where human evaluation metrics come in. Human evaluation metrics involve having human judges reading machine-generated text and score it based on a set of criteria that are relevant to the task at hand like accuracy, coherence, and readability. While human evaluation metrics can be more time-consuming and resource-intensive than automatic metrics, they offer a more accurate and nuanced way of evaluating text-to-text tasks. By having human judges evaluate the quality of machine-generated text in addition to the automatic metrics, we can gain a better understanding of how well the machine-generated text performs in real-world scenarios.

Crowdsourcing offers cost-effectiveness, quicker feedback, and diverse perspectives but faces challenges like quality control issues, inconsistent judgments, and misunderstood instructions. For example, ref. [58] highlighted significant quality challenges when using platforms like Clickworker and MTurk for Arabic tasks.

To facilitate a meticulous and feasible manual evaluation, we employ a randomly selected sample of 100 records from the test set. This sample size provides a robust and manageable subset for a detailed qualitative assessment by human judges, balancing statistical relevance with practical constraints. Three independent evaluators, all native Arabic speakers with college-level proficiency, will assess the semantic similarity of the model-generated text to the ground truth. They will use a three-point scale: a score of 0 for “not similar”, 0.50 for “somewhat similar”, and 1 for “fully similar”. This approach ensures a reliable and nuanced measure of output quality that complements our automated metrics.

We will report the results of human experts using two metrics: overall average score and inter-rater reliability (measured using Fleiss’ kappa). The overall average score is the mean score across all records. Fleiss’ kappa (

κ

) evaluates the level of agreement among multiple raters, accounting for agreement by chance. It is calculated as follows:

κ = \frac{P_{o} - P_{e}}{1 - P_{e}},

(11)

where

P_{o}

is the observed agreement among raters, and

P_{e}

is the expected agreement by chance. The value of

κ

ranges from −1 to 1, where

κ = 1

indicates perfect agreement, and

κ > 0

indicates some level of agreement, with higher values representing better agreement.

6. Results and Discussion

In this work, we evaluated three distinct decoding methods for our two models. We chose not to utilize greedy decoding in our experiments due to its tendency to yield less varied and suboptimal outcomes compared to other methods. Greedy decoding works by selecting the most probable word at each step, which can lead to sequences that are not necessarily the best possible choices. In contrast, beam search and top-k sampling are more advanced methods that consider a range of potential options at each step, often producing sequences that are both more varied and more optimal. As a result, greedy decoding was not included in our study. However, it remains a viable approach for researchers who might prioritize efficiency over the breadth and accuracy in their decoding strategies.

Table 3 presents the outcomes of question generation based on the context provided in Table 1, employing three distinct decoding algorithms. It includes the questions generated both with and without MCQs. In the non-MCQ category, the greedy search algorithm produced a relevant question, although it did not pertain directly to the correct answer. Meanwhile, the top-k algorithm yielded a question that was contextually accurate but suffered from a grammatical error: it incorrectly used the term ولدت (born, female) instead of the appropriate ولد (born, male). On the other hand, beam search generated a question which is valid and related to both the context and the correct answer.

In the MCQ-based question generation (see Table 3), the distractors produced by the greedy search algorithm are not consistently accurate, with one occasionally being the correct answer. Conversely, both the beam search and top-k algorithms generate valid distractors. However, top-k offers greater diversity in its distractors, as its outputs vary with each run, whereas the beam search yields the same distractors in every iteration.

The following subsections present a comprehensive evaluation of our models. We include quantitative results, qualitative examples of effective outputs, and an analysis of error cases. The evaluation is structured around three core tasks: Task 1, generating standalone questions; Task 2, generating multiple-choice questions with distractors; and Task 3, an ablation study comparing beam search and top-k sampling for both question types.

Given the limited number of research on Arabic question generation in the literature, we find it appropriate to compare our results with studies conducted in both Arabic and other languages. We assess our findings using the BLEU, METEOR, BERTScore metrics, and human evaluation, which are the favored methods for evaluating this task.

6.1. Task 1: Standalone Question Generation

Our evaluation employs two distinct baseline models to ensure a comprehensive comparison. The first is AraT5, a state-of-the-art model pre-trained specifically on Arabic, providing a strong representative for Arabic-specific language modeling. The second is English T5-large, a powerful general-purpose model. This allows us to benchmark our model against a high-performance model.

Table 4 summarizes the performance metrics for Task 1 (question generation without MCQs), comparing our proposed model against the AraT5-base baseline and other relevant systems. To provide broader context, we also include results from models operating in other languages. Note that studies which did not employ our core set of evaluation metrics—namely BLEU, METEOR, ROUGE-L, and BERTScore—have been excluded from this comparison. For instance, works such as [59,60] utilized precision, recall, and F-score, and are therefore omitted to ensure a consistent and fair evaluation across all included systems.

For Task1, our model outperforms the fine-tuned AraT5 across BLEU-4, METEOR, and

F_{BERT}

, reflecting gains in both lexical and semantic quality. While previous Arabic QG models [30,33] report higher scores on isolated metrics like METEOR or ROUGE-L, our approach achieves competitive or superior overall performance, particularly in BLEU-4 and METEOR. Notably, our model’s BLEU-4 and METEOR scores exceed most reported results for English SQuAD models, underscoring its effectiveness despite Arabic’s linguistic complexity. The low ROUGE-L performance highlights this metric’s limitations for Arabic and reinforces the value of

F_{BERT}

as a complementary measure. A qualitative comparison of model outputs (our vs. AraT5) is provided in Table 5, which includes the input context, answer, and golden reference question to facilitate analysis of performance and error patterns.

To evaluate the real-world usability and quality of the generated questions, we performed a human assessment study. Three domain experts evaluated a random sample of 100 outputs based on criteria such as grammaticality and relevance. The model achieved a strong average score of 74.20%, demonstrating its effectiveness. The reliability of this human evaluation is confirmed by a Fleiss’ kappa score of

κ = 0.506

, which signifies moderate inter-rater agreement and strengthens the validity of the assessment.

Next, we examine representative samples of both correctly and incorrectly generated questions produced by our model. Table 6 reports examples of correctly generated questions, which align well with the intended context and answer. Conversely, the incorrectly generated questions display a range of deficiencies, including incompleteness, grammatical inaccuracies despite semantic plausibility (see Table 7), and lack of relevance to either the contextual passage or the target answer.

Our second baseline is the English T5-large (https://huggingface.co/lmqg/t5-large-squad-qg, accessed on 20 December 2023), a powerful general-purpose model. To adapt it for Arabic question generation, we employed a translate-generate pipeline: questions generated by T5-large were translated into Arabic. This approach allows us to benchmark our native Arabic model against a high-performance model adapted post-hoc for the language. To evaluate this comparison, we conducted an experiment on a random sample of 100 records from the SQuAD validation dataset. We generated questions using the English T5-large model, translated them into Arabic, and evaluated them alongside the questions from our native model using identical metrics; the average scores for both approaches are presented in Table 8.

Further comparison with translated English–Arabic QG models highlights our model’s superiority in BLEU-4 and

F_{BERT}

, demonstrating enhanced fluency and semantic accuracy, while achieving comparable performance in METEOR and ROUGE-L. These findings confirm the model’s capacity to generate natural and contextually appropriate Arabic questions, while also pointing to areas for potential improvement.

6.2. Task 2: Multiple-Choice Question Generation

Due to the absence of an Arabic system for generating MCQs, we will compare our approach with systems in other languages that used similar architecture and evaluated their performance using the same metrics. The results are reported in Table 9.

We observed that no existing models were assessed using METEOR or human evaluation metrics. In contrast, our model achieved an average METEOR score of 27.40%. Additionally, the overall average score given by three human experts was 72.20%, with Fleiss’ kappa,

κ = 0.285

, indicating modest agreement among the raters. This absence of comparable metrics complicates the definitive evaluation of our model’s performance relative to others. Furthermore, while some models employed the T5 architecture (e.g., [37,63,64]), others, including ours, used mT5, the multilingual variant of T5.

Although multilingual models like mT5 offer broad applicability across languages, they come with inherent limitations. These models are designed to generalize across multiple languages, which can result in a weaker grasp of the nuanced linguistic features specific to a single language. This limitation is particularly important in tasks like question generation, where a deep understanding of grammar, syntax, and semantics is essential.

Additionally, the size of the training dataset plays a significant role. The RACE dataset [65] used in [37], for instance, is a large-scale resource containing approximately 28,000 passages and 100,000 questions created by human experts, covering a wide range of topics. In comparison, our dataset (see Table 2) is smaller, consisting of ≈7400 questions. Despite this difference in scale, we believe our model’s performance is very reasonable given these considerations.

Table 10 provides a few examples of correctly generated distractors. For incorrectly generated distractors, again the issues varied. These included meaningless or incomplete distractors (see Table 11), and distractors having syntactic errors. For example, for the answer في الخمسينات من القرن التاسع عشر, one of the generated distractors is العاشر عشر في العشرينات من القرن.

6.3. Task 3: Analysis of Decoding Strategies

This analysis compares the efficacy of two decoding strategies—beam search and top-k sampling—for generating both standalone questions and multiple-choice questions (MCQs) with distractors. Due to time constraints, experiments were conducted on a randomly selected subset of the dataset. For each task, we applied a paired t test to determine whether the observed performance differences between decoding algorithms were due to random variation or reflected a systematic advantage of one approach.

For standalone question generation, Table 12 reports the results. Across all automated metrics, beam search demonstrates a statistically significant advantage over top-k sampling. The reported values differ slightly from those in Table 4, as they were computed on a smaller evaluation subset.

Beam search consistently achieved higher scores, with relative improvements of 75% in BLEU-4 and 131% in METEOR. All p-values fall below

α = 0.05

, confirming statistical significance. The particularly strong result for

F_{BERT}

(

p < 0.001

) underscores superior semantic alignment with reference questions. These findings establish beam search as the more effective decoding strategy for fact-based question generation, producing outputs that are lexically more precise, fluent, and semantically faithful.

We next evaluated the distractor generation component of our MCQ system. BLEU-1 through BLEU-4 scores were computed for each distractor

(D_{1}, D_{2}, D_{3})

generated under beam search and top-k sampling. Significance testing was again performed using a paired t test. To ensure reliable evaluation, scores were calculated after text normalization, stemming, and lemmatization, with zero values assigned to distractors that were either empty or overly similar to the correct answer or to another distractor (similarity threshold of 90, using the fuzzywuzzy library). The results are presented in Table 13. Again, values are different from those reported earlier (Table 9) since we used a smaller subset.

The results clearly indicate that top-k sampling significantly outperforms beam search across all metrics and distractor types. In every case, top-k achieved higher BLEU scores, with the associated p-values confirming statistical significance (

p < 0.05

for all comparisons). These findings support our earlier claim that top-k sampling promotes greater diversity, enabling the generation of distractors that are more varied yet remain plausible within the multiple-choice setting.

Overall, these two sets of experiments reveal an interesting divergence in the relative strengths of decoding strategies. While beam search emerges as the superior method for fact-based question generation, consistently producing outputs that are semantically faithful and lexically precise, it falls short in the context of distractor generation. In this latter task, top-k sampling provides a clear advantage by yielding distractors that are both diverse and contextually plausible, as reflected in the consistently higher BLEU scores and statistically significant differences. This contrast suggests that the optimal choice of decoding algorithm is task-dependent: beam search is preferable when accuracy and semantic fidelity are paramount, whereas top-k sampling is better suited for generating varied yet coherent alternatives in multiple-choice settings.

7. Conclusions and Future Work

This work addressed the challenging task of Arabic automatic question generation (QG) using the multilingual mT5 architecture, focusing on both standalone questions and multiple-choice questions (MCQs) derived from a given context and answer span. Our framework incorporated answer-aware generation to align questions with target answers and extended this approach to distractor creation, complemented by QG reranking to evaluate and select the highest-quality outputs.

A central finding of our study is the task-dependent efficacy of decoding strategies. Beam search consistently proved superior for fact-based QG, producing semantically faithful and lexically precise questions. By contrast, top-k sampling offered a clear advantage for distractor generation, yielding outputs that were not only higher scoring but also more diverse and contextually plausible, as confirmed by significantly different BLEU scores.

Evaluation across BLEU, METEOR, and human assessment confirmed that our system outperforms baselines in Arabic QG and establishes strong foundations for MCQ generation. These results demonstrate the value of tailoring decoding algorithms to specific subtasks, while highlighting the broader pedagogical potential of Arabic QG systems to support automated assessment and adaptive learning.

Future research in automatic Arabic QG for educational assessments could explore several directions. First, improving question quality by leveraging Arabic-specific transformers like AraBERT and comparing them with existing models is promising. Expanding to diverse assessment types (e.g., fill-in-the-blank questions) and scaling to larger models or semantic graph-based approaches could further enhance text comprehension. Additionally, adapting the model to specialized contexts—such as educational materials, news, and technical documents—would address varied domain needs.

Future work includes several promising directions. One avenue involves augmenting T5-based question generation by prioritizing high-loss or semantically misaligned samples during fine-tuning, adapting methods from [66] originally used for medical AI classifiers. This approach could help the model better capture linguistic nuances and reduce overfitting in low-resource settings. Another critical direction is improving robustness to noisy text—such as spelling errors, grammatical mistakes, or non-standard language—by exploring techniques like imitation learning or self-correction [11]. Such methods may mitigate cascading errors and improve output coherence, particularly given the scarcity of annotated noisy datasets for Arabic. These efforts would enhance the model’s applicability to real-world domains like social media or informal communication.

Author Contributions

Conceptualization, R.B.J. and A.M.A.; methodology, R.B.J.; validation, R.B.J.; formal analysis, R.B.J. and A.M.A.; investigation, R.B.J.; resources, A.M.A.; data curation, R.B.J.; writing—original draft preparation, R.B.J.; writing—review and editing, A.M.A.; supervision, A.M.A.; funding acquisition, A.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank Ongoing Research Funding Program, (ORFFT-2025-006-3), King Saud University, Riyadh, Saudi Arabia for financial support.

Data Availability Statement

The dataset for the MCQs is available from the corresponding author upon email request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations (ordered alphabetically) are used in this manuscript:

AI	Artificial Intelligence
ARCD	Arabic Reading Comprehension Dataset
BLEU	BiLingual Evaluation Understudy
JSON	JavaScript Object Notation
LCS	Longest Common Subsequence
MCQs	Multiple-Choice Questions
METEOR	Metric for Evaluation of Translation with Explicit ORdering
MLQA	Multi-lingual Question Answering
MT	Machine Translation
mT5	Multi-lingual T5
NER	Named Entity Recognition
NLP	Natural Language Processing
NLU	Natural Language Understanding
QA	Question-Answering
QG	Question Generation
ROUGE	Recall-Oriented Understudy for Gisting Evaluation
SQuAD	Stanford Question Answering Dataset
SVO	Subject-Verb-Object
T5	Google’s Text-to-Text Transfer Transformer
VOS	Verb-Object-Subject
VSO	Verb-Subject-Object

References

Neirotti, R.A. The importance of asking questions and doing things for a reason. Braz. J. Cardiovasc. Surg. 2021, 36, 1–2. [Google Scholar] [CrossRef]
Thalheimer, W. Learning Benefits of Questions; A Work-Learning Research Publication: Somerville, MA, USA, 2003. [Google Scholar]
Al-Khatib, H. Automatic Questions Generation from Arabic Content (in Arabic). Master’s Thesis, Higher Institute for Applied Sciences and Technology, Damascus, Syria, 2019. [Google Scholar]
Al-Hasan, A. Measurement and Evaluation Course (Lectures 6th and 7th): Essay and Objective Tests. 2019. Available online: https://hama-univ.edu.sy/newsites/education/wp-content/uploads/2020/05/xxx.pdf (accessed on 7 February 2024).
Alwaneen, T.H.; Azmi, A.M. Stacked dynamic memory-coattention network for answering why-questions in Arabic. Neural Comput. Appl. 2024, 36, 8867–8883. [Google Scholar] [CrossRef]
Lopez, L.E.; Cruz, D.K.; Cruz, J.C.B.; Cheng, C. Transformer-based end-to-end question generation. arXiv 2020, arXiv:2005.01107. [Google Scholar]
Mannaa, Z.M.; Azmi, A.M.; Aboalsamh, H.A. Computer-assisted i‘raab of Arabic sentences for teaching grammar to students. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8909–8926. [Google Scholar] [CrossRef]
Azmi, A.M.; Al-Jouie, M.F.; Hussain, M. AAEE - Automated evaluation of students’ essays in Arabic language. Inf. Process. Manag. 2019, 56, 1736–1752. [Google Scholar] [CrossRef]
Azmi, A.; Al-Thanyyan, S. Ikhtasir—A user selected compression ratio Arabic text summarization system. In Proceedings of the 2009 International Conference on Natural Language Processing and Knowledge Engineering, Dalian, China, 24–27 September 2009; pp. 1–7. [Google Scholar]
Almuzaini, H.A.; Azmi, A.M. TaSbeeb: A judicial decision support system based on deep learning framework. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101695. [Google Scholar] [CrossRef]
Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
Gao, Y.; Bing, L.; Li, P.; King, I.; Lyu, M.R. Generating distractors for reading comprehension questions from real examinations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6423–6430. [Google Scholar]
Tay, Y.; Tuan, L.A.; Hui, S.C. Multi-range reasoning for machine comprehension. arXiv 2018, arXiv:1803.09074. [Google Scholar] [CrossRef]
Ebel, R.L.; Frisbie, D.A. Essentials of Educational Measurement, 5th ed.; Prentice-Hall: Hoboken, NJ, USA, 1991. [Google Scholar]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 483–498. [Google Scholar]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv 2021, arXiv:2010.11934v3. [Google Scholar]
Wolfe, J.H. Automatic question generation from text-an aid to independent study. In Proceedings of the ACM SIGCSE-SIGCUE Technical Symposium on Computer Science and Education, Anaheim, CA, USA, 12–13 February 1976; pp. 104–112. [Google Scholar]
Rus, V.; Graesser, A.C. Workshop Report: The Question Generation Task and Evaluation Challenge; Technical report; Institute for Intelligent Systems: Stuttgart, Germany, 2009. [Google Scholar]
Zhang, R.; Guo, J.; Chen, L.; Fan, Y.; Cheng, X. A review on question generation from natural language text. ACM Trans. Inf. Syst. (TOIS) 2021, 40, 1–43. [Google Scholar] [CrossRef]
Belyanova, M.; Chernenkiy, V.; Kaganov, Y.; Gapanyuk, Y. Using hybrid intelligent information system approach for text question generation. In Proceedings of the CEUR Workshop Proceedings—Russian Advances in Fuzzy Systems and Soft Computing: Selected Contributions to the 8th International Conference on Fuzzy Systems, Soft Computing and Intelligent Technologies (FSSCIT 2020), Smolensk, Russia, 29 June–1 July 2020; Volume 2782, pp. 194–201. [Google Scholar]
Mostow, J.; Chen, W. Generating Instruction Automatically for the Reading Strategy of Self-Questioning. In Proceedings of the 2009 Conference on Artificial Intelligence in Education: Building Learning Systems that Care: From Knowledge Representation to Affective Modelling, Brighton, UK, 6–10 July 2009; pp. 465–472. [Google Scholar]
Kunichika, H.; Katayama, T.; Hirashima, T.; Takeuchi, A. Automated question generation methods for intelligent English learning systems and its evaluation. In Proceedings of the International Conference on Computers in Education (ICCE’04), Melbourne, Australia, 30 November–3 December 2004; Volume 670. [Google Scholar]
Huang, Y.; He, L. Automatic generation of short answer questions for reading comprehension assessment. Nat. Lang. Eng. 2016, 22, 457–489. [Google Scholar] [CrossRef]
Bousmaha, K.Z.; Chergui, N.H.; Mbarek, M.S.A.; Belguith, L.H. AQG: Arabic Question Generator. Rev. D’Intelligence Artif. 2020, 34, 721–729. [Google Scholar] [CrossRef]
Elbasyouni, M.; Abdelrazek, E.; Saad, A. Building a system based on natural languages processing to automatic question generation from Arabic texts. Int. J. Curr. Res. 2014, 6, 7608–7613. [Google Scholar]
Kaur, J.; Bathla, A.K. Automatic question generation from Hindi text using hybrid approach. In Proceedings of the Second International Conference on Science Technology and Management, Sliema, Malta, 17 August 2015. [Google Scholar]
Swali, D.; Palan, J.; Shah, I. Automatic question generation from paragraph. Int. J. Adv. Eng. Res. Dev. 2016, 3, 73–78. [Google Scholar] [CrossRef]
Kriangchaivech, K.; Wangperawong, A. Question generation by transformers. arXiv 2019, arXiv:1909.05017. [Google Scholar] [CrossRef]
Ouahrani, L.; Bennouar, D. Attentional Seq2Seq Model for Arabic Opinion Question Generation. In Proceedings of the International Symposium on Modelling and Implementation of Complex Systems; Springer Nature: Cham, Swtizerland, 2024; pp. 112–126. [Google Scholar]
Alhashedi, S.; Suaib, N.M.; Bakri, A. Arabic automatic question generation using transformer model. In Proceedings of the AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2024; Volume 2991. [Google Scholar]
Bonifacio, L.; Jeronymo, V.; Abonizio, H.Q.; Campiotti, I.; Fadaee, M.; Lotufo, R.; Nogueira, R. mMARCO: A multilingual version of the MS MARCO passage ranking dataset. arXiv 2021, arXiv:2108.13897. [Google Scholar]
Lafkiar, S.; En Nahnahi, N. An end-to-end transformer-based model for Arabic question generation. Multimed. Tools Appl. 2024, 84, 22009–22023. [Google Scholar] [CrossRef]
Lafkiar, S.; Nahnahi, N.E. An Arabic question generation system based on a shared BERT-base encoder-decoder architecture. Math. Model. Comput. 2024, 11, 763–772. [Google Scholar] [CrossRef]
Rahim, M.; Khoja, S.A. Sawaal: A Framework for Automatic Question Generation in Urdu. In Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), Trento, Italy, 19–20 October 2024; pp. 139–148. [Google Scholar]
Nagoudi, E.M.B.; Elmadany, A.; Abdul-Mageed, M. AraT5: Text-to-Text Transformers for Arabic Language Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 628–647. [Google Scholar]
de Fitero-Dominguez, D.; Garcia-Cabot, A.; Garcia-Lopez, E. Automated multiple-choice question generation in Spanish using neural language models. Neural Comput. Appl. 2024, 36, 18223–18235. [Google Scholar] [CrossRef]
Vachev, K.; Hardalov, M.; Karadzhov, G.; Georgiev, G.; Koychev, I.; Nakov, P. Leaf: Multiple‑choice question generation. In Advances in Information Retrieval, Proceedings of the European Conference on Information Retrieval (ECIR 2022), Stavanger, Norway, 10–14 April 2022; Springer: Cham, Swtizerland, 2022; pp. 321–328. [Google Scholar]
Abdel-Galil, H.; Mokhtar, M.; Doma, S. Automatic question generation model based on deep learning approach. Int. J. Intell. Comput. Inf. Sci. 2021, 21, 110–123. [Google Scholar] [CrossRef]
Montgomerie, A. Generating Questions Using Transformers. 2020. Available online: https://amontgomerie.github.io/2020/07/30/question-generator.html (accessed on 24 October 2024).
Qiu, J.; Xiong, D. Generating Highly Relevant Questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5983–5987. [Google Scholar]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
Mozannar, H.; Hajal, K.E.; Maamary, E.; Hajj, H. Neural Arabic question answering. arXiv 2019, arXiv:1906.05394. [Google Scholar] [CrossRef]
Lewis, P.; Oğuz, B.; Rinott, R.; Riedel, S.; Schwenk, H. MLQA: Evaluating cross-lingual extractive question answering. arXiv 2019, arXiv:1910.07475. [Google Scholar]
ARCD, H.F.D. Arabic Reading Comprehension Dataset. 2021. Available online: https://huggingface.co/datasets/arcd (accessed on 10 July 2023).
Liu, B. Neural question generation based on Seq2Seq. In Proceedings of the 2020 5th International Conference on Mathematics and Artificial Intelligence, Chengdu, China, 10–13 April 2020; pp. 119–123. [Google Scholar]
Hernandez, L.; Randall, S.; Nazeri, A. Question Generator. 2020. Available online: http://cs230.stanford.edu/projects_fall_2020/reports/55771015.pdf (accessed on 10 July 2023).
von Platen, P. How to Generate Text: Using Different Decoding Methods for Language Generation with Transformers. 2020. Available online: https://huggingface.co/blog/how-to-generate (accessed on 5 August 2023).
Antoun, W.; Baly, F.; Hajj, H. AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding. arXiv 2020, arXiv:cs.CL/2012.15516. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Al-Thanyyan, S.S.; Azmi, A.M. Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101662. [Google Scholar] [CrossRef]
Kim, Y.; Lee, H.; Shin, J.; Jung, K. Improving neural question generation using answer separation. In Proceedings of the AAAI conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6602–6609. [Google Scholar]
Doshi, K. Foundations of NLP Explained—Bleu Score and WER Metrics. 2021. Available online: https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b (accessed on 10 July 2023).
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Wang, B.; Wang, X.; Tao, T.; Zhang, Q.; Xu, J. Neural question generation with answer pivot. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9138–9145. [Google Scholar]
Al-Numai, A.; Azmi, A. LEMMA-ROUGE: An Evaluation Metric for Arabic Abstractive Text Summarization. Indones. J. Comput. Sci. 2023, 12, 470–481. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Almuzaini, H.A.; Azmi, A.M. An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm. Expert Syst. Appl. 2022, 203, 117384. [Google Scholar] [CrossRef]
Nakhleh, S.; Mustafa, A.M.; Najadat, H. AraT5GQA: Arabic Question Answering model using automatic generated dataset. In Proceedings of the 2024 15th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 13–15 August 2024; pp. 1–5. [Google Scholar]
Tami, M.; Ashqar, H.I.; Elhenawy, M. Automated question generation for science tests in Arabic language using NLP techniques. In International Conference on Intelligent Systems, Blockchain, and Communication Technologies; Springer: Cham, Swtizerland, 2024; pp. 274–285. [Google Scholar]
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13063–13075. [Google Scholar]
Xiao, D.; Zhang, H.; Li, Y.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. ERNIE-GEN: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence, Virtual, 19–26 August 2021; pp. 3997–4003. [Google Scholar]
Chomphooyod, P.; Suchato, A.; Tuaycharoen, N.; Punyabukkana, P. English grammar multiple-choice question generation using Text-to-Text Transfer Transformer. Comput. Educ. Artif. Intell. 2023, 5, 100158. [Google Scholar] [CrossRef]
Rodriguez-Torrealba, R.; Garcia-Lopez, E.; Garcia-Cabot, A. End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Syst. Appl. 2022, 208, 118258. [Google Scholar] [CrossRef]
Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; Palmer, M., Hwa, R., Riedel, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 785–794. [Google Scholar]
Al-Ssulami, A.M.; Alsorori, R.S.; Azmi, A.M.; Aboalsamh, H. Improving coronary heart disease prediction through machine learning and an innovative data augmentation technique. Cogn. Comput. 2023, 15, 1687–1702. [Google Scholar] [CrossRef]

Figure 1. The framework of Google’s Text-to-Text Transfer Transformer (T5).

Figure 2. The taxonomy of the question generation (QG) tasks (Source: [19]).

Figure 3. Our proposed system design for question generation, encompassing both open-ended questions and multiple-choice questions (MCQs). The top part shows the training phase, while the bottom part features the generation phase.

Figure 4. Example of QG with multiple-choice options. The output will include a question accompanied by four options: one correct answer and three distractors.

Table 1. An annotation example showing a question, context, and answer in both Arabic and its English translation. The answer set comprises a correct option accompanied by three incorrect choices, referred to as distractors.

Question
في أي عام ولد ألبرت أينشتاين؟	In which year was Albert Einstein born?
Context
بالألمانية (Albert Einstein) 14 مارس 1979 –18 أبريل 1955) عالم فيزياء ألماني المولد، (حيث تخلى عن الجنسية الألمانية لاحقا) سويسري وأمريكي الجنسية، من أبوين يهوديين، وهو يشتهر بأب النسبية كونه واضع النسبية الخاصة والنسبية العامة الشهيرتين اللتين كانتا اللبنة الأولى للفيزياء النظرية الحديثة، ولقد حاز في عام 1921 على جائزة نوبل في الفيزياء عن ورقة بحثية عن التأثير الكهروضوئي، ضمن ثلاثمائة ورقة علمية أخرى له في تكافؤ المادة والطاقة وميكانيكا الكم وغيرها، وأدت استنتاجاته المبرهنة إلى تفسير العديد من الظواهر العلمية التي فشلت الفيزياء الكلاسيكية في إثباتها	In German, Albert Einstein (14 March 1879–18 April 1955) was a German-born physicist (later renouncing his German citizenship), with Swiss and American citizenship. He was born to Jewish parents and is renowned as the father of relativity, having formulated both the special and general theories of relativity, which laid the foundation for modern theoretical physics. In 1921, he was awarded the Nobel Prize in Physics for his research on the photoelectric effect, among his other 300 scientific papers on the equivalence of matter and energy, quantum mechanics, and more. His groundbreaking findings led to the interpretation of numerous scientific phenomena that classical physics had failed to explain.
Correct answer	1879
Distractor 1	1955
Distractor 2	1921
Distractor 3	1979

Table 2. Summary of our compiled dataset. The domain notes cover the nature of the contents.

Task	Dataset	Source (Domain Notes)	Size	Option
QG w/o MCQs	Arabic SQuAD 1.1 [42]	Translated from English (Wikipedia/General)	48,344	Question, Context, and Answer
QG with MCQs	MLQA [43]	Wikipedia articles (Wikipedia/General)	5852	$}$	Question, Context, Answer, and Distractors
	ARCD [44]	Arabic Wikipedia (Wikipedia/General)	1395
	Qudrat	Arabic reading comprehension from Qiyas (Exams)	$\approx 200$
	Mawdoo3.com	Arabic articles from https://mawdoo3.com/ (News/Articles)	$\approx 700$
	Madinah Arabic	Arabic reading comprehension test (Exams)	$\approx 100$

Table 3. Comparative results of question generation (QG), both without and with multiple-choice questions (MCQs). This comparison is based on the context provided in Table 1, and the results are generated using three different decoding algorithms: greedy, top-k, and beam search. For convenience, we translated the questions to English.

Task1: Generating plain questions
The correct answer		1879
Generated question (by approach)
Greedy	في اي عام ولدت ابنة اينشتاين؟	In which year was Einstein’s daughter born?
Top-k	في اي عام ولدت اينشتاين؟	In which year was Einstein born (fem.)?
Beam search	في اي عام ولد اينشتاين؟	In which year was Einstein born?
Task2: Generating muliple-choice questions
Question	في اي عام ولد اينشتاين؟	In which year was Einstein born?
Correct answer		1879
Generated distractors (by approach)
Greedy		1955, 1879, 1880
Top-k		1955, 1930, 1919
Beam search		1879, 1880, ابريل (April) 1955

Table 4. Comparing the performance of various models on Task 1: straightforward question generation. For models with multiple reported results, we include the best performing scores. Unreported values are indicated with “–”. Additionally, we report the performance of an AraT5-base model, which we fine-tuned specifically for this task to serve as a strong, directly comparable baseline.

Model	Language	Dataset	BLEU-4	METEOR	ROUGE-L	$F_{BERT}$
Wang et al. [55]	English	SQuAD	16.42	18.95	41.87	–
Dong et al. [61]	English	SQuAD	22.12	25.06	51.07	–
Xiao et al. [62]	English	SQuAD	25.40	26.92	52.84	–
Abdel-Galil et al. [38]	English	SQuAD	11.30	–	–	–
Rahim and Khoja [34]	Urdu	UQuAD	23.32	36.47	53.66	–
Alhashedi et al. [30]	Arabic	mMACRO	19.12	23.00	51.99	–
Lafkiar and En Nahnahi [32]	Arabic	Arabic SQuAD & ARCD	20.51	24.04	44.01	–
Lafkiar and Nahnahi [33]	Arabic	ARQGData	20.29	30.73	38.54	–
Ouahrani and Bennouar [29]	Arabic	Adapted-cQA-MD	10.03	11.61	15.08	–
Nagoudi et al. [35]	Arabic	ARGEN-QG	16.99	–	–	–
AraT5-base (fine-tuned)	Arabic	Arabic SQuAD	20.61	20.37	4.80	74.50
Our model	Arabic	Arabic SQuAD	27.49	25.18	4.24	76.34

Table 5. Comparative examples of generated questions (GQs) by AraT5 and our model. Each example provides the source context, the target answer, and the corresponding golden reference question for comparison.

Context	كطالب دكتوراه في جامعة ايرلانجن نورمبرج الالمانية ، بدا كارلينز براندنبورغ العمل على ضغط الموسيقى الرقمية في اوائل الثمانينات ، مع التركيز على كيفية ادراك الناس للموسيقى . اكمل اعمال دكتوراه في عام 1989 . تنحدر MP3 مباشرة من OCF و PXFM مما يمثل نتيجة تعاون براندنبورغ يعمل كخبير ما بعد دكتوراه في مختبرات AT&T Bell مع James D. Johnston (JJ) في AT&T Bell Labs مع معهد فراونهوفر للدوائر المتكاملة ، Erlangen ، مع مساهمات صغيرة نسبيا من فرع MP2 من المبرمجين الفرعيين نفسيا الصوتي . في عام 1990 ، اصبح براندنبورغ استاذا مساعدا في Erlangen Nuremberg . بينما كان هناك ، واصل العمل على ضغط الموسيقى مع العلماء في جمعية فراونهوفر في عام 1993 انضم الى موظفي معهد فراونهوفر .
Answer	1993
Reference question	متى أنضم براندنبورغ إلى معهد فراونهوفر
Our model GQ	في اي عام انضم كارلينز براندنبورغ الى معهد فراونهوفر
AraT5 GQ	في اي عام كان كارلينز براندنبورج استاذا مساعدا في جامعة ساوثامبتون؟
OBSERVATION: AraT5 model generates an off-topic question unrelated to the core subject matter. In contrast, our model correctly identifies the key entity and generates a question which accurately queries about the relevant institution, demonstrating a closer semantic alignment with the reference question.
Context	اكثر الامراض المعروفة التي تؤثر على جهاز المناعة نفسه هو الايدز ، وهو عوز المناعة الذي يتميز بقمع الخلايا التائية CD4 المساعد والخلايا المتغصنة والبلاعم من فيروس نقص المناعة البشرية HIV.
Answer	الايدز
Reference question	ما هو اكثر امراض جهاز المناعة شهرة
Our model GQ	ما هو اكثر الامراض معروفة التي تؤثر على جهاز المناعة
AraT5 GQ	كم عدد الامراض التي تؤثر على جهاز المناعة؟
OBSERVATION: AraT5 model exhibits a failure in task adherence, incorrectly generating a “how many” question that queries quantity, whereas the reference question requires a “what is” question targeting conceptual identity.
Context	في عامي 2013 و 2014 ، اخترقت حلقة اختراق روسية اوكرانية عرفت باسم ريسكاتور اجهزة كمبيوتر Target كوربوراتيون في عام 2013 ، وسرقة ما يقرب من 40 مليون بطاقة ائتمان ، ثم اجهزة كمبيوتر هوم ديبوت في عام 2014 ، سرقت ما بين 53 و 56 مليون رقم بطاقة ائتمان . تم تسليم التحذيرات في كلتا الشركتين ، ولكن تم تجاهلها يعتقد ان خروقات الامن المادية باستخدام الات الخروج الذاتي لعبت دورا كبيرا . يقول جيم والتر ، مدير عمليات استخبارات التهديد في شركة مكافي للتقنية الامنية ان البرمجيات الخبيثة المستخدمة هي غير مبتكرة على الاطلاق وغير مثيرة للاهتمام ، مما يعني انه كان من الممكن ايقاف البرامج المضادة للفيروسات بسهولة من قبل برنامج مكافحة الفيروسات الحالي لو كان المسؤولون يردون على التحذيرات . وقد اسفر حجم السرقات عن اهتمام كبير من سلطات الولايات المتحدة وسلطات الولايات المتحدة الفيدرالية وما زال التحقيق جاريا.
Answer	بين 53 و56 مليون.
Reference question	كم عدد ارقام بطاقات الائتمان التي سرقت من هوم ديبوت في عام 2014؟
Our model GQ	كم عدد بطاقات الائتمان التي سرقت في عام 2014؟
AraT5 GQ	في عام 2014 ، ما هي الارقام التي تم العثور عليها في اجهزة كمبيوتر iPod؟
OBSERVATION: Our model correctly identifies and extracts the relevant numeric fact, while AraT5 confuses the central entity, leading to an inaccurate and meaningless question.

Table 6. Sample examples of correctly generated questions. The input consists of a context and an answer. For reference, a golden (ideal) question is also provided.

Context	بالاضافة الى ذلك ، هذه الدورة التمهيدية تعطي الطلاب المزيد من المعلومات لتكملة علم الاحياء العام او التدريب العلمي . كما ان لديها جزئين مختلفين الجزء الاول هو مقدمة للمبادئ الاساسية للمناعة والجزء الثاني هو سلسلة محاضرات موجهة سريريا . من ناحية اخرى ، فان الدورة التدريبية المتقدمة هي دورة اخرى لاولئك الذين يرغبون في توسيع او تحديث فهمهم لعلم المناعة . ينصح للطلاب الذين يرغبون في حضور دورة متقدمة للحصول على خلفية من مبادئ علم المناعة . تتطلب معظم المدارس من الطلاب اتخاذ اجراءات اختيارية في اخرى لاكمال شهاداتهم . درجة الماجستير تتطلب سنتين من الدراسة بعد الحصول على درجة البكالوريوس . بالنسبة لبرنامج الدكتوراه ، يلزم ان يستغرق عامين اضافيين من الدراسة
Answer	سنتين من الدراسة
Reference question	كم يستغرق الحصول على درجة الماجستير عادة؟
Generated question	كم من الوقت يحتاج الطلاب للحصول على درجة الماجستير؟
Context	يظهر الانسان ما يخفيه من افكار داخلية عن طريق ظهور بعض الحركات الجسدية، او ما يسميها علماء النفس بالايماءات والايحاءات الجسدية، والتي هي حركات لا ارادية تصدر من الشخص يمكن السيطرة على بعضها، والبعض الاخر لا يمكن اخفاؤه او تجنب ظهوره؛ اذ يدركه بسهولة من لديه علم بلغة الجسد لدى الانسان
Answer	عن طريق ظهور بعض الحركات الجسدية
Reference question	كيف يظهر الانسان مايخفيه من افكار داخليه؟
Generated question	كيف يظهر الانسان ما يخفيه من الافكار الداخلية؟

Table 7. Sample example of incorrectly generated question. In this case, the question is relevant to the provided answer and fits the context, but it contains grammatical errors.

Context	كرة القدم هي رياضة جماعية تُلعب بين فريقين يتكون كل منهما من أحد عشر لاعباً بكرة مُكوَّرة. يلعب كرة القدم 250 مليون لاعب في أكثر من مائتي دولة حول العالم، فلذلك تكون الرياضة الأكثر شعبية وانتشاراً في العالم.
Answer	رياضة جماعية تُلعب بين فريقين يتكون كل منهما من أحد عشر لاعباً
Reference question	ما هو عدد اللاعبين في فرقة كرة القدم؟
Generated question	كرة القدم هي ماذا؟

Table 8. Comparative performance metrics of our Arabic QG model and an English QG model followed by translation on a 100-record sample from the SQuAD validation set.

Model	Language	Dataset	BLEU-4	METEOR	ROUGE-L	$F_{BERT}$
English QG (translated to Arabic)	English	SQuAD	2.74	25.48	3.33	75.28
Our model	Arabic	Translated SQuSD	4.46	24.91	3.00	75.56

Table 9. Comparison of our method with other transformer-based approaches for distractors. All methods utilize the same transformer architecture, specifically T5 and its variants. Whenever possible, we show the performance of individually generated distractors, designated

D_{1}, D_{2},

and

D_{3}

.

Table 9. Comparison of our method with other transformer-based approaches for distractors. All methods utilize the same transformer architecture, specifically T5 and its variants. Whenever possible, we show the performance of individually generated distractors, designated

D_{1}, D_{2},

and

D_{3}

.

Model	Dist.	BLEU-1	BLEU-2	BLEU-3	BLEU-4	Lang.	Dataset
Vachev et al. [37]	$D_{1}$	46.37				English	RACE
	$D_{2}$	32.19
	$D_{3}$	34.47
Chomphooyod et al. [63]		6.5				English	NAIST Lang-8 learner corpora
Rodriguez-Torrealba et al. [64]		14.80	7.06	3.75	2.16	Spanish
de Fitero-Dominguez et al. [36]		28.56	18.97	14.27	11.34	Spanish	RACE, Cosmos QA, SciQ
Our model	$D_{1}$	20.28	17.96	17.17	16.75	Arabic	Our dataset
	$D_{2}$	19.83	17.43	16.73	16.41
	$D_{3}$	19.84	17.74	16.96	16.54

Table 10. Sample examples of correctly generated distractors for the multiple-choice questions. The input consists of a context, a question and an answer. For reference, a golden (ideal) distractors are also provided.

Context	كوكب الارض هو الوحيد بين كواكب المجموعة الشمسية المعروف بوجود حياة عليه، ترتيبه الثالث في النظام الشمسي ويبعد مسافة 150 مليون كم عن الشمس، يحتاج كوكب الارض الى 365,25 يوم للدوران حول الشمس، نظرا لانه يسير في الفضاء بسرعة 108 الاف كم في الساعة
Answer	الثالث
Question	ماهو ترتيب كوكب الارض في النظام الشمسي؟
Reference distractors	الثاني
	الرابع
	الأول
Generated distractors	الثامن
	الثاني
	الرابع
Context	المياه: هي عبارة عن مادة مكونة من عنصري الهيدروجين والاكسجين، قادرة على احلال العديد من المواد الاخرى، وهي من اكثر المركبات ضرورة ووفرة على كوكب الارض، حيث توجد في الطبيعة بحالاتها الغازية، والسائلة، والصلبة، كما انها تتميز بان لا لون لها ولا رائحة. وتعد خاصية استخدام المياه كمادة مذيبة اساسية للكائنات الحية، حيث يعتقد ان بداية نشاة الحياة كانت في المحاليل المائية الموجودة في محيطات العالم، فالمحاليل المائية تلعب دورا مهما في العديد من العمليات الحيوية، خاصة في الدم والعصارة الهضمية وذلك لاتمام العمليات البيولوجية. بالرغم من ان الماء لا يظهر لونا عندما تكون كميته قليلة، الا انه يمتلك لونا داخليا خفيفا يميل للزرقة وذلك بسبب امتصاص خفيف للضوء عند الاطوال الموجية الحمراء.
Answer	الهيدروجين والاكسجين
Question	مما يتركب الماء؟
Reference distractors	النيتروجين والاكسجين
	الهيدروجين والفلوريد
	الاكسجين
Generated distractors	الفوسفات والحديد
	النيتروجين
	انثيلين

Table 11. Sample example of incorrectly generated distractors. In this case, two incomplete distractors.

Context	تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض، ثم بعد ذلك تتشكل الرقائق الثلجية التي تذوب في درجة اعلى من 0 درجة مئوية في الهواء الرطب، لتعود وتلتصق معا وتشكل رقائق ثلجية اكبر حجما، وبالتالي يصبح وزنها ثقيلا بالقدر الذي يسمح بسقوطها على الارض، ويمكن تعريف الثلج بالانجليزية: Snow بانه قطرات الماء في حالتها الصلبة والمتبلورة في الغلاف الجوي والتي تسقط على الارض.
Answer	تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض
Question	كيف تتكون الثلوج؟
Reference distractors	تتكون الثلوج خارج السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض
	تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة مرتفعة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض
	تتكون الثلوج داخل السحب، اي عندما تكون درجات الحرارة منخفضة، من خلال تنافر بلورات الثلج الصغيرة ببعضها البعض
Generated distractors	يتكون الثلوج داخل السحب وذلك عندما تكون درجات الحرارة منخفضة، من خلال التصاق بلورات الثلج الصغيرة ببعضها البعض
	تتكون الاملاح في الساحل،
	تتكون الرقائق الثلجية

Table 12. Performance comparison of beam search and top-k for standalone question generation task.

Metric	Beam Search (Mean)	Top-k (Mean)	p-Value
BLEU-1	24.11	13.82	0.0055
BLEU-2	15.14	5.24	0.0014
BLEU-3	9.30	2.96	0.0270
BLEU-4	6.90	2.28	0.0451
METEOR	21.52	9.33	0.0013
$F_{BERT}$	80.67	75.60	$< 0.001$

Table 13. Comparison of BLEU scores for generated distractors using beam search and top-k sampling. We report the mean for both algorithms and the corresponding p-values.

	Distractor 1 ( $D_{1}$ )			Distractor 2 ( $D_{2}$ )			Distractor 3 ( $D_{3}$ )
Metric	Beam	Top-k	p-Value	Beam	Top-k	p-Value	Beam	Top-k	p-Value
BLEU-1	11.56	13.78	0.0095	12.25	14.56	0.0047	10.34	14.42	$< 0.00001$
BLEU-2	10.93	12.76	0.0197	11.58	13.58	0.0076	9.80	13.31	$< 0.00001$
BLEU-3	10.75	12.60	0.0134	11.53	13.38	0.0108	9.67	13.11	$< 0.00001$
BLEU-4	10.63	12.50	0.0098	11.44	13.27	0.0105	9.58	13.00	$< 0.00001$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jabr, R.B.; Azmi, A.M. Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework. Mathematics 2025, 13, 2975. https://doi.org/10.3390/math13182975

AMA Style

Jabr RB, Azmi AM. Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework. Mathematics. 2025; 13(18):2975. https://doi.org/10.3390/math13182975

Chicago/Turabian Style

Jabr, Reham Bin, and Aqil M. Azmi. 2025. "Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework" Mathematics 13, no. 18: 2975. https://doi.org/10.3390/math13182975

APA Style

Jabr, R. B., & Azmi, A. M. (2025). Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework. Mathematics, 13(18), 2975. https://doi.org/10.3390/math13182975

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Knowledge-Aware Arabic Question Generation: A Transformer-Based Framework

Abstract

1. Introduction

2. Background

2.1. Linguistic and Computational Challenges in Arabic QG

2.2. Challenges in Distractor Generation

2.3. T5 Architecture

3. Related Work

3.1. Traditional Approaches

3.2. Deep-Learning Approaches

4. Compiling the Dataset

4.1. Dataset for Plain Question

4.2. Dataset for Multiple-Choice Questions

4.3. Data Preprocessing

5. Our Proposed Approach

5.1. Proposed System Design

5.2. Training the Model

5.3. Generating Optimal Questions

5.4. Evaluation Metrics

6. Results and Discussion

6.1. Task 1: Standalone Question Generation

6.2. Task 2: Multiple-Choice Question Generation

6.3. Task 3: Analysis of Decoding Strategies

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI