MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering

Al-Qahtani, Manal Ali; Alkhamees, Bader Fahad; Ykhlef, Mourad

doi:10.3390/data11030064

Open AccessArticle

MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering

by

Manal Ali Al-Qahtani

^*,

Bader Fahad Alkhamees

and

Mourad Ykhlef

Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh 12372, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Data 2026, 11(3), 64; https://doi.org/10.3390/data11030064

Submission received: 4 February 2026 / Revised: 17 March 2026 / Accepted: 18 March 2026 / Published: 20 March 2026

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Versions Notes

Abstract

Developing reliable Arabic question answering (QA) systems for Islamic fatwas requires datasets that capture the linguistic complexity and multi-step reasoning inherent in jurisprudential inquiries. However, the existing Arabic religious QA datasets primarily focus on direct retrieval or classification, often failing to address the multi-hop reasoning necessary for complex fatwa questions. To bridge this gap, we introduce MAFQA, a benchmark dataset specifically designed for multi-hop Arabic fatwa question answering. MAFQA was constructed from an extensive corpus of authentic fatwa records sourced from authoritative Islamic institutions. The dataset was developed via a semi-automated pipeline that integrates expert-guided identification of complex inquiries with a structured decomposition framework. This framework employs automated reasoning-pattern classification, semantic feature extraction, and template-guided annotation of subquestions and subanswers, followed by rigorous validation to ensure contextual grounding, logical coherence, and structural consistency. To evaluate the utility of the dataset, we conduct an extensive benchmarking study using Arabic-specialized, multilingual, and instruction-tuned language models across two primary tasks: question decomposition (QD) and generative question answering (QA). Performance is assessed using a comprehensive suite of lexical, semantic, relevance, and faithfulness metrics. Experimental results demonstrate that Arabic-specialized models consistently outperform their multilingual counterparts, with AraT5-base and AraBART achieving the highest performance in terms of lexical similarity, semantic alignment, and answer faithfulness.

Keywords:

Arabic QA; Arabic NLP; Arabic QA datasets; Arabic fatwa

1. Introduction

Recent advances in natural language processing (NLP), particularly the emergence of deep neural architectures and transformer-based language models, have catalyzed significant progress in developing dialog systems capable of producing coherent, contextually appropriate, and factually grounded responses. However, contemporary question answering (QA) models still exhibit notable limitations, including content hallucination, contradictory statements, and occasional failures in intent recognition. These shortcomings are particularly consequential in high-stakes domains such as religion, where the dissemination of inaccurate information may have substantive ethical or social implications. In these settings, systems must not only provide accurate answers but also manage epistemic uncertainty responsibly.

Reliable Arabic QA systems for religious domains depend on high-quality datasets that capture the linguistic, contextual, and reasoning complexities of the subject matter. Within the Islamic domain, most existing Arabic QA datasets focus on direct retrieval or classification and rarely address the multi-step reasoning processes inherent in complex inquiries. This gap is particularly evident in fatwa-related datasets, which present unique challenges due to their reliance on nuanced interpretations, conditional rulings, and evidence synthesized from multiple authoritative sources—such as the Qur’an, Hadith, and scholarly consensus. Ensuring that guidance is accurate, well-grounded, and aligned with Islamic law (Shariah) is therefore paramount [1].

To address this gap, we present the Multi-hop Arabic Fatwa Question Answering (MAFQA) dataset, a curated benchmark specifically designed to capture the reasoning complexity of real-world fatwa inquiries. MAFQA is derived from the Arabic fatwa corpus developed in the preceding chapter and incorporates annotated QA pairs, detailed supporting contexts, and structured reasoning chains. These components facilitate the evaluation and training of advanced QA models capable of context-aware, evidence-based reasoning. The construction process combines manual expert annotation with automated structuring techniques to ensure linguistic precision and logical coherence. As a publicly available and rigorously validated benchmark, MAFQA serves as a foundation for advancing Arabic NLP in religious contexts and supports the development of systems capable of delivering reliable answers to Islamic jurisprudential questions.

This paper offers three primary contributions. First, we introduce the MAFQA dataset, a carefully curated resource for multi-hop Arabic fatwa QA that incorporates expert-validated complex questions, structured multi-step reasoning chains, and synthesized answers, thereby filling a significant void in current Arabic NLP resources. Second, we conduct extensive benchmarking of state-of-the-art Arabic and multilingual large language models on both question decomposition and generative QA tasks, providing empirical insights into their performance regarding Arabic religious reasoning. Finally, we provide a comprehensive quantitative and qualitative analysis of the dataset—encompassing question-type distributions, answer-length patterns, and validation procedures—to establish its reliability for future research.

The remainder of this paper is organized as follows: Section 2 surveys prior work on Arabic fatwa QA datasets; Section 3 details the methodology used to construct the MAFQA dataset; Section 4 presents the analytical findings; Section 5 reports model performance on QD and QA tasks; and Section 6 concludes the paper.

2. Related Works

Research on Arabic question answering (QA) for religious domains, particularly fatwas, remains limited compared to general-purpose Arabic QA datasets such as ARCD [2], TyDiQA [3], and XQuAD [4]. The existing literature primarily focuses on developing Arabic datasets for the Holy Qur’an, with relatively less attention paid to sources such as Hadith, Tafsir, and fatwa. Nevertheless, several notable datasets have been developed to support retrieval, classification, and domain-specific QA tasks across various Islamic sources.

For the Holy Qur’an, the Quran Reading Comprehension Dataset (QRCD), proposed by Malhas et al. [5] as an extension of the AyaTEC dataset [6], is among the most prominent. QRCD comprises 1337 sets of questions, passages, and corresponding answers. It allows passages to be paired with multiple questions and vice versa, thereby increasing variability and complexity. While recently used to train and evaluate Arabic language models, QRCD exhibits limitations that restrict comprehensive evaluation, including its small size, limited question count, and partial Qur’anic coverage. Similarly, Alnefaie and Atwell [7] introduced the Quranic Question Answering corpus (QUQA), which contains 2189 questions with answers extracted from 2930 verses. Although it integrates multiple resources, it covers only approximately 47% of the Qur’an.

Regarding Hadith datasets, the Hadith Question Answering (HAQA) dataset, also developed by Alnefaie and Atwell [7], is a significant contribution. It includes 1598 records containing 1359 unique questions collected from authoritative sources, including Sahih al-Bukhari, Muslim, Al-Tirmidhi, and Ibn Majah. Spanning topics such as the life of the Prophet Muhammad and core Islamic principles, HAQA serves as a valuable resource for Hadith-oriented QA research.

In the context of fatwa datasets, the Hajj fatwa dataset (Hajj-FQA) [8] is a widely cited resource that encodes the distinct linguistic and jurisprudential characteristics of inquiries raised by pilgrims. It comprises annotated QA pairs sourced from official juristic references and centered on pilgrimage-related rulings. The construction of Hajj-FQA involved data extraction from trusted websites, manual annotation for question construction, and answer-span selection. The authors benchmarked several Arabic large language models (LLMs) on this dataset, demonstrating its utility for building precise and reliable religious QA systems. Another significant contribution is Fatwaset [9], the first publicly available Arabic dataset dedicated to Islamic fatwas. It contains 130,000 records collected from diverse reliable sources, including government agencies and scholars across different Arab regions. This diversity allows the corpus to capture a broad range of topics and linguistic styles. Each record includes rich metadata—such as publication date, category, and scholar name—enabling researchers to investigate how fatwa inquiries evolve over time. Furthermore, exploratory data analysis (EDA) has revealed valuable linguistic and topical patterns, enhancing its utility for computational Islamic studies. Similarly, Munshi et al. [10] introduced a large-scale Arabic fatwa dataset covering various jurisprudential topics with rich metadata. However, unlike Fatwaset, this dataset is not publicly available.

Beyond domain-specific religious QA, recent research has explored multi-hop QA in Arabic, though resources remain limited compared to English benchmarks. Notable examples include ACQAD [11], which contains over 118,000 questions categorized by comparison and multi-hop types, and MQA-KEAL [12], which focuses on multi-hop QA with knowledge editing capabilities. Additionally, Mintaka [13] provides an Arabic portion translated from English that supports complex reasoning tasks. Despite these developments, a recent survey [14] highlights that Arabic QA resources still lag behind English datasets in scale and reasoning-oriented benchmarking.

Most existing Arabic QA datasets either focus on domain-specific religious content—typically targeting single-step retrieval—or general-domain multi-hop reasoning. Consequently, there is a lack of resources designed to capture the complex, multi-step reasoning characteristic of fatwa inquiries. To address this gap, the MAFQA dataset introduces a publicly available benchmark tailored for reasoning-intensive fatwa QA. Unlike prior resources, MAFQA represents the multilayered jurisprudential reasoning underlying complex fatwa questions, providing a critical resource for advancing computational Islamic studies and Arabic NLP.

3. MAFQA Dataset Construction

The construction of the MAFQA dataset involves a series of manual and automated steps designed to guarantee linguistic complexity, thematic relevance, and jurisprudential accuracy. This semi-automated pipeline combines automated linguistic analysis with human expert annotation to ensure both scalability and reliability. The end-to-end workflow is illustrated in Figure 1, and the step-by-step procedure is formally summarized in Algorithm 1.

Algorithm 1: Semi-Automatic Construction of Multi-Hop Fatwa QA Instances

Input:

F = {f_{1}, f_{2}, \dots, f_{n}}

: set of complex fatwa records, where each record

f_{i} = (q_{i}, a_{i})

contains an original fatwa complex question

q_{i}

, and its corresponding answer

a_{i}

R = {r_{1}, r_{2}, \dots, r_{n}}

: set of reasoning patterns

T = {t_{1}, t_{2}, \dots, t_{n}}

: set of question-generation templates associated with reasoning patterns

L

: LLM-based semantic extraction module

V

: rule-based validation and grounding module

A

: human annotators

Output:

D

: multi-hop QA dataset where each instance contains

{q_{i}, a_{i}, S Q, S A, a_{g e n}, p_{a}, p_{b}}

Initialize an empty dataset

D

Preprocess the fatwa record set

F

foreach fatwa record

f_{i} = (q_{i}, a_{i})

in

F

do

r ⟵ c l a s s i f y_r e a s o n i n g_p a t t e r n (q_{i}, a_{i})

if

r ϵ R

then

Segment

a_{i}

into candidate passages

P_{i} = {p_{1}, p_{2}, \dots, p_{k}}

foreach passage p in

P_{i}

do

E_{p} ⟵ L (p)

E_{p}^{*} ⟵ V (E_{p}, p)

Build a local relation graph

G_{i}

over

P_{i}

foreach connected pair

(p_{i}, p_{j})

in

G_{i}

do

S c o r e (p_{i}, p_{j}) ⟵ s i m (E_{p i}^{*}, E_{p j}^{*})

(p_{a}, p_{b}) ⟵ a r g {m a x}_{(p_{i}, p_{j}) ϵ G_{i}} S c o r e (p_{i}, p_{j})

Retrieve validated semantic features:

E_{p a}^{*} = {{c o n c e p t}_{a}, {r u l i n g}_{a}, {c o n d i t i o n}_{a}, \dots}

E_{p b}^{*} = {{c o n c e p t}_{b}, {r u l i n g}_{b}, {c o n d i t i o n}_{b}, \dots}

T_{r} = {t ϵ T | p a t t e r n (t) = r}

present

p_{a}, p_{b}, r, T_{r}, E_{p a}^{*}, E_{p b}^{*}

to annotators

A

Manually generate sub-questions

{s q}_{1} ⟵ g e n e r a t e (p_{a}, T_{r}, E_{p a}^{*}, A)

{s q}_{2} ⟵ g e n e r a t e (p_{b}, T_{r}, E_{p b}^{*}, A)

Manually extract sub-answers

{s a}_{1} ⟵ e x t r a c t_s u b a n s w e r (p_{a}, {s q}_{1}, A)

{s a}_{2} ⟵ e x t r a c t_s u b a n s w e r (p_{b}, {s q}_{2}, A)

Initialize

S Q = {{s q}_{1}, {s q}_{2}}

Initialize

S A = {{s a}_{1}, {s a}_{2}}

if

{e v i d e n c e}_{a} e x i s t s o r {e v i d e n c e}_{b} e x i s t s

then

{s q}_{3} ⟵ g e n e r a t e (p_{a}, p_{b}, T_{r}, E_{p a}^{*}, E_{p b}^{*}, A)

{s a}_{3} ⟵ e x t r a c t_s u b a n s w e r (p_{a}, p_{b}, {s q}_{3}, A)

a_{g e n} ⟵ c o m p o s e_f i n a l_a n s w e r (S A, A)

if schema_is_valid

(q_{i}, a_{i}, S Q, S A, a_{g e n}, p_{a}, p_{b})

a n d a l l r e q u i r e d f i e l d s a r e n o n - e m p t y

e a c h s u b - a n s w e r i s g r o u n d e d i n i t s c o r r e s p o n d i n g p a s s a g e

a n d t h e r e a s o n i n g c h a i n r e q u i r e s b o t h p_{a} a n d p_{b}

a n d i f ({s q}_{3} e x i s t s t h e n {s a}_{3} i s e v i d e n c e - g r o u n d e d)

then

Add

(q_{i}, a_{i}, S Q, S A, a_{g e n}, p_{a}, p_{b})

to

D

Return

D

3.1. Data Collection and Preprocessing

In the initial phase, an extensive corpus of Arabic fatwa records was collected from authoritative Islamic sources, including official government portals, recognized scholarly websites, and public Islamic knowledge platforms such as Dar Al-Ifta (Egypt), the official portal of Scholar Ibn Baz, and FatwaPedia. These diverse sources were selected to ensure a broad representation of Arab cultures, linguistic variations, and region-specific jurisprudential topics. Each source contributes a unique vocabulary and culturally relevant issues, thereby enriching the corpus. Table 1 lists the primary websites utilized during this phase.

For each entry, we extracted the question and answer alongside available metadata—including category, title, date, and scholar or organization name—to enable traceability and domain-specific categorization. Following extraction, the records were cleaned and normalized. Key preprocessing steps included removing duplicates, eliminating formatting artifacts or fragmented tokens created by HTML, and standardizing whitespace. These procedures reduced noise and rendered the data structurally uniform for subsequent expert review and automated annotation.

3.2. Questions Collections and Classification

From the preprocessed corpus, a subset of records was extracted for evaluation by annotators with expertise in Islamic studies. Guided by these experts, we identified questions requiring multi-step reasoning based on predefined complexity criteria. Only questions classified as “complex” were retained.

A complex fatwa question is defined as one necessitating multi-step reasoning, which involves decomposing the problem into subquestions, integrating evidence from multiple Islamic sources (e.g., the Qur’an, Hadith, or differing scholarly consensus), and handling conditional rulings or exceptions. Such inquiries typically present a specific scenario before posing the core query. An example of complex fatwa question is illustrated as follows:

Fatwa Question in Arabic	بعض كبار السن يقرأ أول سورة الفاتحة وهو جالس، ثم يواصل بعد قيامه في القراءة، ما حكم هذا العمل؟
Fatwa Question in English	Some elderly individuals begin reciting the beginning of Surah Al-Fatiha while seated, then continue the rest after standing. What is the ruling on this practice?
Fatwa Answer in Arabic	إذا كان عاجزا ما يستطيع القيام الكلي وإنّما يستطيع بعض القيام هذا لا بأس به أن يقرأ أولها ، يخاف أن يفوته ، إن أخرها ، أما إذا كان يتمكن من قراءتها وهو قائم فيؤجلها حتى يقرأها وهو قائم ، أما إذا كان لا يستطيع وإن أجلها قد تفوته قد يركع الإمام قرأ بعضها في حال الجلوس ثم يكمل وهو قائم إذا كان عاجزا ، أما إن كان عن كسل وتساهل فلا يجوز ، يجب أن يبادر بالقيام ولا يحل له الجلوس ، فإذا جلس بطلت صلاته ، لكن إذا كان عاجزا لعذر شرعي يشق عليه القيام حالا هذا عذر له شرعي ، وإذا قرأ بعض الفاتحة لأنه لا يتمكن من قراءتها وهو قائم يركع الإمام قبله فلا بأس ، هذا عذر شرعي لأن الله يقول: َاتَّقُوا اللَّهَ مَا اسْتَطَعْتُم. ؛:ويقول النبي صلى الله عليه وسلم لمن عجز عن الصلاة قائماً: صل قائما ، فإن لم تستطع فقاعدا ، فإن لم تستطع فعلى جنب ، فإن لم تستطع فمستلقيا المقصود أنه يراعى.
Fatwa Answer in English	If a worshiper genuinely cannot stand fully, it is permissible for him to begin reciting the start of Al-Fatiha while seated to avoid missing it, but if he is physically able, he must wait and recite it standing. Sitting out of laziness invalidates prayer. Legitimate incapacity (e.g., illness or weakness) allows partial seated recitation—ideally with the imam bowing first—without penalty.
Source/Mufti Name	Official Website of Scholar Mahammad Ibn-Othaimin.

The above question involves conditional multi-step reasoning: (1) determining if the seated position is due to legitimate incapacity or negligence; (2) identifying evidence from the Qur’an or Hadith permitting seated prayer; and (3) establishing the consequence of negligence. The final answer integrates these sub-inquiries into a comprehensive ruling. In contrast, a simple fatwa question is one where the answer is retrieved directly from a single ruling without needing to chain inferences or resolve conflicting opinions. An illustrative example is depicted below.

Fatwa Question in Arabic	ما حكم من أفطر في نهار شهر رمضان بدون عذر؟
Fatwa Question in English	What is the ruling on someone who breaks his fast during the day in Ramadan without valid excuse?
Fatwa Answer in Arabic	من أفطر يوما من رمضان بغير عذر شرعي فقد أتى منكرًا عظيمًا، ومن تاب تاب الله عليه، فعليه التوبة إلى الله بصدق، بأن يندم على ما مضى، ويعزم ألا يعود، ويستغفر ربه كثيرًا، ويبادر بقضاء اليوم الذي أفطره.
Fatwa Answer in English	Whoever breaks his fast one day in Ramadan without a valid excuse has committed a great sin. He must seek forgiveness from Allah and hasten to make up for the day he broke his fast.
Source/Mufti Name	Official Website of Scholar Abdulaziz Ibn-Baz.

Annotation Guidelines for Complexity Labeling

In MAFQA focuses exclusively on capturing complex inquiries that require multi-hop reasoning. The annotation procedure begins with an initial reading of each fatwa question, during which straightforward inquiries are discarded. For the remaining candidates, a decomposition check verifies whether the question can be divided into at least two subquestions, followed by a multi-source check to ensure that answering them requires multiple evidence sources or conditional reasoning. Only questions that satisfy both criteria are retained. Three independent experts in Islamic studies performed the annotation. To assess reliability, we calculated Cohen’s Kappa for pairwise agreement and Fleiss’ Kappa for overall consistency. The results indicate substantial agreement, with Cohen’s Kappa scores of 0.8017 (Annotators 1 vs. 2) and 0.7929 (Annotators 1 vs. 3). The overall Fleiss’ Kappa of 0.7595 confirms strong inter-annotator consistency. Final labels were assigned via majority voting; instances lacking a majority were resolved through consensus discussion.

3.3. Questions Decomposition

Complex fatwa questions often require multiple reasoning steps to derive a correct answer, as Islamic jurisprudential rulings frequently rely on conditions, exceptions, supporting evidence, or comparisons between cases. To model this reasoning process, complex questions in the proposed dataset are decomposed into subquestions representing individual reasoning steps. Each subquestion retrieves a specific piece of information required to answer the overall inquiry, and the final answer is obtained by synthesizing these intermediate responses. This decomposition enables the construction of multi-hop question answering (QA) instances, where addressing the primary question requires integrating information from multiple supporting passages.

The questions used in this step correspond to the subset of fatwas previously identified and classified by domain experts. These expert-labeled complex questions serve as the input for the decomposition process, ensuring that the generated multi-hop instances reflect authentic jurisprudential reasoning patterns. The process begins with automated reasoning-pattern identification and semantic element extraction, followed by manual subquestion formulation and answer annotation. The complete procedure is summarized in Algorithm 1.

3.3.1. Reasoning Pattern Classification

The decomposition framework operates on a set of complex fatwa records that were previously identified by domain experts

F = {(q_{i}, a_{i})}_{i = 1}^{n}

where each record

f_{i}

consists of an original fatwa complex question

q_{i}

and its corresponding answer

a_{i}

. The first step in the decomposition process is identifying the reasoning structure underlying each fatwa instance. To accomplish this, a rule-based reasoning-pattern classifier is applied to the question–answer pair

(q_{i}, a_{i}) .

Based on linguistic and semantic cues, each fatwa instance is assigned to one of several reasoning patterns common in Islamic jurisprudence.

Let

R = {r_{1}, r_{2}, \dots, r_{k}}

denote the set of reasoning patterns used in the dataset. Each reasoning pattern represents a common logical structure capturing how rulings are derived through conditions, exceptions, evidence, causal relationships, or comparisons. Table 2 lists the reasoning patterns together with their descriptions and example question structures. These reasoning patterns guide the subsequent stages of the decomposition process, particularly the selection of question-generation templates and the interpretation of the supporting passages.

3.3.2. Passage Segmentation and Semantic Feature Extraction

After determining the reasoning pattern, the fatwa answer text is segmented into a set of candidate passages

P_{i} = {p_{1}, p_{2}, \dots, p_{k}}

These passages typically correspond to different components of the reasoning process, such as rulings, conditions affecting the ruling, exceptions to a general rule, or supporting textual evidence. This segmentation step enables the framework to isolate the distinct pieces of information required to construct a multi-step reasoning chain.

For each candidate passage

p ϵ P_{i}

, semantic features are extracted using a hybrid approach. First, the LLM-based semantic extraction module

L

generate candidate semantic features from each passage. Formally, the extraction process can be expressed as:

E_{p} = L (p),

where

E_{p}

denotes the set of semantic features extracted from passage

p

. These features may include jurisprudential concepts, legal rulings, conditions affecting the ruling, exceptions, actions, outcomes, and references to supporting evidence such as Qur’anic verses, Hadith, or scholarly opinions. These elements represent the structured information required to formulate intermediate reasoning steps within the multi-hop reasoning chain.

Additionally, the extracted features are processed by the rule-based validation and grounding module

V

, which ensures each feature explicitly appears in the original passage. This validation step can be expressed as:

E_{p}^{*} = V (E_{p}, p),

where

E_{p}^{*}

represents the validated semantic feature set grounded in the source passage

p

.

The combination of LLM-assisted extraction and rule-based validation improves the reliability of the extracted semantic features while ensuring that all extracted information remains grounded in the original fatwa text.

To identify passages that support a multi-step chain, a local relation graph

G_{i} = (P_{i}, E)

is constructed over the candidate passages, where each node corresponds to a passage

p \in P_{i}

and edges

E

represent semantic relationships between passages based on their extracted semantic features.

To determine the most relevant pair for reasoning chain, a relevance score is computed for each candidate pair

{(p}_{a}, p_{b})

:

S c o r e (p_{a}, p_{b}) = S i m (E_{p_{a}}^{*}, E_{p_{b}}^{*})

where

S i m (\cdot)

denotes a semantic similarity function. The final supporting passage pair is selected as:

((p_{a}, p_{b}) = a r g \max_{(p_{i}, p_{j}) \in P_{i}} S c o r e (p_{i}, p_{j}))

which identifies the pair of passages with the highest semantic compatibility for constructing a multi-hop reasoning chain. These two passages together contain the information required to answer the question through multiple reasoning steps.

3.3.3. Question Template Construction

To support consistent subquestion construction, a predefined set of question templates

T

was developed.

T = {t_{1}, t_{2}, \dots, t_{m}}

Each template represents a structured natural language pattern designed to transform semantic elements extracted from fatwa passages into candidate question structures. These templates serve as controlled guidance mechanisms that help convert structured semantic features into natural language queries while preserving the logical structure of the underlying reasoning process.

The template construction process was guided by two primary design principles. First, the templates were aligned with the reasoning patterns defined in the dataset to ensure that each template reflects a specific jurisprudential reasoning structure. Second, the templates were designed to capture common formulations used in Arabic religious inquiries, thereby maintaining linguistic naturalness and domain relevance.

Let

R

denote the set of reasoning patterns:

R = {r_{1}, r_{2}, \dots, r_{k}}

For each reasoning pattern

r \in R

, a subset of templates

T_{r} \subseteq T

is defined, where

T_{r}

contains the templates associated with reasoning pattern

r .

Table 3 lists examples of these templates.

3.3.4. Template-Guided Sub-Question Construction

During the multi-hop question construction process, the reasoning pattern identified for a fatwa instance determines the subset of templates used to guide sub-question construction. Let

T_{r} = {t \in T ∣ p a t t e r n (t) = r}

denote the subset of question templates associated with reasoning pattern

r

, where

p a t t e r n (t)

represents the reasoning pattern corresponding to template

t

.

Each template contains placeholder variables that are instantiated using semantic features extracted from the supporting passages. These features may include jurisprudential concepts, legal rulings, conditions, exceptions, or references to supporting evidence such as Qur’anic verses, Hadith, or scholarly opinions.

Let

E_{p}^{*} = {e_{1}, e_{2}, \dots, e_{l}}

denote the set of validated semantic features extracted from passage

p

, as described in Section 3.3.2.

During the annotation stage, the selected supporting passages

(p_{a}, p_{b})

the reasoning pattern

r

, the template subset

T_{r}

, and the extracted semantic features

E_{p_{a}}^{*}

and

E_{p_{b}}^{*}

are provided to the annotators. Using these components as guidance, annotators construct sub-questions that correspond to the reasoning steps required to answer the original fatwa question.

Formally, the sub-questions are defined as

s q_{1} = G e n e r a t e (p_{a}, T_{r}, E_{p_{a}}^{*}, A)

s q_{2} = G e n e r a t e (p_{b}, T_{r}, E_{p_{b}}^{*}, A)

where

G e n e r a t e (\cdot)

denotes the template-guided question construction process performed by annotators

A

.

In most cases, two sub-questions are constructed, forming the set

S Q = {s q_{1}, s q_{2}}

where each sub-question corresponds to a distinct reasoning step derived from one of the supporting passages.

When explicit supporting evidence appears in the passages—such as references to Quran verses, Hadith, or scholarly statements—annotators may construct an additional sub-question

s q_{3}

. The resulting sub-question set is therefore defined as

S Q = {s q_{1}, s q_{2}, [s q_{3}]}

where the bracketed element indicates that the third sub-question is optional and is included only when explicit textual evidence is present in the supporting passages.

3.3.5. Sub-Answer and Final Answer Construction

Once the sub-questions are formulated, annotators extract the corresponding answers directly from the supporting passages. Each sub-answer represents the information required to resolve a single reasoning step within the multi-hop reasoning chain.

The set of sub-questions

S Q = {s q_{1}, s q_{2}, [s q_{3}]}

and the corresponding sub-answers

S A = {s a_{1}, s a_{2}, [s a_{3}]}

together form the intermediate reasoning structure associated with a dataset instance. The optional elements

s q_{3}

and

s a_{3}

appear only when explicit supporting evidence is present in the passages, such as references to Qur’anic verses, Hadith, or authoritative scholarly statements.

Using these intermediate reasoning components, annotators construct the final composed answer

a_{g e n}

by integrating the information contained within the sub-answers. This composed answer reflects the complete reasoning chain required to address the original fatwa question.

3.4. Dataset Validation and Instance Construction

Following the subquestion construction and answer extraction stages, each candidate instance is automatically validated and structured into a final dataset record. This automated validation step ensures structural correctness, contextual grounding, and the logical consistency of the multi-hop reasoning chain prior to inclusion in the dataset.

The validation process is implemented via automated scripts that verify multiple facets of each instance. First, schema validation ensures that every record adheres to the predefined dataset structure and contains all required fields, including the original fatwa question, intermediate subquestions, corresponding sub-answers, supporting passages, and the composed final answer. Furthermore, the validation procedure confirms that the entries follow a well-formed data format and that mandatory fields are not empty.

To ensure factual grounding, contextual validation verifies that each sub-answer is explicitly supported by its associated passage. Specifically, the script confirms that the textual span corresponding to a sub-answer appears within the supporting passage from which it was derived. This step ensures the reasoning chain remains faithful to the original source text. Additional consistency checks verify that the intermediate reasoning structure is coherent and that the chain requires information from both supporting passages

(p_{a}, p_{b})

thereby preserving the multi-hop nature of the instance. Only instances satisfying these automated conditions are retained.

Formally, the resulting dataset

D

is represented as:

D = {(q_{i}, a_{i}, S Q_{i}, S A_{i}, a_{g e n, i}, p_{a, i}, p_{b, i})}_{i = 1}^{N}

where

N

denotes the total number of dataset instances. For each instance

i

,

q_{i}

represents the original fatwa complex question,

a_{i}

denotes the original fatwa answer used as contextual information,

S Q_{i}

is the set of intermediate sub-questions,

S A_{i}

represents the corresponding sub-answers,

a_{g e n, i}

is the composed final answer and

(p_{a, i}, p_{b, i})

represent the pair of supporting passages. Records are stored in JSON format with UTF-8 encoding to ensure the correct representation of Arabic characters and cross-environment compatibility.

4. Dataset Analysis

We conducted a quantitative analysis of the MAFQA dataset, which comprises 1326 questions: 388 original fatwa inquiries and 938 decomposed subquestions. The distributions of question types for the original fatwa questions and their corresponding subquestions are summarized in Table 4 and Table 5, respectively. Furthermore, Figure 2 provides a visual comparison of these distributions with those observed in the Hajj-FQA [8] dataset.

As shown in Table 4, “What”-type questions constitute the majority of original fatwa inquiries, accounting for 70.10% of all instances. The second most frequent category is “Which” questions (13.40%), indicating that a noticeable portion of the questions involve selecting between alternatives or identifying the most appropriate ruling among multiple possibilities. Other interrogative forms appear considerably less frequently: “How” and “When” questions account for 5.93% and 5.67%, respectively, while “How much” (3.09%) reflects limited quantitative inquiries. “Why” questions are the least frequent at 1.80%, indicating that explicit causal or justificatory questions are relatively rare.

The dominance of “What”-type questions reflects the natural formulation of fatwa inquiries, where users typically seek explicit rulings regarding specific situations (e.g., “What is the ruling on…?”). However, the reasoning complexity in MAFQA does not stem from the interrogative form itself, but from the multi-step jurisprudential reasoning required to derive the final ruling. Many “What”-type questions still require decomposing the problem into multiple sub-questions, integrating evidence from different Islamic sources, and handling conditional rulings or exceptions. Thus, despite the structural simplicity of the interrogative form, the underlying process remains inherently multi-hop. Nevertheless, the prevalence of “What” questions introduces a degree of dataset imbalance, which may influence model evaluation by encouraging implicit specialization in this dominant pattern.

A similar pattern is observed in Table 5 for the subquestions generated during the multi-hop decomposition process. “What”-type subquestions dominate across all decomposition levels, accounting for 75.69% of the total 938 subquestions. “Which” questions represent 18.98%, suggesting that a considerable portion of the reasoning involves determining the most appropriate ruling among multiple possibilities. In contrast, other interrogative types appear only marginally, with “How,” “When,” “How much,” and “Why” each constituting less than 2% of the total.

Compared with Hajj-FQA [8], as illustrated in Figure 2, both datasets exhibit a strong dominance of “What”-type questions, though this tendency is more pronounced in Hajj-FQA (87.95%) than in MAFQA (74.06%). In contrast, MAFQA demonstrates a higher proportion of “Which” questions (17.35%) than Hajj-FQA (5.74%), reflecting the dataset’s emphasis on selecting among alternative rulings or conditions during multi-hop reasoning. Other interrogative forms occur at much lower frequencies in both datasets. “How” questions account for 2.56% in MAFQA and 3.44% in Hajj-FQA, while “How much”, “When”, and “Why” remain relatively rare. The limited presence of these interrogative forms underlines the shared closed-domain character of these datasets, where questions are primarily formulated to request specific religious rulings rather than general knowledge, in contrast to open-domain datasets where such question types are more common.

We also performed a token-length assessment for the final answers and subanswer categories using the NLTK toolkit. As shown in Table 6, answer types exhibit distinct token distributions. Final answers are the longest on average (mean = 40.31 tokens, range = 10–109), reflecting their comprehensive nature as they synthesize intermediate reasoning steps into complete explanations. In contrast, subanswers are more concise, averaging approximately 20 tokens each. Specifically, Subanswer 1 averages 19.94 tokens, Subanswer 2 averages 21.19 tokens, and Subanswer 3 is the most compact at 19.01 tokens. Overall, these results indicate that intermediate subanswers are brief and focused, providing concise reasoning steps that collectively support the detailed final answers.

5. Experimental Evaluation of the MAFQA Dataset

To assess the effectiveness and applicability of the Multihop Arabic Fatwa Question Answering (MAFQA) dataset, we conducted a series of experiments in which several pretrained language models were fine-tuned and evaluated on two primary tasks: question decomposition (QD) and generative question answering (QA). The experimental setup includes models from three categories: Arabic-specialized models, multilingual encoder–decoder models, and instruction-tuned large language models (LLMs).

5.1. Dataset Preparation and Splitting

The MAFQA dataset was divided into training, validation, and test subsets to facilitate reliable model development and unbiased evaluation. To ensure reproducibility, the dataset was randomly shuffled using a fixed seed and partitioned according to an 80/10/10 split. The training set was utilized for model fine-tuning, the validation set for hyperparameter optimization and performance monitoring, and the test set exclusively for final evaluation. Prior to training, several preprocessing steps were applied to the textual data to enhance input consistency and reduce noise. Specifically, Arabic normalization was performed to standardize character variants, such as simplifying hamza variants (ؤ, ئ) to their base forms and removing redundant whitespace and formatting artifacts. These procedures reduce sparsity in the textual representation and improve model generalization. Following normalization, the text was tokenized using the tokenizer corresponding to each selected pretrained model (e.g., AraBART). Input sequences were truncated to a maximum length of 448 tokens, while target decomposition sequences were limited to 256 tokens to ensure computational efficiency. Furthermore, padding tokens in the target sequences were masked with the special value − 100 to exclude them from loss computation during training.

5.2. Evaluated Models

5.2.1. Multilingual Sequence-to-Sequence Models

The multilingual baselines consist of mT5-small [15] and mT5-base [16]. These models belong to the multilingual T5 (mT5) family, which was pretrained on the large-scale mC4 corpus covering 101 languages. They are designed to support multilingual text generation and serve as strong baselines for cross-lingual transfer learning.

5.2.2. Arabic Sequence-to-Sequence Models

Several Arabic specialized models were included to evaluate the benefits of language-specific pretraining. AraBART [17] is an Arabic encoder–decoder model based on the BART-base architecture—comprising six encoder and six decoder layers—pretrained end-to-end on extensive Arabic corpora. The AraT5 family includes multiple variants tailored for Arabic tasks: AraT5-base-msa [18], trained on a mixture of Modern Standard Arabic (MSA) and social media data; AraT5-large [19], a higher-capacity model designed to capture richer semantic representations; and Arabic-T5-small [20], an Arabic-adapted variant trained on the Arabic Billion Words corpus and the Arabic portions of the mC4 and OSCAR datasets.

5.2.3. Instruction-Tuned LLMs

The large language models evaluated include Qwen-7B [21] and Mistral-7B-Instruct [22], which are instruction-tuned, decoder-only transformer models. Qwen-7B demonstrates strong multilingual generation and reasoning capabilities, while Mistral-7B-Instruct is optimized for high performance in contextual understanding and text generation.

5.3. Experimental Setup and Hyperparameter Selection

To ensure rigorous and reproducible benchmarking of the MAFQA dataset, we adopted two complementary evaluation settings. First, encoder–decoder sequence-to-sequence (seq2seq) models were fine-tuned via supervised learning for both question decomposition (QD) and generative question answering (QA) using the dataset’s training split. Second, instruction-tuned large language models (LLMs), specifically Qwen-7B and Mistral-7B-Instruct, were evaluated in a zero-shot setting to assess their ability to address complex fatwa inquiries without task-specific parameter updates.

The seq2seq models were trained using the Hugging Face Seq2SeqTrainer framework. Given that fatwa inquiries often involve extensive context, the maximum source sequence length was set to 512 tokens for generative QA. The maximum target sequence length was defined as 192 tokens for answer generation and 256 tokens for the QD task to accommodate multi-part subquestion outputs.

Hyperparameters were determined through pilot experiments and established sequence-to-sequence fine-tuning practices. Preliminary trials explored learning rates ranging from

1 \times 10^{- 5}

to

5 \times 10^{- 5}

. For most seq2seq experiments, a learning rate of

5 \times 10^{- 5}

ensured stable optimization; however, for multilingual models such as mT5-base, a lower rate of

2 \times 10^{- 5}

was utilized to maintain stability during full fine-tuning. Batch sizes of 2, 4, and 8 were tested based on model scale and GPU memory constraints. Specifically, LoRA-based experiments utilized a batch size of 4, while mT5-base employed a per-device batch size of 2 with gradient accumulation over four steps.

All supervised models were trained for up to five epochs, which proved sufficient for convergence based on validation performance. For Arabic-specific models, optimization was performed using AdamW with a weight decay coefficient of 0.01. Mixed-precision training (FP16) was employed to enhance computational efficiency. For multilingual T5-based models, we utilized the Adafactor optimizer with a linear warmup schedule covering 10% of the total training steps and gradient clipping with a maximum norm of 1.0.

For larger architectures, including AraBART and AraT5-large, parameter-efficient fine-tuning (PEFT) was implemented via Low-Rank Adaptation (LoRA) [23]. Rather than updating all pretrained parameters, LoRA injects trainable low-rank matrices into selected attention projections while freezing the backbone weights. Our configuration utilized a rank

r = 8

, a scaling factor

α = 16

, a dropout of 0.1, and adaptation of the

q_p r o j

and

v_p r o j

layers.

To mitigate overfitting, validation was performed at the end of each epoch. In full fine-tuning experiments, early stopping was enabled with a patience of two evaluations. The optimal checkpoint was automatically selected based on the lowest validation loss. These measures, combined with weight decay, enhanced generalization and training stability.

Finally, we evaluated instruction-tuned decoder-only LLMs (Qwen-7B and Mistral-7B-Instruct-v0.2) in a zero-shot setting. In this configuration, no parameters were updated on the MAFQA dataset; instead, models generated outputs directly from task-specific prompts. To ensure deterministic results and facilitate fair comparison, decoding was performed with sampling disabled (temperature = 0.0, top-p = 1.0) and a maximum limit of 220 generated tokens.

While some publicly available fatwa texts may overlap with the models’ pretraining data, the MAFQA dataset was constructed through a structured annotation process that transforms raw records into multi-hop reasoning instances. This transformation—comprising decomposed subquestions, supporting contexts, and synthesized answers—encourages models to perform multi-step reasoning rather than simply reproducing memorized responses.

5.4. Evaluation Metrics

5.4.1. Lexical and Semantic Similarity Metrics

The evaluation protocol encompasses generative question answering (QA) and question decomposition (QD), while accounting for the morphological complexities of Arabic. In generative QA, models must produce coherent and contextually faithful answers, necessitating metrics that capture both lexical overlap and semantic equivalence. Conversely, QD requires decomposing a complex inquiry into simpler, logically dependent subquestions; this necessitates evaluating not only surface similarity to reference decompositions but also structural fidelity and semantic adequacy. Consequently, we employed a hybrid metric suite: BLEU and ROUGE to assess

n

-gram overlap and lexical precision, and BERTScore for semantic similarity. These metrics are widely adopted in natural language generation (NLG) evaluation, ensuring that both lexical precision and semantic correctness are rigorously assessed.

Bilingual Evaluation Understudy (BLEU-

n

) is a precision-oriented metric that quantifies the number of

n

-grams in the model output that also occur in the reference text, applying a brevity penalty to prevent disproportionately short predictions. In contrast, Recall-Oriented Understudy for Gisting Evaluation (ROUGE-

n

) focuses on recall, measuring the proportion of the reference content covered by the system output. While valuable for summarization and QA, both BLEU and ROUGE are limited by their reliance on surface-level overlap. To address this, BERTScore leverages contextual embeddings from pretrained models to compute token-level similarity based on meaning rather than surface form [24]. The effectiveness of these metrics has been established in diverse Arabic NLP studies, including research on paraphrasing [25], text simplification [26], and legal QA [27].

The BLEU metric is computed as follows:

B L E U = B P * e x p (\frac{1}{N} * \sum_{n = 1}^{N} w_{n} \ln p_{n})

where

N

is the maximum

n

-gram size (typically 4),

w_{n}

represents the weights for each

n

-gram order, and

p_{n}

refers to the precision value for

n

-grams of order

n

. The brevity penalty

(B P)

penalizes predictions shorter than the reference and is calculated as:

B P = {\begin{matrix} 1 i f c > r \\ e^{(1 - r / c)} i f c \leq r \end{matrix}

where

r

denotes the token count of the reference answer and

c

represents the token count of the predicted answer. The

n

-gram precision

p_{n}

is defined as:

p_{n} = \frac{\sum_{C ϵ {C a n d i d a t e s}} \sum_{n - g r a m ϵ C} {C o u n t}_{c l i p} (n - g r a m)}{\sum_{C' ϵ {C a n d i d a t e s}} \sum_{n - g r a m^{'} ϵ C'} {C o u n t}_{c l i p} (n - g r a m')}

The numerator represents the count of

n

-grams occurring in both the prediction and the reference, clipped to the maximum frequency found in the reference to avoid inflation by repetition. ROUGE-L determines similarity via the Longest Common Subsequence (LCS), capturing the longest ordered series of tokens shared by the prediction and reference:

ROUGE-L = \frac{L C S (X, Y)}{m}

where

X

is the predicted answer,

Y

is the reference answer, and

m

is the length of the reference.

Furthermore, BERTScore evaluates semantic similarity by aligning tokens based on cosine similarity in the embedding space. Let

x

be the embeddings of the predicted answer and

y

be the embeddings of the reference. The metric is defined through precision

(BERT-P),

recall

(BERT-R)

and F-1 Score

(BERT-F) .

BERT-P = \frac{1}{| x |} \sum_{x_{i} ϵ x} m a x_{y_{j} \in y} x_{i}^{T} y_{j}

BERT-R = \frac{1}{| y |} \sum_{y_{i} ϵ y} m a x_{x_{i} \in x} x_{i}^{T} y_{j}

BERT-F = \frac{2 (BERT-P \cdot BERT-R)}{BERT-P + BERT-R}

where

x_{i}^{T} y_{j}

denotes the cosine similarity between token-level embeddings.

5.4.2. Faithfulness and Relevance Metrics

Beyond lexical and semantic similarity, we evaluate relevance and faithfulness for the QA task. Relevance quantifies the semantic proximity between the predicted and reference answers using cosine similarity of sentence embeddings. Faithfulness, conversely, assesses whether the predicted answer is supported by the associated evidence passage, ensuring the generated response is free of unsupported or hallucinated information. These metrics provide a complementary evaluation of answer quality by assessing both semantic alignment with the reference and consistency with the supporting context.

Relevance Metric

The relevance score is computed as the average cosine similarity between the embeddings of the predicted answer and the gold reference answer. Let

A_{i}^{'}

be the predicted answer and

A_{i}

be the gold reference answer for the

i

-th QA instance. The overall relevance score over

N

instances is defined as:

R e l e v a n c e = \frac{1}{N} \sum_{i = 1}^{N} c o s (E (A_{i}^{'}), E (A_{i}))

where

(\cdot)

denotes the sentence embedding function and

N

represents the total number of evaluated QA instances. A higher relevance value indicates stronger semantic alignment with the reference answer.

Faithfulness Metric

Faithfulness is assessed using a multilingual natural language inference (NLI) model, where the supporting context

C

serves as the premise and the predicted answer

A^{'}

as the hypothesis. The model outputs probabilities for entailment, neutrality, and contradiction. Based on these outputs, the average entailment

({S c o r e}_{E n t})

and contradiction

({S c o r e}_{C o n t})

scores are computed as follows:

{S c o r e}_{E n t} = \frac{1}{N} \sum_{i = 1}^{N} P_{e n t} (C_{i}, {A^{'}}_{i})

{S c o r e}_{C o n t} = \frac{1}{N} \sum_{i = 1}^{N} P_{c o n t} (C_{i}, {A^{'}}_{i})

where

{A^{'}}_{i}

,

A_{i}

and

C_{i}

denote the predicted answer, gold answer, and supporting context for the

i

-th instance, respectively.

P_{e n t} (C_{i}, {A^{'}}_{i})

and

P_{c o n t} (C_{i}, {A^{'}}_{i})

represent the entailment and contradiction probabilities returned by the NLI model. A higher entailment score indicates that the generated answer is well-supported by the evidence passage, whereas a higher contradiction score suggests that the answer is unfaithful and may contain hallucinations.

5.5. Results and Discussion

5.5.1. Question Decomposition (QD) Task

The results summarized in Table 7 reveal distinct performance variations among Arabic-specialized models, multilingual encoder–decoder architectures, and instruction-tuned large language models (LLMs) on the question decomposition (QD) task. Among the Arabic-specific models, Arabic-T5-small achieved the superior overall lexical performance, yielding the highest Token-F1 (31.0%), BLEU-1 (26.0%), ROUGE-1 (31.0%), ROUGE-2 (16.0%), and ROUGE-L (29.0%) scores. These results suggest that Arabic-T5-small is particularly effective at maintaining lexical overlap and structural similarity with the reference decompositions. AraT5-base produced comparable results; while its lexical scores were slightly lower, it achieved the highest semantic alignment among Arabic models with a BERT-F of 91.0%. In contrast, AraT5-base-msa performed substantially worse across all lexical metrics—obtaining only 8.0% for both Token-F1 and ROUGE-1, and 80.0% for BERT-F—indicating a limited capacity for generating accurate decomposition steps. This performance gap may be attributed to the model’s pretraining focus on general Modern Standard Arabic (MSA) corpora, which may not adequately capture the specialized religious terminology and complex conditional structures inherent in fatwa inquiries.

Furthermore, AraBART demonstrated moderate performance with a Token-F1 and ROUGE-1 of 10.0%, while maintaining relatively strong semantic similarity (BERT-F: 85.0%). This suggests that although lexical overlap is limited, the generated subquestions remain semantically aligned with the gold decompositions. Comparatively, the multilingual mT5 models performed considerably worse than their Arabic-specialized counterparts. Notably, mT5-small yielded the lowest overall scores—Token-F1 (2.0%), BLEU-1 (0.3%), and ROUGE-L (1.6%)—reflecting a limited ability to generate accurate Arabic decomposition structures. While mT5-base showed improvement (Token-F1: 11.0%, ROUGE-L: 10.0%, BERT-F: 83.0%), it still trailed most Arabic-specific models.

The instruction-tuned LLMs demonstrated competitive semantic performance. Qwen-7B achieved one of the highest semantic scores (BERT-F: 91.0%) and strong lexical metrics (Token-F1: 29.0%), performing nearly as well as Arabic-T5-small. Similarly, Mistral-7B attained high semantic similarity (BERT-F: 90.0%) and moderate lexical scores (Token-F1: 25.0%). Overall, these results indicate that Arabic-specialized models, specifically Arabic-T5-small and AraT5-base, provide superior lexical alignment for decomposition generation. Meanwhile, instruction-tuned LLMs such as Qwen-7B and Mistral-7B demonstrate robust semantic understanding. Conversely, the significantly weaker performance of multilingual mT5 models underscores the necessity of Arabic-specific pretraining and domain adaptation for multi-hop question decomposition.

5.5.2. Question-Answering (QA) Task

The results reported in Table 8 and Table 9, alongside the training and validation dynamics illustrated in Figure 3 and Figure 4, reveal distinct performance variations among Arabic-specialized models, multilingual encoder–decoder architectures, and instruction-tuned large language models (LLMs) on the question answering (QA) task. Among the Arabic-specific models, AraT5-base exhibited the superior lexical performance, yielding the highest Token-F1 (25.0%), BLEU-1 (14.0%), BLEU-2 (9.0%), ROUGE-2 (13.0%), and ROUGE-L (21.0%) scores, while maintaining robust semantic alignment with a BERT-F of 89.0%. These findings indicate that AraT5-base effectively preserves lexical overlap and generates answers that closely mirror the reference responses. Arabic-T5-small produced comparable semantic results (BERT-P: 92.0%, BERT-R: 86.0%, and BERT-F: 89.0%), although its lexical scores were marginally lower (Token-F1: 22.0% and BLEU-1: 10.0%). AraBART also demonstrated competitive performance—attaining a Token-F1 of 21.0% and BLEU-1 of 13.0%—while achieving strong semantic similarity (BERT-F: 88.0%), confirming its capacity to generate contextually coherent answers.

In contrast, AraT5-base-msa performed substantially worse across all lexical metrics, achieving only 4.0% for Token-F1 and 2.0% for both BLEU-1 and ROUGE-L, with a lower BERT-F of 77.0%. This underscores its limited effectiveness in generating accurate responses for complex fatwa inquiries. This performance gap may be attributed to the model’s pretraining focus on general Modern Standard Arabic (MSA) corpora, which may not adequately capture the specialized religious terminology and jurisprudential expressions inherent in fatwa texts. Moreover, the MAFQA dataset requires multi-hop reasoning and conditional rule interpretation, involving complex logical structures such as exceptions and evidence-based reasoning derived from primary religious sources. Models lacking exposure to such jurisprudential and scholarly reasoning structures likely struggle to produce semantically aligned answers or accurately interpret the reasoning chains required for this domain.

The relevance and faithfulness evaluations further reinforce these observations. AraBART achieved the highest relevance score (75.0%) and the strongest faithfulness in terms of contextual entailment (65.0%), while maintaining a negligible contradiction rate (6.0%). AraT5-base also demonstrated strong reliability, with 70.0% relevance, 57.0% contextual entailment, and a low contradiction rate (7.0%). Similarly, Arabic-T5-small showed solid performance (Relevance: 68.0%, Entailment: 55.0%), although its contradiction rate was relatively higher (26.0%). These results suggest that Arabic specialized models not only produce answers with strong lexical and semantic similarity, but also maintain high factual consistency with the supporting context.

Conversely, the multilingual mT5 models performed considerably worse in both settings. mT5-small yielded negligible lexical scores (Token-F1: 1.0%, BLEU-1: 0.04%) and weak relevance (0.19) and faithfulness, highlighting its limited utility for Arabic QA. mT5-base showed modest improvements (Token-F1: 4.0%, BERT-F: 83.0%), but its relevance (0.32) and faithfulness remained low compared to Arabic-specialized models. Finally, the instruction-tuned LLMs demonstrated more competitive performance. Qwen-7B achieved strong semantic alignment (BERT-F: 87.0%), while Mistral-7B produced comparable results (BERT-F: 88.0%, Token-F1: 21.0%). Regarding reliability, Mistral-7B achieved a relevance of 0.70 and contextual entailment of 0.56, confirming its ability to generate semantically coherent answers. Overall, these findings indicate that Arabic-specialized models, specifically AraT5-base and AraBART, provide the most reliable performance across all metrics, whereas multilingual models fail to capture the domain-specific nuances required for accurate fatwa question answering.

The training and validation dynamics illustrated in Figure 3 and Figure 4, respectively demonstrate the convergence behavior of the evaluated models across epochs for the generative QA task. As shown in Figure 3, AraT5-base achieves the lowest final average training loss, followed closely by Arabic-T5-small, with both models demonstrating rapid convergence during the initial training epochs. AraBART also exhibits stable optimization, although it maintains slightly higher loss values throughout the training process. In contrast, the multilingual models, mT5-small and mT5-base, present substantially higher average training losses and slower convergence patterns; notably, mT5-base reaches the highest final loss among all evaluated models.

A similar pattern is observed in the validation dynamics in Figure 4, where AraT5-base maintains the lowest validation loss across epochs, followed by Arabic-T5-small and AraBART, suggesting superior generalization performance. Conversely, the multilingual mT5-small and mT5-base models consistently yield higher validation losses, reflecting weaker adaptation to the domain-specific characteristics of Arabic fatwa question answering. Overall, these results underscore the advantages of Arabic-specialized sequence-to-sequence models over multilingual alternatives for the MAFQA QA task.

6. Conclusions

This paper introduced MAFQA, a benchmark dataset designed to advance research on multi-hop Arabic fatwa question answering. Unlike the existing Arabic religious QA datasets, which primarily focus on single-step retrieval or classification, MAFQA explicitly captures the reasoning complexity inherent in fatwa inquiries. The dataset was constructed via a semi-automated pipeline that integrates expert-guided identification of complex inquiries with automated reasoning-pattern analysis and structured decomposition into subquestions and subanswers. Rigorous validation procedures were subsequently employed to ensure contextual grounding and logical consistency across the dataset.

To evaluate the utility of MAFQA, we benchmarked a suite of Arabic-specific, multilingual, and instruction-tuned language models on two primary tasks: question decomposition and generative question answering. Experimental results demonstrate that Arabic-specialized models consistently outperform multilingual counterparts across lexical, semantic, relevance, and faithfulness metrics. Notably, AraT5-base and AraBART achieved the most reliable performance, exhibiting strong lexical alignment with reference outputs while maintaining high semantic consistency with the supporting evidence. These findings underscore the necessity of language-specific and domain-adapted pretraining for developing robust QA systems in specialized Arabic domains. Overall, MAFQA provides a significant resource for advancing Arabic NLP in religious contexts, enabling systematic investigation into multi-hop reasoning, answer faithfulness, and evidence-grounded generation.

Future work could expand this research by incorporating additional Islamic domains, such as Hadith and Tafsir, and investigating advanced reasoning techniques using larger instruction-tuned language models. Additionally, future studies should analyze model performance across individual question categories to better assess robustness and generalization across diverse fatwa inquiries. Given the critical role of lightweight models in practical deployment, exploring efficient adaptation and compression strategies, including recent advances in LLM quantization—remains a priority. Specifically, approaches such as LoTA-QAF (Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning) [28] offer promising directions for maintaining performance during low-bit deployment. Furthermore, incorporating improved evaluation frameworks, such as MDEval [29] for enhancing markdown awareness, may support more robust and structured assessment of generated outputs in complex reasoning tasks.

Author Contributions

Conceptualization, M.A.A.-Q. and B.F.A.; methodology, M.A.A.-Q. and M.Y.; software, M.A.A.-Q.; validation, M.A.A.-Q., B.F.A. and M.Y.; formal analysis, M.A.A.-Q.; investigation, M.Y.; resources, B.F.A.; data curation, M.A.A.-Q.; writing—original draft preparation, M.A.A.-Q.; writing—review and editing, B.F.A. and M.Y.; visualization, M.A.A.-Q.; supervision, B.F.A. and M.Y.; project administration, M.A.A.-Q., B.F.A. and M.Y.; funding acquisition, M.A.A.-Q., B.F.A. and M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

The MAFQA dataset presented in this study is publicly available in the Zenodo repository at https://zenodo.org/records/18965740. (accessed on 17 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Al-Yahya, M. Towards automated fiqh school authorship attribution. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, Hanoi, Vietnam, 18–24 March 2018. [Google Scholar]
Mozannar, H.; Maamouri, M.; El-Haj, M.; Habash, N. Neural Arabic question answering. In Proceedings of the 4th Arabic Natural Language Processing Workshop, Florence, Italy, 1 August 2019. [Google Scholar]
Adelani, D.I.; Abbott, J.; Neubig, G.; Derczynski, L.; Rijhwani, S.; Ruder, S.; Sachan, M.; Setiawan, H.; Tejani, A. MasakhaNER: Named entity recognition for African languages. Trans. Assoc. Comput. Linguist. 2021, 9, 1116–1131. [Google Scholar] [CrossRef]
Artetxe, M.; Ruder, S.; Yogatama, D. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4623–4637. [Google Scholar]
Malhas, R.; Mansour, W.; Elsayed, T. Qur’an QA 2022: Overview of the first shared task on question answering over the Holy Qur’an. In Proceedings of the Qur’an QA Workshop, Gyeongju, Republic of Korea, 17 October 2022. [Google Scholar]
Malhas, R.; Elsayed, T. AyaTEC: Building a reusable verse-based test collection for Arabic question answering on the Holy Qur’an. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2020, 19, 1–21. [Google Scholar] [CrossRef]
Alnefaie, S.; Atwell, E.; Alsalka, M.A. HAQA and QUQA: Constructing two Arabic question-answering corpora for the Quran and Hadith. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, Varna, Bulgaria, 4–6 September 2023; pp. 90–97. [Google Scholar]
Aleid, H.A.; Azmi, A.M. Hajj-FQA: A benchmark Arabic dataset for developing question-answering systems on Hajj fatwas. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 135. [Google Scholar] [CrossRef]
Alyemny, O.; Al-Khalifa, H.; Mirza, A. A data-driven exploration of a new Islamic fatwas dataset for Arabic NLP tasks. Data 2023, 8, 155. [Google Scholar] [CrossRef]
Munshi, A.A.; Al-Khalifa, H.; Alharbi, M.; Mirza, A. Towards an automated Islamic fatwa system: Survey, dataset and benchmarks. Int. J. Comput. Sci. Mobile Comput. 2021, 10, 118–131. [Google Scholar] [CrossRef]
Sidhoum, A.H.; Mataoui, M.H.; Sebbak, F.; Smaïli, K. ACQAD: A dataset for arabic complex question answering. In Proceedings of the International Conference on Cyber Security, Artificial Inteligence and Theoretical Computer Science, Boumerdes, Algeria, 27–28 November 2022. [Google Scholar]
Ali, M.A.; Daftardar, N.; Waheed, M.; Qin, J.; Wang, D. MQA-KEAL: Multi-hop question answering under knowledge editing for Arabic language. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 5629–5644. [Google Scholar]
Sen, P.; Aji, A.F.; Saffari, A. Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1604–1619. [Google Scholar]
Saoudi, Y.; Gammoudi, M.M. A comprehensive review of arabic question answering datasets. In Proceedings of the International Conference on Neural Information Processing; Springer Nature: Singapore, 2023; pp. 278–289. [Google Scholar]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. arXiv 2020, arXiv:2010.11934. Available online: https://arxiv.org/abs/2010.11934 (accessed on 19 August 2025).
Hugging Face. mT5-Base. Available online: https://huggingface.co/google/mt5-base (accessed on 19 August 2025).
Hugging Face. AraBART. Available online: https://huggingface.co/moussaKam/AraBART (accessed on 19 August 2025).
Hugging Face. AraT5-MSA-Base. Available online: https://huggingface.co/UBC-NLP/AraT5-msa-base (accessed on 19 August 2025).
Nagoudi, E.M.B.; Elmadany, A.; Abdul-Mageed, M. AraT5: Text-to-Text Transformers for Arabic Language Generation. arXiv 2021, arXiv:2109.12068. Available online: https://arxiv.org/abs/2109.12068 (accessed on 19 August 2025).
Hugging Face. Arabic-T5-Small. Available online: https://huggingface.co/flax-community/arabic-t5-small (accessed on 19 August 2025).
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Mistral AI. Mistral-7B-Instruct-v0.2. Hugging Face Model Card. 2023. Available online: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 (accessed on 7 March 2026).
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. Available online: https://arxiv.org/abs/1904.09675 (accessed on 19 August 2025).
Al-Shameri, N.; Al-Khalifa, H. Arabic paraphrased parallel synthetic dataset. Data Brief 2024, 57, 111004. [Google Scholar] [CrossRef] [PubMed]
Khallaf, N.; Sharoff, S. Towards Arabic Sentence Simplification via Classification and Generative Approaches. arXiv 2022, arXiv:2204.09292. Available online: https://arxiv.org/abs/2204.09292 (accessed on 19 August 2025).
Kmainasi, M.B.; Shahroor, A.E.; Al-Ghraibah, A. Can Large Language Models Predict the Outcome of Judicial Decisions? arXiv 2025, arXiv:2501.09768. Available online: https://arxiv.org/abs/2501.09768 (accessed on 19 August 2025).
Chen, J.; Li, J.; Peng, Z.; Wang, W.; Ren, Y.; Shi, L.; Hu, X. LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning. arXiv 2025, arXiv:2505.18724. [Google Scholar]
Chen, Z.; Liu, Y.; Shi, L.; Wang, Z.J.; Chen, X.; Zhao, Y.; Ren, F. MDEval: Evaluating and enhancing markdown awareness in large language models. In Proceedings of the ACM Web Conference 2025; ACM: New York, NY, USA, 2025; pp. 2981–2991. [Google Scholar]

Figure 1. Workflow Pipeline for Constructing the MAFQA Multi-Hop Arabic Fatwa QA Dataset.

Figure 2. Comparison of Question Type Distributions between the MAFQA and Hajj-FQA [8] datasets.

Figure 3. Training dynamics for QA task.

Figure 4. Validation dynamics for QA task.

Table 1. List of websites used for collecting fatwas.

Website Name	URL Links
Dar Al-Ifta in Saudi Arabia	https://www.alifta.gov.sa/ (accessed on 20 August 2025)
Dar Al-Ifta in Jordan	https://aliftaa.jo (accessed on 20 August 2025)
Dar Al-Ifta in Egypt	https://www.dar-alifta.org/ (accessed on 20 August 2025)
Scholar Abdul Aziz Ibn Baz	https://binbaz.org.sa/ (accessed on 20 August 2025)
Scholar Mohammad Ibn Othaimin	https://binothaimeen.net/site (accessed on 20 August 2025)
Scholar Saleh Al-Fawzan	https://www.alfawzan.af.org.sa/ (accessed on 20 August 2025)
Fatwa Pedia	https://fatawapedia.com/ (accessed on 20 August 2025)

Table 2. Types of reasoning patterns used in the MAFQA dataset.

Type of Reasoning	Description	Example Question
Rule–Condition	A legal ruling depends on a specific condition that must be satisfied.	What is the ruling on fasting if the patient suffers from a chronic illness?
Rule–Exception	A specific case is exempted from a general ruling based on defined criteria.	Is fasting obligatory for all patients, or are there exceptions for some cases?
Evidence–Ruling	A legal ruling is supported by textual evidence such as a Qur’anic verse, Hadith, or scholarly opinion.	What is the legal evidence for paying expiation for a sick person unable to fast?
Cause–Consequence	A ruling focus on the results or penalties resulting from an action or state.	What are the consequences of a patient who is unable to fast breaking their fast?
Comparison	Two alternative viewpoints or actions are compared to determine preference.	Which is preferable: performing Umrah or paying off a debt?

Table 3. Question templates examples for different reasoning patterns.

Reasoning Pattern	Example Question Templates
Rule–Condition	What is the ruling of {action} if {condition} occurs?
	Is it permissible to perform {action} under {circumstance}?
	What is the Islamic ruling on {action} when {condition} exists?
Rule–Exception	What is the general ruling of {action}, and are there any {exception}?
	In which situations is {action} exempted due to {exception}?
	Is {action} permissible in all cases, or are there {exception}?
Evidence–Ruling	What is the Islamic evidence supporting the ruling of {concept}?
	Is there evidence from the Qur’an or Sunnah regarding {action}?
	Which prophetic hadith supports the ruling of {action}?
	Which Quran verse is used as evidence for the ruling of {concept}?
Cause–Consequence	What is the effect of {cause} on the ruling of {action}?
	What is the legal consequence resulting from {cause}?
	Does {cause} lead to a change in the ruling of {action}?
Comparison	What are the differences between {concept1} and {concept2}?
	What is the difference between {action1} and {action2} in Islamic ruling?
	Does the ruling of {action1} differ from the ruling of {action2}
	Which is more preferable: {action1} or {action2}?

Table 4. Statistical distribution of question types for the original fatwa questions.

Question Type	Counts	Percentage %
What	272	70.10%
Which	52	13.40%
How	23	5.93%
How much	12	3.09%
When	22	5.67%
Why	7	1.80%
Total	388	100%

Table 5. Statistical distribution of question types for subquestions.

Question Type	Subquestion 1	Subquestion 2	Subquestion 3	Total	Percentage %
What	275	284	151	710	75.69%
Which	91	76	11	178	18.98%
How	5	6	0	11	1.17%
How much	12	4	0	16	1.71%
When	3	8	0	11	1.17%
Why	2	10	0	12	1.28%
Total	388	388	162	938	100.00%

Table 6. Token count statistics for final and subanswers in the MAFQA dataset.

	Number of Tokens
	Min	Max	Avg
Final Answer	10	109	40.31
Subanswer 1	3	55	19.94
Subanswer 2	4	54	21.19
Subanswer 3	8	36	19.01

Table 7. Evaluation results of Arabic and multilingual models on the question decomposition (QD) task based on lexical and semantic similarity metrics.

	F1	BLEU-1	ROUGE-1	ROUGE-2	ROUGE-L	BERT-P	BERT-R	BERT-F
Model	F1	BLEU-1	ROUGE-1	ROUGE-2	ROUGE-L	BERT-P	BERT-R	BERT-F
Arabic-T5-small	31.0	26.0	31.0	16.0	29.0	91.0	89.0	90.0
AraT5-base	29.0	21.0	29.0	13.0	25.0	92.0	90.0	91.0
AraT5-base-msa	8.0	5.0	8.0	3.0	8.0	78.0	81.0	80.0
AraBART	10.0	6.0	10.0	4.0	7.0	82.0	87.0	85.0
mT5-small	2.0	0.3	1.7	0.7	1.6	79.0	76.0	78.0
mT5-base	11.0	1.0	11.0	1.0	10.0	83.0	82.0	83.0
Qwen-7B	29.0	24.0	29.0	14.0	25.0	90.0	91.0	91.0
Mistral-7B	25.0	21.0	25.0	10.0	21.0	89.0	91.0	90.0

Table 8. Evaluation results of Arabic and multilingual models on the question-answering (QA) task based on lexical and semantic similarity metrics.

	Token-F1	BLEU-1	BLEU-2	ROUGE-2	ROUGE-L	BERT-P	BERT-R	BERT-F
Model	Token-F1	BLEU-1	BLEU-2	ROUGE-2	ROUGE-L	BERT-P	BERT-R	BERT-F
Arabic-T5-small	22.0	10.0	7.0	12.0	20.0	92.0	86.0	89.0
AraT5-base	25.0	14.0	9.0	13.0	21.0	91.0	87.0	89.0
AraT5-base-msa	4.0	2.0	0.4	0.05	2.0	74.0	79.0	77.0
AraBART	21.0	13.0	8.0	8.0	13.0	85.0	91.0	88.0
mT5-small	1.0	0.04	0.03	0.4	1.0	79.0	77.0	78.0
mT5-base	4.0	1.0	0.6	0.7	5.0	83.0	84.0	83.0
Qwen-7B	20.0	11.0	8.0	12.0	18.0	86.0	87.0	87.0
Mistral-7B	21.0	11.0	7.0	9.0	19.0	90.0	86.0	88.0

Table 9. Evaluation results of Arabic and multilingual models on the question-answering (QA) task based on relevance and faithfulness metrics.

	Relevance	Faith_entail_mean	Faith_contra_mean
Model	Relevance	Faith_entail_mean	Faith_contra_mean
Arabic-T5-small	68.0	55.0	26.0
AraT5-base	70.0	57.0	7.0
AraT5-base-msa	18.0	28.0	21.0
AraBART	75.0	65.0	6.0
mT5-small	0.19	0.30	0.24
mT5-base	0.32	0.34	0.19
Qwen-7B	0.59	0.62	0.13
Mistral-7B	0.70	0.56	0.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al-Qahtani, M.A.; Alkhamees, B.F.; Ykhlef, M. MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering. Data 2026, 11, 64. https://doi.org/10.3390/data11030064

AMA Style

Al-Qahtani MA, Alkhamees BF, Ykhlef M. MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering. Data. 2026; 11(3):64. https://doi.org/10.3390/data11030064

Chicago/Turabian Style

Al-Qahtani, Manal Ali, Bader Fahad Alkhamees, and Mourad Ykhlef. 2026. "MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering" Data 11, no. 3: 64. https://doi.org/10.3390/data11030064

APA Style

Al-Qahtani, M. A., Alkhamees, B. F., & Ykhlef, M. (2026). MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering. Data, 11(3), 64. https://doi.org/10.3390/data11030064

Article Menu

MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering

Abstract

1. Introduction

2. Related Works

3. MAFQA Dataset Construction

3.1. Data Collection and Preprocessing

3.2. Questions Collections and Classification

Annotation Guidelines for Complexity Labeling

3.3. Questions Decomposition

3.3.1. Reasoning Pattern Classification

3.3.2. Passage Segmentation and Semantic Feature Extraction

3.3.3. Question Template Construction

3.3.4. Template-Guided Sub-Question Construction

3.3.5. Sub-Answer and Final Answer Construction

3.4. Dataset Validation and Instance Construction

4. Dataset Analysis

5. Experimental Evaluation of the MAFQA Dataset

5.1. Dataset Preparation and Splitting

5.2. Evaluated Models

5.2.1. Multilingual Sequence-to-Sequence Models

5.2.2. Arabic Sequence-to-Sequence Models

5.2.3. Instruction-Tuned LLMs

5.3. Experimental Setup and Hyperparameter Selection

5.4. Evaluation Metrics

5.4.1. Lexical and Semantic Similarity Metrics

5.4.2. Faithfulness and Relevance Metrics

5.5. Results and Discussion

5.5.1. Question Decomposition (QD) Task

5.5.2. Question-Answering (QA) Task

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI