MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering
Abstract
1. Introduction
2. Related Works
3. MAFQA Dataset Construction
| Algorithm 1: Semi-Automatic Construction of Multi-Hop Fatwa QA Instances |
| Input: : set of complex fatwa records, where each record contains an original fatwa complex question , and its corresponding answer |
| : set of reasoning patterns |
| : set of question-generation templates associated with reasoning patterns |
| : LLM-based semantic extraction module |
| : rule-based validation and grounding module |
| : human annotators |
| Output: : multi-hop QA dataset where each instance contains |
| Initialize an empty dataset |
| Preprocess the fatwa record set |
| foreach fatwa record in do |
| if then |
| Segment into candidate passages |
| foreach passage p in do |
| Build a local relation graph over |
| foreach connected pair in do |
| Retrieve validated semantic features: |
| present to annotators |
| Manually generate sub-questions |
| Manually extract sub-answers |
| Initialize |
| Initialize |
| if then |
| if schema_is_valid |
| then |
| Add to |
| Return |
3.1. Data Collection and Preprocessing
3.2. Questions Collections and Classification
| Fatwa Question in Arabic | بعض كبار السن يقرأ أول سورة الفاتحة وهو جالس، ثم يواصل بعد قيامه في القراءة، ما حكم هذا العمل؟ |
| Fatwa Question in English | Some elderly individuals begin reciting the beginning of Surah Al-Fatiha while seated, then continue the rest after standing. What is the ruling on this practice? |
| Fatwa Answer in Arabic | إذا كان عاجزا ما يستطيع القيام الكلي وإنّما يستطيع بعض القيام هذا لا بأس به أن يقرأ أولها ، يخاف أن يفوته ، إن أخرها ، أما إذا كان يتمكن من قراءتها وهو قائم فيؤجلها حتى يقرأها وهو قائم ، أما إذا كان لا يستطيع وإن أجلها قد تفوته قد يركع الإمام قرأ بعضها في حال الجلوس ثم يكمل وهو قائم إذا كان عاجزا ، أما إن كان عن كسل وتساهل فلا يجوز ، يجب أن يبادر بالقيام ولا يحل له الجلوس ، فإذا جلس بطلت صلاته ، لكن إذا كان عاجزا لعذر شرعي يشق عليه القيام حالا هذا عذر له شرعي ، وإذا قرأ بعض الفاتحة لأنه لا يتمكن من قراءتها وهو قائم يركع الإمام قبله فلا بأس ، هذا عذر شرعي لأن الله يقول: َاتَّقُوا اللَّهَ مَا اسْتَطَعْتُم. ؛:ويقول النبي صلى الله عليه وسلم لمن عجز عن الصلاة قائماً: صل قائما ، فإن لم تستطع فقاعدا ، فإن لم تستطع فعلى جنب ، فإن لم تستطع فمستلقيا المقصود أنه يراعى. |
| Fatwa Answer in English | If a worshiper genuinely cannot stand fully, it is permissible for him to begin reciting the start of Al-Fatiha while seated to avoid missing it, but if he is physically able, he must wait and recite it standing. Sitting out of laziness invalidates prayer. Legitimate incapacity (e.g., illness or weakness) allows partial seated recitation—ideally with the imam bowing first—without penalty. |
| Source/Mufti Name | Official Website of Scholar Mahammad Ibn-Othaimin. |
| Fatwa Question in Arabic | ما حكم من أفطر في نهار شهر رمضان بدون عذر؟ |
| Fatwa Question in English | What is the ruling on someone who breaks his fast during the day in Ramadan without valid excuse? |
| Fatwa Answer in Arabic | من أفطر يوما من رمضان بغير عذر شرعي فقد أتى منكرًا عظيمًا، ومن تاب تاب الله عليه، فعليه التوبة إلى الله بصدق، بأن يندم على ما مضى، ويعزم ألا يعود، ويستغفر ربه كثيرًا، ويبادر بقضاء اليوم الذي أفطره. |
| Fatwa Answer in English | Whoever breaks his fast one day in Ramadan without a valid excuse has committed a great sin. He must seek forgiveness from Allah and hasten to make up for the day he broke his fast. |
| Source/Mufti Name | Official Website of Scholar Abdulaziz Ibn-Baz. |
Annotation Guidelines for Complexity Labeling
3.3. Questions Decomposition
3.3.1. Reasoning Pattern Classification
3.3.2. Passage Segmentation and Semantic Feature Extraction
3.3.3. Question Template Construction
3.3.4. Template-Guided Sub-Question Construction
3.3.5. Sub-Answer and Final Answer Construction
3.4. Dataset Validation and Instance Construction
4. Dataset Analysis
5. Experimental Evaluation of the MAFQA Dataset
5.1. Dataset Preparation and Splitting
5.2. Evaluated Models
5.2.1. Multilingual Sequence-to-Sequence Models
5.2.2. Arabic Sequence-to-Sequence Models
5.2.3. Instruction-Tuned LLMs
5.3. Experimental Setup and Hyperparameter Selection
5.4. Evaluation Metrics
5.4.1. Lexical and Semantic Similarity Metrics
5.4.2. Faithfulness and Relevance Metrics
- Relevance Metric
- Faithfulness Metric
5.5. Results and Discussion
5.5.1. Question Decomposition (QD) Task
5.5.2. Question-Answering (QA) Task
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Al-Yahya, M. Towards automated fiqh school authorship attribution. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, Hanoi, Vietnam, 18–24 March 2018. [Google Scholar]
- Mozannar, H.; Maamouri, M.; El-Haj, M.; Habash, N. Neural Arabic question answering. In Proceedings of the 4th Arabic Natural Language Processing Workshop, Florence, Italy, 1 August 2019. [Google Scholar]
- Adelani, D.I.; Abbott, J.; Neubig, G.; Derczynski, L.; Rijhwani, S.; Ruder, S.; Sachan, M.; Setiawan, H.; Tejani, A. MasakhaNER: Named entity recognition for African languages. Trans. Assoc. Comput. Linguist. 2021, 9, 1116–1131. [Google Scholar] [CrossRef]
- Artetxe, M.; Ruder, S.; Yogatama, D. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4623–4637. [Google Scholar]
- Malhas, R.; Mansour, W.; Elsayed, T. Qur’an QA 2022: Overview of the first shared task on question answering over the Holy Qur’an. In Proceedings of the Qur’an QA Workshop, Gyeongju, Republic of Korea, 17 October 2022. [Google Scholar]
- Malhas, R.; Elsayed, T. AyaTEC: Building a reusable verse-based test collection for Arabic question answering on the Holy Qur’an. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2020, 19, 1–21. [Google Scholar] [CrossRef]
- Alnefaie, S.; Atwell, E.; Alsalka, M.A. HAQA and QUQA: Constructing two Arabic question-answering corpora for the Quran and Hadith. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, Varna, Bulgaria, 4–6 September 2023; pp. 90–97. [Google Scholar]
- Aleid, H.A.; Azmi, A.M. Hajj-FQA: A benchmark Arabic dataset for developing question-answering systems on Hajj fatwas. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 135. [Google Scholar] [CrossRef]
- Alyemny, O.; Al-Khalifa, H.; Mirza, A. A data-driven exploration of a new Islamic fatwas dataset for Arabic NLP tasks. Data 2023, 8, 155. [Google Scholar] [CrossRef]
- Munshi, A.A.; Al-Khalifa, H.; Alharbi, M.; Mirza, A. Towards an automated Islamic fatwa system: Survey, dataset and benchmarks. Int. J. Comput. Sci. Mobile Comput. 2021, 10, 118–131. [Google Scholar] [CrossRef]
- Sidhoum, A.H.; Mataoui, M.H.; Sebbak, F.; Smaïli, K. ACQAD: A dataset for arabic complex question answering. In Proceedings of the International Conference on Cyber Security, Artificial Inteligence and Theoretical Computer Science, Boumerdes, Algeria, 27–28 November 2022. [Google Scholar]
- Ali, M.A.; Daftardar, N.; Waheed, M.; Qin, J.; Wang, D. MQA-KEAL: Multi-hop question answering under knowledge editing for Arabic language. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 5629–5644. [Google Scholar]
- Sen, P.; Aji, A.F.; Saffari, A. Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1604–1619. [Google Scholar]
- Saoudi, Y.; Gammoudi, M.M. A comprehensive review of arabic question answering datasets. In Proceedings of the International Conference on Neural Information Processing; Springer Nature: Singapore, 2023; pp. 278–289. [Google Scholar]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. arXiv 2020, arXiv:2010.11934. Available online: https://arxiv.org/abs/2010.11934 (accessed on 19 August 2025).
- Hugging Face. mT5-Base. Available online: https://huggingface.co/google/mt5-base (accessed on 19 August 2025).
- Hugging Face. AraBART. Available online: https://huggingface.co/moussaKam/AraBART (accessed on 19 August 2025).
- Hugging Face. AraT5-MSA-Base. Available online: https://huggingface.co/UBC-NLP/AraT5-msa-base (accessed on 19 August 2025).
- Nagoudi, E.M.B.; Elmadany, A.; Abdul-Mageed, M. AraT5: Text-to-Text Transformers for Arabic Language Generation. arXiv 2021, arXiv:2109.12068. Available online: https://arxiv.org/abs/2109.12068 (accessed on 19 August 2025).
- Hugging Face. Arabic-T5-Small. Available online: https://huggingface.co/flax-community/arabic-t5-small (accessed on 19 August 2025).
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- Mistral AI. Mistral-7B-Instruct-v0.2. Hugging Face Model Card. 2023. Available online: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 (accessed on 7 March 2026).
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. Available online: https://arxiv.org/abs/1904.09675 (accessed on 19 August 2025).
- Al-Shameri, N.; Al-Khalifa, H. Arabic paraphrased parallel synthetic dataset. Data Brief 2024, 57, 111004. [Google Scholar] [CrossRef] [PubMed]
- Khallaf, N.; Sharoff, S. Towards Arabic Sentence Simplification via Classification and Generative Approaches. arXiv 2022, arXiv:2204.09292. Available online: https://arxiv.org/abs/2204.09292 (accessed on 19 August 2025).
- Kmainasi, M.B.; Shahroor, A.E.; Al-Ghraibah, A. Can Large Language Models Predict the Outcome of Judicial Decisions? arXiv 2025, arXiv:2501.09768. Available online: https://arxiv.org/abs/2501.09768 (accessed on 19 August 2025).
- Chen, J.; Li, J.; Peng, Z.; Wang, W.; Ren, Y.; Shi, L.; Hu, X. LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning. arXiv 2025, arXiv:2505.18724. [Google Scholar]
- Chen, Z.; Liu, Y.; Shi, L.; Wang, Z.J.; Chen, X.; Zhao, Y.; Ren, F. MDEval: Evaluating and enhancing markdown awareness in large language models. In Proceedings of the ACM Web Conference 2025; ACM: New York, NY, USA, 2025; pp. 2981–2991. [Google Scholar]




| Website Name | URL Links |
|---|---|
| Dar Al-Ifta in Saudi Arabia | https://www.alifta.gov.sa/ (accessed on 20 August 2025) |
| Dar Al-Ifta in Jordan | https://aliftaa.jo (accessed on 20 August 2025) |
| Dar Al-Ifta in Egypt | https://www.dar-alifta.org/ (accessed on 20 August 2025) |
| Scholar Abdul Aziz Ibn Baz | https://binbaz.org.sa/ (accessed on 20 August 2025) |
| Scholar Mohammad Ibn Othaimin | https://binothaimeen.net/site (accessed on 20 August 2025) |
| Scholar Saleh Al-Fawzan | https://www.alfawzan.af.org.sa/ (accessed on 20 August 2025) |
| Fatwa Pedia | https://fatawapedia.com/ (accessed on 20 August 2025) |
| Type of Reasoning | Description | Example Question |
|---|---|---|
| Rule–Condition | A legal ruling depends on a specific condition that must be satisfied. | What is the ruling on fasting if the patient suffers from a chronic illness? |
| Rule–Exception | A specific case is exempted from a general ruling based on defined criteria. | Is fasting obligatory for all patients, or are there exceptions for some cases? |
| Evidence–Ruling | A legal ruling is supported by textual evidence such as a Qur’anic verse, Hadith, or scholarly opinion. | What is the legal evidence for paying expiation for a sick person unable to fast? |
| Cause–Consequence | A ruling focus on the results or penalties resulting from an action or state. | What are the consequences of a patient who is unable to fast breaking their fast? |
| Comparison | Two alternative viewpoints or actions are compared to determine preference. | Which is preferable: performing Umrah or paying off a debt? |
| Reasoning Pattern | Example Question Templates |
|---|---|
| Rule–Condition | What is the ruling of {action} if {condition} occurs? |
| Is it permissible to perform {action} under {circumstance}? | |
| What is the Islamic ruling on {action} when {condition} exists? | |
| Rule–Exception | What is the general ruling of {action}, and are there any {exception}? |
| In which situations is {action} exempted due to {exception}? | |
| Is {action} permissible in all cases, or are there {exception}? | |
| Evidence–Ruling | What is the Islamic evidence supporting the ruling of {concept}? |
| Is there evidence from the Qur’an or Sunnah regarding {action}? | |
| Which prophetic hadith supports the ruling of {action}? | |
| Which Quran verse is used as evidence for the ruling of {concept}? | |
| Cause–Consequence | What is the effect of {cause} on the ruling of {action}? |
| What is the legal consequence resulting from {cause}? | |
| Does {cause} lead to a change in the ruling of {action}? | |
| Comparison | What are the differences between {concept1} and {concept2}? |
| What is the difference between {action1} and {action2} in Islamic ruling? | |
| Does the ruling of {action1} differ from the ruling of {action2} | |
| Which is more preferable: {action1} or {action2}? |
| Question Type | Counts | Percentage % |
|---|---|---|
| What | 272 | 70.10% |
| Which | 52 | 13.40% |
| How | 23 | 5.93% |
| How much | 12 | 3.09% |
| When | 22 | 5.67% |
| Why | 7 | 1.80% |
| Total | 388 | 100% |
| Question Type | Subquestion 1 | Subquestion 2 | Subquestion 3 | Total | Percentage % |
|---|---|---|---|---|---|
| What | 275 | 284 | 151 | 710 | 75.69% |
| Which | 91 | 76 | 11 | 178 | 18.98% |
| How | 5 | 6 | 0 | 11 | 1.17% |
| How much | 12 | 4 | 0 | 16 | 1.71% |
| When | 3 | 8 | 0 | 11 | 1.17% |
| Why | 2 | 10 | 0 | 12 | 1.28% |
| Total | 388 | 388 | 162 | 938 | 100.00% |
| Number of Tokens | |||
|---|---|---|---|
| Min | Max | Avg | |
| Final Answer | 10 | 109 | 40.31 |
| Subanswer 1 | 3 | 55 | 19.94 |
| Subanswer 2 | 4 | 54 | 21.19 |
| Subanswer 3 | 8 | 36 | 19.01 |
| Metric | F1 | BLEU-1 | ROUGE-1 | ROUGE-2 | ROUGE-L | BERT-P | BERT-R | BERT-F | |
|---|---|---|---|---|---|---|---|---|---|
| Model | |||||||||
| Arabic-T5-small | 31.0 | 26.0 | 31.0 | 16.0 | 29.0 | 91.0 | 89.0 | 90.0 | |
| AraT5-base | 29.0 | 21.0 | 29.0 | 13.0 | 25.0 | 92.0 | 90.0 | 91.0 | |
| AraT5-base-msa | 8.0 | 5.0 | 8.0 | 3.0 | 8.0 | 78.0 | 81.0 | 80.0 | |
| AraBART | 10.0 | 6.0 | 10.0 | 4.0 | 7.0 | 82.0 | 87.0 | 85.0 | |
| mT5-small | 2.0 | 0.3 | 1.7 | 0.7 | 1.6 | 79.0 | 76.0 | 78.0 | |
| mT5-base | 11.0 | 1.0 | 11.0 | 1.0 | 10.0 | 83.0 | 82.0 | 83.0 | |
| Qwen-7B | 29.0 | 24.0 | 29.0 | 14.0 | 25.0 | 90.0 | 91.0 | 91.0 | |
| Mistral-7B | 25.0 | 21.0 | 25.0 | 10.0 | 21.0 | 89.0 | 91.0 | 90.0 | |
| Metric | Token-F1 | BLEU-1 | BLEU-2 | ROUGE-2 | ROUGE-L | BERT-P | BERT-R | BERT-F | |
|---|---|---|---|---|---|---|---|---|---|
| Model | |||||||||
| Arabic-T5-small | 22.0 | 10.0 | 7.0 | 12.0 | 20.0 | 92.0 | 86.0 | 89.0 | |
| AraT5-base | 25.0 | 14.0 | 9.0 | 13.0 | 21.0 | 91.0 | 87.0 | 89.0 | |
| AraT5-base-msa | 4.0 | 2.0 | 0.4 | 0.05 | 2.0 | 74.0 | 79.0 | 77.0 | |
| AraBART | 21.0 | 13.0 | 8.0 | 8.0 | 13.0 | 85.0 | 91.0 | 88.0 | |
| mT5-small | 1.0 | 0.04 | 0.03 | 0.4 | 1.0 | 79.0 | 77.0 | 78.0 | |
| mT5-base | 4.0 | 1.0 | 0.6 | 0.7 | 5.0 | 83.0 | 84.0 | 83.0 | |
| Qwen-7B | 20.0 | 11.0 | 8.0 | 12.0 | 18.0 | 86.0 | 87.0 | 87.0 | |
| Mistral-7B | 21.0 | 11.0 | 7.0 | 9.0 | 19.0 | 90.0 | 86.0 | 88.0 | |
| Metric | Relevance | Faith_entail_mean | Faith_contra_mean | |
|---|---|---|---|---|
| Model | ||||
| Arabic-T5-small | 68.0 | 55.0 | 26.0 | |
| AraT5-base | 70.0 | 57.0 | 7.0 | |
| AraT5-base-msa | 18.0 | 28.0 | 21.0 | |
| AraBART | 75.0 | 65.0 | 6.0 | |
| mT5-small | 0.19 | 0.30 | 0.24 | |
| mT5-base | 0.32 | 0.34 | 0.19 | |
| Qwen-7B | 0.59 | 0.62 | 0.13 | |
| Mistral-7B | 0.70 | 0.56 | 0.14 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Al-Qahtani, M.A.; Alkhamees, B.F.; Ykhlef, M. MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering. Data 2026, 11, 64. https://doi.org/10.3390/data11030064
Al-Qahtani MA, Alkhamees BF, Ykhlef M. MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering. Data. 2026; 11(3):64. https://doi.org/10.3390/data11030064
Chicago/Turabian StyleAl-Qahtani, Manal Ali, Bader Fahad Alkhamees, and Mourad Ykhlef. 2026. "MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering" Data 11, no. 3: 64. https://doi.org/10.3390/data11030064
APA StyleAl-Qahtani, M. A., Alkhamees, B. F., & Ykhlef, M. (2026). MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering. Data, 11(3), 64. https://doi.org/10.3390/data11030064

