Next Article in Journal
Strengthening Small Object Detection in Adapted RT-DETR Through Robust Enhancements
Previous Article in Journal
Multi-Agent Hierarchical Reinforcement Learning for PTZ Camera Control and Visual Enhancement
Previous Article in Special Issue
Adaptation of Fuzzy Systems Based on Ordered Fuzzy Numbers: A Review of Applications and Development Prospects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Arabic WikiTableQA: Benchmarking Question Answering over Arabic Tables Using Large Language Models

1
Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
2
Department of Information Systems, College of Computer Science and Information Systems, Najran University, Najran 55461, Saudi Arabia
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(19), 3829; https://doi.org/10.3390/electronics14193829
Submission received: 28 August 2025 / Revised: 21 September 2025 / Accepted: 26 September 2025 / Published: 27 September 2025
(This article belongs to the Special Issue Deep Learning Approaches for Natural Language Processing)

Abstract

Table-based question answering (TableQA) has made significant progress in recent years; however, most advancements have focused on English datasets and SQL-based techniques, leaving Arabic TableQA largely unexplored. This gap is especially critical given the widespread use of structured Arabic content in domains such as government, education, and media. The main challenge lies in the absence of benchmark datasets and the difficulty that large language models (LLMs) face when reasoning over long, complex tables in Arabic, due to token limitations and morphological complexity. To address this, we introduce Arabic WikiTableQA, the first large-scale dataset for non-SQL Arabic TableQA, constructed from the WikiTableQuestions dataset and enriched with natural questions and gold-standard answers. We developed three methods to evaluate this dataset: a direct input approach, a sub-table selection strategy using SQL-like filtering, and a knowledge-guided framework that filters the table using semantic graphs. Experimental results with an LLM show that the graph-guided approach outperforms the others, achieving 74% accuracy, compared to 64% for sub-table selection and 45% for direct input, demonstrating its effectiveness in handling long and complex Arabic tables.

1. Introduction

Tabular data is one of the most common formats for representing structured information in real-world applications. It supports many tasks in business analytics, scientific documentation, and governmental reporting. Web tables, particularly those hosted on Wikipedia, range from historical data to comparative statistics and are frequently queried by users seeking answers grounded in structured knowledge [1].
Despite rapid progress in natural language understanding, current large language models (LLMs) remain poorly equipped to process tabular data effectively. LLMs are trained on linear sequences of text and do not naturally capture the structure or reasoning patterns needed to understand two-dimensional tables [2]. As a result, LLMs struggle with understanding schema-based relationships, aligning rows and columns, and performing operations such as filtering, grouping, and aggregating values. Furthermore, LLMs are fundamentally limited by their input length constraints. For long tables—those exceeding a few dozen rows or containing multiple attributes—standard models are forced to truncate content, often omitting essential rows, column headers, or metadata. This leads to a degradation in reasoning accuracy and introduces wrong or incomplete answers [3].
While English-centric approaches have introduced solutions such as TAPAS [4], TaBERT [5], and TabSQLify [6], these methods either rely on resource-heavy pretraining or assume access to short, clean tables. Recent approaches address input length limitations by first applying symbolic decomposition methods before passing the reduced content to LLMs for answer generation [6]. However, none of these methods generalise to morphologically rich, under-resourced languages like Arabic. The absence of Arabic datasets for table question answering further compounds the issue, making it impossible to benchmark performance or train models effectively in this language.
On the other hand, the exponential growth of structured data in domains such as finance, healthcare, and scientific research has necessitated robust methods for automated querying and analysis of large tabular datasets [7]. Traditional natural language processing (NLP) models, primarily designed for unstructured text, have shown limitations when handling the relational and multi-dimensional characteristics of structured tables [7]. LLMs, despite their significant advancements in natural language understanding, exhibit difficulties in reasoning over complex tabular data due to their linear processing nature [8]. These limitations are particularly evident when the data span multiple rows and columns with intricate relationships, which are common in real-world applications.
In this work, we identify the main gap in the current TableQA landscape: there is no benchmark dataset for table-based question answering in Arabic, despite the widespread presence of Arabic tabular content on Wikipedia and government portals. This leaves Arabic QA systems unable to benefit from structured data or support real-world scenarios where answers are embedded in tables.
To address these limitations, we introduce Arabic WikiTableQA, the first large-scale dataset for question answering over Arabic tables. Sourced from structured tables on Arabic Wikipedia, the dataset includes thousands of question-answer pairs aligned with their corresponding tables and enriched with metadata such as table domain, answer type, and reasoning category. This resource enables direct answer extraction without requiring Structured Query Language (SQL) annotations, bridging a critical gap in Arabic TableQA research.
In addition to the dataset, we propose and evaluate three approaches for answering questions over Arabic tables: (1) a direct input method where the full table and question are passed to the LLM; (2) a sub-table selection approach that decomposes the input using SQL-like filtering; and (3) a knowledge-guided filtering framework that leverages semantic graphs to extract relevant table segments. Together, these contributions lay the groundwork for future research in Arabic TableQA and structured reasoning with LLMs in low-resource settings.
Our contributions are as follows:
  • We construct Arabic WikiTableQA, the first Arabic benchmark for table-based question answering, sourced from Wikipedia tables and curated with natural questions.
  • We propose and evaluate three different approaches for answering questions over Arabic tables: a direct LLM-based method, a decomposition-based sub-table selection model (ArTabSQL), and a knowledge-guided filtering approach using semantic graphs.
  • We provide a comprehensive experimental evaluation, demonstrating that the knowledge-guided approach outperforms the other methods and effectively addresses LLM token limitations.
This paper is organised as follows: Section 2 reviews related work on Arabic TableQA and knowledge graph integration. Section 3 details the methodology and proposed frameworks. Section 4 presents the experimental setup, including dataset construction and implementation. Section 5 discusses the results and evaluation. Finally, Section 6 concludes the paper and outlines future research directions.

2. Related Work

2.1. Arabic Tabular Data with LLM

Table-based question answering (TableQA) is a growing research area focused on extracting answers from structured tabular data using natural language queries. While early studies primarily targeted English and relied heavily on semantic parsing or SQL generation, recent advances have introduced novel architectures and strategies capable of operating over small tables without explicit query generation. These developments are especially relevant for Arabic TableQA, where challenges include linguistic variation, limited datasets, and schema misalignment.
One solution involves decomposition-based approaches. The TabSQLify [6] framework decomposes large tables into smaller, question-relevant sub-tables by generating text-to-SQL queries, thereby reducing input length and computational load for LLMs during reasoning. Complementarily, the NormTab [3] framework addresses data inconsistency in web tables by applying a one-time normalisation preprocessing step. This enhances the symbolic reasoning capabilities of LLMs by aligning the structural semantics of tabular inputs.
In Arabic NLP, text-to-SQL approaches have become central to bridging natural language queries and structured data. Ar-Spider [9] is an effort to build Arabic text-to-SQL benchmarks based on translations of the Spider dataset. Ar-Spider proposes a democratised platform for Arabic table interaction, addressing schema-linguistic and SQL-structural challenges using cross-lingual models and context-similarity relationships (CSR), achieving 66.63% accuracy by combining LGESQL, XLM-R, and CSR models.
Beyond SQL-based QA, integrating LLMs with knowledge graphs (KGs) has shown effectiveness. For instance, biomedical and medical QA systems have used KG-enhanced LLMs to ground reasoning in verifiable facts, reduce hallucinations, and increase answer fidelity [10]. Such hybrid methods are promising for domains like Arabic TableQA, where structured information and linguistic nuances coexist.
Another foundational study [11] introduced the WikiTableQuestions dataset for open-domain TableQA, demonstrating the potential of logical-form-based semantic parsing on semi-structured tables and setting benchmarks for reasoning over unseen schemas.
Despite notable advances in English-centric TableQA systems, Arabic remains vastly underrepresented in this domain. Existing Arabic resources such as Ar-Spider [9] focus specifically on text-to-SQL tasks, translating natural language queries into SQL commands over relational databases. While valuable, these datasets are tightly bound to SQL-based systems and assume access to structured relational schemas. Consequently, they are unsuitable for models aiming to extract answers directly from raw tabular data without relying on executable queries.
Furthermore, there is currently no Arabic benchmark or dataset for non-SQL-based table question answering, such as open-domain QA over semi-structured web tables (e.g., HTML tables). No Arabic equivalent exists for English datasets like WikiTableQuestions or TabFact, which support reasoning without SQL. This creates a significant gap in the development of Arabic TableQA models, particularly those based on neural architectures or few-shot prompting that bypass formal logical forms.
Additionally, no Arabic models have been specifically adapted to reason over tables using chunking, decomposition, or agent-based reasoning techniques that have proven effective in English TableQA systems such as TabSQLify or NormTab.
This lack of resources highlights a critical bottleneck: Arabic QA over tables remains largely unexplored beyond SQL-based semantic parsing. Building datasets to support direct answer extraction from Arabic tables, especially for QA without SQL generation, is an essential next step to democratise access to structured knowledge in the Arabic language.

2.2. Knowledge Graph Integration in TableQA

Knowledge graphs (KGs) have emerged as a powerful tool to enhance reasoning and factual consistency in table-based question answering. By representing structured knowledge as entities and relationships, KGs provide a semantic backbone that helps LLMs ground their outputs, reduce hallucinations, and support symbolic reasoning over table content.
Recent frameworks, such as Knowledge Graph-based Thought (KGT), demonstrate the benefits of integrating KGs with LLMs. One study [12] introduced KGT for biomedical QA, where the system verifies LLM-generated answers using facts retrieved from domain-specific KGs. This approach significantly improved answer correctness and robustness against factual errors. The framework also proved effective across different LLMs and use cases, including drug repositioning and resistance prediction.
Similarly, another study [10] applied a KG-augmented LLM to a traditional Chinese medicine QA system. The LLM extracted entities from unstructured text and populated a case-based knowledge graph. The resulting KG-LLM pipeline outperformed vanilla LLMs in accuracy, relevancy, and interpretability when answering clinical questions over patient records.
In the context of tabular data, KGs can serve as an external source of truth that complements the inherent structure of tables. For example, ambiguous column headers or entity mentions within cells can be disambiguated using linked concepts from KGs [13]. This is especially beneficial when dealing with semi-structured or noisy tables, such as those commonly found on the web.
Although these systems are promising, no such KG-augmented TableQA framework currently exists for Arabic. The absence of Arabic-specific tabular KGs and entity-linking tools further compounds the challenge. To bridge this gap, we considered leveraging Arabic KGs aligned with common tabular schemas.

2.3. Applications of LLM-Based QA

Recent studies highlight that LLMs have advanced QA not only in table reasoning but also across broader domains such as fact-checking, cross-lingual QA, and multimodal reasoning. For instance, ref. [14] proposed X-Fact, a framework for automatic fact-checking using LLMs that integrates contextual and discourse information to improve claim verification. Their results demonstrate the value of LLMs in reasoning beyond surface-level retrieval, showing strong potential in ensuring factual consistency across diverse domains.
Similarly, ref. [15] introduced an explainable automated fact-checking approach for public health claims, emphasising that reliability in QA extends beyond correctness to include interpretable justifications. Their work illustrates how domain-specific reasoning and explanation generation can enhance user trust and applicability in high-stakes areas such as healthcare.
These research directions also align with advances in cross-lingual QA and multimodal reasoning, where LLMs are increasingly adapted to handle multiple languages and diverse input modalities.

3. Materials and Methods

To develop and evaluate proposed frameworks capable of answering questions over long Arabic tables, we began by developing an Arabic tabular dataset, which is a translated version of the benchmark English WikiTableQuestions (WikiTQ) [1]. We then applied three experiments using this dataset.
After developing the Arabic benchmark dataset, we tested three approaches: direct, sub-table selection, and knowledge graph filtering. Details of each approach are provided in the following subsections.

3.1. Direct Approach

In this approach, the Arabic WikiTQ dataset is directly provided to the LLM without any intermediate processing or decomposition. As illustrated in Figure 1, the full table along with the natural language question is input to the LLM, which is responsible for both interpreting the table and extracting the final answer. This end-to-end setup relies entirely on the model’s internal reasoning capabilities to handle the structural complexity of tabular data and produce accurate responses.

3.2. Sub-Table Selection with SQL Approach

In this approach, we developed an Arabic model named ArTabSQL, which is based on a decomposition approach [6]. Figure 2 presents the overall structure of the sub-table selection approach, which decomposes the QA process into two main stages: sub-table selection and answer generation. Given a full table and a natural language question, the LLM first performs sub-table selection by identifying the relevant rows and columns that align with the question intent. This selection is often expressed using SQL to extract a related view of the data that is more focused and relevant. The resulting sub-table, containing only essential information, is then passed to an LLM for reasoning and answer generation. This pipeline helps to mitigate the limitations of LLMs regarding input length and irrelevant content by reducing table size and, consequently, the number of tokens.
Figure 3 illustrates an example using a sample form the Arabic WikiTQ dataset. The full table provides a timeline of sports events, and the question posed is: “ماهي المسابقة الوحيدة في عام 2001؟” (What is the only competition in 2001?). In the first stage, the model identifies the most relevant columns, “سنة” (Year) and “منافسة” (Competition), and decomposes the table using the condition WHERE سنة = 2001. This yields a sub-table with a single row corresponding to the year 2001. In the second stage, the model takes this sub-table and the original question to generate the final answer: “نهائي الجائزة الكبرى للاتحاد الدولي لألعاب القوى” (IAAF Grand Prix Final). This example demonstrates the effectiveness of the decomposition strategy in narrowing down the input and enhancing the answer generation process.
However, both the direct input and sub-table selection approaches encounter significant challenges when dealing with long tables, primarily due to the context length limitations of current LLMs [6]. These models can only process a fixed number of tokens per input, and when that limit is exceeded, any content beyond the threshold is truncated and ignored during inference.
In the direct input approach, the full table and question are fed into the model without preprocessing. While this is effective for short and moderately sized tables, it quickly becomes unreliable as the table grows. For large tables, relevant information may lie in the truncated portion that the model never sees. This results in incomplete context and often leads to incorrect or partially correct answers, especially when the answer depends on aggregating or comparing data across distant rows or columns.
The sub-table selection approach attempts to mitigate this issue by first extracting a smaller, relevant portion of the table before passing it to the LLM. However, identifying the correct sub-table still requires scanning the full table. This decomposing step can itself be constrained by token limits or may fail to capture all relevant entries when information is distributed across the table. Furthermore, if the filtering logic is too shallow or approximate, it may exclude rows containing the correct answer, leading to false negatives.
In both cases, the inability of the LLM to process long inputs holistically limits the model’s capacity to perform deep reasoning or answer complex queries that require global understanding of the entire table. This bottleneck highlights the need for more efficient table summarisation, hierarchical filtering, or hybrid systems that can handle large structured data beyond the current context window limitations of LLMs.

3.3. Knowledge Graph for Table Filtering Approach

The third approach introduces a novel knowledge-guided framework that leverages structured graph representations to enhance question answering over Arabic tables. Unlike previous methods that rely on flat table input or shallow SQL-style selection, this approach builds a semantic graph from the full table to capture deeper relationships between table elements, enabling more accurate and interpretable reasoning, especially for long and complex tables.
As illustrated in Figure 4, the process begins with the relationship (operations) extraction stage. The full table is analysed by an LLM-based module that identifies semantic links between cells, headers, and contextual attributes. These relationships are then used to construct a full graph, where each node represents a table cell, column, or header, and edges reflect the identified relationships.
Simultaneously, the system parses the natural language question using a question parser, also powered by an LLM. The parser identifies key entities, target columns, and any operations implied by the question (e.g., filtering by year or aggregating values). The result is a parsed representation of the question, which is then used to generate targeted subgraphs from the full graph. These include subgraphs for entities (e.g., “2001”), columns (e.g., “سنة”, “منافسة”), and operations (e.g., “WHERE”).
In the next step, the extracted subgraphs are combined into a single graph, capturing only the most relevant segments of the full table graph for the question at hand. This combined graph is then converted back into a filtered table using a graph-to-table mapping module. By focusing only on necessary parts of the data, this step ensures that irrelevant information is excluded, reducing input length and improving efficiency.
Finally, the filtered table, along with the original question, is passed to the LLM for answer generation. This graph-guided filtering enables the model to reason over a more compact and semantically aligned view of the long table, resulting in more accurate answers.
Figure 5 provides a detailed example of this process using an Arabic sports event table. The relationship extractor identifies links between years, locations, and competitions, while the question parser detects that the query targets the year “2001” and the column “منافسة”. The corresponding subgraphs are combined, converted back into a filtered table, and used by the LLM to produce the correct answer: “نهائي الجائزة الكبرى للاتحاد الدولي لألعاب القوى”.
This approach is particularly effective in overcoming the limitations of token-based truncation by shifting the reasoning process from raw tabular input to a structured semantic representation. It supports a more scalable and interpretable QA pipeline, which is especially valuable when handling large or complex Arabic tables.

4. Experimental Setup

4.1. Dataset

In this research, we introduce the first Arabic benchmark dataset for non-SQL table question answering (TableQA) by translating and adapting the widely used WikiTableQuestions (WikiTQ) dataset, originally developed by [1]. WikiTQ contains 2108 semi-structured HTML tables paired with over 22,000 natural language questions, sourced from Wikipedia. The dataset emphasises compositional reasoning, making it a strong candidate for evaluating QA models beyond simple retrieval or lookup.
To develop the Arabic version, we implemented a multi-stage translation and preprocessing pipeline. After experimenting with various translation systems, we found that GPT-4o provided the highest quality translations, preserving contextual accuracy, domain-specific terminology, and syntactic clarity. We chose GPT-4o because many studies have shown its strong translation quality [16,17], including for Arabic [18,19]. To further ensure accuracy, we checked random samples several times and confirmed semantic correctness and consistency. As human translation was cost-prohibitive, we used GPT-4o for translating both table content and natural language questions into Modern Standard Arabic. Post-processing involved minor corrections, and consistency checks but no full human re-translation.
The process begins with the normalisation of Arabic text and table structures. Given the diverse formatting of web tables and the morphological complexity of Arabic, we applied a series of cleaning steps to ensure consistency across the dataset. This included removing extraneous symbols such as Tatweel and diacritics, standardising numeral formats, and unifying header and cell content.
The normalisation steps we applied were not intended to affect linguistic richness, but rather to standardise technical formatting for consistency and LLM compatibility. For example, we removed Tatweel (هكــــذا → هكذا, comparable to “heeeellooo” → “hello”), excessive character repetition (نعممم → نعم, “yesss” → “yes”), and diacritics (ذهّب → ذهب, “went”). We also unified numbers (0012 → 12), times (04:00 pm → 16:00), dates (17/09/2025 or 17 September 2025 → ISO format 2025-09-17T16:00:00Z), and web links.
For each table-question pair, we constructed natural language prompts that combined the Arabic question with a serialised representation of the corresponding table. These prompts were formatted to guide the model towards identifying relevant information in the table and inferring the correct answer.
We encountered several challenges during adaptation. First, linguistic divergence between Arabic and English—including differences in word order, inflexion, and ambiguity—made direct translation non-trivial. Complex questions with nested reasoning components often required restructuring to remain grammatically and semantically sound in Arabic. Additionally, schema alignment proved difficult, as table headers often had to be expanded or paraphrased in Arabic to preserve clarity and maintain linkage to the questions.
A major focus was also placed on structural normalisation. Web tables often include irregularities such as merged cells, implicit row headers, inconsistent date formats, and heterogeneous column types. To address these issues, we applied normalisation strategies inspired by recent frameworks like NormTab [3], including unifying numerical formats, standardising date strings, and flattening table hierarchies. This was crucial to ensure downstream models could effectively process and reason over the data.
Unlike prior Arabic datasets, which are primarily text-based or focus on SQL-oriented parsing (e.g., Ar-Spider) [9], our benchmark enables direct answer extraction from Arabic tables without requiring SQL annotations. This distinguishes it from previous work and aligns it with modern trends in open-domain and few-shot TableQA, such as those explored in TabSQLify. The resulting dataset provides over 22,000 QA pairs linked to realistic Arabic tables, preserving the original question complexity and table diversity of the English WikiTQ benchmark.
Moreover, the knowledge graph used in our approach is newly created for each table in the dataset rather than taken from an existing resource. For every table, we construct a new semantic graph rather than relying on a pre-existing resource. Our method sub-selects three focused subgraphs from the full graph—one capturing nodes, one capturing entities, and one capturing relationships—which are then combined to form a compact, question-relevant representation. This approach is dataset-agnostic and can be applied to any other table-based QA resource.

4.2. Baselines and Evaluation Metrics

Due to the absence of prior work on Arabic table-based question answering, no existing baselines are directly applicable to our task. Existing Arabic datasets, such as Ar-Spider, focus exclusively on SQL-oriented semantic parsing and assume access to well-structured relational databases. These settings are incompatible with our open-domain, non-SQL TableQA framework, which relies on answering natural language questions directly from semi-structured Arabic tables without formal query generation.
As a result, we evaluate all approaches under a unified experimental setting using the Arabic WikiTQ dataset. This allows benchmarking of model performance across different reasoning strategies on the same dataset.
For evaluation, we adopt Exact Match (EM) as the primary metric, which measures whether the predicted answer exactly matches the gold-standard answer. Given the morphological richness and syntactic variability of Arabic, we implement a flexible EM scoring mechanism that accounts for common token variations, numeral formats, and minor morphological inflexions.
In addition to EM, we used several complementary metrics to capture both lexical and semantic similarity. At the lexical level, we include normalised EM and token-level F1, which better reflect partial overlaps and morphological variations in Arabic. At the semantic level, we evaluate using BERTScore F1, sentence embedding cosine similarity, and semantic accuracy, allowing us to measure whether predicted answers convey the same meaning as the gold reference even when surface forms differ. These metrics provide a more comprehensive assessment of model performance beyond the strictness of EM.
Since no comparable Arabic baselines exist, our reported results serve as the first benchmark for non-SQL TableQA in Arabic and provide a foundation for future comparisons.

4.3. Implementation

In our experiments, we employ GPT-4o (ChatGPT) as the language model. The prompt format is newly designed and does not follow prior approaches. The main hyperparameters used for the LLM are model = “gpt-4o” and temperature = 0. The full code, along with the prompts, is available at https://github.com/Asmaa-Alrayzah/KG-for-Arabic-tables (accessed on 1 September 2025).

5. Results and Discussion

5.1. Comparative Analysis of Language Models with KG

Table 1 presents the accuracy results of the three proposed approaches evaluated on the Arabic WikiTQ dataset, using EM accuracy as the primary metric. Each approach was tested under consistent experimental settings with a fixed temperature to ensure a fair comparison.
The direct approach achieved an accuracy of 45.2%, reflecting the limitations of relying solely on end-to-end reasoning over long Arabic tables. The low accuracy is attributed to the token constraints of the language model, which often leads to truncation of important parts of the table. As no preprocessing or filtering is applied, the LLM struggles to identify relevant rows or columns amid lengthy or complex tabular structures. This confirms that direct prompting is insufficient for deep reasoning over structured Arabic data.
The sub-table selection with SQL approach significantly outperformed the direct method, reaching an accuracy of 64.8%. This improvement stems from the use of decomposition strategies, where the model first identifies a relevant subset of the table using SQL-like selection before generating the answer. By narrowing the context, the model can focus on a reduced number of tokens directly relevant to the question. However, this method still depends on the LLM’s ability to accurately parse the question and retrieve the correct sub-table. In cases where relevant information spans multiple table regions or the SQL logic fails to capture implicit relations, the model may still produce incorrect or incomplete answers.
The highest performance was obtained using the KG-based approach, which achieved an accuracy of 75.3%. This method constructs a semantic graph from the full table and identifies relevant subgraphs guided by the question, effectively overcoming token limitations and structural misalignment. By combining subgraphs based on entities, columns, and operations, then converting the result into a filtered table, the LLM receives only the essential content for answer generation. This structured semantic reasoning enhances the model’s precision and robustness, particularly in handling complex questions requiring relational understanding across multiple table elements. Additionally, the interpretability of the graph-based filtering offers greater control and transparency over the reasoning process.
The results of Table 1 demonstrate the limitations of using LLMs directly on raw tables and highlight the effectiveness of intermediate representations. While the sub-table selection approach offers a practical compromise, the graph-guided method provides a more scalable and accurate solution for Arabic table question answering, especially when dealing with long and heterogeneous tables. These findings underscore the importance of integrating symbolic structure into LLM workflows to address input length constraints and improve reasoning over structured data.
In addition to EM, we extended our evaluation to include both lexical and semantic similarity metrics to provide a more comprehensive assessment of model performance.
Table 2 reports normalised EM and token-level F1 scores, which account for morphological and tokenization variations in Arabic. The results show that GPT-4o achieved the highest performance across these metrics (Normalised EM = 0.75, F1 = 0.77), with DeepSeek-v3.1 performing comparably and Llama-3.3 slightly lower. On the other hand, GPT-4o and DeepSeek-v3.1 obtained the same token-level F1 score (0.77), indicating that both models are equally capable of capturing partial lexical overlaps and handling morphological variation in Arabic answers.
These findings shows that while EM may underestimate performance due to surface-level strictness, token-level evaluations highlight the models’ ability to capture partial matches and morphological variations effectively.
Table 3 presents semantic similarity metrics, including BERTScore F1, sentence embedding cosine similarity, and semantic accuracy. GPT-4o Also achieved the highest results (BERTScore F1 = 0.94, cosine similarity = 0.88, semantic accuracy = 0.76), followed closely by DeepSeek-v3.1, with Llama-3.3. These results represents that the models, particularly GPT-4o, can produce semantically faithful answers even when surface forms differ from the gold reference.
GPT-4o’s superior performance can be attributed to a combination of scale, architecture, training, and reasoning. Its unprecedented parameter size and extensively curated training data provide stronger factual grounding [20,21], as reflected in its leading Exact Match score (0.75). The model’s native multimodal architecture, trained jointly on text, audio, and vision [21], enhances semantic representation and underpins its highest BERTScore F1 (0.94). Moreover, advanced alignment techniques, including Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), improve reliability and user-preferred outputs [22], leading to the strongest semantic accuracy (0.76). Finally, GPT-4o demonstrates more robust reasoning through optimised chain-of-thought processes, resulting in fewer logical errors and more coherent synthesis across complex tasks. These stacked advantages explain its consistent leadership across evaluation metrics.

5.2. Scalability Analysis

The scalability is a critical concern when applying graph-based approaches to thousands of large tables. To address this, we conducted a scalability analysis as shown in Figure 6 to examine model performance as input size increases.
Both BERTScore F1 and Exact Match scores decrease as the token count grows, reflecting the computational and representational challenges of large graphs.
Specifically, BERTScore F1 shows only a modest decline (from 0.943 to 0.850, drop of 0.093), indicating that semantic similarity is relatively well preserved even for large graphs. In contrast, Exact Match exhibits more substantial degradation (from 0.742 to 0.432, drop of 0.310), suggesting that while the model maintains comprehension of meaning, it struggles with precise reproduction when context length and graph size increase. These results demonstrate that the architecture is more robust in preserving semantic understanding than exact string matching under scalability stress.
Furthermore, we report the approximate node counts associated with each token threshold: ≥0 tokens (all nodes), ≥2000 tokens (~70–350 nodes), ≥4000 tokens (~150–350 nodes), and ≥8000 tokens (~250–350 nodes). This analysis highlights the direct relationship between large tables, token expansion, and performance drop.
Those results show that while graph-based reasoning is computationally expensive, its semantic robustness makes it suitable for large-scale TableQA.

6. Conclusions

In this study, we introduced Arabic WikiTQ, the first benchmark dataset for table-based question answering in Arabic, addressing a critical gap in Arabic NLP resources. We evaluate three approaches—direct input, sub-table selection using SQL decomposition, and knowledge-guided graph filtering—using an LLM as the underlying language model. The results demonstrate the limitations of direct prompting when handling long or complex tables and highlight the effectiveness of structured intermediate representations. The knowledge graph-based approach achieves the highest accuracy, proving its ability to overcome token constraints and improve reasoning quality. Our work establishes a foundation for future research on Arabic TableQA and encourages further exploration of LLMs capable of effective reasoning over structured Arabic data.

Author Contributions

Methodology, A.A.; Validation, F.A.; Resources, A.A.; Data curation, A.A.; Writing—original draft, A.A.; Writing—review & editing, F.A.; Supervision, F.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to express our sincere appreciation to the anonymous reviewers and the editor for their valuable feedback and constructive suggestions. Their insights greatly contributed to improving the quality and clarity of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model
NLPNatural Language Processing
SQLStructured Query Language
TableQATable-based question answering
KGsKnowledge Graphs

References

  1. Lu, W.; Zhang, J.; Fan, J.; Fu, Z.; Chen, Y.; Du, X. Large Language Model for Table Processing: A Survey. Front. Comput. Sci. 2025, 19, 192350. [Google Scholar] [CrossRef]
  2. Nguyen, G.; Brugere, I.; Sharma, S.; Kariyappa, S.; Nguyen, A.T.; Lecue, F. Interpretable LLM-based Table Question Answering. arXiv 2024, arXiv:2412.12386. Available online: http://arxiv.org/abs/2412.12386 (accessed on 1 August 2025). [CrossRef]
  3. Nahid, M.M.H.; Rafiei, D. NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization. arXiv 2024, arXiv:2406.17961, 3569–3585. [Google Scholar] [CrossRef]
  4. Herzig, J.; Nowak, P.K.; Müller, T.; Piccinno, F.; Eisenschlos, J. TaPas: Weakly Supervised Table Parsing via Pre-Training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4320–4333. [Google Scholar] [CrossRef]
  5. Yin, P.; Neubig, G.; Yih, W.T.; Riedel, S. TABERT: Pretraining for joint understanding of textual and tabular data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8413–8426. [Google Scholar] [CrossRef]
  6. Nahid, M.M.H.; Rafiei, D. TabSQLify: Enhancing Reasoning Capabilities of LLMs Through Table Decomposition. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Volume 1, pp. 5725–5737. [Google Scholar] [CrossRef]
  7. Wang, B.; Ren, C.; Yang, J.; Liang, X.; Bai, J.; Chai, L.; Yan, Z.; Zhang, Q.-W.; Yin, D.; Sun, X.; et al. MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. arXiv 2023, arXiv:2312.11242. [Google Scholar]
  8. Mathur, P.; Siu, A.; Lipka, N.; Sun, T. MATSA: Multi-Agent Table Structure Attribution. In Proceedings of the EMNLP 2024-2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Miami, FL, USA, 12–16 November 2024; pp. 250–258. [Google Scholar]
  9. Almohaimeed, S.; Almohaimeed, S.; Al Ghanim, M.; Wang, L. Ar-Spider: Text-to-SQL in Arabic. In Proceedings of the ACM Symposium on Applied Computing, Avila, Spain, 8–12 April 2024; pp. 1024–1030. [Google Scholar] [CrossRef]
  10. Duan, Y.; Zhou, Q.; Li, Y.; Qin, C.; Wang, Z.; Kan, H.; Hu, J. Research on a traditional Chinese medicine case-based question-answering system integrating large language models and knowledge graphs. Front. Med. 2024, 11, 1512329. [Google Scholar] [CrossRef] [PubMed]
  11. Pasupat, P.; Liang, P. Compositional semantic parsing on semi-structured tables. In Proceedings of the ACL-IJCNLP 2015-53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing; Beijing, China, 26–31 July 2015, Volume 1, pp. 1470–1480. [CrossRef]
  12. Feng, Y.; Zhou, L.; Ma, C.; Zheng, Y.; He, R.; Li, Y. Knowledge graph–based thought: A knowledge graph–enhanced LLM framework for pan-cancer question answering. Gigascience 2025, 14, giae082. [Google Scholar] [CrossRef] [PubMed]
  13. Zong, C.; Yan, Y.; Lu, W.; Huang, E.; Shao, J.; Zhuang, Y. Triad: A Framework Leveraging a Multi-Role LLM-based Agent to Solve Knowledge Base Question Answering. arXiv 2024, arXiv:2402.14320. Available online: http://arxiv.org/abs/2402.14320 (accessed on 1 August 2025).
  14. Hang, C.N.; Yu, P.D.; Tan, C.W. TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking. IEEE Trans. Artif. Intell. 2025, 1–15. [Google Scholar] [CrossRef]
  15. Shami, F.; Marchesin, S.; Silvello, G. Fact Verification in Knowledge Graphs Using LLMs. In Proceedings of the SIGIR 2025-Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, Padua, Italy, 13–17 July 2025; pp. 3985–3989. [Google Scholar] [CrossRef]
  16. Yan, J.; Yan, P.; Chen, Y.; Li, J.; Zhu, X.; Zhang, Y. Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels. arXiv 2024, arXiv:2411.13775. Available online: http://arxiv.org/abs/2411.13775 (accessed on 1 August 2025). [CrossRef]
  17. Yan, J.; Yan, P.; Chen, Y.; Li, J.; Zhu, X.; Zhang, Y. GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels. arXiv 2024, arXiv:2407.03658. [Google Scholar] [CrossRef]
  18. Khair, M.M.; Sawalha, M. Automated Translation of Islamic Literature Using Large Language Models: Al-Shamela Library Application. In Proceedings of the New Horizons in Computational Linguistics for Religious Texts (Coling-Rel), Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 53–58. [Google Scholar]
  19. Banat, M. Investigating the Linguistic Fingerprint of GPT-4o in Arabic-to-English Translation Using Stylometry. J. Transl. Lang. Stud. 2024, 5, 65–83. [Google Scholar] [CrossRef]
  20. Shahriar, S.; Lund, B.D.; Mannuru, N.R.; Arshad, M.A.; Hayawi, K.; Bevara, R.V.K.; Mannuru, A.; Batool, L. Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency. Appl. Sci. 2024, 14, 7782. [Google Scholar] [CrossRef]
  21. Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  22. Wang, Z.; Bi, B.; Huang, C.; Pentyala, S.K.; Zhu, Z.J.; Asur, S.; Cheng, N.C. UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function. arXiv 2025, arXiv:2408.15339. Available online: http://arxiv.org/abs/2408.15339 (accessed on 1 August 2025).
Figure 1. Direct approach for extracting answers from Arabic long tables.
Figure 1. Direct approach for extracting answers from Arabic long tables.
Electronics 14 03829 g001
Figure 2. Sub-table selection approach for extracting answers from Arabic long tables.
Figure 2. Sub-table selection approach for extracting answers from Arabic long tables.
Electronics 14 03829 g002
Figure 3. Sub-table selection example.
Figure 3. Sub-table selection example.
Electronics 14 03829 g003
Figure 4. Knowledge graph table filtering table approach for extracting answers from Arabic long tables.
Figure 4. Knowledge graph table filtering table approach for extracting answers from Arabic long tables.
Electronics 14 03829 g004
Figure 5. Knowledge graph table filtering example.
Figure 5. Knowledge graph table filtering example.
Electronics 14 03829 g005
Figure 6. Performance Degradation with Increasing Tokens inputs (graph size).
Figure 6. Performance Degradation with Increasing Tokens inputs (graph size).
Electronics 14 03829 g006
Table 1. Experimental results on Arabic WikiTQ.
Table 1. Experimental results on Arabic WikiTQ.
ApproachEM Accuracy (%)
Direct45
Sub-table selection with SQL64
Knowledge graph filtering approach74
Table 2. Lexical Similarity Metrics Performance.
Table 2. Lexical Similarity Metrics Performance.
Metric (%)GPT-4oDeepSeek-v3.1Llama-3.3
EM0.740.730.64
Normalised EM0.750.740.70
Token-level F1 Score0.770.770.75
Table 3. Semantic Similarity Metrics Performance.
Table 3. Semantic Similarity Metrics Performance.
Metric (%)GPT-4oDeepSeek-v3.1Llama-3.3
BERTScore F10.940.930.89
Sentence Embedding Cosine Similarity0.880.870.82
Semantic Accuracy0.760.750.72
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alsolami, F.; Alrayzah, A. Arabic WikiTableQA: Benchmarking Question Answering over Arabic Tables Using Large Language Models. Electronics 2025, 14, 3829. https://doi.org/10.3390/electronics14193829

AMA Style

Alsolami F, Alrayzah A. Arabic WikiTableQA: Benchmarking Question Answering over Arabic Tables Using Large Language Models. Electronics. 2025; 14(19):3829. https://doi.org/10.3390/electronics14193829

Chicago/Turabian Style

Alsolami, Fawaz, and Asmaa Alrayzah. 2025. "Arabic WikiTableQA: Benchmarking Question Answering over Arabic Tables Using Large Language Models" Electronics 14, no. 19: 3829. https://doi.org/10.3390/electronics14193829

APA Style

Alsolami, F., & Alrayzah, A. (2025). Arabic WikiTableQA: Benchmarking Question Answering over Arabic Tables Using Large Language Models. Electronics, 14(19), 3829. https://doi.org/10.3390/electronics14193829

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop