Explainable Bilingual Medical-Question-Answering Model Using Ensemble Learning Technique

Sait, Abdul Rahaman Wahab; Alkhurayyif, Yazeed

doi:10.3390/electronics14204128

Open AccessArticle

Explainable Bilingual Medical-Question-Answering Model Using Ensemble Learning Technique

by

Abdul Rahaman Wahab Sait

^1,*

and

Yazeed Alkhurayyif

²

¹

Department of Archives and Communication, Center of Documentation and Administrative Communication, King Faisal University, P.O. Box 400, Hofuf 31982, Al-Ahsa, Saudi Arabia

²

Department of Computer Science, College of Computer Science, Shaqra University, Shaqra 11961, Riyadh, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(20), 4128; https://doi.org/10.3390/electronics14204128

Submission received: 15 September 2025 / Revised: 15 October 2025 / Accepted: 20 October 2025 / Published: 21 October 2025

(This article belongs to the Special Issue The Future of AI-Generated Content（AIGC）)

Download

Browse Figures

Versions Notes

Abstract

Accessing reliable medical information is a major challenge for healthcare professionals due to limited accessibility to real-time medical data sources. The study’s objectives are maximization of response accuracy with minimal latency and enhancement of the model’s interpretability. An explainable bilingual medical-question-answering system (MQAS) is introduced to improve accessibility and trust in healthcare information retrieval. Using knowledge-aware networks (KANs), retrieval augmented generation (RAG), and linked open data (LOD), a synthetic bilingual dataset is generated. Through the application of a synthetic dataset and Bayesian optimization HyperBand (BOHB)-based hyperparameter optimization, the performance of GPT-Neo and RoBERTa models is fine-tuned. The outcomes of GPT-Neo and RoBERTa are ensembled using the weighted majority voting approach, while Shapley Additive ExPlanation (SHAP) value provides interpretability and transparency. The proposed model is trained and evaluated using diverse medical-question-answering datasets, demonstrating superior performance over baseline models. It achieves a generalization accuracy of 90.58%, an F1-score of 89.62%, and a BLEU score of 0.80 with a low inference time of 3.4 s per query. The findings underscore the model’s potential in delivering accurate, bilingual, and explainable medical responses. This study establishes a foundation for building multilingual healthcare information systems, promoting inclusive and equitable access to medical information.

Keywords:

large language models; natural language processing; transformers; medical-question-answering models; knowledge-aware networks; context-aware response generation

1. Introduction

Healthcare is one of the critical sectors that significantly influences an individual’s well-being and quality of life, necessitating novel research, therapies, and guidelines [1]. Artificial Intelligence (AI)-driven applications play a crucial role in transforming healthcare services across the globe [2]. Natural language processing (NLP) is the foundation of the medical-question-answering system (MQAS) [2,3,4]. It employs various techniques, including tokenization, part-of-speech tagging, named entity recognition, and syntactic parsing, to provide reliable medical information in response to a user query. MQAS supports patients and healthcare professionals in retrieving personalized responses [5]. Medical queries are complex, requiring a deep understanding of medical terminologies, disease relationships, and treatment methods [5]. The MQAS may face challenges in adequately addressing complex queries. The limited domain knowledge in existing MQASs limits their performance, reliability, and effectiveness in real-time settings. Inadequate or insufficient domain knowledge may influence an MQAS to provide inaccurate or misleading user responses [5]. Consequently, the MQAS may generate inappropriate recommendations on a medical condition, compromising patient safety and worsening health outcomes.

The introduction of transformers has revolutionized the development of MQASs [6]. Transformers leverage self-attention mechanisms to handle complex data. The parallelized processing feature enables the transformers to process long text sequences simultaneously [7]. It reduces the training time of transformers in learning complex patterns from extensive datasets, leading to improved performance. There is significant attention on large language models (LLMs) for analyzing and generating medical responses to a user query [8]. Transformers determine the significance of each term in a user query, enabling the development of practical NLP applications [9]. Transformers, including the Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), and T5, have demonstrated remarkable capabilities in addressing the limitations of traditional QASs [9]. For instance, ChatGPT (GPT-4) generates human-like text responses for a user query. However, these models lack the specialized domain knowledge to develop an appropriate response to a medical query [10,11]. An MQAS with static knowledge may not respond rapidly to these changes, preventing users from accessing the latest medical advances. The lack of adaptability may affect the timely and accurate responses, which are crucial for effective patient care [12]. A large number of existing QASs rely on statistical models or superficial text matching to offer responses. These systems may overlook significant information or context in medical queries. Domain-specific knowledge, including medical ontologies and structured knowledge graphs, is crucial for a QAS to comprehend the complex relationships between disease, symptoms, and treatment [13,14,15]. The lack of interpretability in offering insights into the QAS responses may reduce trust and confidence, which are essential in the healthcare environment [15].

Advanced techniques, including knowledge-aware networks (KANs), synthetic datasets, and linked open data (LOD), can overcome the limitations of existing transformers and improve the MQAS response accuracy [16]. The integration of LOD is essential. With the assistance of LOD, an MQAS can access a frequently updated source of medical information, enabling it to produce reliable responses to user queries [16]. It can guarantee that its responses reflect the latest advances in medical knowledge by establishing access to several databases, scholarly articles, and medical guidelines. By integrating context and background information, LOD improves the system’s ability to respond to complicated queries [17].

LLMs transform NLP techniques, leading to the development of highly effective MQASs [18]. These models utilize massive amounts of textual data to provide contextually appropriate and syntactically reliable responses. Using subtle language patterns, LLMs can translate complicated medical queries and generate context-aware responses [18]. Retrieval-augmented generation (RAG) enhances LLM-based systems with an additional layer of functionality [18]. RAG supplements these models with external information to guarantee syntactically and factually correct responses. Real-time access to the recent medical literature improves the trustworthiness of MQAS responses [18,19]. RAG-enhanced LLMs can generate linguistically accurate and medically sound responses in bilingual QAS systems, providing solutions for low-resource languages, such as Arabic [19]. The integration of LLMs and RAG allows bilingual MQASs to generate medical information [19]. User satisfaction and trust can be improved by providing accurate, relevant, and personalized responses [20]. Recent studies highlighted the performance of LLMs in answering medical queries across various domains and cultural contexts. Chen et al. [21] evaluate ChatGPT and Bard on medical licensing examinations across diverse regions, indicating significant variations in accuracy and reasoning depth. Their findings indicate linguistic nuance and contextual bias as persistent challenges in multilingual medical examination. Chau et al. [22] examine the performance of generative AI models in dental licensing examinations. The generative AI models struggle with clinically applied reasoning and terminology specific to the regional dental curriculum. These studies exhibit the importance of culturally adapted, domain-specific datasets and evaluation protocols for developing MQASs.

The development of bilingual MQASs holds great potential to improve global healthcare accessibility [23]. In a bilingual setting, English-based medical QASs may not elicit meaningful responses from non-English-speaking users [24]. In Middle Eastern countries, English-based MQASs are ineffective in meeting the demands of healthcare professionals and patients who primarily communicate in Arabic [25]. The limited accessibility and understanding of medical responses may affect Arabic-speaking individuals. English-based medical QASs face challenges in handling cultural and linguistic differences, resulting in the generation of contextually inaccurate responses [26]. The critical medical insights are associated with regional practices and health concerns [25,26]. As a result, English-based MQASs may not provide adequate services to Arabic-speaking individuals. These limitations underscore the need to develop a robust, contextually aware, and interpretable bilingual medical QAS, providing trustworthy, accurate, and culturally appropriate responses. Thus, we develop an interpretable bilingual MQAS to accurately and effectively respond to user queries. The study’s novelty lies in its synergistic integration of generative, discriminative, and knowledge-aware paradigms. The proposed model establishes a retrieval-conditioned and knowledge-grounded reasoning through the harmonization of GPT-Neo’s linguistic fluency and RoBERTa’s factual discrimination. KANs-mediated semantic alignment and RAG-based evidence retrieval improve the model’s ability to deliver reliable bilingual medical responses. The proposed approach enables intermediate and trustworthy cross-lingual medical information access, reducing hallucination and promoting culturally aligned, evidence-based responsegned, evidence-based repsoting culturally aligned, evidence-based repsoion access, ccess to medical informations. Furthermore, the SHAP-based explainability supports clinical auditability. The study contributions are listed as follows:

An enhanced query optimization technique using KANs and a linked open data (LOD) model to improve the MQAS response accuracy.

We employ query optimization techniques, including semantic parsing, language detection, and query reformation, to retrieve the most relevant and contextually appropriate information.

2.: Domain-specific data enrichment using RAG.

We integrate the GPT-Neo and RoBERTa models into a single framework to generate responses in Arabic and English. RAG enhances the model’s understanding of complex medical terminology and contextual relationships. It enhances the bilingual performance of the GPT-Neo and RoBERTs models.

3.: Ensemble-learning-based bilingual MQAS.

We use the weighted majority voting approach to ensemble the responses of the GPT-Neo and RoBERTa models. They enhance the model’s performance by integrating KANs and SHAP values. This improvement enables the proposed model to present a foundation for interpretable and trustworthy AI in healthcare.

The study is organized as follows: Section 2 presents the features and limitations of the existing MQAS. The proposed methodology for developing the bilingual MQAS is outlined in Section 3. Section 4 highlights the findings of the experimental validation. The study’s implications and limitations are presented in Section 5. Section 6 concludes the study with its future directions.

2. Related Works

Many technological advancements have been established in MQASs, transforming the healthcare domain. Traditional MQASs, including rule-based systems, employ rules and patterns to respond to user inquiries [27]. Medical ontologies are utilized to retrieve appropriate information. These systems are effective for straightforward and well-structured queries. However, they struggle with the complexities of natural language, ambiguity, and medical queries. In addition, rule-based systems require continuous manual updates to incorporate current healthcare data, which can lower their performance in real-time settings [27]. Recently, machine learning techniques have been widely applied in healthcare to provide reliable services to individuals. These techniques generate responses using a supervised learning approach based on labeled medical datasets. Machine learning models outperformed rule-based systems in handling complex and diverse queries due to their larger datasets and access to high computational resources. Despite these advancements, recent MQASs encounter challenges in comprehending context and addressing queries that comprise multiple levels of interpretation.

Yasunaga et al. (2021) [28] propose a GNN-based QAS. They address the limitations of traditional GNN models with relevance scoring and joint reasoning strategies. By leveraging RoBERTa and structured graph nodes, the model encodes text. An iterative message passing mechanism supports it to learn linguistic and relational representations. The findings of the performance evaluation (accuracy: 82.14%) reveal the importance of graph-based reasoning in generating meaningful responses. Zhang et al. (2022) [29] employ a graphical neural network (GNN) for QAS development, achieving an accuracy of 84.7%. A cross-modal fusion component provides information from the knowledge graph to the language representation. The transformer is employed to update tokens and node embedding. An integrated GNN and transformer improve the model’s contextual understanding. The concept of co-training is applied to enable graph context to inform token-level embedding. The integration of joint gradient updates assists the model to outperform graph–text fusion systems. This architecture motivates the introduction of the proposed KANs module, bridging LOD with RAG-retrieved textual evidence. Alzubi et al. (2023) [30] develop a COVID-19 QAS using BERT. This model obtained an overall accuracy of 58.91%. A retriever–reader dual algorithmic system is used to answer the complex queries. A TF-IDF vectorizer is used to capture the top 15 documents with optimal scores. A preprocessing technique is applied to fine-tune the model’s performance. Attention visualization is used for the model’s interpretability. The application of contextual embedding supports the model in achieving an F1-score of 0.89. However, the training and inference were limited to the English language.

Singhal et al. (2025) [31] proposed an LLMs-based MQAS. Instruction tuning and reinforcement learning from human feedback are used to align outputs with professional medical standards. The findings demonstrate that the MultiMedQA benchmark shows near-clinician accuracy and improved factual grounding. The model is generalized on multiple clinical topic datasets. Wang et al. (2023) [32] introduce augmenting black-box LLMs with medical textbooks in order to generate a response. The retrieval-augmented pipeline constructs a dense retriever of textbook passages using user queries. Using this strategy, the factual grounding of this model is improved, achieving a significant improvement in responding to a query. The model was evaluated on three open-domain datasets. However, the model is monolingual and non-interpretable. Van Sonsbeek et al. (2023) [33] propose a medical visual QAS using pre-trained language models, achieving an accuracy of 84.3%. They map the visual features to a set of learnable tokens. Using a frozen GPT-2 decoder, image embeddings are fused with text prompts. The parameter-efficient fine-tuning strategy optimized the model’s parameters to achieve an optimal outcome. The study’s findings emphasize multimodal reasoning that leads us to introduce bilingual reasoning through pre-trained models. Lin et al. (2023) [34] build a visual MQAS using a pre-trained vision encoder. They conduct internal and external validations using public repositories. They categorize MQAS into image–text attention, graph-fusion transformers, and multimodal LLMs. The challenges in interpretability and dataset imbalance are highlighted, underscoring the importance of enriching the dataset with structured medical information. Similarly, Kamel et al. (2023) [35] develop a visual Arabic QAS using LLM techniques, achieving a lower accuracy of 57.62%. They use a question-tokenization approach and three-word embedding algorithms. Transformer backbones are integrated with the Arabic-BERT encoder. However, it lacks cross-lingual grounding or retrieval augmentation.

Sengupta et al. (2023) [36] build an MQAS using the GPT-2 model. They enhance the model’s performance using a joint artificial intelligence system (jais) tokenizer with Adam optimization. Morphological richness and right-to-left tokenization support the model to produce reliable responses. However, the model obtains an accuracy of 28.8%. Compared to the existing multilingual baseline models, the Arabic-centric model demonstrates significant improvements. However, the model remains general-purpose. Jiang et al. (2023) [37] introduce a model, Mistral 7B, using sliding windows and grouped-query attention. The grouped-query attention and sliding window mechanisms are used to reduce inference latency. This model lacks domain adaptation and explainability, restricting its application to specific domains (accuracy: 40.6%). Zeming et al. (2023) [38] propose an MQAS using NVIDIA’s Megatron-LM distributed trainer. Multi-stage instruction tuning is employed to acquire clinical reasoning. A set of 4.6 trillion tokens from PubMed and clinical guidelines is used to train the model, achieving an accuracy of 60.7%. Yu et al. (2024) [39] proposed a QAS model for the healthcare domain. They generated a dataset [40] to train and validate the QAS model. Multiple transformers were used to generate an answer. The models, including Sentence-T5 and Mistral 7B, are integrated to enhance the model’s generative capabilities (precision of 76.2%). The synthetic minority over-sampling technique is employed to handle class imbalance. Preri et al. (2024) [41] introduce a bilingual medical QAS. A bilingual medical mixture of expert LLMs is used to respond to user queries with an accuracy of 47.3%. Bilingual corpora and domain alignment are used to train the model. However, there is a lack of approaches to address limited interpretability and hallucination. The parameter-efficient fine-tuning (PEFT) technique is used to optimize the hyperparameters. Wu et al. (2024) [42] develop an MQAS using a general-purpose foundation language model, achieving an accuracy of 47.2%. The factual grounding and domain understanding are improved through the fine-tuning process using the PubMed corpora. Table 1 outlines the features and limitations of the existing models.

Visual QAS encounters challenges in managing the complex interdependencies between visual and textual medical information. It relies on image–text correlation. The lack of a comprehensive understanding of text-based context may lead to inaccurate responses. The limited Arabic visual datasets reduce the performance of visual QAS. GNN faces challenges in capturing long-range contextual information. Due to the limitations in handling language-specific requirements, GNN-based QASs may not produce better performance. Open-domain QASs retrieve and synthesize data from multiple sources in order to provide a proper response. They may misinterpret user intent, leading to irrelevant outcomes. The lack of reliability in the numerous sources may result in biased or incorrect responses. Multiple-choice QASs depend on multiple-choice patterns. They retrieve long or comprehensive responses rather than understanding the context. A significant limitation of multiple-choice QASs is the existence of distractor alternatives, misleading users with inappropriate responses. The existing models have difficulties in interpreting complex queries with limited resources. These limitations motivate us to develop a contextually grounded knowledge architecture to address the limitations of existing MQASs, handling the complexities in bilingual queries. Employing ensemble learning techniques with advanced evaluation metrics supports the model’s reliability and interpretability. The combination of high-performance NLP with domain-aware reasoning overcomes the shortcomings of existing MQASs. This strategy promotes future scalability across languages and specialties, leading to advancements in patient-centered healthcare technologies.

3. Materials and Methods

GPT-Neo can generate fluent and contextually rich responses for open-ended medical queries [43]. RoBERTa excels in handling the structure and context of a query [44]. It is commonly applied for query classification, context extraction, and ranking. The development of the proposed MQAS is motivated by the complementary strengths of GPT-Neo and RoBERTa in processing user queries and generating optimal medical responses in English and Arabic. Fine-tuning and facilitating external knowledge through RAG can improve the ability of these transformers to generate contextually accurate responses for bilingual queries. To improve the proposed model’s robustness and generalization, we employ an ensemble learning approach. The approach ensembles the outcomes of GPT-Neo and RoBERTa using a voting approach. In addition, it applies SHAP values to facilitate the model interpretability. In addition, value-aware quantization is used to minimize the computational resources [45].

Figure 1 shows the proposed approach for developing bilingual MQAS. It highlights the process of integrating GPT-Neo and RoBERTa within a unified hybrid framework, leveraging the complementary strengths of generative and discriminative modeling. GPT-Neo functions as the generative reasoning engine, while RoBERTa serves as the discriminative verifier. Using this approach, factuality and linguistic coherence are evaluated. A user query is optimized and enriched using KANs and LOD, yielding a semantically expanded query representation. The introduction of a synthetic enrichment layer augments the training corpus with diverse, medically valid QA pairs in Arabic and English. The enrichment addresses data sparsity in underrepresented categories, improving the model’s generalization. RAG + T5 generates Arabic and English QA pairs from biomedical corpora and verified knowledge sources, producing linguistically natural and semantically diverse QA pairs. The enriched corpus is used to train the model. During inference, RAG retrieves top-k evidence documents corresponding to the user query, guiding GPT-Neo’s generation of candidate responses. These outputs are re-ranked by the RoBERTa model. The final response is selected using a weighted ensemble approach. SHAP-based interpretability reveals the linguistic and biomedical features that influence the model’s decision.

3.1. Data Acquisition

We utilize four public datasets in order to train and test the proposed model. The datasets used in this study are carefully selected and curated from reputable public repositories and trusted medical sources. Dataset 1 is an Arabic healthcare dataset containing 808,000 questions and answers across 90 categories. It offers a substantial sample size to capture the linguistic and conceptual features required for the model to understand the complex terms in Arabic. It is available in a public repository [46]. Dataset 2 [47] contains 87,930 Arabic medical questions and answers. It is divided into training, validation, and test sets. It enables structured performance assessment. Dataset 3 [40] includes 247,000 English QA pairs collected from medical experts. The high clinical reliability and coverage of domain-specific terminology enhance the model’s interpretability. Dataset 4 [48] is a specialized dataset covering 47,000 English questions and answers across 12 categories. It is collected from trusted medical websites. It provides high-quality and reliable medical questions and answers. It supports the proposed model to deliver reliable responses. We utilize datasets 1 and 3 for synthetic data generation. We split the datasets into a training set (80%) and a test set (20%). The large dataset sizes provide a robust foundation for training GPT-Neo and RoBERTa in enriching datasets 1 and 3. The balanced coverage across medical categories improves the proposed model’s stability, bilingual capability, and medical relevance. The dataset providers have indicated that the data were acquired through appropriate methods, including informed consent from the participants. They have anonymized the datasets, protecting sensitive user data privacy. As a result, Institutional Review Board approval or additional ethical clearance are not essential for this study. The use of public repositories is in line with the ethical and legal frameworks defined by the original dataset providers.

For the purpose of assessing the model’s generalization and preventing overfitting, we utilize 20% of datasets 2 and 4. This strategy determines the model’s generalization ability on unseen data. We limit the evaluation set to 20% to ensure statistically meaningful generalization testing. The 20% test ratio provides sufficient statistical power (

>

95% confidence) for F1-score, bilingual evaluation understudy (BLEU), and recall-oriented understudy for gisting evaluation (ROUGE), reflecting a trade-off between representativeness and model’s generalization efficiency. In addition, it aligns with established NLP evaluation standards and ensures that Arabic and English models are tested on unseen and balanced subsets. A stratified random sampling technique is used for the selection process. Using this technique, the original class and category distribution are preserved. In order to guarantee extensive and comprehensive coverage of a wide range of medical conditions, each subset is designed to reflect a proportionate representation of all question categories. This strategy enables a fair comparison of Arabic-only, English-only, and combined bilingual MQAS models. It provides a dependable and representative foundation for quantitative and qualitative performance evaluation.

3.2. Synthetic Dataset

To improve the performance of GPT-Neo and RoBERTa, data enrichment is essential. RAG enriches the train set (Datasets 1 and 3) to enhance the performance of the proposed model. It involves a structured approach that integrates the retrieval and generating processes. The knowledge bases, including PubMed [49] for English and e-Marefa [50] for Arabic, are used in this study. PubMed is a comprehensive repository of the biomedical literature. It provides rich contextual information for English queries. e-Marefa is a specialized resource for Arabic terms. A dense passage retrieval system is used to index the two knowledge bases. Algorithm 1 outlines the procedure for generating the synthetic QA (enriched datasets 1 and 3) dataset. For instance, the query “What is the treatment for hypertension?”, retrieves relevant documents from PubMed, identifies the entity “hypertension” through KANs, and maps it to DBPedia and MeSH. The T5 model [51] generates a response such as “Antihypertensive therapy with inhibitors or lifestyle modification”. Furthermore, the bilingual experts validate the generated QA pairs before including them in the final dataset.

Algorithm 1: Synthetic dataset generation

Input: Query Q (Arabic or English)
Output: Synthetic QA pair (Q, A_syn)
1. Language Detection:
if Q ∈ Arabic → L ← “AR”
else L ← “EN”
2. Template Generation:
Retrieve relevant question templates T_L
based on query intent (diagnosis, treatment, prevention)
3. Document Retrieval (RAG step):
if L = "EN" → Search PubMed/Medline
if L = “AR” → Search e-Marefa/AraBase
Retrieve top-k documents D = {d1, d2, …, dk}
4. Entity and Context Enrichment:
Use KAN to extract medical entities E from D
Map E ↔ Linked Open Data (DBpedia, MeSH, AraWiki)
Append semantically aligned terms to query context
5. Synthetic Response Generation:
Feed (Q + enriched context) into fine-tuned T5 model
A_syn ← T5(Q, context)
6. Expert Validation:
Two bilingual medical experts review A_syn
if disagreement → resolve via consensus (third expert)
7. Finalization:
Store (Q, A_syn) into bilingual QA dataset

We developed question templates covering medical queries in Arabic and English. The templates, including “What is [X]?”, “How does [Y] work?”, “ما هو [X]?” (What is [X]?) or “كيف يعمل [Y]?” (How does [Y] work?) are employed. These templates offer a fundamental structure for developing synthetic queries associated with the medical domain. By generating a diverse question–answer pair across multiple medical domains, the proposed synthetic dataset generation improves the MQAS’s robustness. We follow the common structure of question templates in existing templates, frequently asked questions in health organizations, and user forums. They categorize the templates based on medical specialties, enabling the model to reflect realistic and domain-relevant inquiries. Using publicly available ontologies, each template is instantiated with medical terms, conditions, treatments, and symptoms. In order to maintain linguistic balance, we pair each question with an equivalent Arabic version using the combination of expert-crafted translations and back-translation techniques. Grammatical accuracy and cultural relevance are maintained for each template to ensure that the synthetically generated questions mimic natural linguistic patterns. This strategy enhances the model’s generalizability and guarantees fair performance across diverse user queries. User queries are standardized using these templates, facilitating management and categorization of the generated data. By leveraging these approaches, we employ RAG to retrieve relevant documents for each template using PubMed and e-Marefa repositories. In addition, a retrieval model is used to analyze the templates and rank documents based on semantic similarity to the query. Equation (1) shows the mathematical expression of the retrieval process.

D_{k} = R e t r e i v e_t o p_K_d o c u m e n t s (q, K B)

(1)

where

D_{k}

is the top K documents retrieved from the knowledge base (KB) for a query (q).

In order to generate responses from the retrieved top k documents, we employ the T5 [51] transformer. The potential of the T5 generator synthesizes coherent and contextually appropriate responses. Equation (2) presents the response generation for the Arabic and English queries.

R = P ((r| q), D_{k})

(2)

where R is a response, and

P ((r| q), D_{k})

represents that the generation of the response (r) is conditioned on the query (q) and the retrieved documents (

D_{k}

).

A language detection module is used to identify English and Arabic queries. We use a lightweight model, FastTest [52], to detect the query’s language. The model uses character n-grams and text embeddings for language identification. Based on the query, the system utilizes either the PubMed or e-Marefa knowledge bases. In the context of bilingual MQAS development, the quality, relevance, and authenticity of training data are crucial.

To ensure consistency and reliability in the expert validation phase, a structured calibration and an inter-annotator agreement assessment protocol is implemented. The calibration phase comprises two bilingual experts and a medical language specialist (mediator). Two medical experts are actively engaged to ensure the clinical and linguistic validity of the synthetic dataset. They guide the process of answer pair generation, reflecting common real-world patient queries, terminology, and phrasing that align with clinical practice. Their bilingual medical expertise supports us in verifying cultural sensitivities and local healthcare practices, strengthening the dataset’s applicability to Arabic-speaking populations. Using the calibration phase, an initial random sample of 500 Arabic and 500 English QA pairs is reviewed. Calibration outcomes are consolidated into a shared annotation guide, ensuring clinical appropriateness and linguistic clarity. The inter-annotator agreement is quantified through Cohen’s kappa (k) for pairwise evaluations. As per Landis and Koch’s interpretation scale, a kappa coefficient of 0.86 is considered perfect agreement. A structured consensus-building approach is employed in order to address the disagreements or differences between the two experts. We isolate the items related to disagreements for discussion. Each expert is requested to independently annotate the rationale behind his or her recommendations. We facilitate a moderated session with a third expert to review the recommendations, enabling transparent discussion based on clinical standards and mutual understanding. In addition, a resolution protocol is applied to cross-validate the disputed items against medical guidelines. Using this multi-layered review protocol, the credibility and trustworthiness of the final synthetic dataset is strengthened.

Using authoritative medical repositories and consensus discussion, the resolution protocol ensures the clinical fidelity and linguistic equivalence of the bilingual QA pairs. To guarantee cross-lingual consistency between Arabic and English QA pairs, we employ a back-translation validation protocol using the MarianMT and mBART50 models. The original and translated versions are compared using cosine similarity over sentence embedding and reviewed by bilingual experts. Using this process, we ensure the preservation of linguistic nuances, idiomatic expressions, and medical terminologies, preventing potential misalignment between Arabic and English QA pairs during model training and evaluation. With the systematic validation, the credibility of the synthetic dataset is strengthened, and the reproducibility of the proposed MQAS pipeline is supported.

Normalization and control strategies are used to ensure the reliability of the cross-dataset comparisons. Datasets are standardized to comparable topic distributions, covering overlapping medical domains and preventive medicine. We exclude the queries with domain mismatches or ambiguous semantic categories through expert validation, maintaining a consistent conceptual scope. The sampling strategy of datasets 1 and 3 for evaluation guarantees the representation of similar semantic and complexity levels within its linguistic domains. The quantitative comparisons are interpreted within datasets after normalizing scoring metrics, reflecting genuine model-level generalization rather than dataset-driven variability. The semantic parallelism and controlled sampling of the dataset indicate that the datasets are sufficiently comparable for bilingual evaluation.

In Table 2, the findings of the expert validation are outlined. Although the primary training set (datasets 1 and 3) collectively contains over one million QA pairs, only 49,000 English and 48,000 Arabic QA pairs are utilized for training, ensuring data quality, bilingual balance, and medical relevance. This approach prevents redundancy and linguistic noise present in the raw corpora, prioritizing data quality over quantity. The low modification ratio of 14.8% and high Cohen’s k of 0.86 indicate the quality of the synthetically generated QA pairs and reliability between independent annotators.

Table 3 outlines the importance of features of the synthetic datasets, contributing to the model’s outcomes. The SHAP value represents a numeric value indicating the role of each feature in generated responses. It provides a positive or negative numeric value. A positive values indicate the positive influence of a feature, whereas a negative value highlights the feature’s negative contribution.

3.3. Query Optimization

The query optimization process improves the accuracy and relevance of the proposed bilingual MQAS through multiple preprocessing techniques. The techniques, including tokenization, stemming, stop-word removal, and lemmatization, are used to optimize the queries. Equation (3) shows the mathematical form of query optimization.

Q_{o p t} = T o k e n i z e (S t e m (R e m o v e S t o p w o r d s (l e m m a t i z e (Q))))

(3)

where

Q

is the actual query, and

Q_{o p t}

is the optimized query.

Furthermore, we employ KANs and LOD to enhance the query. The query enhancement allows users to use prompts and complex queries to retrieve a response. KANs can interpret the user query by identifying their relationship without explicit named entity recognition. It can map query terms to entities in the knowledge bases. LOD is a repository of linked datasets. It facilitates data interoperability by integrating diverse datasets. We apply synonym enrichment using KANs to expand the medical terms in the query. The enrichment allows the proposed model to capture broader information. Equation (4) highlights the query enrichment process.

Q_{e n r i c h e d} = Q_{o p t} \cup S (x)

(4)

where

Q_{e n r i c h e d}

is the enriched query, and

S (x)

is the set of synonyms for term (x).

The integration of LOD and KANs enhances the proposed MQAS’s ability to interpret and respond to user queries. For instance, a query related to “diabetes” can be enriched with related medical terms, “insulin resistance” or “glucose monitoring”. Equation (5) outlines the query enrichment process using KANs and LOD. We employ DBPedia for English and AraWiki for Arabic queries, respectively.

Q_{f i n a l} = K A N (Q_{e n r i c h e d} \cup L (Q_{o p t}))

(5)

where

L (Q_{o p t})

is the set of related terms from the LOD sources, and KAN is the KANs model for the comprehensive interpretation of the query.

To illustrate the query optimization process, consider a query, “What are the treatment options for controlling high blood sugar?”. The preprocessing step transforms the query into a standardized query representation: treat, option, control, blood, and sugar. Through semantic mapping, KANs identify contextual equivalents such as “therapy”, “management”, and “medication” for the term “treat” and associate “glucose” with “insulin” and “blood sugar”. The enriched query terms, i.e., treat, therapy, management, glucose, insulin, diabetes, and medication, are extended through LOD, resulting in the final optimized query terms: treat, therapy, diabetes, insulin, metformin, HbA1C, and insulin resistance. This richer semantic structure enhances the retrieval process within the RAG framework.

3.4. GPT-Neo and RoBERTa-Based Response Generation

In this section, we present the methodologies for response generation using the enhanced GPT-Neo and RoBERTa models. They highlight the recommended approach to enhance the overall effectiveness and reliability of the proposed MQAS in generating appropriate responses for English and Arabic queries.

GPT-Neo is an open-source alternative to OpenAI’s GPT-3. It has been trained on diverse datasets to comprehend and compose cohesive descriptions, conversations, and generative human-like narratives on a variety of themes. It is widely used for developing chatbots and producing content. It guarantees smooth interactions by handling multi-turn discussions. However, it may generate factually incorrect or nonsensical information and lacks fine-grained control over the generated results. The unique grammatical structure may limit GPT-Neo’s capability in tokenizing Arabic text. These limitations restrict the potential of GPT-Neo to produce precise outcomes for context-sensitive information.

RoBERTa is an enhanced version of the BERT model. It is used to understand and analyze the specific text. It demonstrates reliable performance on a wide range of NLP tasks, including text classification, sentiment analysis, and named entity recognition. A masked language modeling approach is used to capture intricate word–context relationships. RoBERTa demands high computational costs during training and inference. The effectiveness of the RoBERTa model depends on the selection of hyperparameters.

To overcome the limitations of GPT-Neo and RoBERTa, we employ domain-specific data (synthetic datasets (enriched datasets 1 and 3)), confidence-based dynamic weighting, and fine-tuning using BOHB optimization. Using these strategies, we improve the generative capability of GPT-Neo and the context understanding of RoBERTa. The objective of training the GPT-Neo and RoBERTa models is to reduce the loss function for the synthetic dataset.

Let

S = \{s_{1}, s_{2}, \dots s_{N}\}

be the synthetic dataset, where s is a sequence of tokens. Let

T = \{t_{1}, t_{2}, \dots t_{N}\}

be the corresponding target tokens for the individual sequence. GPT-Neo and RoBERTa predict the subsequent token in a sequence. These models compute the probability distribution, P

(t_{i}| s_{i}, θ)

, for each token, (

t_{i})

, where

θ

is the model’s parameters. Equation (6) presents the computation of a SoftMax function for each token over the logits (

z_{i}) .

P (t_{i}| s_{i}, θ) = \frac{e x p (z_{i}, t_{i})}{\sum_{j = 1}^{|v|} e x p (z_{i}, j)}

(6)

where

|v|

is the vocabulary, and

z_{i}, t_{i}

is the output of the model prior to the SoftMax for the token (

t_{i}

).

A cross-entropy loss function L for GPT-Neo and RoBERTa is used to measure the difference between the predicted and the true classes. Equation (7) shows the mathematical form of the loss function.

L (θ^{°}, S, T) = - \sum_{i = 1}^{N} \log P (t_{i}| s_{i}, θ)

(7)

where N is the number of input–output pairs, and

P (t_{i}| s_{i}, θ)

is the predicted probability for the true class (

t_{i}) .

Equation (8) outlines the back propagation and optimization processes through the computation of gradients of the loss function.

θ^{(t + 1)} = θ^{t} - α \nabla_{θ} L (θ^{°}, S, T)

(8)

where

θ^{t}

is the model’s parameter at iteration t,

α

is the learning rate, and

\nabla_{θ} L (θ^{°}, S, T)

is the gradients of the loss function.

3.5. BOHB-Based Hyperparameter Optimization

In order to fine-tune the parameters of the transformers, we employ the BOHB algorithm. Compared to conventional techniques, such as grid search, random search, and standalone Bayesian optimization, BOHB offers superior efficiency, scalability, and adaptability in large, high-dimensional search spaces. Bayesian models guide the search operation, and HyperBand discards underperforming configurations without exhaustively exploring every possibility. During the fine-tuning process using the synthetic bilingual datasets, BOHB optimizes the hyperparameters. The parameters, including learning rate, batch size, number of epochs, and weight decay, are optimized to achieve exceptional accuracy. Additionally, BOHB manages the knowledge-injection parameters used in the KANs module, including the degree of graph attention and knowledge integration weights. These benefits render BOHB a practical and strategically superior choice for complex, bilingual, and knowledge-enriched NLP applications.

GPT-Neo and RoBERTa cover a traditional gradient-based optimization. However, the recommended BOHB guides the fine-tuning process, leading to obtaining accurate outcomes with limited computational resources. Equation (9) outlines the fine-tuning process using the synthetic bilingual dataset. The optimizations are initialized by balancing exploration and exploitation.

H_{o p t} = \arg {}_{H \in H_{s}}^{m i n}{E_{θ ~ H} [L (θ, D_{s y n t h e t i c}]}

(9)

where

H_{s}

is the hyperparameter space,

E

represents the expectation over the loss function (

L

), and

H_{o p t}

is the optimal set of hyperparameters. A random selection of hyperparameters with multiple trials is used to identify the key parameters. We apply quantization to reduce computational costs. This maintains the trade-off between memory usage and acceptable accuracy levels, making the model suitable for deployment in low-resource environments. Equation (10) outlines the quantization process.

\hat{θ} = Q u a n t i z e (θ, q)

(10)

where

q

is the precision level,

θ

is the model weights, and

\hat{θ}

is the quantized model weights.

3.6. Ensemble-Learning-Based Response Generation

To facilitate the ensemble-learning-based outcome generation and model interpretability, we compute the confidence score using the SoftMax probabilities. Equations (11) and (12) present the confidence scores for predicting outputs.

P_{G} (C) = \frac{e^{z_{c}}}{\sum_{j = 1}^{c} e^{z_{j}}}; P_{R} (C) = \frac{e^{z_{c}}}{\sum_{j = 1}^{c} e^{z_{j}}}

(11)

w_{G} (q) = \frac{P_{G} (q)}{\sum_{j = 1}^{N} p_{i} (q)}; w_{R} (q) = \frac{P_{G} (q)}{\sum_{j = 1}^{N} p_{i} (q)}

(12)

where

z_{c}

is the logit for class C, c is the possible answer classes,

w_{G} (q)

and

w_{R} (q)

are the model weights for query (q), and

P_{G} (C)

and

P_{R} (C)

are the confidence scores for the query (q).

We employ the weighted majority voting mechanism to combine the GPT-Neo and RoBERTa models’ outcomes based on the weights and confidence scores. This approach uses the transformers’ unique strengths, contributing to producing reliable results for each query. Each model’s vote is scaled by a weight based on its performance and reliability. The confidence-based dynamic weighting mechanisms maintain the individual model’s contribution based on its performance. This enables the proposed model to ensure that models operating with lower confidence are assigned with lower weights. Equation (13) computes the outcome using the ensemble-learning approach.

P_{f} (C) = \sum_{i = 1}^{N} W_{i} \cdot P_{i}

(13)

where

P_{f} (C)

is the final response for class C,

P_{i}

is the confidence score, and

W_{i}

is the weight associated with ith model with GPT-Neo and RoBERTa.

3.7. Performance Evaluation

We employ the key evaluation metrics to determine the proposed model’s effectiveness and reliability. This uses a top-k document retrieval strategy for the performance evaluation. This strategy focuses on selecting the top-k relevant documents. The potential of the model in understanding and matching query context can be determined using the top-k document retrieval strategy. The standard evaluation metrics, including accuracy, precision, recall, and F1-score, are used to evaluate the model’s performance. Accuracy guarantees the model’s ability to retrieve accurate responses to user queries. The precision value ensures that the model generates less incorrect medical information. Recall measures the potential of the proposed model in extracting relevant responses. F1-score provides a comprehensive view of the performance of the proposed MQAS by balancing precision and recall. In order to evaluate the bilingual response generation capability, we use the BLEU metric. BLEU evaluates the proximity of the model’s outcomes to the actual responses. ROUGE assesses the extent to which the model’s response reflects the exact response. Mean reciprocal rank (MRR) evaluates the consistency of the model in returning appropriate responses at the top of the list. In addition, we compute the average number of floating-point operations per second (FLOPs) to identify the impact of quantization on the proposed model. The statistical evaluation metrics are used to ensure the model’s reliability in responding to real-time medical queries. These metrics provide insights into the MQAS’s performance and guide the iterative process of model refinement. We identify areas of improvement, validate enhancements, and guarantee the accuracy and efficiency of the final system in the bilingual healthcare environment.

4. Results

We implement the proposed MQAS using Windows 11 Pro, Intel (Santa Clara, CA, USA) Xeon Sliver-4309Y@ 3.6 GHz, and NVIDIA (Santa Clara, CA, USA) H100 GPU. PyTorch v2.3.0 and Huggingface frameworks are used for building the T5 V1.1, GPT-Neo (v2.7B), and RoBERTa models. The libraries, including Sentencepiece v0.2.1, Elasticsearch V9.0. (for document retrieval using RAG), NumPy v2.3.4, SpaCy V2.0, CAMeL tools v1.5.7, PyArabic v0.6.15, and Pandas v2.3.1, are used in the model implementation. Table 4 outlines the computational configurations for developing the proposed MQAS.

Figure 2 offers insights into the training phase using the enriched training sets (datasets 1 and 3). The progressive improvement in the training accuracy indicates the model’s ability to capture the intricate patterns in the dataset. The decline in the training loss shows the error minimization in the model’s prediction. It indicates the convergence of the proposed model, suggesting an optimal point to end the training and avoid overfitting. The exceptional performance during the training phase reflects the effectiveness of the model in handling the complexities in the Arabic and English datasets, leading to better performance on novel data.

Figure 3 highlights the model’s performance on the test set of dataset 1. It illustrates the trade-offs between accuracy and recall at varying document retrieval depths. For instance, at K = 5, the model achieves the highest accuracy of 94.5% with a slightly lower recall of 91.9%. This indicates that the proposed model produced precise predictions on fewer documents. The outcomes for top K = 10 and K = 20 reveal that the model maintained high accuracy with enhanced recall and F1-scores, achieving comprehensive and reliable answers.

Figure 4 reveals the exceptional outcome of the proposed MQAS on the test set of dataset 3. It demonstrates the ability of the model to prioritize relevant documents and generate a precise outcome for the user. The model’s robust design, comprising advanced query optimization, RAG-T5 integration, and a fine-tuning approach, enhanced its understanding of the context of a user query. The findings of the top-k strategy highlight the model’s attention on small and high-quality subsets of data. It indicates that the model retrieved and ranked the relevant documents. In addition, the weighted majority voting approach supported the model in delivering an optimal response. The integration of KANs and LOD enables the model to comprehend the complex query and retrieve meaningful response.

Table 5 outlines the findings of generalization. The model achieved a remarkable accuracy with optimal BLEU and ROUGE scores. These scores represent the model’s ability to generate responses with high linguistic quality. The outcomes suggest that the model’s architecture maintains effective generalization on datasets 2 and 4, supporting robust and quality output. The recommended BOHB-based hyperparameter optimization plays a crucial role in enhancing the model’s information retrieval capabilities. The significant improvement in recall and F1-score can be attributed to RAG’s ability to retrieve external documents to add context and depth to the generated responses.

Table 6 presents an overview of the computational efficiency of the proposed MQAS. The low inference time signifies a prompt response time, suitable for healthcare applications. The enriched datasets 1 and 3 contain semantically expanded and normalized QA pairs, enabling the model to generate contextually richer and meaningful responses to the user queries. Although the training set comprises a limited set of enriched QA pairs, the proposed model delivers strong performance on unseen data (datasets 2 and 4). The data quality, diversity, and enrichment mechanisms improve the model’s performance. For dataset 3, the proposed model achieved the lowest inference time of 1.5 s, indicating its potential to deliver reliable responses in a limited time. The token throughput remains consistent throughout the datasets, representing better processing efficiency and response time. Across the datasets, hallucination rates remained below 5% and bias variance under 3%, ensuring robustness and reliability. These values underscore the combined effect of RAG-based factual grounding, KAN-driven semantic normalization, and SHAP-guided interpretability. The proposed MQAS’s efficiency and computational feasibility in terms of latency, token throughput, and FLOPs requirements render it ideal for real-world healthcare applications.

Table 7 demonstrates the performance and robustness of the proposed model. By achieving a significant improvement of 3–5% in F1-score, the proposed model surpasses the strongest multilingual baseline models. The BLEU and ROUGE scores are improved by an average of 8–10%, indicating substantial semantic alignment and fluency in generated responses. In the English-only test, the F1-score of the proposed model increases by 4.3%, which is higher than BioBERT. The findings demonstrate consistent and significant improvements across the evaluation metrics. The proposed model outperforms biomedical-specific models such as BioBERT and ClinicalBERT. The proposed hybrid integration of RAG-enhanced retrieval, KANs-based semantic expansion, and BOHB-optimized ensemble weighting enables the model to balance contextual precision and language-adaptive fluency. The model achieved an F1-score of 90.00% and BLEU score of 0.810, which reflects the Arabic-only model’s ability to handle rich morphological variations. The potential of cross-lingual transfer learning of the bilingual model is confirmed by the F1-score, BLEU score, and ROUGE score of 89.62%, 0.806, and 0.632, respectively. The validation of the statistical significance through the McNemar χ² test reveals that the proposed MQAS’s improvements are significant at the 95% confidence level. The hallucination reduction from 12.4 (GPT-Neo) to 3.4 (English-only model) shows the reliability of the proposed model. The bias variance of 2.3% indicates the contribution of the KANs-LOD pipeline in mitigating the over-representation of high-frequency entities. The low mean uncertainty value of 0.18 affirms the generation of reliable and consistent responses.

The time complexity of the proposed model is (

O (k . n . d^{2} + r . \log r)),

where

k

denotes the number of integrated sub-models, r is the retrieved documents, n is the number of tokens, and d is the embedding size. The inference time of 3.4 s per query incurred by the bilingual model reflects its ability to maintain the trade-off between response generation and computational efficiency. The quantitative metrics and inferential statistics reveal that the proposed model achieves statistically reliable, interpretable, and cross-lingually consistent performance. The dynamic re-scaling of model’s contribution using dropout variance and SHAP-based interpretability presents transparent reasoning, quantifying the rationale contribution of clinical, linguistic, and contextual features. The low hallucination and bias variance address the ethical concerns of misinformation propagation in automated MQAS. The bilingual capability of the proposed model allows Arabic-speaking populations to access evidence-based insights.

Table 8 highlights the uncertainty analysis of the proposed model. It outlines the key responses of the proposed model for the queries of datasets 2 and 4. Each query batch corresponds to a semantically grouped set of medical questions in order to evaluate the model’s performance across diverse contexts. These batches represent distinct medical subdomains and reasoning types. The diversity across batches reflects real-world user interactions in clinical and patient-education settings. The higher confidence score suggests reliable predictions. For instance, “type 2 diabetes”, “ارتفاع ضغط الدم” (hypertension), “الربو” (asthma), and “flu vaccination” predictions can be trusted without substantial verification. The low dropout and ensemble variances affirm the model’s stability. In addition, the SHAP values of +0.12 represent the influence of “age” and “risk factors” in predicting “type 2 diabetes”. These values enhance the model’s transparency, allowing the user to understand the model’s reasoning. The higher confidence score of 0.82 for “flu vaccination” indicates the effectiveness of the proposed ensemble-learning approach. The action “recommended human review” highlights the high-uncertainty predictions. It demonstrates the necessity of human review to ensure the accuracy of the response.

Figure 5 provides a comprehensive view of the global feature contributions that support the model’s interpretability. The features are categorized into three domains: clinical, linguistic, and contextual, highlighting the reasoning process of the proposed model. The highest SHAP values of the clinical features, such as age, blood pressure history, and cholesterol, indicate the influence of patient-specific medical attributes on response generation. The potential of integrating preventive and lifestyle-based determinants is underscored by environmental exposure and vaccination history. Linguistic features, such as language domain specialists and answer length, reflect the model’s sensitivity to bilingual query structure and domain phrasing. Contextual factors, including retrieval count and prior response, exhibit the crucial role of RAG in providing meaningful information. The findings of the SHAP analysis show that the model’s decision aligns with domain-plausible medical reasoning while maintaining linguistic fairness across languages.

Table 9 reports the outcomes of the proposed model on the test sets of datasets 1–4, underscoring its strength and efficiency. The superior performance showcases the model’s potential in handling bilingual queries. The balance between computational efficiency and high BLEU and ROUGE scores renders the feasibility of the model’s deployment in a real-time setting. The existing bilingual models, including Sengupta et al. (2023) [36] and Preri et al. (2024) [41], produced low-accuracy results due to their lack of contextual understanding of the user query. The recommended language identification model supported the proposed model in achieving a better ROUGE score. It assisted the model in identifying the user intent and delivering an appropriate response. The proposed model required low computational costs compared to the Wu et al. [42] model, enabling the proposed model to be ideal for fast-paced and low-resource environments.

5. Discussion

In this study, we offered an effective bilingual MQAS to generate a reliable response for a query. The process of accessing trustworthy medical information is streamlined through the integration of advanced NLP techniques. The model’s interpretability is enabled in order to generate a reliable response. The proposed model supports Arabic-speaking individuals who experience difficulties due to the limited Arabic medical resources and databases. It facilitates users to obtain precise medical responses by addressing the existing challenges. It provides an efficient mechanism for retrieving medical information in a limited time, minimizing the demand for manual consultations. Individuals with limited medical expertise can understand complex medical terms in their preferred language. This promotes awareness of health issues and allows individuals to improve their quality of life.

The innovative integration of a generative, discriminative, and knowledge-aware neural architecture distinguishes the proposed MQAS from the existing models. By leveraging the complementary strengths of generative language models and extractive architectures, the proposed model effectively handles syntactic complexity and linguistic variations. KANs harmonize GPT-Neo and RoBERTa, enabling the model to align its outputs with medically validated concepts, reducing hallucination and improving trustworthiness. Furthermore, the deployment of an ensembling strategy adds a critical layer of intelligence to the decision-making process. This layered architecture generates adaptability across different query types, languages, and complexity levels.

The inability of traditional models to straightforwardly integrate external information into their design renders them inadequate for the healthcare domain. Compared to Zhang et al. (2022) [29] (accuracy of 84.7%) and Van Sonsbeek et al. (2023) [33] (accuracy of 84.3%), the proposed MQAS offered a remarkable accuracy in a bilingual setting. The application of a top-k retrieval approach enabled it to focus on the relevant information. The limited adaptability to bilingual datasets and limited query optimization reduced the performance of the Sengupta et al. (2023) [36] and Preri et al. (2024) [41] models. The integration of transformers facilitated the proposed model to generate responses with minimal latency, outperforming the Zeming et al. [38] model. Unlike Jiang et al. (2023) [37] (accuracy of 40.6%) and Zeming et al. (2023) [38] (accuracy of 60.7%), the proposed model addressed healthcare queries in diverse linguistic contexts. The complexities of medical terminologies affected the Van Sonsbeek et al. (2023) [33] and Preri et al. (2024) [41] models’ ability, leading to suboptimal performance. The lack of advanced retrieval strategies affected the performance of the Wu et al. (2024) [42] model.

The higher BLEU and F1-score values underscore the practical strengths of the proposed model in generating high-quality, contextually relevant responses. The generated responses match human-written responses in terms of structure and phrasing. This is essential in applications, including patient portals, digital health assistants, and triage chatbots. The improved BLEU scores show that the model is approachable and trustworthy to end-users. Similarly, the high F1-score reflects the model’s potential to consistently return reliable and comprehensive medical information, reducing the risk of misinformation and assisting clinicians as a decision-support tool. Healthcare providers in telemedicine settings may benefit from the proposed MQAS’s ability to pre-answer routine queries, triage symptom descriptions, and deliver contextually appropriate medical advice. This guarantees that patients receive prompt, evidence-based advice before a live consultation, improving efficiency and minimizing physician burden. The MQAS can be integrated into digital front-desk systems or patient-facing kiosks in healthcare facilities such as hospitals and clinics, facilitating bilingual assistance to patients who may not be confident in their ability to communicate in English. With the model’s dual-language understanding and response generation capabilities, healthcare practitioners and their various patient groups may communicate effectively, fostering inclusion. Utilizing natural-language queries, the system may additionally offer clinical decision assistance in settings where electronic health records (EHR) are linked. This could be accomplished by accessing essential clinical information, medical procedures, or summarizing patient conditions.

Unlike the existing models, the proposed MQAS extracts context from an external medical knowledge source to guarantee proper and suitable responses. RAG allows the model to access and retrieve current medical information, assisting the proposed model in generating a meaningful response. This synergy considerably enhances the model’s performance, enabling the proposed model to be suitable for the healthcare industry. The recommended approach addresses a crucial gap in bilingual QAS systems by providing high-quality responses with minimal computational resources. In cross-language contexts, existing models face challenges in handling the structure of specific languages and the nature of medical queries. The proposed model leveraged KANs, RAG, and effective fine-tuning strategies, rendering it capable of answering a wide range of medical queries, including straightforward and complex diagnostic queries, outperforming traditional models. In terms of BLEU and ROUGE, the model showed a significant improvement in response quality and relevance. The use of synthetic datasets supported the proposed model in achieving an exceptional outcome. The integration of the capabilities of GPT-Neo and RoBERTa reduced the response time, facilitating the user to retrieve faster responses. The BOHB optimization assisted the model in maintaining a trade-off between response time and reliable outcome.

The model’s interpretation ability is enhanced through the integration of LOD and KANs, responding to medical queries with greater semantic precision. For instance, in the query “ما أسباب أمراض القلب التاجية؟” (“What are the causes of coronary heart disease?”), the term “أمراض القلب التاجية” (“coronary heart disease”) is mapped to the standard medical concept using LOD resources, enriching the response through the incorporation of information, including underlying causes, clinical terminology, and risk factors. Similarly, the model distinguishes between the broader category “الخرف” (“dementia”) and the specific condition “الزهايمر” (“Alzheimer’s disease”), for the query “هل الزهايمر هو نفسه الخرف؟” (is Alzheimer’s the same as dementia?”). Using LOD resources, such as DBpedia and AraWiki, the model accesses hierarchical and relational knowledge, enabling it to explain that Alzheimer’s is a subtype of dementia. This feature enhances the accuracy, clarity, and educational value of the response.

The proposed context-aware bilingual MQAS exhibits strong performance across complex Arabic and English queries. However, we encounter challenges and limitations in implementing the model. The lack of accessible, high-quality Arabic medical literature is a significant challenge. While the proposed model generated a reliable outcome for Arabic and English queries, the lack of substantial, domain-specific Arabic datasets may influence precision and reliability. The integration of RAG into the proposed model may address this limitation by incorporating relevant external documents. However, this approach depends significantly on the variety and quality of the retrieved literature. A closer analysis suggests opportunities for further refinement, particularly in processing nuanced aspects of Arabic queries, highlighting edge cases where additional linguistic and domain tuning could further performance. For instance, the Arabic query, “ما هو أفضل علاج طبيعي لضغط الدم المرتفع؟” (“What is the best natural remedy for high blood pressure?”) obtains a clinically accurate and relevant response from the proposed model. However, the model is unable to emphasize the “natural remedy” aspect as strongly as intended. This suggests the requirement of additional training on alternative medical terminology in the Arabic language, assisting the model to better align with user-specific intent in lifestyle-oriented queries.

Within the context of multilingual healthcare settings, the proposed bilingual MQAS has the potential to revolutionize the accessibility of medical information, the level of patient interaction, and the clinical decision support. The limited availability of trustworthy medical content in Arabic is one of the persistent challenges in the Middle East and other multilingual regions. The proposed model addresses the challenges by integrating generative, discriminative, and knowledge-aware architectures, delivering contextually accurate and linguistically adaptive responses. By bridging the gap between Arabic-speaking individuals and medical information expressed in English, healthcare facilities, patient education platforms, and telemedicine systems are able to provide culturally and linguistically appropriate medical guidance. The proposed MQAS has the ability to serve as a decision-support system in clinical practice, assisting clinicians in retrieving evidence-based information. The interpretability layer, which is driven by SHAP values and uncertainty estimates, fosters clinician confidence by offering transparent justification for each generated response. In the context of AI-assisted diagnostics and education, this interpretability guarantees that responses can be inspected, confirmed, and modified, contributing to enhanced accountability. With credible, source-based, and multilingual responses, the proposed model can combat online medical misinformation for public health. By providing medical students and practitioners with reliable Arabic translations of complex English medical literature, the proposed MQAS provides assistance for cross-lingual learning in educational contexts.

We reduced the proposed model’s computational requirement in responding to a user query. However, healthcare centers may require substantial computational resources for model implementation. The specificity of the training data may limit the model’s generalization capability to rare medical conditions. The SHAP value interpretation may vary significantly in different medical contexts. This may affect the model’s consistency across English and Arabic languages. Periodic retraining with updated data is required to guarantee the model’s reliability and accuracy. We follow privacy and ethical guidelines in the model development. The implementation of the proposed multilingual MQAS poses substantial ethical challenges relating to medical misinformation, algorithmic bias, and user trust. Large language models typically generate probabilistic outputs, which may lead to hallucinations. The integration of RAG and KANs in generating responses has mitigated the hallucination rate, guaranteeing the generation of contextually relevant and evidence-based responses. SHAP-based interpretability and uncertainty quantification improve transparency, enabling end-users and physicians to understand prediction rationales and flag instances with high epistemic uncertainty that require human supervision. Moreover, bias mitigation is one of the ethical responsibilities in developing MQASs. The system balances Arabic and English datasets and uses KAN-based synonym enrichment to minimize semantic biases that might disfavor one language group over another.

Furthermore, the bias variance under 3% reveals that the model retains comparable reliability across language contexts, reducing the likelihood of biased performance or misunderstandings of medical terminology. From an ethical standpoint, the system follows the rules of transparency and human-in-the-loop decision-making. The proposed model is intended to promote informed patient education and clinician decision support. As a result of the proposed framework’s emphasis on interpretability, factual foundation, and human review for high-uncertainty scenarios, it guarantees that AI-assisted healthcare will continue to be trustworthy and ethically compliant. However, the proposed model requires careful handling to adhere to healthcare data regulations.

In the future, the proposed model can be extended to include multiple languages. The model’s adaptability can be improved by integrating various datasets and dialect variations. By leveraging regional Arabic dialects and rare medical terms, the model’s understanding of the context of the user query can be enhanced. Future work should explore other interpretability approaches for improving the model’s trustworthiness. Incorporation of patient history or demographic information could enable the model to provide personalized responses.

6. Conclusions

In this study, we introduce an innovative bilingual MQAS, generating reliable and contextually relevant responses to Arabic and English medical queries. This study addresses critical gaps in existing general-purpose MQASs, particularly the lack of support for non-English speakers and the limited interpretability of NLP models. The novel contribution of this study lies in its ability to bridge linguistic and cultural barriers in healthcare settings. The generalization accuracy of 90.58% with a BLEU score of 0.80 reflects the potential of the proposed model to provide accurate and meaningful responses. The proposed model outperformed the baseline models in terms of accuracy, F1-score, BLEU, and latency. The reduction in latency and improvement in token processing speech underscore the suitability of the proposed MQAS in real-time healthcare settings. The findings emphasize the model’s potential in generating medically accurate and linguistically appropriate responses to complex medical queries. The incorporation of the BOHB optimization ensures efficient parameter tuning, enabling effective performance with minimal computational overhead. The multi-layered architecture facilitates the model to respond effectively to diverse medical queries. By presenting a bilingual, interpretable, and scalable MQAS, this study contributes to broader efforts in democratizing access to healthcare information. However, the proposed model has certain limitations. The regional dialects and linguistic nuances in Arabic may affect the model’s performance. The SHAP values may differ significantly in different medical contexts. Future research should focus on multiple directions. Expanding the model to support additional languages will enhance its global applicability. Incorporating rare disease data and patient-generated health content can improve the model’s ability to handle uncommon queries. By implementing a continuous learning framework, the model will adapt to new medical guidelines, terminologies, and treatments.

Author Contributions

Conceptualization, A.R.W.S. and Y.A.; Methodology, A.R.W.S. and Y.A.; Software, A.R.W.S. and Y.A.; Validation, A.R.W.S. and Y.A.; Formal analysis, A.R.W.S. and Y.A.; Investigation, A.R.W.S.; Resources, Y.A.; Writing—original draft, A.R.W.S.; Writing—review & editing, A.R.W.S. and Y.A.; Visualization, A.R.W.S.; Project administration, Y.A.; Funding acquisition, A.R.W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU253678].

Data Availability Statement

The data presented in this study are openly available in Huggingface at https://huggingface.co/datasets/keivalya/MedQuad-MedicalQnADataset, accessed on 18 January 2024, Kaggle at https://www.kaggle.com/datasets/yassinabdulmahdi/arabic-medical-q-and-a-dataset/data, accessed on 16 February 2024, Huggingface at https://huggingface.co/datasets/Malikeh1375/medical-question-answering-datasets, accessed on 21 May 2024.

Acknowledgments

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU253678]. The authors would like to thank the Deanship of Scientific Research at Shaqra University for supporting this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Roy, P.K.; Saumya, S.; Singh, J.P.; Banerjee, S.; Gutub, A. Analysis of community question-answering issues via machine learning and deep learning: State-of-the-art review. CAAI Trans. Intell. Technol. 2023, 8, 95–117. [Google Scholar] [CrossRef]
Zafar, A.; Sahoo, S.K.; Bhardawaj, H.; Das, A.; Ekbal, A. KI-MAG: A knowledge-infused abstractive question answering system in medical domain. Neurocomputing 2024, 571, 127141. [Google Scholar] [CrossRef]
Alwaneen, T.H.; Azmi, A.M.; Aboalsamh, H.A.; Cambria, E.; Hussain, A. Arabic question answering system: A survey. Artif. Intell. Rev. 2022, 55, 207–253. [Google Scholar] [CrossRef]
Demirhan, H.; Zadrozny, W. Survey of multimodal medical question answering. BioMedInformatics 2024, 4, 50–74. [Google Scholar] [CrossRef]
Budler, L.C.; Gosak, L.; Stiglic, G. Review of artificial intelligence-based question-answering systems in healthcare. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2023, 13, e1487. [Google Scholar] [CrossRef]
Jin, Q.; Yuan, Z.; Xiong, G.; Yu, Q.; Ying, H.; Tan, C.; Chen, M.; Huang, S.; Liu, X.; Yu, S. Biomedical question answering: A survey of approaches and challenges. ACM Comput. Surv. (CSUR) 2022, 55, 1–36. [Google Scholar] [CrossRef]
Wahdan, A.; Al-Emran, M.; Shaalan, K. A systematic review of Arabic text classification: Areas, applications, and future directions. Soft Comput. 2024, 28, 1545–1566. [Google Scholar] [CrossRef]
Mutabazi, E.; Ni, J.; Tang, G.; Cao, W. A review on medical textual question answering systems based on deep learning approaches. Appl. Sci. 2021, 11, 5456. [Google Scholar] [CrossRef]
Dimitrakis, E.; Sgontzos, K.; Tzitzikas, Y. A survey on question answering systems over linked data and documents. J. Intell. Inf. Syst. 2020, 55, 233–259. [Google Scholar] [CrossRef]
Savery, M.; Abacha, A.B.; Gayen, S.; Demner-Fushman, D. Question-driven summarization of answers to consumer health questions. Sci. Data 2020, 7, 322. [Google Scholar] [CrossRef]
Nassiri, K.; Akhloufi, M. Transformer models used for text-based question answering systems. Appl. Intell. 2023, 53, 10602–10635. [Google Scholar] [CrossRef]
Jati, B.S.; Widyawan, S.T.; Rizal, S.M.N. Multilingual named entity recognition model for Indonesian health insurance question answering system. In Proceedings of the 2020 3rd International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia, 24–25 November 2020; IEEE: Piscataway, NI, USA, 2020; pp. 180–184. [Google Scholar]
Hu, Z.; Ma, X. A novel neural network model fusion approach for improving medical named entity recognition in online health expert question-answering services. Expert Syst. Appl. 2023, 223, 119880. [Google Scholar] [CrossRef]
Sarrouti, M.; El Alaoui, S.O. SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions. Artif. Intell. Med. 2020, 102, 101767. [Google Scholar] [CrossRef] [PubMed]
Boudjellal, N.; Zhang, H.; Khan, A.; Ahmad, A.; Naseem, R.; Shang, J.; Dai, L. ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition. Complexity 2021, 2021, 6633213. [Google Scholar] [CrossRef]
Bizer, C.; Heath, T.; Berners-Lee, T. Linked data-the story so far. In Linking the World’s Information: Essays on Tim Berners-Lee’s Invention of the World Wide Web; Association for Computing Machinery: New York, NY, USA, 2023; pp. 115–143. [Google Scholar]
Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.S.; Li, Q. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6491–6501. [Google Scholar]
Cuconasu, F.; Trappolini, G.; Siciliano, F.; Filice, S.; Campagnano, C.; Maarek, Y.; Tonellotto, N.; Silvestri, F. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 719–729. [Google Scholar]
Mahboub, A.; Za’ter, M.E.; Al-Rfooh, B.; Estaitia, Y.; Jaljuli, A.; Hakouz, A. Evaluation of semantic search and its role in retrieved-augmented-generation (rag) for arabic language. arXiv 2024, arXiv:2403.18350. [Google Scholar]
Vladika, J.; Matthes, F. Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval. arXiv 2024, arXiv:2404.08359. [Google Scholar] [CrossRef]
Chen, Y.; Huang, X.; Yang, F.; Lin, H.; Lin, H.; Zheng, Z.; Liang, Q.; Zhang, J.; Li, X. Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: A comparison study. BMC Med. Educ. 2024, 24, 1372. [Google Scholar] [CrossRef]
Chau, R.C.; Thu, K.M.; Yu, O.Y.; Hsung, R.T.; Lo, E.C.; Lam, W.Y. Performance of generative artificial intelligence in dental licensing examinations. Int. Dent. J. 2024, 74, 616–621. [Google Scholar] [CrossRef]
Ackerman, R.; Balyan, R. Automatic Multilingual Question Generation for Health Data Using LLMs. In Proceedings of the International Conference on AI-generated Content, Shanghai, China, 25–26 August 2023; Springer Nature: Singapore, 2023; pp. 1–11. [Google Scholar]
Nentidis, A.; Katsimpras, G.; Krithara, A.; Krallinger, M.; Rodríguez-Ortega, M.; Rodriguez-López, E.; Loukachevitch, N.; Sakhovskiy, A.; Tutubalina, E.; Dimitriadis, D.; et al. Overview of BioASQ 2025: The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, Madrid, Spain, 3 September 2024; pp. 173–198. [Google Scholar]
Jin, Y.; Chandra, M.; Verma, G.; Hu, Y.; De Choudhury, M.; Kumar, S. Ask Me in English Instead: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries. In Proceedings of the Web Conference 2024, Singapore, 13–17 May 2024. [Google Scholar]
Vladika, J.; Schneider, P.; Matthes, F. MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering. arXiv 2024, arXiv:2406.05845. [Google Scholar] [CrossRef]
Das, D.; Kumar, N.; Longjam, L.A.; Sinha, R.; Roy, A.D.; Mondal, H.; Gupta, P. Assessing the capability of ChatGPT in answering first-and second-order knowledge questions on microbiology as per competency-based medical education curriculum. Cureus 2023, 15, e36034. [Google Scholar] [CrossRef]
Yasunaga, M.; Ren, H.; Bosselut, A.; Liang, P.; Leskovec, J. QA-GNN: Reasoning with language models and knowledge graphs for question answering. arXiv 2021, arXiv:2104.06378. [Google Scholar]
Zhang, X.; Bosselut, A.; Yasunaga, M.; Ren, H.; Liang, P.; Manning, C.D.; Leskovec, J. Greaselm: Graph reasoning enhanced language models for question answering. arXiv 2022, arXiv:2201.08860. [Google Scholar] [CrossRef]
Alzubi, J.A.; Jain, R.; Singh, A.; Parwekar, P.; Gupta, M. COBERT: COVID-19 question answering system using BERT. Arab. J. Sci. Eng. 2023, 48, 11003–11013. [Google Scholar] [CrossRef] [PubMed]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Ma, X.; Chen, W. Augmenting black-box LLMs with medical textbooks for clinical question answering. arXiv 2023, arXiv:2309.02233. [Google Scholar]
Van Sonsbeek, T.; Derakhshani, M.M.; Najdenkoska, I.; Snoek, C.G.; Worring, M. Open-ended medical visual question answering through prefix tuning of language models. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer Nature: Cham, Switzerland, 2023; pp. 726–736. [Google Scholar]
Lin, Z.; Zhang, D.; Tao, Q.; Shi, D.; Haffari, G.; Wu, Q.; He, M.; Ge, Z. Medical visual question answering: A survey. Artif. Intell. Med. 2023, 143, 102611. [Google Scholar] [CrossRef]
Kamel, S.M.; Hassan, S.I.; Elrefaei, L. Vaqa: Visual arabic question answering. Arab. J. Sci. Eng. 2023, 48, 10803–10823. [Google Scholar] [CrossRef]
Sengupta, N.; Sahu, S.K.; Jia, B.; Katipomu, S.; Li, H.; Koto, F.; Marshall, W.; Gosal, G.; Liu, C.; Chen, Z.; et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv 2023, arXiv:2308.16149. [Google Scholar]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Chen, Z.; Cano, A.H.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv 2023, arXiv:2311.16079. [Google Scholar] [CrossRef]
Yu, H.; Yu, C.; Wang, Z.; Zou, D.; Qin, H. Enhancing healthcare through large language models: A study on medical question answering. In Proceedings of the 2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 26–28 July 2024; IEEE: Piscataway, NI, USA, 2024; pp. 895–900. [Google Scholar]
English Medical Q&A Dataset. Available online: https://huggingface.co/datasets/keivalya/MedQuad-MedicalQnADataset (accessed on 18 January 2024).
Pieri, S.; Mullappilly, S.S.; Khan, F.S.; Anwer, R.M.; Khan, S.; Baldwin, T.; Cholakkal, H. Bimedix: Bilingual medical mixture of experts llm. arXiv 2024, arXiv:2402.13253. [Google Scholar]
Wu, C.; Lin, W.; Zhang, X.; Zhang, Y.; Xie, W.; Wang, Y. PMC-LLaMA: Toward building open-source language models for medicine. J. Am. Med. Inform. Assoc. 2024, 31, 1833–1843. [Google Scholar]
Black, S.; Biderman, S.; Hallahan, E.; Anthony, Q.; Gao, L.; Golding, L.; He, H.; Leahy, C.; McDonell, K.; Phang, J.; et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv 2022, arXiv:2204.06745. [Google Scholar]
Antaki, F.; Milad, D.; Chia, M.A.; Giguère, C.-É.; Touma, S.; El-Khoury, J.; Keane, P.A.; Duva, R.l. Capabilities of GPT-4 in ophthalmology: An analysis of model entropy and progress towards human-level medical question answering. Br. J. Ophthalmol. 2024, 108, 1371–1378. [Google Scholar]
Park, E.; Yoo, S.; Vajda, P. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 580–595. [Google Scholar]
Gawbah, H. AHQAD: Arabic Healthcare Question and Answer Dataset, Mendeley Data, V1. Available online: https://data.mendeley.com/datasets/mgj29ndgrk/1 (accessed on 19 April 2024).
Arabic Medical Q&A Dataset. Available online: https://www.kaggle.com/datasets/yassinabdulmahdi/arabic-medical-q-and-a-dataset/data (accessed on 16 February 2024).
English Medical Question Answering Dataset. Available online: https://huggingface.co/datasets/Malikeh1375/medical-question-answering-datasets (accessed on 21 May 2024).
PubMed Knowledge Base. Available online: https://pubmed.ncbi.nlm.nih.gov/ (accessed on 16 February 2024).
Emarefa Knowledge Base. Available online: https://emarefa.net/ (accessed on 18 April 2024).
T5 Transformer. Available online: https://huggingface.co/docs/transformers/en/model_doc/t5 (accessed on 18 January 2024).
FastText Model. Available online: https://github.com/facebookresearch/fastText (accessed on 18 January 2024).

Figure 1. The structure of the proposed MQAS.

Figure 2. Training (a) accuracy and (b) loss.

Figure 3. Findings of performance analysis—dataset 1.

Figure 4. Findings performance analysis—dataset 3.

Figure 5. Global SHAP summary: clinical, linguistic, and contextual feature contributions.

Table 1. Characteristics of existing models.

Authors	Datasets	Features	Limitations
Yasunaga et al. (2021) [28]	Commonsense QA, Openbook QA, and MEDQA-USMLE	Joint reasoning and relevance scoring techniques	Lack of model interpretability.
Zhang et al. (2022) [29]	Commonsense QA, Openbook QA, and MEDQA-USMLE	Graphical neural network (GNN) with transformers for language representation	Limited contextual understanding due to the shortcomings of GNN.
Alzubi et al. (2023) [30]	Coronavirus Open Research Dataset Challenge	BERT and ranker with logits score	Limited generalization on novel datasets.
Singhal et al. (2023) [31]	MEDQA	Medical domain fine-tuning and prompting strategies	Lack of effective query optimization and model interpretability.
Wang et al. (2023) [32]	MEDQA and MEDQA-USMLE	Knowledge of self-refinement and query augmentation techniques	High computational costs.
Van Sonsbeek et al. (2023) [33]	SLAKE, OVQA, and PATHVQA	PEFT strategies	Lack of external medical information may reduce the model’s performance.
Lin et al. (2023) [34]	MEDQA-USMLE	Knowledge graphs and modality interaction operation	Lack of generalization.
Kamel et al. (2023) [35]	Private visual Arabic QAS dataset	AraVec 2.0 skipGram model and one-layer unidirectional long short-term memory-based feature extraction	Demand substantial computational costs.
Sengupta et al. (2023) [36]	The private dataset contains QA pairs collected from multiple sources	jais tokenizer with Swish activation function	jais tokenizer frequently struggles with Arabic-specific named entities.
Jiang et al. (2023) [37]	Mistral 7B dataset contains QA pairs collected from multiple sources	Sliding window attention and even a billion parameters	High computational costs.
Zeming et al. (2023) [38]	MEDQA-USMLE and PubMed	Supervised fine-tuning approach	Lack of model interpretability.
Yu et al. (2024) [39]	MedQuad dataset [40]	Integration of Sentence-T5 and Mistral 7B models	Demand substantial computational resources.
Preri et al. (2024) [41]	MedQuad, PubMedQA, MedMCQA, icliniq, UMLS, and LiveQA	PEFT approach with a mixture of experts	The limitations, including hallucination, toxicity, and stereotypes, may reduce the effectiveness of the model.
Wu et al. (2024) [42]	4.8 million biomedical papers and 30,000 medical textbooks	Data-centric knowledge injection and tokenizer for understanding the user query	The data-centric model demands continual updates with novel information.

Table 2. Expert validation and inter-annotator agreement statistics.

Term	Description	Value
Total QA Pairs	Number of bilingual QA pairs	97,000 (49,000 English + 48,000 Arabic)
Experts	Two bilingual medical specialists	2 Primary and 1 Mediator
Modified QA Pairs	Percentage of pairs edited or refined after expert review for medical accuracy	14.8% (14,356 QA Pairs)
Major Disagreement Categories	Common cases of annotation divergence	Symptom–diagnosis mapping and translation of culturally specific terms
Resolution Protocol	Structured consensus meeting by a senior expert	Applied to all flagged items
Example of Disagreement	“كيف يتم تشخيص فقر الدم؟” (How is anemia diagnosed?) Expert 1: CBC and ferritin tests and Expert 2: Hemoglobin level and reticulocyte count	Unified by merging them into comprehensive diagnostic tests.
Example of Disagreement	Arabic: “هل يجب أخذ لقاح الإنفلونزا كل عام؟” (should the flu vaccine be taken every year?) Expert 1: yes, recommended and Expert 2: Only for high-risk groups	Both preserved and annotated with conditional metadata (general population vs. risk-group guidelines.
Inter-Annotator Agreement (Cohen’s k)	Statistical measure of annotation reliability between two experts	k = 0.86 (high agreement)
Final Consensus Dataset	Total bilingual QA pairs retained after quality verification	97,000 validated QA pairs

Table 3. Significance of features.

Features	Description	Use in SHAP Values
Domain Specificity	Represents the medical subdomain	Highlights the model’s performance on a specific domain
Answer Length	Length of the generated responses	Indicates the effect of the response length on model’s confidence and accuracy
Language	Arabic or English	Demonstrates language-specific interpretability in order to guide bilingual model tuning
Medical Terminology	Identification of specific medical terms	Measures the model’s understanding of medical terminology
Contextual Relevance	Determines the proximity of responses with relevant medical context	Indicates the quality of the model’s prediction
Prior Response Relevance Score	Evaluates the relevance of previous model responses to similar queries	Highlights the significance of previous model responses regarding the proposed model’s accuracy
Retrieval Count	Retrieval of multiple documents as potential reference scores	Indicates the model’s ability to provide additional context for a relevant response
User Demographic Data	Metadata, including age, gender, and language, for personalized answers (based on availability)	Indicates the impact of specific demographic data on the generated response

Table 4. Computational configurations.

Model	Parameters	Values
GPT-Neo	Learning Rate	2 × 10⁻⁵
	Batch Size	16
	Maximum Sequence Length	512
	Dropout Rate	0.1
	Optimizer	BOHB
	Embedding Dimension	2048
	Gradient Clipping	1.0
RoBERTa	Maximum Sequence Length	384
	Learning Rate Schedule	3 × 10⁻⁵
	Attention Heads	12
	Layer Count	12
	Dropout Rate	0.1
	Batch Size	32
	Gradient Clipping	0.5
Weighted Majority Voting	GPT-Neo Weight	0.6
	RoBERTa Weight	0.4
	Minimum Confidence Threshold	0.7
	Weight Decay for Stability	0.01
	Score Aggregation Method	Mean of top-k
	Aggregation Function	Weighted Sum
	Normalization	Min–Max Scaling
	Maximum Ensemble Iteration	50
T5	Learning Rate	1 × 10⁻⁴
	Batch Size	8
	Maximum Sequence Length	512
	Warmup Steps	500
	Parameter Count	770 Million
RAG	Retrieval Embedding Size	768
	Batch Size	32
	Frequency	Every 50 steps
	Parameter Count	1.2 Billion
KANs	Learning Rate	1 × 10⁻⁴
	Knowledge Embedding Size	300
	Number of Layers	3
	Batch Size	16
	Neighbor Sampling	5 Neighbors per Node
	Regularization	L2 Regularization with 0.01

Table 5. Findings of generalization—datasets 2 and 4.

K	Accuracy	Precision	Recall	F1-Score	BLEU	ROUGE
5	92.6	89.5	88.7	89.1	0.82	0.64
10	91.4	88.9	89.1	89	0.77	0.59
15	90.8	90.1	90.5	90.3	0.84	0.67
20	89.5	90.6	89.6	90.1	0.79	0.69
25	88.6	89.7	89.5	89.6	0.81	0.57

Table 6. Computational and validation performance of the proposed MQAS.

Model	Inference Time/Query (s)	Tokens/Second	Hallucination Rate (%)	Bias Variance (%)	Mean Uncertainty (Variance Threshold > 0.25)
Proposed MQAS (Dataset 1)	2.2	199.7	3.7	2.8	0.21
Proposed MQAS (Dataset 2)	2.7	203.8	3.3	2.4	0.19
Proposed MQAS (Dataset 3)	1.5	192.8	3.9	2.5	0.19
Proposed MQAS (Dataset 4)	2.1	190.6	3.4	2.2	0.18
Proposed MQAS (Datasets 2 and 4)	3.4	202.1	3.6	2.3	0.18

Table 7. Comparative evaluation of baseline and proposed models (datasets 2 and 4).

Model	Language Scope	F1-Score (%)	BLEU	ROUGE	Hallucination Rate (%)	Bias Variance (%)	Mean Uncertainty (Variance Threshold > 0.25)	Inference Time/Query (s)	McNemar χ² (p) vs. Proposed
GPT-Neo	English only (Dataset 4)	83.00	0.701	0.541	12.4	4.8	0.29	4.5	12.45 (p < 0.01)
RoBERTa	English only (Dataset 4)	84.21	0.726	0.557	10.8	4.5	0.27	5.1	11.32 (p < 0.01)
BioBERT	English only (Dataset 4)	86.51	0.747	0.574	9.2	3.9	0.26	6.3	9.28 (p < 0.01)
ClinicalBERT	English only (Dataset 4)	85.77	0.731	0.561	9.5	3.8	0.25	4.4	10.06 (p < 0.01)
mBERT	Arabic + English (Datasets 2 and 4)	81.02	0.698	0.514	8.6	3.4	0.24	5.8	13.57 (p < 0.001)
XLM-RoBERTa	Arabic + English (Datasets 2 and 4)	86.97	0.750	0.582	6.8	2.9	0.22	6.7	6.00 (p < 0.001)
Proposed MQAS	Arabic Only (Dataset 2)	90.00	0.810	0.640	3.3	2.4	0.19	2.7	—
Proposed MQAS	English only (Dataset 4)	91.00	0.860	0.698	3.4	2.2	0.18	2.1	—
Proposed MQAS	Arabic + English (Datasets 2 and 4)	89.62	0.806	0.632	3.6	2.3	0.18	3.4	—

Note: McNemar’s χ² test values are reported for the baseline models compared against the proposed MQAS. The proposed model serves as the statistical reference. Thus, its cell is denoted as “—”. All p-values < 0.05 indicate statistically significant differences.

Table 8. Findings of uncertainty analysis (datasets 2 and 4).

Query Batch	Predicted Answer	Prediction Confidence	Dropout Variance	Ensemble Variance	SHAP Value	Uncertainty Rating (Low and High)	Action
1	“Type 2 Diabetes”	0.85	0.05	0.03	+0.12 (Age, Risk Factors)	Low	Acceptable
2	“ارتفاع ضغط الدم” (Hypertension)	0.77	0.07	0.05	+0.18 (Blood Pressure History)	Low	Acceptable
3	“الربو” (Asthma)	0.83	0.08	0.05	+0.09 (Environmental Factors)	Low	Acceptable
4	“Stroke”	0.65	0.25	0.15	+0.30 (Smoking History)	High	Recommend Human Review
5	“Flu Vaccination”	0.82	0.07	0.04	+0.13 (Vaccination History)	Low	Acceptable
6	“Heat Disease”	0.68	0.18	0.11	+0.22 (Cholesterol and Age)	High	Recommend Human Review

Table 9. Findings of comparative analysis.

Model	Type	Performance
Proposed MQAS	Arabic (Dataset 1)	Accuracy: 93.04% Precision: 93.06% Recall: 92.84% F1-Score: 92.92% Inference time/query: 2.2 s Tokens/Second: 199.7
Proposed MQAS	English (Dataset 3)	Accuracy: 87.4% Precision: 90.26% Recall: 89.26% F1-Score: 89.72% Inference time/query: 1.5 s Tokens/Second: 192.8
Proposed MQAS	Bilingual (Datasets 2 and 4)	Accuracy: 90.58% Precision: 89.76% Recall: 89.48% F1-Score: 89.62% BLEU: 0.806 ROUGE: 0.632 Inference time/query: 3.4 s Tokens/Second: 202.1 FLOPs: 2.9 Giga
Zhang et al. (2022) [29]	Single (English)	Commonsense QA Accuracy: 84.7% Generalization Accuracy: 38.5%
Van Sonsbeek et al. (2023) [33]	Single (English)	Accuracy: 84.3% F1-Score: 78.1% BLEU: 0.78
Sengupta et al. (2023) [36]	Bilingual	Accuracy: 28.8% Latency: 14 s Tokens/Second: 36.5
Jiang et al. (2023) [37]	Single (English)	Accuracy: 40.6%
Zeming et al. (2023) [38]	Single (English)	Accuracy: 60.7% Latency Score: 24 s Tokens/Second: 20.9
Yu et al. (2024) [39]	Single (English)	Precision: 76.2%
Preri et al. (2024) [41]	Bilingual	Accuracy: 47.3% Latency: 2.8 s Tokens/Second: 180.6
Wu et al. (2024) [42]	Single (English)	Accuracy: 47.2% Latency Score: 4.1 s Tokens/Second: 124.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sait, A.R.W.; Alkhurayyif, Y. Explainable Bilingual Medical-Question-Answering Model Using Ensemble Learning Technique. Electronics 2025, 14, 4128. https://doi.org/10.3390/electronics14204128

AMA Style

Sait ARW, Alkhurayyif Y. Explainable Bilingual Medical-Question-Answering Model Using Ensemble Learning Technique. Electronics. 2025; 14(20):4128. https://doi.org/10.3390/electronics14204128

Chicago/Turabian Style

Sait, Abdul Rahaman Wahab, and Yazeed Alkhurayyif. 2025. "Explainable Bilingual Medical-Question-Answering Model Using Ensemble Learning Technique" Electronics 14, no. 20: 4128. https://doi.org/10.3390/electronics14204128

APA Style

Sait, A. R. W., & Alkhurayyif, Y. (2025). Explainable Bilingual Medical-Question-Answering Model Using Ensemble Learning Technique. Electronics, 14(20), 4128. https://doi.org/10.3390/electronics14204128

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Bilingual Medical-Question-Answering Model Using Ensemble Learning Technique

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Data Acquisition

3.2. Synthetic Dataset

3.3. Query Optimization

3.4. GPT-Neo and RoBERTa-Based Response Generation

3.5. BOHB-Based Hyperparameter Optimization

3.6. Ensemble-Learning-Based Response Generation

3.7. Performance Evaluation

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI