1. Introduction
In recent years, Artificial Intelligence (AI) has had a powerful influence on human lives. It has achieved remarkable results in many fields of technology and computing and now interacts with various branches of science. Among these, natural language processing (NLP) [
1] stands out, referring to the ability of computers to understand and process human language, enabling seamless human–computer interaction. Numerous applications of NLP have emerged, especially chatbots [
2], which serve as virtual assistants capable of understanding user needs and providing relevant information or responses based on their knowledge.
The majority of existing work in the development of chatbots has been focused on the English language. However, Arabic chatbots remain scarce due to the complexity and variability of the Arabic language [
3]. Spoken Arabic differs from one country to another, giving rise to several dialects, each with its own characteristics. Additionally, the development of Arabic chatbots is hindered by a lack of resources, particularly the scarcity of annotated datasets needed to train NLP models. Although there has been growing interest in Arabic dialects in recent years, these efforts still face many challenges, and the results remain far from those achieved in English.
In this paper, we present a chatbot that supports Moroccan Arabic (Darija). This dialect, like others in the Arab world, remains one of the most effective means of communicating with citizens in their daily lives. So we propose a chatbot architecture that can be adapted to any Arabic dialect to provide domain-specific responses. Our system was tested in the Moroccan fiscal system, where users often need assistance in procedures such as making payments or finding documents related to tax regulations. To evaluate the chatbot’s performance, we used a dataset of 500 question–answer pairs.
The remainder of this paper is organized as follows:
Section 2 presents related work and existing Arabic chatbots.
Section 3 describes the methodology used to develop our chatbot, including the system architecture, data collection, and the integration of the Darija dialect.
Section 4 details the experimental setup, including data labeling, the AI models, and the results. Finally,
Section 5 concludes this paper.
2. Related Works
The growing demand for chatbot systems in fields such as healthcare, education, and public services has driven significant advances in natural language processing (NLP) and conversational AI. Large language models (LLMs) like ChatGPT (GPT-4 Turbo), Gemini (Gemini 2.5), and LLaMA (LLaMA 2) now demonstrate strong performance across various domains and in multiple languages. now demonstrate strong performance across various domains and in multiple languages. Despite this, the support for Arabic chatbots, especially dialectal Arabic, is still limited. In this section, we will describe some related studies, research, and applications that are similar to our work.
2.1. Multilingual Language Models and Chatbots
Multilingual models like XLM-R [
4] and mBERT [
5] have enabled cross-lingual understanding. However, the majority of Arabic chatbots are handled using standard Arabic, and this limitation excludes a significant portion of native speakers who use regional dialects in daily conversations.
2.2. NLP Challenges for Moroccan Darija
Arabic dialects, including Moroccan Darija, present a unique linguistic challenge because they are informal, highly oral, and non-standardized. Recent research by Abbad et al. [
6] presents DarijaBERT, a linguistic model based on transformers, optimized and fine-tuned on written Moroccan Arabic. In parallel, Shang et al. [
3] proposed Atlas Chat, adapting LLM architectures for Darija dialect processing, highlighting the promise and challenges of integrating dialect-specific capabilities into modern language models. This offers a crucial basis for NLP applications in this under-resourced language.
2.3. Semantic Similarity Models for Dialogue Systems
To improve chatbot response accuracy, it is important to combine not only text generation but also retrieval mechanisms. Sentence-BERT [
7] and MiniLM [
8] show strong performance in semantic similarity search, enabling efficient information retrieval. These approaches are often used with vector databases and scalable vector search libraries such as FAISS [
9] to enhance chatbot accuracy by matching user input to semantically similar responses stored in a predefined dataset.
2.4. Chatbots in Government and Public Sector Applications
In the public sector, conversational agents play a crucial role by reducing workload and improving service accessibility. Many countries have AI-powered chatbots to respond to citizen inquiries, such as the USA, UAE, and Scandinavian countries. However, in Arab countries, especially North Africa, the integration of chatbots in the public sector is limited to Standard Arabic and French, which does not reflect daily communication patterns. Our work addresses this challenge by offering an approach that can identify a similarity level comparable to French to find similar patterns in the Moroccan dialect, enhancing access for Moroccan citizens.
3. Methodology
In this section, we detail the methodology used for the development of the chatbot, which consists of five stages: collecting data from online sources within the context of the tax domain, structuring the data into a JSON file, creating equivalent data in the Moroccan dialect, designing and developing the chatbot algorithm, and testing the chatbot to ensure accuracy and relevance.
3.1. Dataset
To collect and prepare the data, we analyzed the official websites of Moroccan public entities operating in the fiscal domain. We extracted frequently asked questions (FAQs), citizen guides, and various documents to cover the most frequently asked questions related to public services.
The data collected was mainly in French. We gathered a total of 500 question–answer pairs, each representing or associated with a specific class or intent, and for structuring the data, we used a JSON file (
Figure 1), which consists of a list of intents. Each intent (or tag) is associated with multiple elements:
tag: A unique identifier representing the intent or category of the question.
patterns: A set of example user questions that represent different ways of asking the same thing. These are used to train the chatbot to recognize various formulations.
responses: The predefined response(s) that the chatbot returns when this intent is detected. In the French version, this field contains responses in French. In the Darija dataset, this field holds responses in the Moroccan dialect.
context_set: An optional context label that helps place the user’s question within a specific conversation flow or situation to manage the conversation more effectively.
Figure 1.
Structure of a JSON intent element in the dataset.
Figure 1.
Structure of a JSON intent element in the dataset.
To enhance the dataset, we supplement each question with multiple rephrasings to improve the robustness of the model. These rephrasings simulate the variety of ways users may express the same intent (
Table 1).
After structuring the French dataset, we created an equivalent dataset in the Moroccan dialect (Darija) (
Figure 2) using the same structure and the same number of question–answer pairs. This Darija dataset was also formatted in a JSON file.
3.2. System Design
Our system supports multiple languages: French, English, Spanish, standard Arabic, and Moroccan dialect (Darija) (
Table 2). The design is based on a pivot architecture in which French serves as the main language for intent classification and response generation. The system relies on two types of similarity techniques to identify the most relevant intent and response: a static similarity method based on lexical matching and a semantic similarity model, specifically
all-MiniLM-L6-v2 [
10] (
Table 3), which is used to compute deep semantic similarity between user queries and dataset patterns. Depending on the selected language, we apply a specific processing flow that combines these methods to return the correct response.
3.2.1. Similarity Computation Approaches
Static Similarity: This method compares the input query
q with each pattern
using a lexical similarity measure based on the normalized Levenshtein distance:
Here, represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform q into , and denotes the length of the string. If the similarity score exceeds a threshold , the system considers a close match and directly returns the associated response.
Semantic Similarity: If no adequate static match is found, the system resorts to semantic similarity, which captures the meaning beyond exact lexical matches. Both the user query and patterns are embedded into vector representations using a sentence embedding model. The similarity between vectors
and
is then computed using cosine similarity:
where
is the dot product of the vectors and
denotes the vector norm. The pattern with the highest semantic similarity score determines the best-matching intent and response.
3.2.2. French Input
When the user submits a question in French, the processing flow follows the structure illustrated in
Figure 3. First, we check for an exact match between the user input and the dataset using static similarity, where the user input is matched against the dataset on a per-character basis. Although the probability of the user entering a question identical to one stored in the dataset is low, our system includes an intelligent suggestion mechanism (
Figure 4). This module compares the current user input to stored questions and displays, in real time, a list of suggested queries. These suggestions are exact questions already present in the dataset, which encourages users to select or refine their query accordingly. In such cases, only static similarity is required. If no exact match is found, the system computes the semantic similarity between the input and all available intent patterns using the semantic model [
10]. A ranked list of semantically similar questions from various classes is then presented to the user. The user can confirm the intended question, after which the corresponding response is returned immediately.
3.2.3. Other Languages: English, Standard Arabic, and Spanish
For these standard languages, we use a translation-based strategy using Google Translate (
Figure 5).
For these languages, the user input is translated into French to find the response normally, as described in
Section 3.2.2. Then, we take the response and retranslate it into the original language using Google Translate.
3.2.4. Moroccan Dialect (Darija) Handling
For the Darija dialect, we use the Gemini model, specifically Gemini-1.5-flash [
11], for translation to French. After testing several LLMs for Darija-to-French translation, we identified that Gemini provided the highest accuracy for this translation. Once translated into French, it becomes easier to identify the corresponding pattern and retrieve the match using the aligned Darija dataset, as detailed in Algorithm 1.
| Algorithm 1: Darija query handling algorithm with static and semantic matching. |
- 1:
Start - 2:
Input: - 3:
– User input in Moroccan Darija - 4:
– French dataset (intents, patterns, responses) - 5:
– Aligned Darija dataset (intents, patterns, responses) - 6:
– Semantic similarity model (all-MiniLM-L6-v2) - 7:
– Translation model from Darija to French (Gemini) - 8:
Output: Response in Darija - 9:
Step 1: Static Matching in Darija - 10:
for each intent i in do - 11:
for each pattern p in i do - 12:
if is True then - 13:
/* Static match found */ - 14:
Retrieve response corresponding to p - 15:
Return: - 16:
End - 17:
end if - 18:
end for - 19:
end for - 20:
/* Else: No static match found */ - 21:
Step 2: Translate Input - 22:
Translate to French using - 23:
Let - 24:
Step 3: Semantic Similarity Matching - 25:
for each intent i in do - 26:
for each pattern p in i do - 27:
Compute similarity score - 28:
end for - 29:
end for - 30:
Select intent with highest average similarity - 31:
Step 4: Retrieve Response - 32:
Retrieve corresponding response from using intent - 33:
Return: - 34:
End
|
3.3. Audio Input and Output Handling
To enhance the accessibility of our chatbot, we add audio functionality for input and output across supported languages.
Speech Recognition (Input) For French, English, Spanish, and standard Arabic, we use WebKit Speech Recognition to transcribe speech to text to retrieve the answer, as described in
Section 3.2.2 and
Section 3.2.3.
We handle Darija with the same API, but we treat the speech-to-text output as standard Arabic.
Speech Synthesis (Output) For spoken responses, we integrate ResponsiveVoice to ensure that the chatbot can deliver answers audibly and fluently.
Darija Audio Traitement
For Darija TTS, we create a JSON file (Listing 1) that has the same architecture as the Darija and French dataset, respecting the number of responses and patterns, but instead of text, we add the path to the recorded audio. So when we retrieve the Darija answer, we also retrieve its equivalent audio path to be listenable (
Figure 6).
| Listing 1. Darija Audio JSON Structure. |
![Engproc 112 00041 i001 Engproc 112 00041 i001]() |
4. Experiments and Evaluation
This section presents the evaluation methodology and performance metrics used to assess our multilingual chatbot system. We focus on two key aspects: response accuracy and computational efficiency.
4.1. Evaluation Metrics
We employ the following quantitative metrics to evaluate system performance.
4.1.1. Accuracy Metrics
Top-1 Accuracy: Measures the percentage of queries where the correct answer appears as the first suggestion returned by the system. This reflects the system’s precision in immediate response generation.
Top-3 Accuracy: Measures the percentage of queries where the correct answer appears among the top three suggestions. This indicates the system’s robustness when multiple candidate responses are considered.
4.1.2. Performance Metrics
Average Response Time: The mean duration (in seconds) required for:
- -
Speech-to-text conversion (when applicable);
- -
Language translation (for non-French queries);
- -
Semantic similarity computation;
- -
Response retrieval.
Standard Deviation of Response Time: Measures the variability in processing times across different queries. A low standard deviation indicates consistent performance regardless of input complexity.
4.2. Experimental Setup
We conducted our experiments using a pre-trained sentence transformer model for semantic similarity, combined with a FAISS index to retrieve the most similar patterns. All code was written in Python 3.11, using the sentence-transformers, scikit-learn, and faiss libraries.
The test sets consist of manually curated utterances in French, Moroccan dialect (Darija), English, Spanish, and modern standard Arabic, each associated with a predefined intent tag. Some non-French inputs were translated into French using Google Translate or a custom Darija-to-French module before similarity computation. For each input, we measured whether the expected intent was returned in the Top-1 or Top-3 predictions based on semantic similarity. We also recorded the average response time and its standard deviation across the test set.
4.3. Results
Table 4 summarizes the performance of our chatbot in five languages. The system shows the highest Top-1 and Top-3 accuracy in French, which is expected since it is the primary language used during training. In particular, despite the lack of standardized orthography and the linguistic complexity of Moroccan Darija, the system achieves a respectable Top-1 accuracy of 70% and a Top-3 accuracy of 90%. These results outperform those of other languages, such as standard Arabic, demonstrating the effectiveness of our approach in low-resource, dialectal settings. Furthermore, the average response time remains within acceptable limits, confirming the efficiency of our lightweight method.
5. Conclusions
In this work, we presented a lightweight and efficient chatbot system designed specifically for the Darija dialect, leveraging a similarity-based retrieval approach combined with light translation techniques. Our experiments show that, despite using only around 500 question–answer pairs, the system achieves strong accuracy and low latency, outperforming many heavier multilingual models that require significantly more computational resources. This makes our solution particularly suitable for real-time deployment in resource-constrained environments, such as public service applications.
Future work will focus on expanding the dataset to cover more diverse intents and dialects, improving translation quality, and integrating the chatbot into practical platforms to evaluate user experience in real-world scenarios.
Author Contributions
Conceptualization, O.E., B.E.B., Y.B.M.; methodology, O.E., B.E.B.; software, O.E.; validation, O.E., B.E.B., Y.B.M.; formal analysis, O.E.; investigation, O.E.; resources, B.E.B.; data curation, O.E.; writing—original draft preparation, O.E.; writing—review and editing, O.E., B.E.B., Y.B.M.; visualization, O.E.; supervision, B.E.B., Y.B.M.; project administration, O.E. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding. The APC was funded by Harmony Technology.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The dataset supporting the results reported in this study is not publicly available due to privacy constraints, but can be provided upon reasonable request.
Conflicts of Interest
Oumaima Ennasri conducted this research as an intern at Harmony Technology under the supervision of Brahim El Bhiri, with academic guidance from Yann Ben Maissa at INPT. The research was carried out in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Kang, Y.; Cai, Z.; Tan, C.W.; Huang, Q.; Liu, H. Natural language processing (NLP) in management research: A literature review. J. Manag. Anal. 2020, 7, 139–172. [Google Scholar] [CrossRef]
- Lalwani, T.; Bhalotia, S.; Pal, A.; Rathod, V.; Bisen, S. Implementation of a Chatbot System Using AI and NLP. Int. J. Innov. Res. Comput. Sci. Technol. (IJIRCST) 2018, 6, 26–30. [Google Scholar] [CrossRef]
- Saoudi, Y.; Gammoudi, M.M. Trends and challenges of Arabic Chatbots: Literature review. Jordanian J. Comput. Inf. Technol. (JJCIT) 2023, 9, 1. [Google Scholar] [CrossRef]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
- Pires, T.; Schlinger, E.; Garrette, D. How multilingual is multilingual BERT? arXiv 2019, arXiv:1906.01502. [Google Scholar]
- Gaanoun, K.; Naira, A.M.; Allak, A.; Benelallam, I. DarijaBERT: A step forward in NLP for the written Moroccan dialect. Int. J. Data Sci. Anal. 2024, 20, 917–929. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]
- Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
- Face, H. Sentence-Transformers/all-MiniLM-L6-v2. April 2023. Available online: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 1 March 2025).
- Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).