AraConv: Developing an Arabic Task-Oriented Dialogue System Using Multi-Lingual Transformer Model mT5

: Task-oriented dialogue systems (DS) are designed to help users perform daily activities using natural language. Task-oriented DS for English language have demonstrated promising performance outcomes; however, developing such systems to support Arabic remains a challenge. This challenge is mainly due to the lack of Arabic dialogue datasets. This study introduces the ﬁrst Arabic end-to-end generative model for task-oriented DS (AraConv), which uses the multilingual transformer model mT5 with different settings. We also present an Arabic dialogue dataset (Arabic-TOD) and used it to train and test the proposed AraConv model. The results obtained are reasonable compared to those reported in the studies of English and Chinese using the same mono-lingual settings. To avoid problems associated with a small training dataset and to improve the AraConv model’s results, we suggest joint-training, in which the model is jointly trained on Arabic dialogue data and data from one or two high-resource languages such as English and Chinese. The ﬁndings indicate the AraConv model performed better in the joint-training setting than in the mono-lingual setting. The results obtained from AraConv on the Arabic dialogue dataset provide a baseline for other researchers to build robust end-to-end Arabic task-oriented DS that can engage with complex scenarios.


Introduction
Task-oriented dialogue systems (DS) are a type of conversational system designed to help users achieve pre-defined tasks. These systems are designed to help humans perform routine activities, such as make restaurant or hotel reservations, search for attractions, book flights, enquire about the weather forecast, and shop online. Task-oriented DS are considered the core modules of virtual assistants such as Google Assistant, Amazon Alexa, and Apple Siri, which utilize natural language interfaces for various online services [1]. Task-oriented DS allow users to ask questions using natural language and provide answers to those questions in the form of a conversation.
Despite the current progress of state-of-the-art English-based task-oriented DS, it remains a substantial challenge to build systems that can achieve coherent, sustained conversation on diverse topics [2]. Notably, task-oriented DS for Arabic lag behind [3], until now precluding the application of advanced data-intensive deep-learning models for the language [4], especially due to the shortage of Arabic dialogue datasets. Therefore, this study aimed to investigate the effectiveness of the multi-lingual pre-trained language model mT5 [5] for building end-to-end Arabic task-oriented DS. These end-to-end DS must be capable of handling both dialogue state tracking (DST) task and response generation task; in this context, DST is mainly responsible for helping to extract the goals and slotvalue pairs from the conversation. As such, this work aimed to answer the following major research questions: The paper comprises five sections. The next section explores related works in the area of task-oriented DS for both English and Arabic. The third section demonstrates the methodology used in this research, including the data collection process and the model architecture. Next, we detail our experiments, discussing the tasks and evaluation metrics, experimental setup, and findings. Finally, the fifth section summarizes our work and the significance of the AraConv before considering possible future research directions.

Related Works
There are two approaches in applying DS: traditional DS and end-to-end DS. Traditional DS use a pipeline that connects, trains, and evaluates each module separately. End-toend DS are designed to train all modules as a single unit directly on both knowledge-based information and text transcripts [6]. This section discusses the evaluation of task-oriented DS for the English language before surveying the landscape of Arabic task-oriented DS.

English Task Oriented Dialogue Systems
Given the availability of multi-domain English task-oriented dialogue datasets, work on task-oriented DS in the language has progressed from modularized modeling to generative and end-to-end modeling. Given the fact that the traditional DS design complicates tracking the module responsible for interaction failure [6], some studies have built DS using the end-to-end paradigm [7][8][9][10][11][12][13][14][15]. However, building powerful task-oriented DS still engenders many challenges due to the system design complexity and the limited availability of human-annotated data. Therefore, the research community has focused on working with the pre-trained language models to reduce human supervision to the extent possible. This approach involves fine-tuning these models and helping to transfer the prior knowledge to improve various NLP tasks, including task-oriented DS. Large pre-trained language models, such as GPT2 and T5, have been used for various NLP tasks, especially language generation tasks. These new approaches model the dialogue pipeline in an end-to-end manner [6].
Given the high costs associated with data collection and annotation, researchers tend to train their models with the least number of samples using transfer learning. Transfer learning represents one of the most successful few-shot learning approaches for taskoriented DS. It refers to pre-training large language models on text or task-related data and then fine-tuning on a few samples. Such systems have proved their success in task-oriented DS such as the work presented in [12][13][14][15][16][17][18][19][20][21][22].
The task-oriented DS literature includes two study categories: studies targeting only DST and studies targeting both DST and response generation. Dialogue state tracking mainly helps to extract the goals (intents) and slot-value pairs from the conversation to maintain the dialogue belief state (BS) and the summary of the dialogue history. The BS contains information about the dialogue from the system perspective [6]. At each user turn Appl. Sci. 2022, 12, 1881 3 of 16 during the conversation, the input to the DST comprises the previous BS, the outputs of the intent classification (the goal), and slot filling information; thus, the DST output is the new/updated BS. For end-to-end dialogue generation, the system indicates the correct required information and generates the appropriate response.
For the first category, studies targeting only DST, some studies focus on handling the DST task to guarantee building a good base for the whole dialogue system [23][24][25][26][27][28][29]. Meanwhile, other studies have targeted both DST and response generation in an end-to-end manner [11,12]. Table 1 summarizes the available models for task-oriented DS in English, including datasets and performance measures. Although the models have achieved promising results, they have been designed for English-language task-oriented DS, and, to the best of our knowledge, no research exists concerning Arabic-language task-oriented DS.
Nonetheless, the promising performance of pre-trained language models for Englishlanguage task-oriented DS has prompted efforts to produce multi-lingual models for task-oriented DS in other languages. Many of these languages are considered low-resource languages due to the absence of high-quality data in the language, and most existing task-oriented DS do not support low-resource languages, creating a gap between the performance of low-resource language systems and high-resource systems. Therefore, providing datasets for low-resource languages is critical to driving the development of efficient end-to-end task-oriented DS for these languages. Several existing studies have built taskoriented DS for low-resource languages using cross-lingual transfer learning [1,30,31]. This involves transferring knowledge from high-resource to low-resource languages, enabling the satisfactory performance of end-to-end task-oriented DS.

Arabic Task-Oriented Dialogue Systems
Considering the maturity of research concerning English-based task-oriented DS, we find that task-oriented DS research more broadly remains in its infancy for Arabic. This is due to a lack of fundamental NLP resources and a scarcity of datasets for Arabic taskoriented DS. Most of the research on Arabic task-oriented DS focuses on achieving specific tasks, such as intent classification [34][35][36] and entity classification [34]. However, there some attempts to build task-oriented DS have investigated specific domains, including home automation [34], flight bookings [37], education [38][39][40], hotel reservations [41], and Islamic knowledge enquires [42]. Some Arabic task-oriented DS have been designed to specifically serve the Arabic dialects (e.g., OlloBot [43] and Nabiha [44]). However, this review excludes some of these studies because they are categorized as chatbots rather than task-oriented DS because their system design does not follow a task-oriented DS structure [39,40,[42][43][44].
Notably, Bashir et al. [34] used deep learning approaches to build a natural language understanding module for Arabic task-oriented DS for home automation. The module manages of both intent classification and entity extraction tasks. For intent classification, it uses LSTMs and CNNs; for entity extraction, BiLSTM and character-based word embeddings are used. The study used data collected via an online survey and the AQMAR dataset. The data were filtered and labeled according to the Conll-2003 NER format. The findings for the intent classification demonstrated that CNNs performed better than LSTMs (F-score = 94%). For entity extraction, the model obtained comparable results to the named entity recognition benchmarks in English (F-score = 94%).
Meanwhile, Elmadany et al. [35] used a multi-class hierarchical model to solve the dialogue acts classification issue associated with Arabic dialects. They used a manually collected and annotated dataset from multi-genre Egyptian call centers to evaluate their system performance. Using an SVM classifier produced an average F-score of 91.2%, indicating an improvement of 20% compared to the state-of-art approach. Elsewhere, Joukhadar et al. [36] examined different machine learning approaches to recognizing user acts in a text-based DS for the Levantine Arabic dialect. They manually produced 873 sentences for both restaurant orders and flight booking, reporting accuracy of 86% using the SVM model. However, their small dataset was insufficient to build an efficient dialogue system, suggesting an imperative to develop large multi-domain datasets or more efficient techniques.
For Arabic user-based DS, several studies [37,38,41] have applied either pattern matching, rule-based, or rule-based and data-driven hybrid approaches to task-oriented DS. Nonetheless, it is apparent that most Arabic task-oriented DS use either rule-based or pattern matching approaches, with very few using a hybrid approach. It is understandable that they use these approaches due to the challenges associated with building Arabic taskoriented DS in Arabic [3], among which is the lack of Arabic task-oriented dialogue datasets. Therefore, this study aimed to address this challenge by leveraging the pre-trained language models to build an Arabic task-oriented DS. Multi-lingual language models are among the most popular and common language models, observed to produce good performance on task-oriented DS for many languages. Accordingly, we explored the extent to which mT5 can be useful for building an Arabic task-oriented dialogue system. To the best of our knowledge, this work represents the first attempt at pre-training a large transformer-based language representation model on an Arabic task-oriented dialogue dataset (Arabic-TOD).

Method
A pre-trained language model is a deep learning model that has been trained on a large amount of data to perform particular NLP tasks [45]. Figure 1 shows a high-level view of the approach adopted. We began with the English BiToD dataset [1], translating the dialogues into Arabic to produce the Arabic-TOD dataset. The dataset was then preprocessed and prepared for the training step. Subsequently, we trained the models on the training Arabic-TOD dataset using different settings. Finally, we used the testing Arabic-TOD dataset to test the models and obtain the results for the AraConv model. quently, we trained the models on the training Arabic-TOD dataset using different settings. Finally, we used the testing Arabic-TOD dataset to test the models and obtain the results for the AraConv model.

Arabic Task-Oriented DS Dataset
Because Arabic is a low-resource language, no human-annotated Arabic dataset for task-oriented DS has been produced (to the best of our knowledge). To obtain a goodquality dataset, we decided to use an existing dataset, translating a benchmark dataset for task-oriented DS (BiToD [1]) to develop a suitable training dataset for Arabic task-oriented DS.
Translating existing datasets is a practice frequently observed in the literature for low-resource languages, with examples including [46][47][48]. Recent translation techniques for crowd-sourced annotated datasets have produced reasonable results on training data for different languages, enabling many studies to address the lack of datasets by translating existing datasets for many downstream tasks in NLP. For example, for question answering (QA), the SQuAD dataset has been translated into Arabic [46] and Bengali [47], and for conversation generation, the EmpatheticDialogues dataset has been translated into Arabic [48].
Still, it is imperative for the research community to develop multi-lingual benchmarks to evaluate the cross-lingual transferability of end-to-end systems in general and task-oriented DS in particular [49]. For task-oriented DS, many multi-lingual datasets can be obtained by translating the English datasets. Error! Reference source not found. presents some of these alongside their corresponding tasks and domains. Translation represents a good choice for low-resource languages to support the reuse of resources and save time spent creating and annotating long dialogues. Additionally, this enables the development of multi-lingual benchmarks for the research community to use.

Arabic Task-Oriented DS Dataset
Because Arabic is a low-resource language, no human-annotated Arabic dataset for task-oriented DS has been produced (to the best of our knowledge). To obtain a goodquality dataset, we decided to use an existing dataset, translating a benchmark dataset for task-oriented DS (BiToD [1]) to develop a suitable training dataset for Arabic task-oriented DS.
Translating existing datasets is a practice frequently observed in the literature for low-resource languages, with examples including [46][47][48]. Recent translation techniques for crowd-sourced annotated datasets have produced reasonable results on training data for different languages, enabling many studies to address the lack of datasets by translating existing datasets for many downstream tasks in NLP. For example, for question answering (QA), the SQuAD dataset has been translated into Arabic [46] and Bengali [47], and for conversation generation, the EmpatheticDialogues dataset has been translated into Arabic [48].
Still, it is imperative for the research community to develop multi-lingual benchmarks to evaluate the cross-lingual transferability of end-to-end systems in general and taskoriented DS in particular [49]. For task-oriented DS, many multi-lingual datasets can be obtained by translating the English datasets. Table 2 presents some of these alongside their corresponding tasks and domains. Translation represents a good choice for low-resource languages to support the reuse of resources and save time spent creating and annotating long dialogues. Additionally, this enables the development of multi-lingual benchmarks for the research community to use.

Structure and Organization of Arabic-TOD Dataset
The Arabic-TOD dataset is based on the BiToD dataset, the first large bilingual taskoriented dialogue dataset created for training and evaluating end-to-end task-oriented DS. It contains annotated English and Chinese dialogues and features a total of 7232 dialogues with 144,798 utterances (3689 dialogues in English and 3543 dialogues in Chinese). The dialogues range between 10 and more than 50 turns with an average length of 19.98 turns. Each turn can be defined as one or more utterances from one speaker [56]. The BiToD dataset includes dialogues in five domains: Hotels, Restaurants, Weather, Attractions, and Metro.
Although there are many other common multi-domain task-oriented dialogue datasets, including MultiWOZ, we chose to translate the BiToD dataset to leverage certain useful features that distinguished it from other datasets [1]. Notably, the BiToD dataset supports mixed-language contexts, also known as code-switching. Some items in the knowledge base (and in daily life) feature mixed-language information, meaning English and Arabic texts appear in the same utterance. For example, there are some restaurant names in English that cannot be translated into Arabic, such as Chom Chom, which maintains the English name even if our conversation is in Arabic (i.e., " Chom Chom"). Another advantageous feature of the BiToD dataset is its use of a deterministic API, which simplifies model evaluations. Deterministic API refers to the ability of the system to recommend the query-matched items on the basis of certain criteria (e.g., user rating). This differs from other API evaluation methods, which randomly return only one or two matched items with the API. Another important aspect of the BiToD dataset is the diversity of user tasks, meaning users might want to book hotels and restaurants within the same dialogue, as they might in a real human-based interactions. As such, we decided to contribute to enriching and augmenting the BiToD dataset by translating the English dialogues into Arabic, producing a multi-lingual dataset enabling the combined use of English, Chinese, and Arabic. Table 3 summarizes the different common multi-domain task-oriented dialogue datasets, indicating the features that we have tried to utilize.  For the translation task, three bilingual speakers of Arabic and English were paid to manually translate the English BiToD dataset into Arabic over 2.5 months, translating the utterances and slot-values in the dataset in the Hotels, Restaurants, Weather, and Attractions categories. We determined the strategy of translation and the used lexicons previously, and we gave them some examples of the target translated dialogues. Of the 3689 English dialogues, 1500 dialogues (30,000 utterances) were translated into Arabic. The translated utterances and slot-values were reviewed to verify the quality of translation and correctness of slot-value pairs on the basis of the English BiToD dataset.
Arabic-TOD dataset contains different lengths of dialogues, some of them with a single task and the others with multiple tasks varying between 2 and 4. For instance, some dialogues include multiple tasks in a single dialogue (e.g., a single dialogue can involve different tasks including enquiring about the weather, finding a restaurant to eat at, and an attraction to visit).
To the best of our knowledge, this Arabic-TOD is the first Arabic dataset supporting a mixed languages context for task-oriented DS that has been annotated following the BiToD dataset's structure [1].

Model Architecture
The AraConv model's generation process is based on a single multi-lingual Seq2Seq (mSeq2Seq) model that uses the pre-trained model mT5 [5], a multilingual variant of T5 [57], which can be formally defined as follows: Assume the dialogue D represents a set of user utterances (Ut) and system utterances (St) at turn t, where D = {U 1 , S 1 , . . . ., U t , S t }.
The dialogue history (H) holds the previous user and system utterances of turn t, specified by the context window size (w), where H t = {U t−w , S t−w , . . . , S t−1 ; U t }. For turn t, the dialogue state is represented as B t , and the knowledge state is represented as K t . Figure 2 illustrates the proposed workflow for response generation using the mSeq2Seq model based on the BiToD dataset [1].
Initially, we set the dialogue state and knowledge state to empty strings as B 0 and K 0 . Then, we considered the current dialogue history (H t ), previous dialogue state (B t−1 ), and previous knowledge state (K t−1 ) as input at turn t. We added the prompt PB = "TrackDialogueState:" to indicate the generation task [57]. Therefore, the mSeq2Seq model produces Levenshtein Belief Spans at turn t (Lev t ), indicating a text span that contains the information for updating the dialogue state from (B t−1 ) to B t . Lev t can be represented by the following equation: Then, the model generates an output (o/p) based on the new input as the updated dialogue state (B t ), and the response generation prompt--referred to as PR = "Response:" --at the current turn t. If there is a need for an API call, the model will generate an API name according to the following: In this case, the system queries the API with particular constraints in the dialogue state and updates the knowledge state form (K t−1 ) to (K t ). The updated knowledge state (K t ) and API name (API) are subsequently combined to generate the next turn response. Otherwise, the model generates a textual response (R) that is returned directly to the user: ,

x FOR PEER REVIEW 8 of 15
In this case, the system queries the API with particular constraints in the dialogue state and updates the knowledge state form (Kt−1) to (Kt). The updated knowledge state (Kt) and API name (API) are subsequently combined to generate the next turn response. Otherwise, the model generates a textual response (R) that is returned directly to the user:

Experiments
This section first explains the evaluation metrics used to measure the performance of the AraConv model. Next, we describe the experimental setup and detail the experiments performed to test our hypothesis. Finally, we discuss the results of each experiment.

Evaluation Metrics
This study addresses two main tasks: DST and end-to-end dialogue generation, which includes both DST and response. To evaluate the DST performance of the AraConv model, we used the joint goal accuracy (JGA) metric to compare the predicted dialogue state to the ground truth for each dialogue turn. If all predicted slot values exactly match the ground-truth values, the model's output is considered correct. To evaluate the performance on the end-to-end generation task by the AraConv model, we used four metrics: • the BLEU metric to assess the generated response fluency; • the API call accuracy (APIAcc) metric to assess if the system generates the correct API

Experiments
This section first explains the evaluation metrics used to measure the performance of the AraConv model. Next, we describe the experimental setup and detail the experiments performed to test our hypothesis. Finally, we discuss the results of each experiment.

Evaluation Metrics
This study addresses two main tasks: DST and end-to-end dialogue generation, which includes both DST and response. To evaluate the DST performance of the AraConv model, we used the joint goal accuracy (JGA) metric to compare the predicted dialogue state to the ground truth for each dialogue turn. If all predicted slot values exactly match the ground-truth values, the model's output is considered correct. To evaluate the performance on the end-to-end generation task by the AraConv model, we used four metrics: • the BLEU metric to assess the generated response fluency; • the API call accuracy (API Acc ) metric to assess if the system generates the correct API call; • the task success rate (TSR) metric to assess whether the system finds the correct entity and provides all of the requested information for a particular task. TSR can be defined as TSR = ∑ success task total number of tasks (4) where the tasks involve searching task and booking task for hotel and restaurant domains, and search task for attraction and weather domains.
• the dialogue success rate (DSR) metric to evaluate whether the system accomplishes all of the dialogue tasks. DSR can be defined as The evaluation method's main goal is to obtain an automated and repeatable evaluation procedure that enables efficient comparisons of the quality of different dialogue strategies. This involves focusing on the automatic evaluation metrics. However, further measurement of the quality of the generated responses also requires human review. Thus, following the literature [15], we evaluated the AraConv model's performance on end-to-end generation tasks according to two metrics: • the language understanding score to indicate the extent to which the system understands user inputs; and • the response appropriateness score to indicate whether the response is appropriate and human-like.
We performed a small-scale human review to measure these scores. The literature indicates two other common metrics used in human evaluation: TSR and DSR [56]. Given the costs and time-intensiveness of human evaluation, we measured these scores automatically (TSR and DSR).

Experimental Setup
Our framework uses the pre-trained multi-lingual model mT5-small. All of our experiments used the Transformers library [58] and the deep learning framework PyTorch [59]. We trained all of the models using an AdamW optimizer [60] (with an initial learning rate of 0.0005). We set the dialogue context window size (w) at 2 and the batch size at 128 in accordance with the approach observed to obtain the best results in the extant literature.
We split our Arabic-TOD dataset into 67%, 7%, and 26% for training, validation, and testing, resulting in 1000, 100, and 400 training, validation, and testing dialogues, respectively. For the mono-lingual setting, we trained the model for 20 epochs; for the bi-lingual and multi-lingual settings, we trained the models for 8 epochs. Training using Google Colab required approximately 22 hours.

Baseline
As this is the first work to build an Arabic end-to-end generative model for taskoriented DS, there is no directly comparable approach in the previous Arabic studies. Therefore, we experimented with several initial baselines (using the zero-shot setting, that is transferring the model, which is trained to solve task-oriented DS, in English to solve that specific task in Arabic). We trained the mT5 model on English using the English BiToD dataset then tested its performance directly on the Arabic-TOD dataset. This approach is a common practice similar to many downstream tasks such as QA [61,62] or task-oriented DS [63]. The performance of these initial baselines was very low; therefore, we set our baseline using the same concept of zero-shot setting where mT5 model is trained on mixed language training data by replacing the most task-related keyword entities in English BiToD language with their corresponding in Arabic language from a parallel dictionary.

Experiments
RQ1: To what extent can mT5, a multi-lingual pre-trained language model, produce satisfactory results for Arabic end-to-end task-oriented DS?
This experiment aimed to investigate the performance of an end-to-end Arabic taskoriented dialogue system using an mSeq2Seq model for Arabic. This mono-lingual setting only requires one language to train and test the model. Thus, we trained and tested the proposed mT5 model (AraConv) using the Arabic-TOD dataset. The AraConv model differs from the baseline with the training setting where AraConv trained on Arabic dialogues while the baseline did not (zero-shot learning). Table 4 shows the results--in terms of BLEU, APIACC, TSR, DSR, and JGA--of the AraConv model in the mono-lingual setting in comparison to the English and Chinese experiments on the BiToD dataset [1]. The observed English results [1] outperformed the AraConv results. This is unsurprising because there are more data for English and Chinese. The model trained and tested on English or Chinese data still performed better than that tested on the Arabic-TOD dataset, which represented only 27% of the BiToD dataset [1]. Where the original mT5 model was trained using multiple languages, the English data represented 5.67% of the whole corpus, and Chinese and Arabic represented 1.67% and 1.66% of the total data, respectively [5], explaining the superior performance for English dialogue. Additionally, Arabic is a language with extensive grammatical case marking [5], which causes lower evaluation metrics compared to English. Meanwhile, despite the comparable sizes of the training data for Arabic and Chinese, the results of the mono-lingual model trained on Chinese BiToD dataset outperformed the AraConv model. This may have been due to the small size of the Arabic-TOD dataset compared to the Chinese BiToD dataset. Nonetheless, the AraConv model achieved a better BLEU value (by approximately 63%) than the Chinese model, meaning that the AraConv model can generate more fluent responses than the Chinese model. Still, the AraConv model did not achieve perfect results, potentially due to the complicated nature of the Arabic-TOD dataset, its complex ontology, and its diversity of user goals. Moreover, the DSR result was lower than the TSR result, likely because of the multiple tasks included in the dialogue (2-4 tasks). For instance, some dialogues included multiple tasks in a single dialogue (e.g., a single dialogue can involve the tasks of finding a hotel to stay at, a restaurant to eat at, an attraction to visit, and information about the weather).
RQ2: To what extent can joint-training the mT5 model on Arabic dialogue data and data for one or two high-resource languages (namely, English or English and Chinese) improve the performance of Arabic task-oriented DS?
Answering this research question requires performing two experiments to investigate the performance of building an end-to-end Arabic task-oriented dialogue system using an mSeq2Seq model in bi-lingual and multi-lingual settings. Because two languages are used to train and test the model in the bi-lingual setting, we trained the proposed model mT5 on both the Arabic-TOD and English-BiToD datasets [1].
In the experiments described in [1], the models were trained on almost the same number of English and Chinese dialogues (2952 and 2835). However, our Arabic-TOD dataset includes only 27% of the data included in the BiToD datasets. Accordingly, we investigated two cases: Because three languages were used to train and test the model for the multi-lingual experiment, the mT5 model was trained on the Arabic-TOD, the English BiToD, and the Chinese BiToD datasets [1]. As in the previous experiment, we investigated two cases: For the bi-lingual setting, Table 5 compares the AraConv results--in terms of BLEU, APIACC, TSR, DSR, and JGA--to the experiments reported in [1] regarding English and Chinese dialogues with the same settings. We observed that the non-equipollent bi-lingual AraConv model (AraConvBi-NQ) outperformed the equipollent bi-lingual AraConv model (AraConv Bi-Q ), demonstrating the impact of training dialogue dataset size on the final model given that the AraConv Bi-NQ model is trained on more data. Therefore, using more English data in training with Arabic helps to improve the result because of the semantics of the conversation, which is almost similar to Arabic, especially for the task-related words. However the model in [1], which was trained on both English and Chinese data and then tested on English, outperformed all models, assuming the dialogues in the two datasets were almost the same. As discussed, the distinguished performance of the English model could have been due to the amount of English data used to train the mT5 model. Nonetheless, we observed that the AraConv model performed better according to the BLEU metric than the Chinese model, despite training on the same English dataset (as a second dataset for joint-training), confirming the greater fluency of the AraConv model.
For the multi-lingual setting, Table 6 presents AraConv results calculated in terms of BLEU, APIACC, TSR, DSR, and JGA. Our findings emphasize the previous results of AraConv in the bi-lingual experiment, which saw the non-equipollent multi-lingual AraConv model (AraConv M-NQ ) perform better than the equipollent multi-lingual AraConv model (AraConv M-Q ). Accordingly, we recognize that joint-training on multiple languages including the target language (in this case, Arabic) improves the results in experiments on the target language, which aligns with the results reported in [30]. Generally, for bi-lingual and multi-lingual experiments, the trained models can simultaneously handle dialogues in multiple languages (whether English, Chinese, or Arabic) without using any of the language identifiers supplied during testing.
For the human review, we aimed to rate dialogue or utterances on the basis of certain metrics identified in the literature [56]. Five expert researchers (who are independent from this paper author) were chosen for this task. We randomly selected 20 complete dialogue sessions from the generated dialogues of AraConv model. The researchers were asked to rate these dialogues by providing language understanding and response appropriateness scores. Their scores ranged from 0 (extremely bad) to 5 (extremely good), depending on the system's response. Subsequently, we evaluated the reliability of their rating using Fleiss' Kappa [64]. The overall Fleiss' kappa values for the language understanding and appropriateness scores were 0.253 and 0.229, respectively, indicating "fair agreement".

Conclusions and Future Work
To the best of our knowledge, this work represents to the first attempt to build an endto-end Arabic task-oriented dialogue system (AraConv) using a pre-trained transformerbased multi-lingual language model. We utilized the highly regarded multi-lingual model mT5 to build an end-to-end Arabic task-oriented dialogue system with different settings and presented an Arabic-TOD dataset based on translating 27% of the BiToD dataset's English dialogue data into Arabic. The Arabic-TOD dataset is considered the first dialogue dataset for the Arabic task-oriented DS that supports code-switching. Although using the Arabic-TOD dataset to train and test the model in a mono-lingual setting demonstrates a reasonable performance for the AraConv model compared to the results observed for the English and Chinese BiToD datasets in the same settings, the performance is undermined by the small size of the Arabic TOD dataset. To overcome this problem, we considered joint-training the model on Arabic dialogue data and one or two high-resource languages (English or both English and Chinese). The findings reveal that the AraConv model in the multi-lingual setting outperformed the AraConv model in the mono-lingual setting, with multi-lingual training with English, Chinese, and Arabic observed to be better than bi-lingual training with only English and Arabic data. Thus, the AraConv model can be considered a good baseline for building robust end-to-end Arabic task-oriented DS that can engage with complex scenarios.
The main limitation of this work is the small size of the Arabic-TOD dataset. A related limitation concerns the Arabic-TOD dataset using non-Arabic entities, with the dataset code-switching due to entities in the original BiToD dataset. However, we leveraged this property to align the model with the routine usage of such entities in conversation. In the future, we aim to extend the Arabic-TOD dataset to equal the BiToD dataset in terms of the number of dialogues. Additionally, we plan to examine cross-lingual models, especially involving the Arabic-TOD dataset. Furthermore, we plan to develop Arabic task-oriented DS using other multilingual language models (e.g., mBART [65]). Another possible venue for future work is using a pre-trained Arabic model for Arabic task-oriented DS such as AraT5 [66], which was yet to be deployed at the time of working on this paper.