A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding

Ma, Yajing; Yu, Yingxia; Liu, Han; Altenbek, Gulila; Zhang, Xiang; Tuersun, Yilixiati

doi:10.3390/app15126643

Open AccessArticle

A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding

by

Yajing Ma

^1,†,

Yingxia Yu

^2,*,†,

Han Liu

²,

Gulila Altenbek

^2,3,4,

Xiang Zhang

⁵ and

Yilixiati Tuersun

¹

School of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, China

²

School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

³

The Base of Kazakh and Kirghiz Language of National Language Resource Monitoring and Research Center on Minority Languages, Urumqi 830017, China

⁴

Xinjiang Laboratory of Multi-Language Information Technology, Urumqi 830017, China

⁵

School of Big Data and Artificial Intelligence, Xinyang University, Xinyang 464000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(12), 6643; https://doi.org/10.3390/app15126643

Submission received: 7 May 2025 / Revised: 5 June 2025 / Accepted: 5 June 2025 / Published: 12 June 2025

Download

Browse Figures

Versions Notes

Abstract

Current research on intelligent question answering mainly focuses on high-resource languages such as Chinese and English, with limited studies on question understanding and reasoning in low-resource languages. In addition, during the joint modeling of question understanding tasks, the interdependence among subtasks can lead to error accumulation during the interaction phase, thereby affecting the prediction performance of the individual subtasks. To address the issue of error propagation caused by sentence-level intent encoding in the joint modeling of intent recognition and slot filling, this paper proposes a Cross-lingual Token-level Bi-Interactive Model (Bi-XTM). The model introduces a novel subtask interaction method that leverages the token-level intent output distribution as additional information for slot vector representation, effectively reducing error propagation and enhancing the information exchange between intent and slot vectors. Meanwhile, to address the scarcity of Kazakh (Arabic alphabet) language corpora, this paper constructs a cross-lingual joint question understanding dataset for the Xinjiang tourism domain, named JISD, which includes 16,548 Chinese samples and 1399 Kazakh samples. This dataset provides a new resource for cross-lingual intent recognition and slot filling joint tasks. Experimental results on the publicly available multi-lingual question understanding dataset MTOD and the newly constructed dataset demonstrate that the proposed Bi-XTM achieves state-of-the-art performance in both monolingual and cross-lingual settings.

Keywords:

intent recognition; slot filling; joint modeling; cross-lingual; Kazakh

1. Introduction

The single-task modeling approach for intent recognition and slot filling involves building separate models for the two subtasks. The two models operate independently and do not interfere with each other during parameter transmission. Early researchers directly transferred several traditional text classification methods to the intent recognition subtask. Xu et al. [1] used Convolutional Neural Networks (CNNs) [2] to extract 5-gram features and applied max-pooling operations to obtain word vector representations. Ravuri et al. [3] integrated Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks [4] for the intent recognition subtask, demonstrating that sequence features play a supportive role in the recognition of intent labels. With the gradual development of large-scale data pre-trained models, using pre-trained models as encoders to build more complex model structures has become the mainstream approach. Researchers are no longer focused solely on simple classification tasks for intent recognition but have gradually shifted toward more complex tasks such as open intent recognition and unsupervised intent recognition, achieving significant progress in these areas. Congying Xia et al. [5] explored the problems of open intent recognition and zero-shot intent recognition, proposing two capsule network-based frameworks: INTENTCAPSNET and INTENTCAPSNET-ZSL. The former leverages the advantages of capsule networks in text modeling, extracting semantic features from text in a hierarchical manner through self-attention, and aggregating them via a dynamic routing agreement mechanism to obtain sentence-level intent representations for distinguishing known intents. The latter equips the capsule network with zero-shot learning capabilities, enabling knowledge transfer from existing intents to recognize new ones. In addition to demonstrating strong zero-shot learning performance in intent recognition tasks, the inductive bias embedded in this capsule-based hierarchical learning framework also shows broader applicability across various text modeling tasks. After capsule networks were effectively applied to zero-shot intent recognition tasks, Casanueva I et al. [6] further explored few-shot intent recognition. They proposed an intent recognition method based on dual encoders, using the Universal Sentence Encoder (USE) and ConveRT. The dual encoders are used to obtain fixed vector representations, on top of which a multilayer perceptron with ReLU nonlinear activation functions is stacked. Compared to BERT-based classifiers, this model demonstrates more significant advantages under few-shot conditions, achieving more efficient intent recognition. The research on the slot filling subtask has undergone a similar development process as described above, including the following: Yao et al. [7] first applying RNNs to predict slot labels and exploring tasks such as the recognition of out-of-vocabulary words, named entities, syntactic features, and part-of-speech information. Mesnil et al. [8] investigated various RNN variant architectures and conducted multiple comparisons on the slot filling subtask to explore the advantages brought by different architectures. Yao et al. [9] adopted the more advanced LSTM model over RNNs for slot filling, explicitly modeling the dependency between output labels. They proposed a regression model based on the non-normalized scores of LSTM and validated the model’s effectiveness in handling long-text label dependencies. Mesnil et al. [10] found that Conditional Random Fields (CRFs) could effectively constrain the horizontal dependencies of slot labels and applied Viterbi decoding and recurrent CRFs to address the label bias problem. Liu et al. [11] built dependencies among slot labels by feeding sampled output labels back into the sequence state. Md Tanvirul Alam et al. [12] proposed maintaining a candidate set of up to k slot values for each slot, and then scoring and ranking the k slot values using a bidirectional RNN, thereby addressing the issue of unenumerable slot values. Currently, both the research focus and methods for slot filling tasks have changed significantly, with task objectives becoming more refined. Emerging research directions such as slot variation, slot value variation, and cross-domain slot filling have also begun to appear [13].

Although building separate models for the intent recognition and slot filling subtasks has achieved noticeable results for each individual task, the relational information between the intent vector and the slot vector has not been fully exploited. The models for the two subtasks operate independently, lacking the guidance provided by shared knowledge. To address the above issues, this paper proposes a joint modeling approach for intent recognition and slot filling. The model introduces a novel bidirectional propagation method that uses the word-level intent output distribution as additional information for slot vector representations, effectively mitigating the problem of error propagation and enhancing the information flow between intent vectors and slot vectors. Meanwhile, to alleviate the scarcity of Kazakh language corpora, this paper constructs a joint intent recognition and slot filling dataset (JISD), which includes 16,548 Chinese data samples and 1399 Kazakh data samples, providing new training and evaluation resources for cross-lingual joint intent recognition and slot filling tasks.

2. Related Work

2.1. Joint Modeling of Intent Recognition and Slot Filling

Joint modeling of intent recognition and slot filling is divided into two parts: implicit joint modeling and explicit joint modeling. Implicit joint modeling means that the model only uses a shared encoder to capture common features, without any obvious interaction between the intent recognition model and the slot filling model. Explicit joint modeling means that the model uses a shared encoder to obtain common features, while also adding an additional explicit interaction module to enhance the relationship between the intent recognition and slot filling subtasks, enabling direct transfer between the intent vector and the slot vector. Recently, an increasing number of studies have proposed using explicit interaction modules to enhance the direct connection between intent recognition and slot filling tasks. Libo Qin et al. [14] proposed a stack propagation framework that directly uses intent information as input for slot filling, enhancing the slot filling effect through intent semantic information, while using a token-level intent recognition method to avoid error propagation caused by utterance-level intent information. Haihong E et al. [15] proposed an iterative interaction model (Slot Filling–Intent Detection model, SF-ID) to directly establish the connection between intent recognition and slot filling. The intent recognition sub-network and the slot filling sub-network exchange semantic information and use a novel iterative mechanism to enhance this interaction. To effectively model using semantic hierarchies, Chenwei Zhang et al. [16] proposed a capsule neural network model that utilizes dynamic routing to complete the joint task of intent recognition and slot filling. Yijin Liu et al. [17] introduced a novel Collaborative Memory Network (CM-Net), which captures intent and slot features through collaborative memory blocks to enhance local context representation. Libo Qin et al. [18] proposed a collaborative interaction Transformer framework, introducing BERT as a shared encoder and adding a collaborative interaction attention mechanism between the two tasks to achieve bidirectional interaction between intent vectors and slot vectors. Simple implicit joint modeling only considers the mutual connection between the two subtasks through shared latent representations, which may not achieve optimal results. Compared to implicit joint modeling, explicit joint modeling allows the model to fully capture the shared knowledge between the two subtasks, thereby improving performance on both sides. Additionally, explicitly controlling the knowledge transfer between the two subtasks helps enhance the model’s interpretability and allows for effective analysis of the mutual influence between intent recognition and slot filling.

2.2. Cross-Lingual Transfer

The question understanding modeling methods mentioned above are all based on single-language corpora, but these models rely heavily on large amounts of labeled training data. In real-world scenarios, question-answering systems often need to handle questions in multiple languages. However, it is unrealistic to collect rich labeled data for each new language or domain. Therefore, the applicability of single-language models in other languages or domains is limited, and they are increasingly unable to handle complex and dynamic real-world situations. The introduction and development of cross-lingual pre-trained models provide a feasible solution for cross-lingual question answering and have attracted considerable attention from researchers in recent years. Zihan Liu et al. [19] proposed an attention-based mixed-language training method (Attention-Informed Mixed-Language Training, MLT), which uses a small amount of task-related parallel word pairs to generate code-switching texts for learning cross-lingual semantic information to align representations between the source and target languages. To address the alignment error issue caused by machine translation methods projecting source slot labels onto target slot labels, Wei jia Xu et al. [20] proposed a new end-to-end model for jointly aligning and predicting target slot labels for cross-lingual transfer. These models and methods have been successfully applied to zero-shot cross-lingual question understanding tasks. However, due to inconsistent subword context representations across different languages, these models face limitations in improving cross-lingual transfer performance. To address this issue, Libo Qin et al. [21] proposed a data augmentation framework to generate multi-lingual code-switching data for fine-tuning mBERT, allowing the model to integrate mixed multi-lingual context information to align source and target language context representations. This approach does not require parallel sentence pairs for training, alleviating the issue of limited availability of target language corpora. Currently, the development of low-resource question understanding is progressing slowly and is still in its emerging stage. In certain domains, there are very few instances with intent and slot label annotations, which makes traditional supervised learning methods less effective. Many existing zero-shot learning methods address this issue by discovering common characteristics between slots. For example, Bapna et al. [22] proposed using contextual slot names and description information in a multi-task slot filling model to implicitly align slot labels across domains without requiring any labeled or unlabeled instances in the domain, enabling quick transfers to new domains. This method eliminates the need for labeled data and explicit paradigm alignment, making it promising for solving domain expansion problems. Zihan Liu et al. [23] made further improvements to this method for cross-domain slot label alignment by proposing a coarse-to-fine approach (Coach) for cross-domain slot filling. First, they learn a universal paradigm by distinguishing between characters and slot entities, and then they predict the specific types of slot entities. They also use a text-template-based regularization method to standardize text representations and enhance adaptive robustness, providing a new idea and an effective zero-shot modeling approach for cross-domain slot filling tasks. Shah et al. [24] further explored the issue of misalignment between the source and target domains, proposing to use slot label information and a small number of instances containing slot values to learn slot semantic representations, thereby achieving cross-domain transfer. Zeng et al. [25] proposed the ProToCo approach for prompting pre-trained language models (PLMs) to be consistent to improve the factual assessment capabilities of PLM in few-shot and zero-shot settings.

3. Proposed Method

3.1. Cross-Lingual Bidirectional Token-Level Interaction Model

The Cross-lingual Token-level Bi-Interactive Model (Bi-XTM) adopts an encoder–decoder framework, consisting of three main modules: a cross-lingual shared encoder module, a bidirectional interaction module, and a joint decoder module. The model architecture is shown in Figure 1.

3.2. Cross-Lingual Shared Encoder Module

During the pre-training phase, the CINO model constructs a shared vocabulary by learning from a large multi-lingual corpus. This vocabulary consists of common subword units across different languages. The shared vocabulary enables the model to handle multiple languages without maintaining a separate vocabulary for each one. Using subword tokenization, the CINO shared encoder segments the input text

X = (x_{1}, x_{2}, \dots, x_{n})

into smaller semantic units, where

x_{1}, x_{2}, \dots, x_{n}

represent individual characters. This strategy is applicable to Chinese (e.g., character-based subword segmentation) as well as to languages like English and Kazakh (e.g., root-based subword segmentation and affix-based subword segmentation). The cross-lingual shared encoder maps each subword into a high-dimensional vector space to generate a sequence of word embeddings, which are then fed into multiple Transformer encoder layers for deep processing. Each encoder layer includes a self-attention mechanism and a feed-forward neural network. This structure enables the shared encoder to effectively capture bidirectional contextual information within the text. The output of the self-attention mechanism is further processed by the feed-forward neural network to produce the final output of each layer. By stacking multiple encoder layers, the shared encoder can capture deeper-level features from the input text.

In terms of intent vector processing, the Bi-XTM model adopts a token-level intent encoding approach. This means that the encoder no longer relies on a pooled probability output distribution to represent the intent vector. Instead, it uses a method similar to slot vectors, assigning an intent label to each character, as shown in Equation (1). The main goal is to reduce potential error propagation between intent and slot representations during information transfer.

H_{I} = H_{s} = (h_{1}, h_{2}, \dots, h_{s})

(1)

where

h_{s}, i = 1, 2, \dots, s

represents the contextual semantic vector representation of the encoded X. The advantage of using token-level intent encoding is that each input character is assigned an intent label representing the entire text. Using a token-level approach avoids error propagation caused by utterance-level information.

3.3. Bidirectional Propagation Module

The subtask interaction module is a core component of the Bi-XTM model, designed to match the token-level intent encoding approach. Based on the collaborative interaction attention mechanism, a subtask interaction layer is added to mitigate the error propagation that may occur during the transfer of subtask vectors. The collaborative interaction attention layer in the Collaborative Interaction Transformer framework has demonstrated significant effectiveness in capturing the local interdependence between intent and slot vector representations. To further leverage this advantage, it is integrated into the subtask interaction module for the joint task of intent recognition and slot filling. Through the collaborative interaction attention mechanism, the model establishes a clear interaction between the two subtasks, enabling bidirectional synchronous updating and transfer of intent and slot vector representations. After encoding the input text, the intent vector representation

H_{I}

and the slot vector representation

H_{s}

are obtained. Then, linear functions are used to map

H_{I}

and

H_{s}

to their corresponding

Q u e r y

,

K e y

, and

V a l u e

matrices, as shown in Equations (2) and (3).

Q_{I}, K_{I}, V_{I} = W_{q}^{i} H_{I}, W_{k}^{i} H_{I}, W_{v}^{i} H_{I}

(2)

Q_{S}, K_{S}, V_{S} = W_{q}^{s} H_{S}, W_{k}^{s} H_{S}, W_{v}^{s} H_{S}

(3)

where

W {((q, k, v))}^{i}

and

W {((q, k, v))}^{s}

represent trainable model parameters. Taking the intent recognition subtask as an example, a multi-head attention mechanism is employed to capture the interdependencies within the input text from multiple perspectives. Specifically,

H_{I}

is linearly transformed into multiple sets of vectors

Q_{I}

,

K_{I}

and

V_{I}

. Each set of vectors enables the model to focus on different local contextual information within the input text. The multi-head attention mechanism then integrates information from these different vector groups, using the scaled dot-product formula to calculate the attention scores between

Q_{I}

and

K_{S}

, thereby obtaining context-aware attention vector representations, as shown in Equations (4) and (5).

s (Q_{I}, K_{S}) = \frac{(K_{S}^{T} \cdot Q_{I})}{\sqrt{d}}

(4)

H_{I}^{*} = L N (H_{I} + S o f t m a x (s (Q_{I}, K_{S})) \cdot V_{I})

(5)

where

S (\cdot)

denotes the intent attention function. It captures the importance vector representation of each character in the input text based on contextual information, allowing the model to leverage key local contextual features and enhance its intent recognition capability. After applying residual connections and layer normalization, the intent vector representation with mixed semantic information,

H_{I}^{*}

, is obtained. A similar approach is used in the slot filling subtask to obtain the slot vector representation

H_{S}^{*}

with mixed semantic information.

To more effectively improve the performance of the joint intent recognition and slot filling model, the Bi-XTM model extends the collaborative interaction attention mechanism by introducing a subtask interaction approach. This method transforms the traditional utterance-level intent encoding into a token-level intent encoding, specifically addressing the error propagation issues present in bidirectional interactions, thereby enhancing the explicit bidirectional interaction between intent and slot information. The bidirectional propagation layer architecture is shown in Figure 2.

The subtask interaction approach not only provides intent label information from the intent output distribution to the slot filling task but also uses the slot output distribution to supply slot label information to the intent vector representation. Specifically, the subtask interaction method applies linear transformations to the intent and slot output distributions to compute the label distributions, thereby obtaining the probability outputs for each intent label and slot label. These label distributions are then incorporated as additional input information into the intent and slot vector representations, enabling more efficient bidirectional information exchange and utilization. As shown in Equation (6), linear transformations are applied to

H_{I}^{*}

and

H_{S}^{*}

, respectively, as follows:

y_{I}^{*}, y_{S}^{*} = S o f t m a x (W_{I} H_{I}^{*}), S o f t m a x (W_{S} H_{S}^{*})

(6)

where

y_{I}^{*}

and

y_{S}^{*}

represent the intent output probability distribution and the slot output probability distribution, respectively. For the intent detection task,

y_{I}^{*}

and

y_{S}^{*}

are concatenated according to their original sizes and dimensionally adjusted. After applying residual connections and layer normalization, they are combined with the context-aware semantic vector representation

H_{I}

from the encoder output, as shown in Equation (7):

{\hat{H}}_{I} = L N (H_{I} + W_{I} (H_{I}^{*} \oplus y_{S}^{*}))

(7)

where

{\hat{H}}_{I}

is the intent vector representation with slot label awareness, and

W_{I}

represents the model’s trainable parameters. Similarly,

y_{I}^{*}

and

H_{S}^{*}

are processed in the same way to obtain the slot vector representation with intent label awareness. By leveraging the bidirectional information flow between intent and slot vectors, the subtask interaction method enhances the model’s ability to recognize intent and slot labels.

3.4. Intention and Slot Joint Decoder Module

As shown in Equation (8), since the Bi-XTM model uses word-level intent encoding, the final intent probability output distribution has the same output dimension as the slot probability output distribution. To obtain the final intent prediction label for the entire input text, the predicted labels for each character need to be optimally sorted, as shown in Equations (9) and (10):

{\hat{y}}^{I} = S o f t m a x (L i n e a r) {\hat{H}}_{I}

(8)

where

{\hat{y}}^{I} = ({\hat{y}}_{1}^{I}, {\hat{y}}_{2}^{I}, \dots, {\hat{y}}_{n}^{I})

is the final intent probability output distribution, and

{\hat{y}}_{i}^{I}

,

i = 1, 2, \dots, n

is the predicted label for each character. After sorting

{\hat{y}}_{i}^{I}

, the label with the highest count is selected as the final intent output label

{\hat{y}}_{p r e}^{I}

:

{\hat{y}}_{i}^{I_{-} p r e} = a r g m a x ({\hat{y}}_{i}^{I})

(9)

{\hat{y}}_{p r e}^{I} = a r g m a x (\sum_{i = 1}^{n} \sum_{j = 1}^{I} σ_{j} [{\hat{y}}_{i}^{I_{p r e}} = j])

(10)

Each slot label not only has a vertical relationship with the corresponding slot entity but also has natural constraint relationships with other slot labels. To capture these dependencies, the Bi-XTM model uses Conditional Random Fields (CRFs) [26] to establish the lateral dependencies between the labels.

4. Experiments

4.1. Datasets

The SNIPS [27] dataset is derived from the Snips personal voice assistant, containing 13,784 training samples and 700 test samples. It covers 7 different intent categories and 39 entity categories, with content related to various topics in daily life. The ATIS [28] dataset includes 4478 training samples and 1393 test samples, covering 21 intent categories and 60 entity categories. The data primarily involves user queries and dialogues commonly found in the context of air travel. The Multilingual Task-Oriented Dialogue (MTOD) dataset [29] used in this paper contains dialogue samples from three languages (English, Spanish, and Thai), offering a broader coverage. The dataset includes 30,521 English training samples, 3617 Spanish training samples, and 2156 Thai training samples, covering 12 intent categories and 11 entity categories across multiple domains and tasks, such as restaurant reservations, hotel inquiries, movie recommendations, etc. MTOD is designed for multi-lingual task-oriented dialogues and enables knowledge sharing and performance enhancement in a multi-lingual environment through cross-lingual pre-training models and dialogue state transfer. It holds significant practical value in the development and application of multi-lingual question-answering systems. MTOD uses English as the source language, and experiments are conducted according to the original dataset division. The training samples consist of 2879 sentences and 22,104 characters, the validation samples include 1977 sentences and 13,571 characters, and the test samples contain 2215 sentences and 15,944 characters. The target languages are Spanish and Thai, with 2600 sentences and 9132 characters used for zero-shot training method evaluations. The detailed information of the public datasets is shown in Figure 3.

Most existing public question understanding datasets primarily support high-resource monolingual corpora, such as Chinese, English, etc., while research involving low-resource language question understanding datasets is relatively limited. Only the MTOD dataset supports multi-lingual question understanding tasks in English, Spanish, and Thai. Due to the limited research on low-resource question understanding, there is currently no publicly available multi-lingual question understanding corpus that includes minority languages in China. To address the scarcity of minority language corpora, this chapter constructs a cross-lingual tourism question dataset (Cross-lingual Tourism Field Question Dataset, JISD) that includes Chinese and Kazakh. The original tourism information data are sourced from tourism websites like Baidu Encyclopedia, Qunar, and Ctrip, covering famous tourist attractions and related information from 34 provinces, municipalities, autonomous regions, and special administrative regions across China. The sample display of the original tourism data are shown in Figure 4. The original Kazakh dataset shown in this figure is based on the Arabic alphabet and is briefly described in Appendix A. It is worth noting that this writing system differs from the Cyrillic (1940–2021) and Latin (introduced since 2021) systems used in Kazakhstan, but is interoperable through phonemic transcription.

After organizing and analyzing the original tourism information data and considering the relevance and accuracy between intent and slot, a total of 27 intent categories were designed, including transportation, food, location, tickets, etc., as well as 13 slot categories, including attraction names, types, climate, and levels. The statistics of intent and slot categories are shown in Figure 5.

Based on the designed intent categories and entity categories, question word templates were constructed, using attraction names and provincial/city names from the original tourism information data as the main body. These generated Chinese tourism question data cover all intent and entity categories nationwide. For the Xinjiang tourism question data, the GT4T tool was used to translate Chinese into Kazakh, with manual proofreading to ensure translation quality.

After screening and de-duplication, BIO annotation was applied, and Label Studio was used to label the slot categories for all question data according to the defined slot categories. The final JISD dataset was constructed. This dataset includes 16,548 Chinese data and 1399 Kazakh data, with training, validation, and testing samples split using stratified sampling into a ratio of 8:1:1. The statistical information of the JISD dataset is shown in Table 1.

4.2. Experiment Parameters

In this experiment, CINO-large is used as the cross-lingual shared encoder, with a hidden layer dimension of 1024. The bidirectional interaction module has the cooperative interaction attention layer set to 5, and the number of subtask interaction layers is 3. The training process uses the AdamW optimizer [30] with a learning rate of 5 × 10⁻⁶. The model is trained for 100 iterations on the MTOD and JISD datasets, with a batch size of 32. The model parameters are stored as the model that achieves the highest average evaluation metric in the validation set for the final tests.

4.3. Baselines

In this experiment, four general cross-lingual pre-trained models, mBERT, XLM-R, CINO and Co-Transformer, are used as baseline models for joint modeling on two datasets. The results are compared with the experimental results of the proposed BI-XTM model to validate its effectiveness. mBERT uses masked language modeling as the pre-training objective, and it is pre-trained on corpora from multiple languages to obtain a multi-lingual BERT model. This model can implicitly generate similar semantic representations across different languages. XLM-R, on the other hand, trains the model using sentence pairs from parallel corpora, learning cross-lingual semantic representations during the translation process. CINO supports multiple Chinese minority languages and regional dialects. Using the same pre-training objective as XLM-R, it undergoes secondary pre-training on corpora from minority languages and regional dialects, enabling the model to adapt to the Chinese minority language scenarios and achieve significant results in cross-lingual transfer tasks. To evaluate whether adding a bidirectional propagation method to the collaborative interactive attention mechanism has a positive impact, this experiment replaces the shared encoder in the Co-Transformer framework with CINO-large. This serves as the optimal baseline model in current explicit joint models for comparison in the experiment.

4.4. Evaluation Metrics

The experiment uses three evaluation metrics to assess the model’s performance in intent and slot recognition on the MTOD and JISD datasets. For the intent recognition subtask, intent classification accuracy (Accuracy) is chosen as the evaluation metric. Intent classification accuracy refers to the proportion of question sentence samples for which the model correctly identifies the intent, out of all question sentence samples. The intent classification accuracy is calculated as shown in Equation (11):

A C C = \frac{T}{P} \times 100 %

(11)

where P represents all question sentence samples, and T represents the question sentence samples for which the intent category is correctly identified.

For the slot filling task, the

F 1

score [31] is chosen as the evaluation metric, as shown in Equation (12):

F 1 = \frac{2 (P r e c i s i o n \times R e c a l l)}{P r e c i s i o n + R e c a l l}

(12)

where

P r e c i s i o n

is the proportion of actual positive samples among those predicted as positive by the model, while

R e c a l l

is the proportion of predicted positive samples among the actual positive samples.

Sentence-level accuracy (Sent.Acc) [32] is used to evaluate the overall performance of the model on both intent recognition and slot filling tasks. It refers to the proportion of question sentence samples where the model correctly identifies both the question’s intent category and all slot categories, out of all question sentence samples. The calculation process is shown in Equation (13):

S e n t . A C C = \frac{T^{*}}{P}

(13)

where P represents all question sentence samples, and

T^{*}

represents the question sentence samples correctly identified by the model.

4.5. Experimental Results and Analysis

4.5.1. Comparative Experiments

To comprehensively compare the cross-lingual transfer ability of the Bi-XTM model, this experiment uses three different training methods, including the following: (1) Zero-shot training: Only the source language is used during the training process and the model is not exposed to the target language. (2) Single-language training: The model is trained and evaluated separately on different language corpora without distinguishing between the source and target languages. (3) Mixed-language training: The source language is used during the training process, and the model is exposed to the target language. In the zero-shot training method, the main focus is on the model’s cross-lingual transfer ability to Kazakh. Since the source languages are Chinese and English, the evaluation results for the Chinese and English corpora are the same as those for single-language training. The comparison results of the MTOD zero-shot training experiment are shown in Table 2.

Table 2, Table 3 and Table 4 show the comparative experimental results of the Bi-XTM model on the MTOD dataset. From the experimental results, it can be seen that since mBERT did not include Spanish and Thai training data during the pre-training phase, the model performed poorly on the target language data. The XLM-R model, compared to the mBERT model, used more parallel corpora and monolingual data during the pre-training phase, but this does not mean that it can be directly applied to all languages.

The coverage of Thai data in the training dataset is relatively low, so the XLM-R model may not fully learn and capture the linguistic features of Thai, leading to poor performance on Thai data. The CINO model introduced more low-resource corpora for secondary pre-training, addressing the issue of insufficient data in low-resource languages and improving the model’s adaptability to these languages, thus achieving a significant performance improvement on target language data. The Co-Transformer model, using CINO as the shared encoder, incorporated an explicit bidirectional interaction module, which not only shares lower-level representations but also more finely controls the interaction and information transfer between subtasks. The Co-Transformer model effectively improved intent accuracy on target language data; however, due to the error propagation problem in sentence-level intent vector representations during the transfer to slot vector representations, the model negatively impacted the evaluation of slot

F 1

scores, with a slight decline in performance on Thai data.

Compared to the baseline models, the Bi-XTM model achieved significant improvements in intent classification accuracy across all languages (English, Spanish, and Thai). Furthermore, on English and Spanish corpora, the Bi-XTM model effectively alleviated the error propagation issues caused by bidirectional information transfer, resulting in notable improvements in slot

F 1

scores. However, the model’s slot filling performance on Thai data did not reach optimal levels, as the shared encoder introduced minority language semantic vector representations, and shared information between different language families may have been interfered with, leading to data bias. Future research could address this issue by expanding the training samples and labels, including more low-resource languages, such as Thai, to improve the model’s generalization capability in low-resource scenarios. For English corpora, although the Bi-XTM model showed a slight decrease in sentence-level accuracy, its advantage lies in more precisely controlling the bidirectional relationships and information transfer between subtasks in cross-lingual scenarios, resulting in overall good generalization. Therefore, the slight accuracy drop on certain languages is a trade-off to achieve better model generalization.

The comparative experimental results of the Bi-XTM model on the MTOD dataset demonstrate that introducing the subtask interaction module in cross-lingual pre-training models helps alleviate the error propagation issues caused by bidirectional information transfer, positively impacting the joint tasks of intent recognition and slot filling. However, the Bi-XTM model shows performance gaps of 17.8%, 36.6%, and 37.8% in Acc, F1, and Sent.Acc scores between Spanish and Thai, respectively, indicating that cross-lingual transfer between languages from different language families remains challenging.

Specifically, the model’s recognition accuracy on Thai is low. This is primarily because Spanish and English, as widely spoken global languages, have abundant publicly available data resources, allowing the model to fully learn their grammatical and semantic features from large-scale corpora and thus form more accurate representations. In contrast, Thai and minority languages suffer from a lack of data, making it difficult for the model to capture their unique linguistic patterns, which limits classification performance. Spanish and English both belong to the Indo-European language family and share high similarity in grammatical structures and vocabulary composition, enabling the model to transfer and share linguistic knowledge through transfer learning. However, Thai belongs to the Kra–Dai language family, and minority languages such as Tibetan and Mongolian have unique writing systems and grammatical rules that differ significantly from mainstream languages. As a result, it is difficult for the model to directly reuse existing knowledge. These languages require additional adaptation to their specific linguistic characteristics, but due to limited data availability, the model is unable to effectively learn these features, which negatively affects classification performance.

Table 5, Table 6 and Table 7 show the comparative experimental results of the Bi-XTM model on the JISD dataset. During the pre-training phase, both the mBERT and XLM-R models used a large amount of multi-lingual data; however, these datasets lacked sufficient Kazakh data, preventing the models from fully learning the linguistic features and semantic information of Kazakh, leading to poor performance in cross-lingual scenarios.

The CINO model uses a large amount of data from Chinese minority languages, including a significant amount of Kazakh data, enabling the CINO model to better understand and learn Kazakh’s grammatical structures and semantic rules. The rich data volume and diversity provide the model with a more comprehensive linguistic background and knowledge, significantly improving its recognition performance in Kazakh. The Co-Transformer model still faces the issue of error propagation caused by sentence-level intent information transfer. Compared to CINO, this model shows a slight improvement in intent accuracy but reduces slot recognition performance. The Bi-XTM model outperforms the baseline model on all three metrics, demonstrating its excellent cross-lingual transfer ability. However, its overall accuracy has not yet reached a stage where it is fully applicable, so a further comparison experiment using the mixed-language training method is considered.

To compare with the mixed-language training method, this experiment also included single-language training, using Chinese and Kazakh corpora as the model’s training data to observe the model’s performance on different language corpora. Compared to the zero-shot training method, the mBERT and XLM-R models show significant performance improvements on the Kazakh corpus. However, due to the lack of specialized training and optimization for low-resource languages, these two models still exhibit a significant performance gap when compared to the CINO and Co-Transformer models. Although the Bi-XTM model achieves optimal performance on the Kazakh corpus across all three metrics, the model struggles to reach system-level application effectiveness due to data scarcity when trained solely on low-resource single-language corpora.

Mixed-language training allows the model to share knowledge between the source and target languages. There may be certain linguistic similarities and shared features between different languages, and mixed-language training enables the model to encounter multiple languages during the training phase, allowing it to learn these shared features. From the results of the mixed-language training experiment, it can be seen that the Bi-XTM model achieves optimal performance on the Kazakh corpus across all three metrics. However, the model shows an overall decrease in performance on the Chinese corpus compared to the other two training methods. This is because factors such as linguistic differences between languages, domain-specific differences, and lexical ambiguity affect the model’s performance on the source language. Therefore, in practical applications, it is important to consider the proportion of mixed-language training samples based on the specific situation to achieve a performance balance across languages and enhance the model’s generalization ability.

Zero-shot, monolingual, and mixed-language comparative experiments conducted on the JISD dataset show that introducing semantic vector representations of Chinese minority languages in cross-lingual pre-trained models, combined with the use of the subtask interaction module, can effectively alleviate the error propagation caused by sentence-level intention information transmission and significantly enhance the model’s cross-lingual transfer ability from Chinese to Kazakh.

4.5.2. Ablation Experiments

The comparative experiment analyzed the cross-lingual transfer performance of the Bi-XTM model on the joint task of question understanding. To further investigate the positive impact of the subtask interaction module on the model, the following ablation experiments were conducted: (1) Encoder: The probability outputs generated by the cross-lingual shared encoder are directly fed into the intent and slot decoders without adding any explicit interaction layers. (2) w/o BP: The subtask interaction layer is removed to explore the effect of collaborative interaction attention on the model’s performance. (3) w/o CA: The collaborative interaction attention layer is removed, keeping the subtask interaction layer, and the word-level encoding method is used to understand the positive impact brought by the subtask interaction layer. (4) Intent: Information transmission from intent vectors to slot vectors in the interaction layer is canceled. (5) Slot: Information transmission from slot vectors to intent vectors in the interaction layer is canceled.

Table 8 and Table 9 show the ablation experiment results of the Bi-XTM model on the target language data of the MTOD and JISD datasets, respectively. After removing the subtask interaction module, the model degrades into an implicit joint model (Encoder), exhibiting the worst cross-lingual transfer performance across the three target language corpora. This is because relying solely on shared low-level representations does not enable the model to learn the direct correlation between intent vectors and slot vectors.

When the subtask interaction layer is removed (w/o BP), the model shares local information of intermediate-layer intent and slot vectors through the collaborative interaction attention mechanism. Under this setting, intent accuracy shows a slight improvement, but the slot

F 1

score drops significantly. This indicates that the attention layer captures key local information from the characters, allowing the model to predict intents while also capturing the slot information associated with those intents. Such contextual association helps improve the accurate understanding of user input and better semantic modeling. However, when the model uses sentence-level encoding for transmission, if an incorrect intent label is recognized, the slot vector will receive noisy information from the intent, leading to a misunderstanding of slot information and thus causing errors in slot recognition. When the collaborative interaction attention layer is removed (w/o CA), the improvement in intent accuracy is not significant, but the error propagation caused by sentence-level encoding is effectively mitigated. In the one-way transmission methods (Intent and Slot), cutting off either the intent-to-slot or slot-to-intent transmission results in only slight performance differences, and the receiving side of the one-way propagation generally achieves better performance. This is due to the error correction phenomenon induced by cutting the interaction loop in one-way propagation. Although this helps reduce error propagation, it also blocks the positive transmission of intent or slot information. After integrating the collaborative interaction attention layer and the subtask interaction layer (Bi-XTM), the model not only leverages the interdependence between subtasks but also filters out erroneous information during the recognition process through word-level encoding, thus achieving optimal performance.

The ablation experiment results confirm that introducing the subtask interaction method based on the collaborative interaction attention mechanism, along with label-aware word-level intent and slot vector representations, indeed enhances the interaction between the two tasks, leading to better overall performance.

5. Conclusions

This paper addresses the issue that current joint models for question understanding fail to effectively handle the joint task of intent recognition and slot filling in cross-lingual scenarios. It proposes the Cross-lingual Bi-directional Propagation Model (Bi-XTM), which employs a subtask interaction approach to mitigate the error propagation problem caused by sentence-level encoding. At the same time, it fully leverages the output distributions of both intent and slot tasks for bidirectional information transmission, enhancing the interaction between the intent recognition and slot filling subtasks. This approach balances the model’s performance on the joint task and improves its generalization ability.

This paper also focuses on the use of public question understanding datasets, detailing the methods for data acquisition, data information, and data processing, and it provides some examples from the datasets used. In addition, it introduces the construction process, text annotation, and data processing methods for the Cross-lingual Tourism Field Question Dataset (JISD), summarizing the dataset resources utilized in this study. It clearly specifies the names, usage scopes, data volumes, and the number of intents and slots for each dataset.

The comparative experimental results of the Bi-XTM model on the MTOD and JISD datasets show that Bi-XTM achieves significant overall performance improvements over the baseline models on both datasets. Through the ablation studies conducted on the target language data of the MTOD and JISD datasets, it can be concluded that incorporating both the co-interactive attention mechanism and the subtask interaction layers into the Bi-XTM model leads to substantial performance gains compared to the implicit joint model, the unidirectional propagation models, and models that use only either the co-interactive attention mechanism or the subtask interaction layers individually.

The cross-lingual subtask interaction model addresses the current research gaps in joint question understanding tasks under cross-lingual scenarios. Compared to traditional approaches, this model significantly reduces over-reliance on monolingual data and marks a notable advancement in cross-lingual joint modeling. However, we must also acknowledge that cross-lingual transfer still faces certain limitations in real-world applications, especially for languages like Thai that are relatively low-resource or exhibit substantial linguistic differences. There remains considerable room for performance improvements in such cases.

In the future, we plan to further extend the cross-lingual subtask interaction model to cover more low-resource languages. At the same time, we are committed to releasing a larger-scale Kazakh language corpus. A rich and high-quality corpus serves as the cornerstone for model training and optimization. A more extensive Kazakh corpus will provide strong data support for related research, help improve model performance for Kazakh, and offer more valuable samples for cross-lingual studies.

Overall, this research presents a novel and explicit joint modeling approach for cross-lingual joint question understanding tasks and offers valuable insights for future advancements in this field.

Author Contributions

Conceptualization, Y.Y.; methodology, Y.M.; software, H.L.; validation, H.L.; formal analysis, Y.Y.; investigation, X.Z.; resources, Y.M.; data curation, Y.M.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.M.; visualization, Y.T.; supervision, Y.Y.; project administration, Y.M.; funding acquisition, G.A. All authors have read and agreed to the published version of the manuscript.

Funding

The funding to support this work came from the National Natural Science Foundation of China (62062062), hosted by Gulila Altenbek.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support this study are divided into a public dataset (MTOD) and a self-built dataset (JISD). MTOD is available in a public repository and JISD will be available upon request from the authors.

Conflicts of Interest

The authors declare no potential conflicts of interests.

Appendix A. Kazakh Alphabet Description

The Kazakh script used in China is a writing system based on the Arabic alphabet, with a total of 33 letters. It contains 28 Arabic letters and 5 additional symbolic letters, which can be divided into vowel letters, consonant letters and special characters according to the pronunciation characteristics that are used to accurately record the phonetic system of the Kazakh language. Since the 1980s, the orthography has been further improved, and Unicode encoding (U+0600 to U+06FF blocks) has been implemented. This writing system is widely used in Kazakh-inhabited areas such as Ili, Tacheng, and Altay in Xinjiang, China, and it is an important cultural carrier of the Kazakh people. In China, the national standard codes for Kazakh alphabets and corresponding encoding character sets are shown in Figure A1. The letters with ∗ in the figure represent vowel letters.

Figure A1. Kazakh alphabet.

References

Xu, P.; Sarikaya, R. Convolutional neural network based triangular crf for joint intent detection and slot filling. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 78–83. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Ravuri, S.V.; Stolcke, A. Recurrent neural network and LSTM models for lexical utterance classification. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Interspeech, Dresden, Germany, 6–10 September 2015; pp. 135–139. [Google Scholar]
Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Xia, C.; Zhang, C.; Yan, X.; Chang, Y.; Yu, P.S. Zero-shot user intent detection via capsule neural networks. arXiv 2018, arXiv:1809.00385. [Google Scholar]
Casanueva, I.; Temčinas, T.; Gerz, D.; Henderson, M.; Vulić, I. Efficient intent detection with dual sentence encoders. arXiv 2020, arXiv:2003.04807. [Google Scholar]
Yao, K.; Zweig, G.; Hwang, M.Y.; Shi, Y.; Yu, D. Recurrent neural networks for language understanding. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Interspeech, Lyon, France, 25–29 August 2013; pp. 2524–2528. [Google Scholar]
Mesnil, G.; He, X.; Deng, L.; Bengio, Y. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Interspeech, Lyon, France, 25–29 August 2013; pp. 3771–3775. [Google Scholar]
Yao, K.; Peng, B.; Zhang, Y.; Yu, D.; Zweig, G.; Shi, Y. Spoken language understanding using long short-term memory neural networks. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, CA, USA, 7–10 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 189–194. [Google Scholar]
Mesnil, G.; Dauphin, Y.; Yao, K.; Bengio, Y.; Deng, L.; Hakkani-Tur, D.; He, X.; Heck, L.; Tur, G.; Yu, D.; et al. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 530–539. [Google Scholar] [CrossRef]
Liu, B.; Lane, I. Recurrent neural network structured output prediction for spoken language understanding. In Proceedings of the NIPS Workshop on Machine Learning for Spoken Language Understanding and Interactions, Montreal, QC, Canada, 11 December 2015. [Google Scholar]
Alam, M.T.; Bhusal, D.; Park, Y.; Rastogi, N. Cyner: A python library for cybersecurity named entity recognition. arXiv 2022, arXiv:2204.05754. [Google Scholar]
Dao, M.H.; Truong, T.H.; Nguyen, D.Q. Intent detection and slot filling for Vietnamese. arXiv 2021, arXiv:2104.02021. [Google Scholar]
Qin, L.; Che, W.; Li, Y.; Wen, H.; Liu, T. A stack-propagation framework with token-level intent detection for spoken language understanding. arXiv 2019, arXiv:1909.02188. [Google Scholar]
E, H.; Niu, P.; Chen, Z.; Song, M. A novel bi-directional interrelated model for joint intent detection and slot filling. arXiv 2019, arXiv:1907.00390. [Google Scholar]
Zhang, C.; Li, Y.; Du, N.; Fan, W.; Yu, P.S. Joint slot filling and intent detection via capsule neural networks. arXiv 2018, arXiv:1812.09471. [Google Scholar]
Liu, Y.; Meng, F.; Zhang, J.; Zhou, J.; Chen, Y.; Xu, J. Cm-net: A novel collaborative memory network for spoken language understanding. arXiv 2019, arXiv:1909.06937. [Google Scholar]
Qin, L.; Liu, T.; Che, W.; Kang, B.; Zhao, S.; Liu, T. A co-interactive transformer for joint slot filling and intent detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), virtual, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8193–8197. [Google Scholar]
Liu, Z.; Winata, G.I.; Lin, Z.; Xu, P.; Fung, P. Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8433–8440. [Google Scholar]
Xu, W.; Haider, B.; Mansour, S. End-to-end slot alignment and recognition for cross-lingual NLU. arXiv 2020, arXiv:2004.14353. [Google Scholar]
Qin, L.; Ni, M.; Zhang, Y.; Che, W. Cosda-ml: Multi-lingual code-switching data augmentation for zero-shot cross-lingual nlp. arXiv 2020, arXiv:2006.06402. [Google Scholar]
Bapna, A.; Tur, G.; Hakkani-Tur, D.; Heck, L. Towards zero-shot frame semantic parsing for domain scaling. arXiv 2017, arXiv:1707.02363. [Google Scholar]
Liu, Z.; Winata, G.I.; Xu, P.; Fung, P. Coach: A coarse-to-fine approach for cross-domain slot filling. arXiv 2020, arXiv:2004.11727. [Google Scholar]
Shah, D.J.; Gupta, R.; Fayazi, A.A.; Hakkani-Tur, D. Robust zero-shot cross-domain slot filling with example values. arXiv 2019, arXiv:1906.06870. [Google Scholar]
Zeng, F.; Gao, W. Prompt to be consistent is better than self-consistent? few-shot and zero-shot fact verification with pre-trained language models. arXiv 2023, arXiv:2306.02569. [Google Scholar]
Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML’01 Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
Coucke, A.; Saade, A.; Ball, A.; Bluche, T.; Caulier, A.; Leroy, D.; Doumouro, C.; Gisselbrecht, T.; Caltagirone, F.; Lavril, T.; et al. Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces. arXiv 2018, arXiv:1805.10190. [Google Scholar]
Hemphill, C.T.; Godfrey, J.J.; Doddington, G.R. The ATIS spoken language systems pilot corpus. In Proceedings of the Speech and Natural Language: Proceedings of a Workshop, Hidden Valley, PA, USA, 24–27 June 1990. [Google Scholar]
Schuster, S.; Gupta, S.; Shah, R.; Lewis, M. Cross-lingual transfer learning for multilingual task oriented dialog. arXiv 2018, arXiv:1810.13327. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
Goo, C.W.; Gao, G.; Hsu, Y.K.; Huo, C.L.; Chen, T.C.; Hsu, K.W.; Chen, Y.N. Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 2 (Short Papers), pp. 753–757. [Google Scholar]

Figure 1. Bi-XTM model architecture.

Figure 2. The bidirectional propagation layer architecture.

Figure 3. Detailed statistical information of public question understanding datasets.

Figure 4. The sample display of the original tourism data. (Note: The original data are only in Chinese and Kazakh, and English is the corresponding translation).

Figure 5. The statistics of intent and slot categories.

Table 1. JISD statistical information.

Language	Sentence			Character
Language	Train	Dev	Test	Train	Dev	Test
Chinese	14,625	1749	1749	75,538	26,423	24,585
Kazakh	929	135	135	12,021	1127	1035
Number of intentions	27
Number of slots	13

Table 2. MTOD zero-shot training comparative experiment results.

Model	Spanish			Thai
Model	Acc	$F 1$	Sent.Acc	Acc	$F 1$	Sent.Acc
zero-shot training
mBERT	27.5	28.7	19.6	11.6	4.1	4.4
XLM-R	72.5	32.4	39.2	44.9	17.2	18.5
CINO	88.5	71.2	66.7	52.2	45.0	34.5
Co-Trans	90.9	75.5	66.8	55.1	44.5	34.8
Bi-XTM	92.2	77.3	68.6	57.8	44.9	37.2

Table 3. MTOD single-language training comparative experiment results.

Model	Spanish			Thai
Model	Acc	$F 1$	Sent.Acc	Acc	$F 1$	Sent.Acc
Single-language training
mBERT	71.3	45.2	41.7	58.5	16.3	10.6
XLM-R	78.5	49.5	46.3	66.2	23.6	19.8
CINO	91.3	71.3	66.5	65.8	44.1	34.7
Co-Trans	95.6	71.8	67.1	66.1	42.7	34.2
Bi-XTM	94.1	77.9	69.5	66.4	44.6	36.6

Table 4. MTOD mixed-language training comparative experiment results.

Model	Spanish			Thai
Model	Acc	$F 1$	Sent.Acc	Acc	$F 1$	Sent.Acc
Mixed-language training
mBERT	69.7	51.2	47.4	61.3	18.8	11.4
XLM-R	77.2	53.8	59.1	67.7	21.9	20.0
CINO	90.7	71.5	68.5	72.8	27.4	22.5
Co-Trans	92.1	76.9	68.6	77.4	43.8	37.4
Bi-XTM	95.4	80.1	75.7	77.6	43.5	37.9

Table 5. JISD zero-shot training comparative experiment results.

Model	Chinese			Kazakh
Model	Acc	$F 1$	Sent.Acc	Acc	$F 1$	Sent.Acc
zero-shot training
mBERT	98.3	95.0	88.9	13.3	3.4	2.4
XLM-R	96.0	98.5	95.2	19.0	5.2	3.0
CINO	99.5	97.2	97.1	42.2	30.7	11.4
Co-Trans	99.7	96.4	97.2	44.6	28.5	11.9
Bi-XTM	99.4	98.9	97.6	45.2	31.8	12.8

Table 6. JISD single-language training comparative experiment results.

Model	Chinese			Kazakh
Model	Acc	$F 1$	Sent.Acc	Acc	$F 1$	Sent.Acc
Single-language training
mBERT	98.3	95.0	88.9	66.5	57.1	43.8
XLM-R	96.0	98.5	95.2	67.9	57.5	51.1
CINO	99.5	97.2	97.1	90.8	71.6	65.2
Co-Trans	99.7	96.4	97.2	91.4	69.4	65.6
Bi-XTM	99.4	98.9	97.6	92.1	74.6	66.3

Table 7. JISD mixed-language training comparative experiment results.

Model	Chinese			Kazakh
Model	Acc	$F 1$	Sent.Acc	Acc	$F 1$	Sent.Acc
Mixed-language training
mBERT	98.4	91.6	87.9	75.3	60.0	54.5
XLM-R	97.6	96.0	94.8	75.6	61.7	60.5
CINO	99.2	96.3	94.2	92.1	73.0	66.2
Co-Trans	98.7	96.9	95.1	92.8	72.1	69.7
Bi-XTM	99.6	97.1	95.7	93.5	77.2	70.1

Table 8. MTOD ablation experiment results.

Model	Spanish			Thai
Model	Acc	$F 1$	Sent.Acc	Acc	$F 1$	Sent.Acc
Encoder	88.2	71.2	66.7	52.2	44.8	34.5
w/o BP	90.2	70.1	68.1	56.9	44.2	35.5
w/o CA	89.6	73.8	61.4	55.3	44.8	36.8
Intent	90.2	73.7	68.4	56.3	44.1	36.5
Slot	91.6	75.3	68.3	57.0	44.7	37.0
Bi-XTM	92.2	77.3	68.6	57.8	44.9	37.2

Table 9. JISD ablation experiment results.

Model	Kazakh
Model	Acc	$F 1$	Sent.Acc
Encoder	42.2	30.7	11.4
w/o BP	43.9	28.4	10.5
w/o CA	43.3	31.6	11.6
Intent	44.1	31.2	11.3
Slot	44.3	31.6	11.5
Bi-XTM	45.2	31.8	12.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Y.; Yu, Y.; Liu, H.; Altenbek, G.; Zhang, X.; Tuersun, Y. A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding. Appl. Sci. 2025, 15, 6643. https://doi.org/10.3390/app15126643

AMA Style

Ma Y, Yu Y, Liu H, Altenbek G, Zhang X, Tuersun Y. A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding. Applied Sciences. 2025; 15(12):6643. https://doi.org/10.3390/app15126643

Chicago/Turabian Style

Ma, Yajing, Yingxia Yu, Han Liu, Gulila Altenbek, Xiang Zhang, and Yilixiati Tuersun. 2025. "A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding" Applied Sciences 15, no. 12: 6643. https://doi.org/10.3390/app15126643

APA Style

Ma, Y., Yu, Y., Liu, H., Altenbek, G., Zhang, X., & Tuersun, Y. (2025). A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding. Applied Sciences, 15(12), 6643. https://doi.org/10.3390/app15126643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding

Abstract

1. Introduction

2. Related Work

2.1. Joint Modeling of Intent Recognition and Slot Filling

2.2. Cross-Lingual Transfer

3. Proposed Method

3.1. Cross-Lingual Bidirectional Token-Level Interaction Model

3.2. Cross-Lingual Shared Encoder Module

3.3. Bidirectional Propagation Module

3.4. Intention and Slot Joint Decoder Module

4. Experiments

4.1. Datasets

4.2. Experiment Parameters

4.3. Baselines

4.4. Evaluation Metrics

4.5. Experimental Results and Analysis

4.5.1. Comparative Experiments

4.5.2. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Kazakh Alphabet Description

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI