3.1. Multi-Task Learning Model Structure
In our work, we focused on Kazakh query understanding. We utilized a deep learning model that integrated Kazakh linguistic characteristics in response to Kazakh linguistic features.
The multi-task learning model (MTQU), based on the QC and NER presented in this paper, were composed of five layers: a feature extraction layer, BiLSTM layer, attention layer, pooling layer and the output layer.
Figure 1 shows a detailed model structure where, for instance, the sentence ‘Тыйаншан үлкен шатқалы қай жерге oрналасқан?’ is illustrated. In
Figure 1, each word corresponds to one named entity label, and a specific question classification is assigned for the whole sentence.
3.2. Feature Extraction Layer
Kazakh is a typical agglutinative language, meaning that adding a prefix or suffix to the same root can generate hundreds or thousands of words. This feature is likely to cause data sparseness problems in natural language processing. To solve the data sparseness problem effectively, it is necessary to break the words down into stems and morphemes through a morphological analysis. To illustrate this, consider the following English phrase ‘People who are currently traveling’, which can be translated into Kazakh with only one word, ‘саяхаттағылардың’, which can then be broken down into the root and additional suffixes, ‘саяхат+та+ғы+лар+дың.’ Where the first section is the stem, the last four spliced behind it are suffixes, and these four suffixes are very special. There are two inflectional and two derivational suffixes. What is more interesting is that each time you add a suffix, the part of speech of the stem changes once. Therefore, you can see the complex morphological features of Kazakh through this example.
The feature representation layer maps each word to a high-dimensional vector space. The vector representation of the word,
, is
, and by looking up the word embedding matrix,
is calculated, where
represents the dimension of the word embedding. This article uses pre-trained token vectors
and stems vectors
as the fixed-size vectors for each word. Through the research of [
17,
23], it was found that in the Kazakh QC and NER study, lexical features such as stems and affixes effectively avoided data sparsity and improved the recognition accuracy; therefore, this study also used these two characteristics as its important input information. It has also been shown in several experiments and data analyses that syntactic features such as phrase markers, whether the current word starts a sentence, and whether the current word is Latin, can also enhance the model’s accuracy in identifying named entities and questions. Finally, based on previous research, this paper combines the morphological features, word-level features, and sentence-level feature as the final input of the model.
Tokens: divide the original text with spaces and punctuation marks as separators. Many natural language processing tasks use this feature.
Stem (root): obtained from previous research work. The stem and affix information were obtained through a morphological analysis system. Several agglutinative language processes use these features.
Suffixes: Kazakh, as like other agglutinative languages, has inflectional and derivational affixes. The main feature of these two types is that inflectional affixes very often only add a minute or delicate grammatical meaning to the stem and do not change the word class to which they attach. Derivational affixes often change the lexical meaning. The nominal suffix is also important in NER. There were 39 types of non-transitional suffixes and 4 types of transitional suffixes.
Gazetteers: these were obtained from the Kazakh NER [
17] task by the researchers focused on the base of the Kazakh and Kirghiz languages at the national language resource monitoring and research center on monitory languages.
Phrases tagging: as mentioned above, we used an automated phrase tagging system to tag the token information. Two types of Kazakh phrases were used here: noun phrases and verbal phrases.
In this article, the rich features discussed above serve as the input layers for the neural network. The overall embeddings can be expressed as:
where ⨁ represents a concatenate operation for linking various feature vectors,
is the token,
is the stem,
is the suffixes,
is the phrase feature,
is the named entity dictionary,
is the current word as the beginning of a sentence, and
is the current word as a Latin word.
3.3. LSTM Layer
LSTM (long short-term memory) has strong sequence modeling capabilities and can capture contextual information at a longer distance. LSTM controls the input and output information through three special gate structures. To obtain the sequential characteristics and context-dependent information about words, the model uses a weight-sharing mechanism at the BiLSTM layer to share the weight parameters of QC and NER tasks to improve the feature representation. Specifically, the word vector output by the feature representation layer
,
uses the bidirectional LSTM model to generate a hidden vector sequence
,
, encodes the context word of the entire question
in
, and finally maps the
to the context representation space:
where
is the forward hidden layer,
is the backward hidden layer,
represents the model parameters of the LSTM, and
is the output of the BiLSTM layer.
3.4. Attention Layer
In the question sentence, not all words are necessary to identify the named entities and intent classification; therefore, attention mechanisms are introduced to extract words important to the QU task and aggregate the importance representations of each word to obtain an attention representation. The attention weight matrix of each word is obtained through the attention mechanism, and the text sequence
is obtained by combining the output of all hidden layers:
where,
represents the LSTM hidden layer state of the encoder at the
-th time;
represents the length of the input sentence;
represents the attention distribution probability of the output at the
-th time, which is then calculated using the softmax function. The calculation formula is as follows:
where,
represents the evaluation score of influence on
outputs at
moments;
and
are the weight matrices.
3.6. Output Layer
The output layer of the model feeds the results of the pooled layer into two different represents, namely, the as a text representation representing the QC and the as a text representation representing the NER task.
This paper uses the softmax function for the text classification to obtain the final classification result. The final result of the QC task is predicted to be . The outcome of the NER task is predicted to be .
Compared with the classification problem, the current prediction label in the sequence labeling problem is related not only to the input feature of the current input but also to the previous prediction label; that is, there is a mutual dependence between the prediction labels. CRFs are a conditional probability distribution model of input and output random variables. Consequently, we add the CRF layer above the to jointly decode the best chain of labels of the question.
For the multi-task learning QU model,
and
are used as loss functions for QC and NER, respectively:
This article combines the
and
as the final objective function of multi-task learning, and the final joint objective is formulated as:
where
and
are tunable parameters that measure the impact of the two tasks.