Intent Detection Problem Solving via Automatic DNN Hyperparameter Optimization

: Accurate intent detection-based chatbots are usually trained on larger datasets that are not available for some languages. Seeking the most accurate models, three English benchmark datasets that were human-translated into four morphologically complex languages (i.e., Estonian, Latvian, Lithuanian, Russian) were used. Two types of word embeddings (fastText and BERT), three types of deep neural network (DNN) classiﬁers (convolutional neural network (CNN); long short-term memory method (LSTM), and bidirectional LSTM (BiLSTM)), di ﬀ erent DNN architectures (shallower and deeper), and various DNN hyperparameter values were investigated. DNN architecture and hyperparameter values were optimized automatically using the Bayesian method and random search. On three datasets of 2 / 5 / 8 intents for English, Estonian, Latvian, Lithuanian, and Russian languages, accuracies of 0.991 / 0.890 / 0.712, 0.972 / 0.890 / 0.644, 1.000 / 0.890 / 0.644, 0.981 / 0.872 / 0.712, and 0.972 / 0.881 / 0.661 were achieved, respectively. The BERT multilingual vectorization with the CNN classiﬁer was proven to be a good choice for all datasets for all languages. Moreover, in the majority of models, the same set of optimal hyperparameter values was determined. The results obtained in this research were also compared with the previously reported values (where hyperparameter values of DNN models were selected by an expert). This comparison revealed that automatically optimized models are competitive or even more accurate when created with larger training datasets.


Introduction
Our society is not imaginable without virtual assistants and chatbots such as Siri, Alexa, and Cortana. The AI technology in chatbots is responsible for intelligent human-computer interaction [1]; chatbots can answer vital questions 24/7 [2,3] and even assist in learning [4].
Usually, chatbots are composed of the following components: natural language understanding (NLU; responsible for comprehension of user's questions), dialog management (responsible for a fluent conversation), content (responsible for chatbot's properly selected answers), and custom data (that helps to personalize conversations). The focus of this research is on the NLU module (specifically, on the intent detection) because comprehension of the structure and meaning of user questions is the core of smooth operation in any dialog system.
The intent detection task that is a typical example of text classification can be solved with the rule-based or machine learning (ML) approaches. However, the creation of rules usually requires a lot from unseen but domain-related questions (more details about various intent detection techniques can be found in [25]). Supervised classifiers are trained with traditional SML or DL approaches. Traditional ML methods are usually applied to discrete feature representations as textual (e.g., word and character n-grams) or syntactic (e.g., part-of-speech tags) features (e.g., research with support vector machines in [26]).
Nevertheless, major progress in addressing intent detection problems has been made due to DL. Effective DL-based research presented knowledge distillation and a posterior regularization method to detect a user's intent of leaving a service for another service provider (known as a churn detection problem) on an English microblog dataset [27]: the applied CNN method learns simultaneously from logic rules and supervised data (represented as random, skip-gram, CBOW, and gloVe embeddings). The churn detection problem was also successfully tackled in [28]: the CNN method, with bidirectional GRU and bilingual German and English fastText embeddings, was applied to a conversational English and German Twitter dataset. Another research direction covers topic-based intent detection problems. Comparative topic-based intent detection experiments in the English, Estonian, Latvian, Lithuanian and Russian languages, performed with two methods (i.e., the feed forward neural network and fastText embeddings with CNN), demonstrated the superiority of CNN [29]. Authors used rather small datasets (three English benchmark datasets that were also machine translated into Estonian, Latvian, Lithuanian, and Russian languages) but claimed to achieve state-of-the-art performance.
Previously summarized research works focused on closed-set intent detection problems. However, there have been some attempts to detect even those intents that have no training data, such as, e.g., in [30]. Authors have tackled this problem for the English and Chinese languages by applying a two-fold architecture based on BiLSTM, with multiple self-attention heads to discriminate existing intents. However, if this cannot determine any intent, emerging intents are detected from the existing ones (by specifying or generalizing them) using the knowledge transfer method based on a similarity evaluation. Despite the fact that the majority of intent detection research relies on the assumption that any intent can be predicted solely from a user's question, some researchers have offered additional measures to help clarify the meaning of some questions in further conversation. Such a problem (which is called a multiturn response problem) is tackled in [31]. Authors use the deep attention matching network, with stacked attention on text segments with different granularities, and then extract-matched sentence pairs from the conversational context and the author's responses. The authors successfully applied their offered method on the English corpus containing conversations about system troubleshooting and the Chinese social networking corpus.
In this research, a topic-based intent detection problem is tackled for the English, Estonian, Latvian, Lithuanian and Russian languages. This work is a continuation of the research presented in [29,32]. In [32], similar DNN hyperparameter tuning was performed; however, it was done on one Lithuanian dataset only. In contrast, in this research, three different datasets for five different languages are used. Compared to [29], the purpose of this research is to test more types of DNNs, more architectures, more options of DNN hyperparameter values, and more word-embedding types. Contrary to [29], the parameters in our research are tuned automatically by using two hyperparameter optimization strategies. In comparison to [32], a goal of this work is to determine (1) which choices of methods (embedding types, classifier types, DNN architectures, and hyperparameter values) boost the most accuracy for different datasets and languages, (2) if those choices are valid on a dataset-level and/or a language-level; (3) if there are choices that are equally good for all languages. Compared to [29], our goal is to determine (1) if the intent detection benefits from automatic hyperparameter optimization and (2) if the achieved accuracies exceed previously reported ones.

Datasets
Tilde's (www.tilde.com) research interests cover morphologically complex languages, e.g., Estonian, Latvian, Lithuanian, and Russian. Unfortunately, labeled datasets for the intent detection problems are not publicly available or may not even exist for some of these languages. The problem was overcome by taking English benchmark datasets and manually translating them into target languages. Similar datasets (having the same number of instances and intents; the same distribution of instances among training/testing subsets) contribute to the equalization of experimental conditions that, in turn, make comparative analysis possible for different languages. A detailed description of English benchmark datasets can be found in [33]: (1) the chatbot dataset (presented in Table 1) contains real users' questions about public transport connections; (2) askUbuntu (Table 2) and (3) webapps (Table 3) datasets are based on questions from the StackExchange (https://www.stackexchange.com) platform. It is important to notice that training/testing splits for all these benchmark datasets were kept the same as in [33], and it is the main reason why the cross-validation has not been performed. Moreover, having several folds of the same dataset, it would be much more difficult to come up with the summarized recommendations.   Some language differences can already be seen directly from the tables. For example, the English language usually has the largest average number of words per instance. English is then followed by Russian, Lithuanian, and Latvian, whereas Estonian has the smallest (e.g., 510 and 393 of words covering FindConnection in the training dataset for EN and ET, respectively). If analyzing the distinct words, Estonian comes first as having the largest number, and English the last. All these observations can be explained linguistically: the English language has the least complex morphology, and different morphological forms are expressed with the help of functional words. Estonian and the other three languages are morphologically complex. Estonian is an agglutinative language (having prefixes, suffices, and infixes to express inflections); Lithuanian, Latvian and Russian are fusional languages (allowing the word ending to have several categories depending on the inflection form); all this contributes to a larger number of different words and their forms.

Formal Description of the Task
The intent detection problem is a typical example of a supervised text classification task. Formally, such a task is determined as follows: Let D = {d 1 , d 2 , . . . , d n } be a set of documents (questions/statements-an input from a user). Let C = {c 1 , c 2 , . . . , c m } be a set of intents (classes). In this research, a closed-set and single-label classification problem is tackled because each d i ∈ D can be labeled with only one c j ∈ C. Let function η be a classification function that maps documents from the determined domain into their correct classes: D → C. Let D L ⊂ D be a set of documents for which intents are known. Thus, (d i , c j ) pairs are labeled instances used to train a model. Let Γ be a classification method (i.e., classifier, its architecture, parameter set) that, from labeled instances, can learn an accurate model (which is the approximation of η).
The aim of our solving intent detection task is to offer a classification method Γ that can find the best approximation of η, achieving as high an intent detection accuracy as possible on unseen instances, i.e., on instances D-D T that were not used for training.

DNN-Based Classifiers
Training datasets (presented in Section 3) contain labeled instances and, therefore, can be used to train classifiers in a supervised manner. A binary classification problem will be tackled with the chatbot dataset (containing only two classes) and multiclass classification with the askUbuntu and webapps datasets (containing more than two classes).
DL approaches used in various NLP tasks outperform traditional ML and, therefore, allow us to expect the higher accuracy for our intent detection problems as well. From a whole group of SML methods, the following DL approaches (that are considered to be the most suitable for text classification problems) were selected: • CNN (convolutional neural network; introduced by LeCun [34]) is a DNN used to seek fixed-size patterns, so-called convolutions. The text has a one-dimensional structure in which sequences of tokens matter because convolutions are expressed with a sliding window function over these tokens. By resizing filters and linking their output to different sizes of patterns (consisting of 2, 3, or more adjacent tokens, so-called n-grams), tokens can be detected and generalized. The main advantage of the CNN method is that it learns to detect important patterns regardless of their position in the text. In our experiments, the architecture of the CNN, similar to [35], has been explored.

•
The LSTM (long short-term memory) method (presented by Hochreiter and Schmidhuber [36]) is an improved version of the recurrent neural network (RNN). An advantage of RNNs over, e.g., feed forward neural networks is that RNNs have memory and, therefore, can be effectively applied on the sequential data (i.e., text). Sometimes, the presence/absence of some patterns (as in a case of CNN) does not play the major role, but rather, the order of tokens in sequences. However, RNNs confront a vanishing gradient problem and, therefore, cannot solve tasks that require learning long-term dependencies. In contrast, the LSTM method contains a "memory cell" that is able to maintain memory for a longer period of time; integrated gates control what information entering the "memory cell" is important, to which hidden state it has to be outputted, and when it has to be forgotten. Hence, LSTM methods are superior to RNNs when applied to longer sequences.

•
The BiLSTM (bidirectional LSTM) method (introduced by Graves and Schmidhuber [37]). Like the LSTM classifier, the BiLSTM is suitable for tasks when a learning problem is sequential. If LSTMs run an input forward, preserving information only from the past, BiLSTMs analyze sequences in both directions, i.e., forward and backward; thus, in any hidden state, they preserve information from the-past-to-the-future and from the-future-to-the-past, respectively.
Experiments with CNN, LSTM, and BiLSTM methods were performed using our implementations in the Python programming language with the Keras library (Keras: the Python DL library; available online: https://keras.io/; adjusted to create DL architectures) and the internal TensorFlow engine (an end-to-end open source ML platform; available online: https://www.tensorflow.org/; used for developing ML methods, managing large data flows, and performing mathematical operations).

Vectorization Types
The input/output of DNNs must be numerical. Calculated output neuron values (linked to separate classes/intents) can indicate how likely each predicted class/intent is. Input neurons linked to incoming text elements must be numerical in order to apply DNN classifiers (described in Section 4.2) on top of them. For this reason, word embeddings (also called word vectors) that project words into N-dimensional space are used. In our experiments, the highly recommended types of distributional word embeddings that are able to catch semantic similarities between words were chosen:

Hyperparameter Tuning
The NLU problem is considered an AI-hard problem (meaning that the created software should be as intelligent as a human), and a lot of effort has been put into the optimization of DNN methods (as, e.g., in [41]). DNN methods have many hyperparameters, and each hyperparameter may have several determined choices (e.g., several types of activation functions) and discrete numeric (e.g., number of neurons) or real numeric (e.g., dropout from an interval [0, 1]) values. However, choosing optimal hyperparameter values manually is a difficult task, even for human experts. To overcome this problem, an open-source Python's library, Hyperas (the information about Hyperas is in https: //github.com/maxpumperla/hyperas), implemented to optimize hyperparameters in Keras models automatically, has been used. The following options were experimentally investigated: Tuning of the DNN models (i.e., their hyperparameters) was performed automatically. The training of some models was done on the training split (which contains 80% of instances from the shuffled training set), and the validation was done on the rest (20% of instances from the training set). The hyperparameter optimization was done to increase the validation accuracy, and, for this reason, the following two optimization algorithms were used:

•
Random.suggest performs a random search over a set of hyperparameter values in 100 iterations (the experimental investigation revealed that 100 iterations are enough to find the optimal hyperparameter value set that gives maximum accuracy on the validation dataset). When seeking Tpe.suggest (tree-structured Parzen estimator) [42] performs a Bayesian-based iterative search (for the schematic representation of TPE, see Figure 1). The search strategy of TPE contains two phases. During an initial warm-up phase, it randomly explores a hyperparameter value space. These hyperparameter values can be conditional (additional layer in the architecture), sampled from an interval (as, e.g., for a dropout), or chosen from a determined list of values (e.g., activation functions). The chosen hyperparameter value combinations are used to train a model (with the training dataset split), which is evaluated with the validation split to see each chosen hyperparameter value combination impact on the accuracy. The warm-up phase lasts for n_init iterations (n_init = 20 in our experiments) and builds a function based on the Bayesian rule presented in Equation (1).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 20 calculated. Thus, P(acc|param) is expected to give an improvement only if acc ≥ acc′. P(param|acc) is presented in Equation (2).
A goal of the second TPE phase is to maximize the expected improvement (EI) ratio in Equation The maximization of EI can be done by choosing the hyperparameter values param, with high probability under P good (param) and low probability under P bad (param). It is done by sampling n_EI hyperparameter value combinations (n_EI = 24 in our experiments) and choosing the best one with the largest EI improvement. Then, the combination with the biggest improvement is memorized and used in the next iteration. In the next iteration, TPE calculates the validation accuracy and distributes the hyperparameter value combinations into good and bad splits, but this time, it uses all previous combinations together with the recent one. The process is repeated until the determined number of trials n_trials is reached (n_trials = 100 in our experiments). In our experiments, the default TPE parameters (n_init = 20, γ = 0.25, n_EI = 24), together with n_trials set to 100, have been used. The reason for not experimenting with other values is that these specified parameter values allowed the trained model to reach 100% accuracy on the validation dataset, which resulted in finding the optimal set of hyperparameter values.  P(acc|param) is the probability of some validation accuracy (acc) to be achieved with a determined set of hyperparameter values (param). Based on this accuracy, hyperparameter value combinations are distributed into good and bad splits. The parameter γ allows us to determine the size of a good split. In our experiments, γ = 0.25, which means that 25% of all hyperparameter value combinations belong to a good split, and the rest (75%) belong to a bad split. Based on how hyperparameter value combinations are distributed, the accuracy threshold (denoted as acc ) is calculated. Thus, P(acc|param) is expected to give an improvement only if acc ≥ acc . P(param|acc) is presented in Equation (2).
A goal of the second TPE phase is to maximize the expected improvement (EI) ratio in Equation (3).
The maximization of EI can be done by choosing the hyperparameter values param, with high probability under P good (param) and low probability under P bad (param). It is done by sampling n_EI hyperparameter value combinations (n_EI = 24 in our experiments) and choosing the best one with the largest EI improvement. Then, the combination with the biggest improvement is memorized and used in the next iteration. In the next iteration, TPE calculates the validation accuracy and distributes the hyperparameter value combinations into good and bad splits, but this time, it uses all previous combinations together with the recent one. The process is repeated until the determined number of trials n_trials is reached (n_trials = 100 in our experiments). In our experiments, the default TPE parameters (n_init = 20, γ = 0.25, n_EI = 24), together with n_trials set to 100, have been used. The reason for not experimenting with other values is that these specified parameter values allowed the trained model to reach 100% accuracy on the validation dataset, which resulted in finding the optimal set of hyperparameter values.

Experiments and Results
Experiments were carried out with datasets, DNN methods (i.e., CNN, LSTM, and BiLSTM), and vectorization techniques (i.e., fastText and BERT) described in Section 3, Section 4.2, and Section 4.3, respectively. Moreover, DNN Keras models were optimized with the tpe.suggest and random.suggest algorithms presented in Section 4.4. Models were tuned to achieve as high an accuracy on the validation dataset as possible for each language, dataset, and classifier, with word embedding type tuned separately, and later evaluated in the testing phase.
The performance of each model was evaluated with the accuracy metric, as presented in Equation (4).
where N correct and N all stands for correctly predicted and all tested instances, respectively. The model is considered reasonable if its accuracy on the testing dataset is above random Equation (5) (assigning labels to instances according to their probabilities in the testing set) and majority Equation (6) (assigning all instances to a class having the largest probability in the training set) baselines (see Table 4).
where P(c j ) is a probability of a class (intent).
The testing results for English, Estonian, Latvian, Lithuanian, and Russian are summarized in Tables 5-9.  Table 6. Evaluated accuracies on the Estonian testing datasets with optimized models. For the other notation, see Table 5.   Table 9. Evaluated accuracies on the Russian testing datasets with optimized models. For the other notation, see Table 5. When comparing different evaluation results, it is important to determine if those differences are statistically significant. For this purpose, the McNemar test [43], with 95% confidence (α = 0.05), has been used. Differences were considered statistically significant if the calculated p-value was below α = 0.05.

Discussion
This research assumes that DNN hyperparameter optimization can be done without manual intervention. However, there are a few things that set this process in the right direction: the usage of the most promising word embedding types (i.e., fastText and BERT) and the most suitable classifiers (CNN, LSTM, and BiLSTM), adjusted to deal with the text. Moreover, two hyperparameter optimization algorithms have been applied: random search (rand.suggest) and TPE (tpe.suggest) that combines exploration (reaching new regions of hyperparameter values) and exploitation (searching for optimal solutions in a given region of hyperparameter values) strategies.
Zooming into Tables 5-9 allows us to make the following statements. With some rare exceptions, all obtained results are reasonable because they exceed random and majority baselines (presented in Table 4).
The best results on testing splits for each dataset and each language (English (EN), Estonian (ET), Latvian (LV), Lithuanian (LT), and Russian (RU)) are summarized in Figure 2. combines exploration (reaching new regions of hyperparameter values) and exploitation (searching for optimal solutions in a given region of hyperparameter values) strategies.
Zooming into Tables 5-9 allows us to make the following statements. With some rare exceptions, all obtained results are reasonable because they exceed random and majority baselines (presented in Table 4).
The best results on testing splits for each dataset and each language (English (EN), Estonian (ET), Latvian (LV), Lithuanian (LT), and Russian (RU)) are summarized in Figure 2.  As it can be seen from Figure 2, the best and more stable results across different languages are achieved with fewer intents but more training instances (i.e., the chatbot dataset), and the worse results are with more intents and less training instances (webapps). The validation split (which is 20% from the shuffled training dataset) of the webapps dataset is extremely small (i.e., only 6 instances); moreover, these randomly selected instances do not necessarily overlap for different languages. All it means is that for some languages, the validation split happened to be less representative (and less consistent with the testing dataset) than for the others. Despite the fact that the DNN hyperparameter optimization algorithm was able to find very accurate models on the validation splits, these models performed poorly on the testing datasets. From this experimental investigation, it can be concluded that automatic hyperparameter optimization is suitable only for the larger and more representative datasets (as, in our case, for chatbot or askUbuntu).
When comparing our best results with the results reported in [29], it can be concluded that our DNN hyperparameter optimization method, unfortunately, underperforms [29] on webapps for all languages. The reason is that the DNN hyperparameter optimization method is not suitable for smaller datasets. Our approach is competitive (giving the same or very similar accuracy) on the askUbuntu dataset for all languages, and it is better on the chatbot dataset. Hence, automatic hyperparameter optimization was able to surpass methods that had DNN hyperparameters selected by experts; therefore, the automatic DNN hyperparameter optimization is the right way to seek the most accurate DNN models for intent detection problems.
As can be seen from Tables 5-9, it is hard to make a conclusion on which of the hyperparameter optimization algorithms (rand.suggest or tpe.suggest) is more suitable for our tasks. Thus, both methods are equally good if a large enough number of iterations (100 in our experiments) is selected for the optimization.
In the past, text classification tasks with morphologically complex languages could not reach the same accuracy levels as with English, but the results for English in Figure 2 do not fall out of the picture. Since experimental conditions are equalized for all languages (the same number of intents, the same distribution of instances among different classes, the same classifiers, the same optimization algorithms), the only difference lies in the language processing, i.e., vectorization. However, with neural vectorization, none of these languages have an advantage due to a smaller vocabulary (because the vector space dimensionality is the same for all languages) or less variety in inflection forms (because word embeddings are not discrete but distributional). However, some differences between the choices of word embedding types still exist. As seen from Figure 2, BERT vectorization is a better choice compared to fastText for all morphologically complex languages for all datasets, and this is not surprising. Morphologically complex languages (especially fusional languages) suffer from disambiguation problems, but BERT has mechanisms that are able to vectorize even those words that are written the same but have different meanings, depending on their context, differently. Despite the fact that fastText embeddings are also trained to consider a context around the target word, that context is restricted to only a few words. Despite this, fastText is a suitable vectorization solution for languages (such as English) with strict word order in a sentence. In contrast, BERT is able to consider a much broader context (words, sentences, their order) compared to fastText and is, therefore, more suitable for languages that have a relatively free word order in a sentence (such as Latvian, Lithuanian, and Russian).
Despite LSTM, BiLSTM classifiers can sometimes be very accurate (especially on the chatbot dataset, having only two intents and enough representative training data); the domination of the CNN classifier is obvious (see Figure 2). Since we are solving the topic-based intent detection problem (where different intents are related to different topics), the contextual words or their n-grams seem to play a more important role than the sequential nature of the text.
Furthermore, our focus is on the DNN architectures (the most accurate DNN architectures are visualized using the plot_model utility function in Keras) and the hyperparameter values of the most accurate models. The architecture of the most accurate CNN method happens to be the same for all datasets and languages (see Figure 3). Here, a notation WE defines a dimensionality of word embeddings (WE = 300 with fastText and 768 with BERT), and C stands for the number of classes (i.e., 2, 5, and 8 in the chatbot, askUbuntu, and webapps datasets, respectively). The None dimension in shape tuples refers to a batch size, which, in our case, is variable (because it is among optimized parameters). Moreover, the None value is presented automatically by plot_model and means that the layer can accept input of any size.
Next to the CNN method architecture, architectures of LSTM and BiLSTM methods, which happen to be equally accurate, are presented: i.e., BiLSTM on the chatbot and webapps datasets with English (see Figure 4); LSTM on the chatbot dataset with Latvian ( Figure 5); BiLSTM on the webapps dataset with Lithuanian ( Figure 6).
Since the plot_model function presents only partial information about hyperparameter values, the missing values are summarized in Appendix A. For many datasets and languages, not only the same CNN classifier and the same CNN architecture (in Figure 3) but also the same set of hyperparameter values allows us to reach the best performance. This set is presented in Appendix A with the English language, the askUbuntu dataset, and BERT vectorization. Since this set happened to be optimal in almost half of the best-determined models, it is recommended for various intent detection problems.

Conclusions
In this research, we were solving a supervised intent detection problem for the English language and four morphologically complex languages, i.e., Estonian, Latvian, Lithuanian, and Russian. This problem has been tackled by seeking the most accurate models via automatic DNN hyperparameter optimization. In our research, two types of word embeddings (fastText, BERT), three types of DNNs (CNN, LSTM, BiLSTM), their different architectures (shallower and deeper), and various hyperparameter values (e.g., activation functions, numbers of neurons, dropouts) have been explored. The optimization was performed on three English benchmark datasets (containing 2, 5, and 8 intents) that were also manually translated into other languages.
Despite the fact that very strict conclusions cannot be drawn due to a lack of statistical Since the plot_model function presents only partial information about hyperparameter values, the missing values are summarized in Appendix A. For many datasets and languages, not only the same CNN classifier and the same CNN architecture (in Figure 3) but also the same set of hyperparameter values allows us to reach the best performance. This set is presented in Appendix A with the English language, the askUbuntu dataset, and BERT vectorization. Since this set happened to be optimal in almost half of the best-determined models, it is recommended for various intent detection problems.

Conclusions
In this research, we were solving a supervised intent detection problem for the English language and four morphologically complex languages, i.e., Estonian, Latvian, Lithuanian, and Russian. This problem has been tackled by seeking the most accurate models via automatic DNN hyperparameter optimization. In our research, two types of word embeddings (fastText, BERT), three types of DNNs (CNN, LSTM, BiLSTM), their different architectures (shallower and deeper), and various hyperparameter values (e.g., activation functions, numbers of neurons, dropouts) have been explored. The optimization was performed on three English benchmark datasets (containing 2, 5, and 8 intents) that were also manually translated into other languages.
Despite the fact that very strict conclusions cannot be drawn due to a lack of statistical significance in the result differences, some trends are apparent: (1) DNN hyperparameter optimization is the right solution when seeking accurate models for the larger training datasets; (2) the BERT embeddings type is an especially good vectorization choice for morphologically complex languages, whereas English can benefit from fastText as well; (3) the CNN classifier allows us to reach high accuracy levels despite the dataset or language; the other classification techniques are equally good only with enough training data.
This research is important from the scientific perspective due to (1) the automatic hyperparameter optimization of DNN models for various intent detection problems and (2) the comparison of obtained results for different languages: i.e., for English and morphologically complex languages from the Finno-Ugric (for Estonian), Baltic (Latvian and Lithuanian), and Slavic (Russian) branches. Moreover, some solutions work across datasets and even languages. It allows us to anticipate that similar results can also be expected for other languages of the same branches.
The research is important due to practical reasons. The optimal parameters are here revealed and, therefore, can be used to train other intent detection-based chatbots for the English, Estonian, Latvian, Lithuanian and Russian languages. However, higher accuracy can be expected only with larger and more representative training datasets.
In future research, it would be useful to experiment with larger datasets, try other classification methods such as BERT fine-tuning, and even go beyond intent detection problems (by focusing on generative chatbots).