Next Article in Journal
Automatic Language Identification Using Speech Rhythm Features for Multi-Lingual Speech Recognition
Previous Article in Journal
Electro-Optical Ion Trap for Experiments with Atom-Ion Quantum Hybrid Systems
Open AccessArticle

A Domain-Specific Generative Chatbot Trained from Little Data

1
JSC Tilde Information Technology, Naugarduko str. 100, LT-03160 Vilnius, Lithuania
2
Department of Applied Informatics, Vytautas Magnus University, Vileikos str. 8, LT-44404 Kaunas, Lithuania
Appl. Sci. 2020, 10(7), 2221; https://doi.org/10.3390/app10072221
Received: 20 February 2020 / Revised: 9 March 2020 / Accepted: 16 March 2020 / Published: 25 March 2020
Accurate generative chatbots are usually trained on large datasets of question–answer pairs. Despite such datasets not existing for some languages, it does not reduce the need for companies to have chatbot technology in their websites. However, companies usually own small domain-specific datasets (at least in the form of an FAQ) about their products, services, or used technologies. In this research, we seek effective solutions to create generative seq2seq-based chatbots from very small data. Since experiments are carried out in English and morphologically complex Lithuanian languages, we have an opportunity to compare results for languages with very different characteristics. We experimentally explore three encoder–decoder LSTM-based approaches (simple LSTM, stacked LSTM, and BiLSTM), three word embedding types (one-hot encoding, fastText, and BERT embeddings), and five encoder–decoder architectures based on different encoder and decoder vectorization units. Furthermore, all offered approaches are applied to the pre-processed datasets with removed and separated punctuation. The experimental investigation revealed the advantages of the stacked LSTM and BiLSTM encoder architectures and BERT embedding vectorization (especially for the encoder). The best achieved BLUE on English/Lithuanian datasets with removed and separated punctuation was ~0.513/~0.505 and ~0.488/~0.439, respectively. Better results were achieved with the English language, because generating different inflection forms for the morphologically complex Lithuanian is a harder task. The BLUE scores fell into the range defining the quality of the generated answers as good or very good for both languages. This research was performed with very small datasets having little variety in covered topics, which makes this research not only more difficult, but also more interesting. Moreover, to our knowledge, it is the first attempt to train generative chatbots for a morphologically complex language. View Full-Text
Keywords: generative chatbot; encoder–decoder architecture; LSTM; BiLSTM; fastText and BERT word embeddings generative chatbot; encoder–decoder architecture; LSTM; BiLSTM; fastText and BERT word embeddings
Show Figures

Figure 1

MDPI and ACS Style

Kapočiūtė-Dzikienė, J. A Domain-Specific Generative Chatbot Trained from Little Data. Appl. Sci. 2020, 10, 2221.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop