Designing an Open Source Virtual Assistant †

: A chatbot is a type of agent that allows people to interact with an information repository using natural language. Nowadays, chatbots have been incorporated in the form of conversational assistants on the most important mobile and desktop platforms. In this article, we present our design of an assistant developed with open-source and widely used components. Our proposal covers the process end-to-end, from information gathering and processing to visual and speech-based interaction. We have deployed a proof of concept over the website of our Computer Science Faculty.


Introduction
Nowadays, conversational systems are part of our daily routines [1]. Tech giants are aware of their relevance, and they are incorporating these assistants to their platforms. Microsoft Cortana or Apple Siri are popularly used examples. Some companies, such as Google with the Assistant or Amazon with Alexa, are even manufacturing dedicated devices. Moreover, these services offer great opportunities for customer support [2]. Many companies' websites are gradually enabling conversational capacities to help users discover products and information [3,4]. On the other hand, voice commands have gained much attraction for user interaction [5]. There is a critical tendency to move towards audio controls over tactile interfaces [6]. The inclusion of voice capabilities was, therefore, a straightforward improvement for conversational agents. Apart from the business value of the technology, voice-enabled assistants are truly useful for people with functional diversity [7].
In this article, we propose an architecture for a conversational assistant. As we mentioned, our proposal covers the process end-to-end. We apply information retrieval, natural language processing, machine learning, and speech technologies to cover data acquisition to audio response and user questions.

Proposal
As mentioned previously, our architectural design involves all stages of the process. The system covers everything from information gathering and processing to visual-and speech-based interaction. For that, we have used models and techniques from different information processing fields. In this section, we will explain the process we have followed and the description of the technologies used in the development.
A web-crawler is in charge of the first step of the information-processing pipeline. In this phase, we retrieved the information from the domain webpages, and kept those documents up-to-date. For this task, we used Scrapy (https://scrapy.org/), a popular web scraper. It saves the data from the Internet and creates a repository containing the files to be indexed. The website (https://www.fic.udc.es/) contains documents both in Spanish and Galician. The second phase corresponds to text-processing, which includes sentence-splitting and indexing. We used ElasticSearch (https://www.elastic.co/), a distributed search engine based on Lucene, for both indexing and searching. We propose the use of the ElasticSearch identification component for tagging the documents. As we were building a conversational system, we indexed the data at both the document-and sentence-level. Indexing isolated sentences allowed us to answer many of the user's questions concisely and directly.
Our design accepts the user's input, both through writing and spoken queries. For processing voice queries, we used Kaldi (https://kaldi-asr.org/) for Automatic Speech Recognition (ASR). Finally, we provided a system response in text and audio format using Cotovia (http://gtm.uvigo.es/cotovia). Here, we again used language identification models to process user inputs and outputs correctly. In the case of voice interaction, we trained the automatic speech recognition language models with specific domain lexica [8]. On the voice response side, the system reproduces the responses, selecting the language accordingly to the user input. In this case, we used Cotovia pre-trained models to perform speech synthesis [9].
One crucial problem of spoken document retrieval is term misrecognition. This problem provokes the inability to process the information need correctly. ASR misrecognition produces term mismatch between user input and document content. We used efficient state-of-the-art retrieval models [10] based on n-gram decomposition in dealing with it. The system processes both searchable content and user input in that way to allow fast and robust query matching. These models achieve state-of-the-art effectiveness figures, while also being quite efficient [11].
For answering information needs, we designed a four-level cascade system. First, the system tries to classify the user's intent in some predefined structured tasks (e.g., the timetable for a subject or the date of an exam). If the input falls onto one of those categories, the answer is processed according to the defined pipelines. Second, if that was not the case, the system attempts to provide a direct answer to the specific user question. For that, we propose to use two approaches: best-sentence-matching and BERT-based question-answering [12]. Thirdly, there is no satisfactory direct answer, the system tries to provide the best document answer. Finally, if the system does not rank satisfory documents, it asks the user to reformulate the question.
Architecturally speaking, we are thus using a basic client-server application. The web client communicates with the backed-through rest services and WebSocket APIs . For the web interface, we used BotUI (https://github.com/botui/botui), a very intuitive Javascript library for conversational interfaces. The server contains the implementation of the different REST endpoints and WebSocket APIS for processing audio streams.

Conclusions and Future Work
In this article, we presented our design of a conversational assistant based on open-source and well-known technologies. Even though we exemplified the design for a specific web domain for its implementation, the architecture introduced here can consume other information repositories, such as different enterprises data information or any databases.
There are many avenues for future work. We propose to improve the architecture with advanced Natural Language Generation (NLG) capacities. The fine-tuning of acoustic models for specific language variants is another interesting research address. When not depending on languages, such as Galician, we would favor the use of Tacotron as the TTS engine [13].

Conflicts of Interest:
The authors declare no conflict of interest.