Developing Amaia: A Conversational Agent for Helping Portuguese Entrepreneurs—An Extensive Exploration of Question-Matching Approaches for Portuguese

: This paper describes how we tackled the development of Amaia, a conversational agent for Portuguese entrepreneurs. After introducing the domain corpus used as Amaia’s Knowledge Base (KB), we make an extensive comparison of approaches for automatically matching user requests with Frequently Asked Questions (FAQs) in the KB, covering Information Retrieval (IR), approaches based on static and contextual word embeddings, and a model of Semantic Textual Similarity (STS) trained for Portuguese, which achieved the best performance. We further describe how we decreased the model’s complexity and improved scalability, with minimal impact on performance. In the end, Amaia combines an IR library and an STS model with reduced features. Towards a more human-like behavior, Amaia can also answer out-of-domain questions, based on a second corpus integrated in the KB. Such interactions are identiﬁed with a text classiﬁer, also described in the paper.


Introduction
ePortugal (https://eportugal.gov.pt/) is a web portal managed by the Portuguese Administrative Modernization Agency (AMA), which "aims to facilitate the interactions between citizens, companies and the Portuguese State, making them clearer and simpler". Among others, it provides information on public administration services, which may, indirectly, answer a broad range of questions. However, due to the huge amount of information on different services, involving significantly different procedures, and thus also organized differently, some answers can be hard to get or take too much time to find.
In order to make the process of finding answers for entrepreneurs easier, about two years ago, we were challenged to develop an alternative interface to Balcão do Empreendedor (BDE, in English, Entrepreneur's Desk), now incorporated in ePortugal. Beyond a search interface, it would enable interested users to make questions, in natural language, to be answered automatically, also in natural language, thus avoiding to explore the site, and spending time on navigation and reading long documents. In fact, the challenge was to develop a computational agent that, among other conversational skills, would be apt to help entrepreneur's willing to develop an economic activity in Portugal, by providing answers to their questions.
Given the limitations of end-to-end conversational agents, and once we noticed that lists of Frequently Asked Questions (FAQs) were available for some of the target services, we decided to • We compare the supervised STS model with a broader range of unsupervised approaches for STS and make a more thorough selection of features, also considering the complexity of the model; • Amaia uses and is assessed in a new version of the AIA-BDE corpus; • Amaia relies on a more flexible strategy for identifying OOD interactions, based on a classifier, and provides answers to such questions based on a smaller and more controlled corpus.
The remainder of the paper is organized as follows: Section 2 overviews related work on conversational agents and IR-based natural language interfaces to FAQs; Section 3 describes the corpora used in this work, namely the AIA-BDE corpus, used both as Amaia's KB and for evaluation purposes, and the Chitchat corpus, which Amaia resorts to for handling OOD interactions; Section 4 discusses the performance of several unsupervised approaches for STS in the AIA-BDE corpus; Section 5 describes how a model can be trained for STS in Portuguese and then applied to AIA-BDE, also including a discussion on the selection of the most relevant features; Section 6 is on the approach for dealing with OOD interactions, which includes training a classifier for discriminating between such interactions and domain questions. Before concluding in Section 8, Section 7 wraps everything with the integration of the IR library, the STS model, and the OOD classifier, as well as the created corpora, in Amaia, illustrated with an example of a conversation.

Related Work
Dialogue systems typically exploit large collections of text, often including conversations. End-to-end generative systems model conversations with a neural network that learns to decode sequences of text (e.g., interactions) and translate them to other sequences (e.g., responses) [9]. Such systems are generally scalable and versatile, always generate a response, but have limitations for performing specific tasks. As they make few assumptions on the domain and generally have no access to external sources of knowledge, they can rarely handle factual content. They also tend to be too repetitive and provide inconsistent, trivial or meaningless answers.
Domain-oriented dialogue systems tend to follow other strategies and integrate Information Retrieval (IR) and Question Answering (QA) techniques to find the most relevant response for natural language requests. In traditional IR [1], a query represents an information need, typically in the form of keywords, to be answered with a list of documents. Relevant documents are generally selected because they mention the keywords, or are about the topics they convey. Automatic QA [10], diversely, finds answers to natural language questions. Answers can be retrieved from a structured KB [11] or from a collection of documents [12]. This has similarities to IR, but queries have to be further interpreted, possibly reasoned-where Natural Language Understanding (NLU) capabilities may be necessary-while answers are expected to go beyond a mere list of documents.
Given a user input, IR-based conversational agents search for the most similar request on the corpus and output their response (e.g., [13]). They rely on an IR system for efficiently indexing the documents of the corpus and, in order to identify similar texts and computing their relevance, a common approach is to rely on the cosine between vector representations of the query and of the indexed texts, where words can be weighted according to their relevance, with techniques such as TF-IDF. Instead of relying exclusively on the cosine, an alternative function can be learned specifically for computing the relevance or relatedness of a document for a query. This can be achieved, for instance, with a regression model that considers several lexical or semantic features to measure Semantic Textual Similarity (STS, [14]). This is also a common approach of systems participating in STS shared tasks (e.g., [3]), some of which covers pairs of questions and their similarity [15]. A related shared task is Community Question Answering [16,17], where similarity between questions and comments or other questions is computed, for ranking purposes.
STS can also be useful in the development of natural language interfaces for lists of FAQs. Due to their nature and structure, the latter should be seen as valuable resources for exploitation. On this context, there has been interest in SMS-based interfaces for FAQs [18], work for QA from FAQs in Croatian [19], and a shared task on this topic, in Italian [20]. FAQ-based QA agents often pre-process text in questions, answers, and user requests, applying tokenization and stopword removal operations. For retrieving suitable answers, the similarity between user queries and available FAQs is computed by exploiting word overlap [19], the presence of synonyms [18,21], or distributional semantic features [19,22].
In opposition to generative systems, IR-based dialogue systems do not handle very well requests for which there is no similar text in the corpus. However, an alternative IR-based strategy can still be followed, in this case, for finding similar texts in a more general corpus, such as movie subtitles [23]. Either with an IR or generative approach, an important challenge is to give consistent responses. For this purpose, there are different approaches for developing conversational agents with a persona. In the generative domain, persona embeddings can be incorporated [24], while in the IR-domain, this issue has been tackled by including a smaller corpus of personal questions and answers [25].

Corpora
Our main goal was to develop a conversational agent that would answer questions related to entrepreneurship and performing economic activity in Portugal. To some extent, it would be an alternative channel to searching for the necessary information in the former Balcão do Empreendedor (BDE, in English, Entrepreneur's Desk) and related services, now incorporated in the e-Portugal website.
However, towards a more human behavior, we also wanted the agent to enable basic open-domain conversations. Therefore, we have also compiled the Chitchat corpus, to be used instead of AIA-BDE for answering out-of-domain (OOD) interactions. This section describes both corpora and, later, Section 6 explains how Amaia discriminates between domain and OOD interactions.

The AIA-BDE Corpus
In order to create what we later called the agent's KB, we asked AMA for data available in BDE that would be valuable to such an agent and, at the same time, easy to integrate and exploit. Once we found that several services had FAQs, we decided to focus the development of the KB around them, and get some inspiration from related work on FAQ-based agents [18][19][20]. At the same time, we aimed to develop an agent with a flexible architecture that would be easily adapted to other domains, and this also seemed like a good option from that perspective.
FAQs were collected and compiled in the KB, so that the agent would access questions and their answers. The agent would thus try to match natural language user requests with questions for which it had an answer, i.e., they were answered by a FAQ in the KB. Once it identifies the most similar FAQ to the request, it may provide its answer to the user.
We baptized the corpus of FAQs as AIA-BDE, and it is now in the second version, the one used in this paper, with more questions and more variations than its first version [6]. More precisely, it contains 855 FAQs from four different sources: Espaço Empresa (EE, Business Spot, 625 FAQs), Apoios Sociais (AS, Social Support, 56 FAQs), Regime de Acesso a Atividades de Comércio, Serviços e Restauração (RJACSR, Access Regime to Commerce, Services and Catering Activities, 118 FAQs), and Alojamento Local (AL, Local Accommodation, 56 FAQs) (For those interested, both versions of the AIA-BDE corpus are available from https://github.com/NLP-CISUC/AIA-BDE).
However, in addition to the FAQs, AIA-BDE also contains their variations, which are paraphrases or related questions using other words, sometimes omitting information. To some extent, such variations simulate user requests for which answers are available. Therefore, we may use them for assessing our agents on this domain, i.e., how well they can match variations with the original questions.
There are at least five variations for each question, and some have up to 12. Though, as there is no perfect way of creating variations, and because their manual creation is time-consuming, variations were produced along the time, following significantly different approaches, and by different people. Therefore, since matching variations created differently may pose different challenges, we marked variations according their creation process, namely: •

The Chitchat Corpus
The Chitchat corpus has the same format as AIA-BDE, but includes OOD interactions and its responses, acquired from two different sources:

•
Handcrafted set of 22 personal questions, i.e., questions that are commonly made in chats, and their responses; • About 1500 interaction-response pairs obtained from the Portuguese part of the Subtle [7] corpus of movie subtitles.
Subtitles are indeed a great source of material for chitchat. However, we soon noticed that, when using too many subtitles with no selection criteria, conversations could easily become impractical. Therefore, we selected a subset of interactions that occur 50 or more times in Subtle, as well as their most frequent response (For those interested, the Chitchat corpus is available from https://github. com/NLP-CISUC/AIA-BDE). In this process, interactions and responses with strange characters and proper nouns were ignored. Table 2 illustrates the Chitchat corpus with some examples of its entries.

Answering AIA-BDE with Unsupervised Approaches
As the AIA-BDE corpus allows for the assessment of different approaches when matching variations (i.e., simulations of user requests) with actual questions, we used it as a benchmark in this task. In this section, we look at the performance of several unsupervised approaches for STS, in the sense that they rely exclusively on the existing data and, in some cases, on pre-trained embeddings. The first is a traditional IR approach, based on a full text search library, used for indexing and searching the text, according to different configurable parameters. The second group of approaches is based on vector representations of text, which can be created directly from the data, or based on pre-trained models of word embeddings. To some extent, these approaches could be seen as baselines. However, as we show throughout the paper, some rely on very powerful language models that lead to high performances.
In both cases, performance was measured by computing the accuracy of each approach in all the variations of the AIA-BDE corpus. Moreover, having in mind that, in many scenarios, it is better to return a smaller set of answers that include the correct one, than to give no answer or return one that is incorrect, accuracy was also measured for the presence of the correct answer in the top-3 or top-5 best-ranked candidates.

Traditional IR
For testing a traditional IR approach, we relied on Whoosh (https://whoosh.readthedocs.io), a Python full text search library, which builds an index for a corpus and enables efficient text-based searches on it. More precisely, Whoosh was used for indexing the AIA-BDE corpus, such that each FAQ was represented by two fields, the question and the answer, with searches made on the question. Despite using the same corpus, Whoosh provides different ranking functions and analyzers that may be used, some of which for Portuguese. We tested both BM25F and Frequency scoring functions, opting for the former due to the poor performance of the latter, whose accuracy remained below 15% on all our tests. The group parameter value of the query parser was changed to OrGroup, which makes the terms in the query optional. When compared to the default setting (see our previous paper [8]), this improves the matching performance significantly.
In addition to the default indexation, Whoosh also allows for the application of a set of filters, possibly included in an analyzer, which may differ in how text is tokenized, or how tokens are normalized. In this work, the following configurations were compared: Default + Fuzzy, the default configuration with Fuzzy Search, which enables partial matches (e.g., spelling mistakes). • LanguageAnalyzer, which converts words to lower-case, removes Portuguese stopwords, and converts words to their stem, following Portuguese rules. • Stemming Analyzer, a simplification of the previous that does not remove stopwords. • Stemming Analyzer + Charset Filter, the Stemming Analyzer followed by a filter that removes graphical accents. • N-gram Filter (2-3), which tokenizes text and indexes it according to character n-grams of sizes 2 and 3. • N-gram Filter (2-4), which tokenizes text and indexes it according to character n-grams of sizes 2, 3 and 4. Table 3 shows the accuracy of the previous configurations when matching the original questions with themselves, for sanity check, and with the set of all available variations in AIA-BDE. Since Whoosh may return more than a single result, i.e., a ranked list with the most relevant results, we can also look for the presence of the correct question in the top-n results. Thus, in addition to the first in the rank (Top1), the table presents the proportion of questions for which the correct match was in the top-3 and top-5. This has also in mind that, even in a real application scenario, missing the correct match might be minimized by presenting the top-n matches, hoping that one of them will be correct.
As expected, the great majority of questions is correctly matched with itself, which shows that the traditional IR approach is doing its job well. The minority of questions not matched are short questions that share the majority of tokens with others. For instance, with the Default configuration, this includes mostly questions with a single different word, such as: O que é um certificado digital? and O que é um certificado digital qualificado?, or Quem é o franchisador? and Quem é o franchisado?. As for the variations, the Stemming Analyzer leads to the best results, especially with the Charset Filter. We recall that the only difference between the Language and the Stemming Analyzer is that the latter does not remove stopwords, which shows that, in opposition to other tasks, stopwords are important here. With the best configuration, the proportion of correct matches is close to 80%, with almost 90% in the top-3 and more than 92% in the top-5. This confirms that, given a user request, considering more than a single question may significantly increase the chance of giving the right answer. Table 4 shows the accuracy for the variations of each type. As expected, the highest performance is for the VG1 and VG2 variations because, in terms of surface text, they are closer to the original questions. Nevertheless, accuracy is significantly lower than for the original questions (about 10 points for the top-1, and 3 for the top-5, considering the best configuration in both). Manually-created variations are the most difficult to match correctly, especially VUC and VMT. With the best configuration, about 62% of the VUC variations is matched correctly, and 83% in the top-5. VMT variations are also those for which the best performance, 60% correct matches, is achieved without the Charset Filter, and for which the best performance for the top-3 (75%) and top-5 (86%) is achieved with the Language Analyzer. Nevertheless, from these figures, we would decide to use Whoosh with the Stemming Analyzer and the Charset Filter. With the N-gram filter, performance decreases significantly for all variations, especially when 4-grams are not included, so it would not be a viable option.

Word Vector Approaches
In the second group of approaches, each sentence was represented by a fixed-length vector of numbers, and similarity was computed with the cosine between the vector representation of each variation and the vector representation of all original questions. Different methods were used for representing the sentence as a vector, including traditional approaches, where the vector representation considers only the vocabulary of our data and the surface text, but also approaches based on pre-trained models of static word embeddings and state-of-the-art contextual embeddings. The traditional methods tested were based in the following scikit-learn [26] implementations: • Count Vectorizer, which converts each sentence to a vector of token counts.
• TFIDF Vectorizer, which converts each sentence to a vector of TF-IDF features, i.e., the weight of each token increases proportionally to count, but is inversely proportional to its frequency in the corpus, in this case, the original questions of AIA-BDE.
Both were used with default parameters, meaning that sentences were represented by sparse vectors with a fixed-size equal to the size of the vocabulary.
In approaches based on static word embeddings, the sentence vector is computed from the vector of each of its words, according to a pre-trained model. In this process, tokens without alpha-numeric characters (e.g., punctuation signs) and tokens not covered by the model are ignored. Moreover, all words may have the same weight, resulting in the average embedding, or they can be weighted by the relevance of each word, given by the TF-IDF, again computed in the original questions. Four different pre-trained models of this kind were tested in this experiment, learned with the following algorithms: • word2vec [27], namely its two common variations of CBOW and SKIP-GRAM; • GloVe [28], a common alternative to word2vec; • FastText [29], as an attempt to better deal with the Portuguese morphology, given that it considers character n-grams.
The word2vec and GloVe models pre-trained for Portuguese were obtained from the NILC word embeddings repository [30]. For FastText, we used a different source, trained by the creators of this algorithm (https://fasttext.cc/). All of them had vectors with 300 dimensions and were loaded with the Gensim Python library [31].
Approaches based on contextual embeddings relied on BERT [32], a recent model that encodes words and longer sequences based on a Transformer neural network. In this case, full sentences were encoded directly by BERT, which resulted in their vector representation. Two pre-trained BERT models were used for this purpose: bert-large-portuguese-cased (Portuguese BERT) [33], trained specifically for Portuguese, which encodes given text in 1024-sized vectors.
BERT models were loaded with the bert-as-a-service tool (https://github.com/hanxiao/bertas-service), with default options, except for the maximum length of sequences, set to NONE for dynamically using the longest sequence in the batch.
Similarly to Table 3, Table 5 shows the accuracy of the approaches based on the previous models when matching the original questions, for sanity check, and for the set of all variations in AIA-BDE. As in the previous section, accuracy is obtained from the number of variations for which the correct question was the most similar. For the top-3 and top-5, the correct question must be in the top-3 and top-5 most similar, respectively.
The first observation is that word vectors that are learned from external sources of text lead to better performances than the Count and the TF-IDF, which are computed from the questions of AIA-BDE and rely only on the surface text. Another observation is that, with the pre-trained word embeddings, performance decreases with TF-IDF. Although TF-IDF would give more weight to more relevant words, this is only based on the questions of AIA-BDE, which are probably not enough for computing proper weights. Another reason for this may be related to the role of stopwords. TF-IDF should give them less weight, but the previous experiments with Whoosh suggested that removing stopwords had a negative impact on performance. Different performances are achieved by different models, with the best achieved by the word2vec-CBOW, without TF-IDF. Its figures are comparable to the best achieved with Whoosh. Surprisingly, none of the state-of-the-art BERT models could outperform word2vec. Out of the two, the best was the Portuguese BERT, which makes sense because it was trained exclusively for Portuguese. Its performance is comparable to word2vec-SKIP, which is the second best model. Table 6 shows the accuracy for the variations of each type. Again, performance is higher for VG1 and VG2 and lower for the manually-created variations. However, these figures show that the selection of the best model is not as straightforward as it was for Whoosh, with different models having the best performance for different variations. For instance, Multilingual BERT achieved the best performance for VG1, considering only the first result, and VG2, possibly because these variations are generated with the help of machine translation and this BERT model is multilingual. Moreover, since this model was trained by Google, it is also possible that it is somehow used by Google Translate. The best performance in the VIN variations (83.5%, about 1 point higher than the best with Whoosh) is by the Portuguese BERT, the same model that achieves the best performance in the VUC (60.4%, about 2 points lower than the best with Whoosh). However, this happened only for the first result, with word2vec-CBOW slightly improving in the top-3 and top-5. This was also the best model for the VMT variations and, when considering the top-3 and top-5 in VG1, which is why it was the best model overall.
Considering also that word2vec is less complex than BERT, out of the tested models, it would be our choice. However, we believed that these figures could be further improved if several models were combined, and possibly combined with other features. Therefore, in the next section, we describe how different features can be exploited for learning a model of Portuguese STS that suits our purpose.

A Model for Portuguese STS
After testing several unsupervised approaches, which, due to their easy implementation, can be seen as baselines, we leveraged on available data for training an STS model for Portuguese, which would hopefully improve the performance of the baselines. The goal was to develop a model as broad as possible that would exploit many potentially useful features. However, at the same time, we did not want it to become too complex, which is why we tried to use only a fraction of all the features that we could extract. The development of the STS model followed a supervised learning approach. It was validated, trained, and tested in sentence pairs from the collections of ASSIN [4] and ASSIN 2 [5], which comprise a total of ≈20,000 pairs with annotated similarity scores, based on human opinions, ranging between 1 (completely different) and 5 (equivalent).
Before concluding the section, we describe how the IR approach can be combined with the STS model for reducing the number of necessary computations.

Training a Model for Portuguese STS
To compute the STS between sentence pairs, a broad set of 64 features was initially extracted, covering different types of features, namely lexical, syntactic, semantic, and distributional. Features were extracted with the help of the following Python libraries: NLTK [34], for getting token and character n-grams; NLPyPort [35] (i.e., NLTK with some improvements for Portuguese), for getting Part-of-Speech (PoS) tags, named entities and lemmas; Gensim [31] and scikit-learn [26], for extracting distributional features, which included the word embeddings in Section 4.2 and others.
However, as mentioned earlier, we wanted to avoid a very complex model. Therefore, even before training and testing any model, we tried to reduce the dimensionality of the feature set. For this purpose, we ran Recursive Feature Elimination (RFE), available in scikit-learn, to select the most relevant features out of the initial 64. This method requires an external estimator for assigning weights to features according to their respective importance. In this case, we chose a Random Forest Regressor (RFR) model as the estimator. Even though other algorithms could have been used for this purpose, we had previous experience with the RFR in a similar context. Starting with the initial set of features, the estimator is repeatedly trained until the desired number of features is reached, by removing the least important feature from the group at each iteration. We tested different thresholds for the number of features to be selected, ranging from 20% (top-13 features) to 80% (top-51 features) of the original set, and evaluated the performance of each test with the coefficient of determination R 2 of the prediction, which allowed us to select the threshold value. To avoid overfitting, this process was run in a validation set comprising of 10% of the sentence pairs in the ASSIN and ASSIN-2 training collections, selected randomly.
The best performance was achieved with a threshold of 42%, meaning that the initial set of 64 features could be reduced to 27. This includes the following features: • Jaccard coefficient computed between the sets of token 1-grams (1).
These features are also summarized in Table 7. We note that most models of static word embeddings tested in Section 4.2 were in the set of 64. BERT contextual embeddings, on the other hand, were not included, due to the large memory requirements of these models. What is curious to see is that, even though fastText was one of worst-performing methods back then, it was selected, possibly due to its complementary nature.

Cosine(token vectors) Models
Average word2vec-CBOW, GloVe, fastText.cc, Numberbatch, PT-LKB TF-IDF weighted word2vec-CBOW, GloVe, fastText.cc, Numberbatch, PT-LKB With this feature set, we explored different regression algorithms available in scikit-learn. They were trained in the remaining training pairs of ASSIN and ASSIN-2 (90%, after removing the 10% used for feature selection), and tested in each of the three test collections available. In those experiments, a Support Vector Regressor (SVR) and a Random Forest Regressor (RFR), both using default parameters, stood out, with comparable results. However, we decided to stick with the SVR because we had already used it in our previous work, with both ASSIN [38] and AIA-BDE [8].

Further Reducing the Size of the Model
Even though we were able to reduce the feature set considerably with RFE, we had an intuition that this set could be further reduced. Our intuition mainly relied on the fact that the reduced set includes five models of word embeddings that, although learned with different algorithms, should be somehow overlapping. Moreover, we would like to analyze whether we could get rid of two features that require the use of two external libraries, namely the syntactic dependencies, which require spaCy, and the adverbs, which require NLPyPort. Those features do not only increase the complexity of the model, but also of Amaia's installation, which will depend on additional software packages. In fact, even though the STS model only requires the dependency parsing and PoS tagging, in order to compute such features, those external libraries end up making additional analysis that takes time, with no direct benefits for the model.
We thus decided to test the impact of removing the aforementioned features from the 27-feature model. However, before this, we analyzed the impact of reducing the size of the largest word embeddings, namely word2vec-CBOW, GloVe, and fastText.cc. This had in mind that, in the embeddings matrix, words of the vocabulary are ordered according to their frequency, i.e., most frequent words are in the initial lines and the final lines include rare words, frequently typos. Thus, what we did was to remove everything except the first 300,000 lines of each of the three aforementioned models and repeat the experiments of Section 4.2 with the smaller versions.
The conclusion was that such a reduction did not impact the performance in AIA-BDE. The highest drop of performance was 0.2 percentual points in fastText.cc, while word2vec-CBOW and GloVe had exactly the same performance and an increase of 0.1 points in the top-5. Therefore, we decided to start using the reduced embeddings, thus decreasing the memory required for the STS model.
After that, we moved on to what can be seen as an ablation study. Table 8 shows the performance of each model when tested on both ASSIN and ASSIN-2 test collections. More precisely, it has the Pearson correlation (ρ) and the Mean Square Error (MSE) between the automatically-assigned similarity scores and those in the collection, which is based on human opinions. In the first line, REDUCED-27 corresponds to the model that uses the 27 features, with subsequent lines corresponding to models where features were manually removed, namely: ADV for the adverbs, DP for dependency parsing, CBOW for word2vec-CBOW, FT for fastText.cc, NB for Numberbatch, and PTLKB for the PT-LKB embeddings. Removing a feature based on embeddings, in fact, entails the removal of both features computed from them, namely, the average embeddings vector and the one weighted with TF-IDF.  The REDUCED-27 model achieved the best performance in ASSIN-2 (ρ = 0.75), but not in ASSIN, where several other models achieved the best Pearson correlation (ρ = 0.72). Performance differences are not substantial. However, we can say that Numberbatch or PT-LKB embeddings do not contribute enough to good performance, and when each of them is the only model of embeddings (lines 6 and 8), performance is generally low. Unlike the others, which have been learned from large quantities of text, these were learned from semantic networks, and thus have lower vocabulary coverage. At the same time, fastText seems to be essential for a good performance in ASSIN-2. Out of the tested models, we selected three for comparison in the AIA-BDE corpus, namely those that we see as a having a good balance between performance and number of used features. None of the selected models uses the dependencies nor the adverbs feature and all use fastText.
The test in AIA-BDE allows for an analysis of the models behavior in a scenario closer to what we expect from Amaia, especially when considering the manually-created variations (VIN, VUC, VMT). As we did in other tables concerning tests in AIA-BDE, Table 9 has the performance of REDUCED-27 and of the selected models when used for matching the original questions and the set of all variations. It shows that the performance of REDUCED-27 and the selected models are not harmed by the reduction of features. Even if by a low margin, the best performing model overall uses GloVe instead of word2vec-CBOW. Its accuracy is 81.3% and 94.1%, respectively, for the first result and for the top-5, which, though not substantial, is still more than one point higher than the unsupervised approaches, based on IR (Table 3) and word vectors (Table 5)  Table 10 has the performance for each variation. The best model overall, in the second row, is also the best for all variations, except for VMT, where the best performance is by REDUCED-27, with all the others tied. When looking at the performance in the top-5, the best model is also not always the same, but differences are low.

Combining IR with STS
Even with a reduced model of 19 features, relying on a STS model implies that STS is computed between each user interaction and all the questions in the agent's KB. For a large KB, this might result in higher response times.
In the IR alternative, however, this problem is minimized, due to the index. Therefore, in a final experimentation, we aimed at combining the IR approach with the STS model. More precisely, we create an index with the best Whoosh configuration (Stemming + CharsetFilter, see Table 4) and then, for each user interaction, we use Whoosh for retrieving the 30 most relevant questions in the KB, and only apply the STS model to the questions in this subset. While this definitely makes the system more scalable, we had to test whether it could harm performance. Table 11 has the overall performance of REDUCED-27 and of the best model in the previous section (Table 9). Surprisingly, when considering only the first result, performance is not at all harmed. Together with the results of Section 4.1, this supports that traditional IR is already a good baseline for matching user interactions with questions. Though not always in the first position, it often includes the best candidate in the top retrieved candidates. Moreover, STS is better for discriminating the single best candidate out of the top retrieved. On the other hand, considering the presence in the top-5, performance has a small drop of 0.4 points for the best model and 0.3 for REDUCED-27. However, given that scalability can be significantly increased with the initial selection by the IR approach, with a still neglectable loss of performance, we opted for this combination as the question-matching approach of Amaia.

Identifying Out-of-Domain Interactions
A common limitation of IR-based conversational agents is in handling Out-Of-Domain (OOD) interactions. Though not always required, to give the agent a more human-like behavior, it would be interesting to have responses for virtually any question. Therefore, to complement Amaia's capability of answering entrepreneurship questions (domain), we have compiled the Chitchat corpus (see Section 3.2), to be used instead of AIA-BDE, when OOD interactions are identified.
In order to identify OOD interactions and decide whether to search for matching questions in the AIA-BDE or in the Chitchat corpus, a text classifier was trained with all 855 original questions of AIA-BDE (domain) and the same amount of randomly selected interactions from the original Subtle corpus (OOD), for balancing reasons. The Chitchat corpus was not used directly for being too small to provide both training and test data. The performance of the classifier was computed for the classification of OOD interactions in four test sets, each one with all the available variations of each type, namely VG1, VG2, VIN, and VUC. A new random selection of 855 questions from the Subtle corpus was added to each test set. This selection was the same for the four datasets. For an overall performance, a fifth dataset contained all of the 4805 question variations (except VMT, which were not available at the moment), and the same number of randomly selected OOD interactions, again obtained from Subtle.
Three classification algorithms available in the scikit-learn library were tested for this, namely a Linear SVM, a Random Forest classifier (RF) and a Naïve Bayes (NB) classifier, all used with default parameters. For all, questions were represented by their TF-IDF-weighted vectors. Table 12 shows the performance of each classifier, measured with the precision, recall and F1-score of correctly identifying OOD interactions. Results show that the performance of the classifiers in this task is positive. This happens mostly due to the significant differences between the questions in AIA-BDE and in Subtle (see, e.g., Tables 1 and 2). Nevertheless, performance is slightly lower for the VIN and VUC variations, created manually, than for VG1 and VG2, which have more similarities with the original questions. In any case, the SVM classifier performed better than the other two, with a precision of 96% and a recall of 86% overall. This still means that 14 out of 100 OOD interactions will be incorrectly classified, i.e., will be considered domain questions, and thus their response will be retrieved from AIA-BDE. This should not harm the system too much-at least not as much as the 4 out of each 100 interactions that will be incorrectly classified as OOD. Given that this is a binary classification problem, this means that such interactions were domain questions and that, unless they are rephrased, users will never get their answer. This is why we decided to use the SVM classifier for classifying OOD interactions but, at the same time, included a flag that enables the programmer to turn it off easily.

Amaia: A Portuguese Conversational System
Amaia is a Portuguese conversational agent that results from the combination of the previous components. Besides the two corpora (AIA-BDE and Chitchat) used, each indexed in a different Whoosh index, it includes a reduced version of the SVR model for STS, and a SVM classifier of OOD interactions.
In order to get a suitable response R, any interaction I with Amaia goes through the workflow in Algorithm 1. Two parameters are configurable, namely the maximum number of returned questions and answers (n) and a threshold for including a question that is similar to top (θ). We empirically set these parameters to n = 3 and θ = 0.1, but, depending on the desired behavior, they can be changed when launching Amaia. The same happens for other options. For instance, handling OOD interactions may be turned off, which makes Amaia always search for the most similar domain question. Whoosh may also be turned off, which implies that the STS model is used for computing STS against all questions in the KB, and not just a subset. This may also be the option for those cases when Whoosh does not retrieve any question for an interaction. However, currently, in this case, Amaia just gives the default response: "Desculpe, não percebi, pode colocar a sua questão de outra forma?' (I'm sorry, I didn't understand, could you rephrase your question?). While it is unlikely to happen with the current configuration (In a Whoosh index of AIA-BDE with the configuration selected in Section 4.1, this happens for four out of the 4973 variations), this behavior works as a fallback mechanism.
The algorithm is complemented with the diagram in Figure 1, which shows the different paths taken by interactions, depending on their classification as OOD or domain, then resulting in different responses. Depending on the classification, a different retriever is used. Moreover, before returning a response, domain interactions are re-ranked according to the STS model.

Algorithm 1: Amaia's workflow.
Given an interaction I, use the classifier to label it as OOD or domain; if I is labelled as OOD then Search for I in a Whoosh index of the Chitchat corpus; Use the response of the first retrieved interaction as R; else Search for I in a Whoosh index of AIA-BDE; Get the top-30 questions retrieved; Compute the STS with each of those questions; Build R with the following template: "Se a sua pergunta foi: <P> R: <R>" ("If your question was <P> R: <R>"), with <P> replaced by the question and <R> replaced by its answer; For each additional retrieved questions in the top-n for which the difference of the STS is only θ lower than the best, concatenate the following text to R: "Também poderá estar interessado em: <P> R: <R>" ("You may also be interested in <P> R: <R>") end Give R as the answer.
As it is, Amaia can be easily integrated in Slack (https://slack.com/) or any other communication platform with an API that allows for the integration of bots. Figure 2 is a brief real conversation with Amaia that illustrates its capabilities. We highlight how it switches between domain and OOD interactions.  The user starts by greeting Amaia ('Good evening') and Amaia says hello, meaning that the interaction was correctly labeled as OOD. In the second interaction, the user asks what Amaia can do to help them, which is again answered with a question in the Chitchat corpus, this time a personal question where Amaia describes its goal. After this, the user asks several domain questions, for which Amaia provides good answers. For the third question (a quem tenho de pedir autorização...), the best answer is not the first given, but the second, which supports our option for returning also questions with a close STS. For the fourth question (e o RNAL é mesmo necessário?), Amaia's answer is simply "Sim" (Yes). This happens to be a good answer, but only by chance. In fact, the interaction was labeled as OOD. When, in the fifth question, the acronym RNAL is replaced by its full version, the correct answer is given, in the first position. In the final interactions, the user thanks Amaia and says goodbye, with Amaia giving suitable responses (roughly, 'You're welcome' and 'Goodbye').

Conclusions
We have described the steps towards the development of Amaia, a conversational agent for helping Portuguese entrepreneurs. After presenting AIA-BDE, the corpus used both as Amaia's KB and as our benchmark, we make an extensive comparison of approaches for matching user requests with existing questions. Those included IR-based approaches, unsupervised STS approaches, and supervised STS models. In the end, we combined the STS model with reduced features, which had achieved the best performance, but only apply it to a subset of the available questions, pre-selected with the best IR-approach. Furthermore, we presented how Amaia uses a text classifier for labeling interactions as domain or OOD, and thus either look for matching questions in AIA-BDE or in a chitchat corpus. Having responses for OOD interactions gives Amaia a more human-like behavior, even if the same interaction has always the same response.
For more variation in the answers, in the future, we may improve how OOD interactions are handled. While learning a generative model could have a negative impact on coherence, we may always define different possible answers for the same question. Moreover, we aim to study how an agent like Amaia may deal with context, and thus avoid giving the same answer several times, while also increasing its performance. This should involve some kind of history, or memory that is updated with each interaction.
The current version of Amaia can be easily integrated in communication platforms, like Slack. In the future, its KB will be increased with more FAQs, which, given our previous options, should not pose challenges on scalability. New FAQs will come from new lists and, ideally, some will be generated automatically, either from structured documents, or from raw text. However, the latter poses a difficult challenge due to the complex language used in most documents we have so far looked at, so additional work is required.
We can say that interesting results were achieved, but there is still much room for improving accuracy. Several improvements may come from alternative ways of combining all the features and/or approaches tested here. For instance, we have not tested promising approaches for STS, namely those based on fine-tuning Transformer neural networks like BERT, which recently achieved high performances for Portuguese [39,40]. Thus far, we just used pre-trained BERT models directly. We also aim to test different combinations of approaches in a voting system, to see whether it is capable of outperforming supervised STS models or not. Finally, some of the approaches could possibly benefit from considering the answers, when matching questions. However, from our preliminary experiments, some of the answers in AIA-BDE are too large and thus an additional source of noise that harms performance.