Next Article in Journal
Quantum Mechanics/Molecular Mechanics Simulations for Chiral-Selective Aminoacylation: Unraveling the Nature of Life
Next Article in Special Issue
Enhancing Accessibility: Automated Tactile Graphics Generation for Individuals with Visual Impairments
Previous Article in Journal
Asymptotic and Probabilistic Perturbation Analysis of Controllable Subspaces
Previous Article in Special Issue
Classification of Acoustic Tones and Cardiac Murmurs Based on Digital Signal Analysis Leveraging Machine Learning Methods
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Interpretable Conversation Routing via the Latent Embeddings Approach

Department of Artificial Intelligence, Kharkiv National University of Radio Electronics, 61166 Kharkiv, Ukraine
*
Author to whom correspondence should be addressed.
Computation 2024, 12(12), 237; https://doi.org/10.3390/computation12120237
Submission received: 26 October 2024 / Revised: 23 November 2024 / Accepted: 28 November 2024 / Published: 1 December 2024
(This article belongs to the Special Issue Artificial Intelligence Applications in Public Health)

Abstract

:
Large language models (LLMs) are quickly implemented to answer question and support systems to automate customer experience across all domains, including medical use cases. Models in such environments should solve multiple problems like general knowledge questions, queries to external sources, function calling and many others. Some cases might not even require a full-on text generation. They possibly need different prompts or even different models. All of it can be managed by a routing step. This paper focuses on interpretable few-shot approaches for conversation routing like latent embeddings retrieval. The work here presents a benchmark, a sorrow analysis, and a set of visualizations of the way latent embeddings routing works for long-context conversations in a multilingual, domain-specific environment. The results presented here show that the latent embeddings router is able to achieve performance on the same level as LLM-based routers with additional interpretability and higher level of control over model decision-making.

1. Introduction

Chatbots built with large language models (LLMs) face significantly more complex tasks [1], which can contradict each other. For example, answers may require different lengths and levels of details, the model can use external data sources, call other models, and APIs, or run certain code to improve the result for a specific use case. A simple monolithic architecture with a single large prompt for LLM-based applications cannot comprehend such challenges as it is unable to execute all diverse tasks with the same level of quality [2]. One big prompt would not be enough to handle it and may cause conflicts between instructions for use cases (for example be creative in some situations and give laconic, straight answers in others). Use cases might need different generation parameters (different values of temperature or top p) to execute corresponding tasks.
The routing layer can solve this problem by orchestrating a system of multiple specified agents, which solve concrete tasks instead of trying to handle everything [3]. This way the chatbot is split into a network of sophisticated sub-applications, which are connected by a classifier that chooses the best answer generator for new messages. This makes the answering process more context-aware, accurate, and cost-efficient. It also helps to avoid conflicts between instructions as these agents can be different prompts, models, or predefined sets of actions and answers.
Another advantage is the increased security of the application as harmful instructions, unintended questions or jailbreaks will not even get passed to the LLM itself [4]. The routing layer can filter out such requests and protect agents from instruction injections and attempts at out-of-scope usage [5].
Routing problems can be expressed as a simple classification task. Common classifiers like Naive Bayes, SVMs, and other TF–IDF-based methods do not have a multilingual advantage, which current transformers can provide. They are easy to build and support, but they are difficult to scale into a multilingual, multidomain environment, where input language is not predefined and can switch from conversation to conversation.
Routing can be solved by a fine-tuned transformer or any other deep-learning model [6,7]. The only requirement is a high prediction speed as this layer would be called for every input and can easily become a bottleneck due to high latency. However, developers would face the problem of gathering and annotating training data for conversations. Fine-tuning a classifier would require thousands of correctly labeled examples for a variety of use cases. Messages have to be diverse as users can formulate similar requests in multiple ways or make spelling, punctuation, or grammar mistakes. Some cases can be covered poorly and it would be difficult to catch it before the real-life usage of the system. Inputs for modern chatbots are not structured and do not have any determined, correct way of expressing users’ intentions. This makes the development and support of the routing layer much more difficult at the early stages of the chatbot lifecycle.
Multiagent chatbot systems can significantly enhance the performance of medical workers by taking routine tasks and providing a simple way to interact with already collected data [8]. Machine learning can act as an advice system to help diagnose, forecast, and model situations [9]. However, the interpretability and controllability of machine learning systems are crucial in this field as they require full-on trust from the user to the system with detailed explanations of the decision-making process [10]. Each step of the answering has to be explained and grounded including the routing layer.
Also, a router maintenance problem can emerge. The number of routes can change with time and the definitions of existing routes can be dynamic too. This would require data gathering, annotation, full retraining, and testing of the routing model, so the update will take a significant amount of time, research, and development to end up in production.
Another crucial challenge is the interpretation and control of the routing process. It is important to understand the reasoning behind the chosen route and to have ways of easily influencing the prediction [11]. Standard text classification models do not provide means to control the inference without tuning the model itself [12] or ingesting additional feature vectors to give a level of control over the original model [13]. However, this would still require a full-on fine-tuning of the model as such modifications need significant architectural changes. Interpretation methods are lacking too as it would rely on either analysis of attention masks or SHAP-like visualizations [14].
Attention scores indicate the relative importance of tokens, but they do not translate into causality, so even if token attention is high, it does not mean it is important for the final prediction. Also, attention masks highlight where the model looked at during task solving. They do not provide a reason for a specific class to be chosen or a token to be generated. Another crucial disadvantage is the difficulty of processing big inputs. Masks will be too big to handle manually, making this method irrelevant for tasks like chat routing as inputs will be long.
SHAP provides a set of visualizations to ease the interpretation of the results (summary plots, dependence plots, force plots, etc.). It is based on game theory and assigns a clear numerical value to each feature (input token in the case of text classification) to indicate how much it increases or decreases the probability of a target class. SHAP has scaling issues for high-dimensional data too, which would make it computationally expensive and more difficult to gain meaningful insights for chats or even individual long messages. Just like with attention masks, SHAP values do not show cause-and-effect relationships and do not provide insights into how the model decision-making can be modified to fix certain cases.
Zero- and few-shot classifiers can ease the implementation as they do not need training or a large set of examples [15]. Approaches like natural language inference (NLI) classifiers or LLM in-context learning can be used to solve it [16,17]. Our previous research proved that NLI routers significantly underperform and the LLM router is too unreliable. LLM routers are easy to implement and have a high accuracy even for multilingual systems. However, it is biased towards class order, labels, and descriptions of classes [18]. Even an intermediate multiple evidence calibration (MEC) cannot make it more stable [19]. Such instability makes them suitable for initial data annotation or application prototyping, but they cannot be relied on in a real-life production environment.
LLM router lacks interpretability as the model can use similar reasoning for different results, which makes it more difficult to debug and tune use cases [20]. Control is possible via prompting and implementing change in the generation hyperparameters, which is better than full retraining as it requires less data and time. However, it is difficult to correctly adjust LLM classifier behavior without breaking other cases considering the lack of interpretability and stability [21].
To address the issues listed previously, a method of routing via latent embedding retrieval was proposed [22]. This work shows that the semantic routing approach can be used in various fields, where high interpretability is critical, as it can provide explanations of decision-making and can be easily controlled and modified to adjust the behavior.
In this paper, we aim to provide a set of best practices for semantic routing, particularly to enhance their efficiency in dynamic, multilingual, and multitask environments with a high level of interpretability and control over the inference. Research highlights the possible disadvantages of this approach in comparison to LLM routing and transformer classifiers. We checked how such a routing method handles LLM jailbreaks and the influence of examples set size, compared multiple examples set building approaches, and created visualizations to highlight router decision-making.
The article is organized as follows: Section 2 covers the used datasets, the semantic routing approach in general, and proposed modifications. Section 3 presents the experiments, their results, and interpretations. Finally, Section 4 provides conclusions and discusses the obtained results in terms of further possible research and general usage of the routing approach.

2. Materials and Methods

2.1. Routing Benchmark Datasets

We need a dataset to test semantic routing approaches, which would contain multiple types of questions to be handled by different agents. During our previous research, we used a dataset of questions for wine store assistant, which handles the following scenarios:
  • Catalog recommendations, where the agent has to generate a search query, find the corresponding wines, and form a message with a proposition. The resulting message has to be grounded by the search response so that the model does not make up any items out of the catalog;
  • General questions about wine, where the agent has to answer any question about wine topics in general without search or other additional actions. Here, the model is allowed to talk about the topic in any way, even about items that are not sold by store;
  • Small talks, where the agent just needs to keep a human-like, friendly conversation and answer simple messages like greetings, gratitude, and others. The agent is allowed to be creative and does not require any additional actions;
  • Out of scope messages (offtop). Such messages should be ignored by the system completely as they are either LLM jailbreaks (harmful injections or attempts to use the chatbot as a free LLM wrapper) or just random questions out of the system’s scope (not about wines or their attributes, history, geography, and the winemaking process itself in the case of this dataset) [23].
Instructions to solve these use cases conflict with each other and they need different intermediate steps, so one shared answering pipeline would make the system not only less stable and reliable but also more expensive and slower. Such conflicting use cases may appear in any field, so splitting them into different sub-applications with a routing layer on top would be a suitable strategy to manage such a situation.
The dataset was gathered in multiple ways: a set of examples provided by consumers (wine online store), questions scrapped from the web, and synthetically generated questions with LLM. Synthetic messages were generated using Gemini 1.5 Pro for each route separately with multiple configurations: a question by a new customer, a middle-experienced customer, and an expert [24]. Also, 40% of the synthetic questions were also passed through another prompt to add grammatical and punctuation mistakes into them. In this way, we wanted to check how the router would handle text, which is not written ideally. Offtop (out of scope) messages were taken from the Stanford SQuAD dataset [25]. Table 1 presents general statistics on the gathered dataset.
Also, we used machine translation to create some multilingual samples in French, Italian, German, and Ukrainian. Table 2 shows the language distribution of texts in the dataset.
Table 3 shows the characteristics of messages in every language present in the benchmark (character count, word count, average text length in words, and average word length).
Benchmark was saved in 2 formats: a full one with all 2876 texts and a small randomly chosen subset with 40% of texts (1151 samples).
The dataset was created for internal usage using Teamwork Commerce, so it cannot be open-sourced at this moment. More information, statistics, and visualizations regarding the dataset are provided in our previous paper about conversation routing approaches [21].
Another dataset we used in this research is Jailbreak28K specifically its mini version with 258 harmful LLM jailbreaks, which should interfere with original instructions and inject malicious intents or make the model solve an unethical problem [26]. Also, we used a 10% sample of the large dataset (2800 samples) without “figstep” types of questions (questions for multimodal LLMs with vision capabilities, where the jailbreak itself is passed as an image) as we test routing specifically for text-only interactions.

2.2. Semantic Routing Based on Latent Sentence Embeddings Retrieval

In this paper, we propose a slight modification of semantic routing based on latent sentence embedding retrieval. The original approach can work both as a few-shot learning classifier and as a supervised learning classifier. It receives a set of examples for each route and a rejection threshold. The router builds an embedding storage for provided examples, and each time it needs to classify an example, it retrieves the N with the most similar examples and their respective labels. Similarity scores of examples are aggregated using labels with a function, which is chosen during the creation of the routing layer. It can be a sum, mean, or max aggregation. Then, each route has to pass a round of rejection threshold filtering. If the route does not achieve an aggregated similarity score higher than its rejection threshold, then it is rejected even if it is the top 1 among all fetched. If all routes are filtered by the rejection thresholds, it means the message is out of the scope of the current system and should not be passed along to any agent (Figure 1).
This routing approach provides an easily interpretable solution as a user can just check which examples were most similar to the input message and compare aggregated similarity scores to threshold values in the case where the message was rejected completely. Also, it is easy to control such a system and modify its behavior by just adding new examples, deleting old ones, modifying existing, or adjusting the thresholds.
If you have a training dataset and want to tune such a routing layer, it will only adjust the thresholds without learning new examples or generalizing them, which could boost the performance of the system even further. The router will iterate over the dataset, try different values for thresholds in the provided range, and check which one gives the best performance based on the training data.
Another problem with this approach is the lack of redundant example filtering. They are memorized anyway; the examples set grows uncontrollably and the router has to search across similar samples. This can slow the search for systems with a large number of use cases or big message datasets. Examples can be additionally optimized to take less time for retrieval and make classification faster. Also, the routing layer does not try to generalize or transform examples in any way. This leads to a larger example set to cover more similar samples. For our dataset, there can be multiple examples to cover different occasions to buy wine, when one generalized and the more templated sample could correspond to all of them simultaneously.

2.3. Proposed Modifications for the Semantic Routing Method

We propose to add the step of initial examples pruning based on their similarity to already memorized samples. There would be 2 approaches to prune excessive examples, which already have similar counterparts in router memory:
  • Just filter input examples using another cutoff threshold to reduce the example set size and make it more efficient;
  • Save not just the original version of the input example but also a generalized version generated via LLM to cover multiple use cases at once with possibly higher similarity. Examples would become more templated to correspond to multiple requests at once.
By pruning we mean the following steps during router initialization:
  • Fetch examples set of the route provided by the developer;
  • Encode all examples of the route with the embedder model;
  • Save the first example to router memory;
  • For the following examples retrieve the top 1 similar sample of the current route, which is already saved to the router memory;
  • If the similarity score of the top 1 most similar sample is higher than the example pruning threshold (a coefficient, which describes when 2 texts are too similar, so it does not make sense to save another similar example), ignore the new example and do not add it to the router memory;
  • If the similarity score of the top 1 most similar sample is less than the example pruning threshold, add it to router memory either as it is (first proposed approach) or pass it through the LLM to generalize it (second proposed approach) and save this generalized version to router memory;
  • Repeat for every route.
Figure 2 demonstrates this pipeline in the form of a diagram.
During our work, we checked these routing adjustments and the original way with multiple encoders, quantization techniques, and embedding sizes to check how additional pruning changes the behavior of the router in terms of accuracy and speed (especially for local environments as API-based encoders are more difficult to track in terms of time performance).

3. Results

3.1. The Effect of Pruning of Examples on Semantic Routing

First, we tested the original semantic router approach and the two proposed modifications on a small subset of our routing benchmark. Here is the detailed router configuration we used in this research for the reproducibility of results:
  • Aggregation function: max;
  • Number of examples to retrieve: 15;
  • Similarity score threshold for each route: 0.6 (if route aggregated similarity is lower than 0.6, it would be rejected). This threshold was tuned to suffice the benchmark during our previous research [21]. Higher values would cut off more possible routes and lower would allow low-confidence predictions;
  • Examples are provided only for valid routes: general wine questions, catalog, and small talk. The offtop route is assigned only when all valid routes are rejected;
  • Examples pruning threshold: 0.8;
  • LLM configuration for the generalization of examples: GPT 4o with temperature 0.0 and top p 0.0 [27];
  • Encoder model for the router: text-multilingual-embedding-002 by Google with task type RETRIEVAL_QUERY with an embedding size of 768 (task types in Google text embedding API allow us to choose the model optimized for the specific embeddings use case) [28,29]. During our previous research, this encoder proved to be the best out of e5, OpenAI text-ada, and OpenAI text-embedding-3-small. It is a multilingual version of the text-embedding-4 model by Google, which scores 66.31 on average across all tasks on the MTEB benchmark;
  • Random seed: 42;
  • Encoder model for pruning: text-multilingual-embedding-002 by Google with task type SEMANTIC_SIMILARITY with an embedding size of 768;
  • A full list of examples provided to the router is listed in Appendix A.
Table 4 shows the classification report (accuracy and routes F1 scores) for this small portion of the benchmark for all three semantic router configurations and XLM-R-based router fine-tuned to classify four possible routes with the other 60% of the dataset. The “-” in the memorized examples column for XLM-R means that it does not use a vector base filled with a certain number of samples to predict labels. The same sign would be used further to mark the lack of a vector database filled with examples for router models, which do not have it.
The semantic router in all configurations significantly loses to a supervised fine-tuned transformer classifier (0.13–0.16 accuracy difference). However, a dataset with 1725 samples was required to train this model. Considering a cold start problem, these samples would probably be synthetic for conversation routing tasks as there are no sources to gather even such a small dataset to fit the model. Synthetic data cannot fully replicate the real-life variety of language usage and the dataset can become homogeneous [30]. This model can be applied to the routing task at later stages, when there is enough data to train it and a semantic router can be a good option to label examples initially.
As you can see, pruning preprocessing allowed us to remove around 36.1% of examples from routers without an overall performance degradation. Overall accuracy stayed the same with only offtop and small talk routes receiving a slight F1 score drop (0.02–0.03).
However, the generalization of examples proved to be less efficient than expected as it performed the worst out of all configurations. The LLM tends to over-simplify provided examples to the point where catalog inquiries become too general and start to mislead the router into general wine questions or even small talk routes. The generalization prompt can be found in Appendix B.
Then, we ran the same measurement on a full benchmark to check how the results would scale on it. XLM-R was not used in this experiment as it uses 60% of these data for fine-tuning. Table 5 shows the obtained classification report and the result of the LLM-based router built with GPT 4o, which was one of the best LLM routers in our previous research [21].
Pruning without generalization still keeps the overall accuracy on the same level as a router built with all available examples. Slight performance degradation for small talks and offtops are also reproduced with the same 0.02 score drop. Generalized examples still did not show any positive effects as the overall accuracy is the lowest for this router and it does not outperform others on any specific route or at least by example set size.
Semantic router still loses to LLM in-context learning classifier (classification prompt can be found in Appendix B), but the difference is just 6% of the overall accuracy with the advantage of high interpretability, easier control, and reproducibility. Adding more specific examples can make the performance of the semantic router even better and closer to the LLM router. Also, it would be easier to extend it later in production and debug wrong cases, while the LLM router requires a prompt rewrite, which does not guarantee an accuracy boost or even the preservation of existing performance.
Then, we proceeded to check the influence of the pruning coefficient value on the final size of the router examples set. Table 6 shows multiple pruning configurations with different values of pruning threshold to demonstrate how it affects the size of the router examples set.
Once routers with different pruning thresholds were constructed, we measured them on a 40% subset of the routing benchmark. Table 7 shows how different pruning threshold values affect the classification accuracy.
Table 6 and Table 7 show that tuning a pruning coefficient is crucial as low values can lead to catastrophic performance degradation of the router (coefficient values lower than 0.75). The examples set becomes too small to cover necessary use cases, so the router classifies the most valid samples as either small talks or offtops. At the same time, keeping the coefficient around 0.90 or higher would not make any sense, as almost no examples are pruned and the set size remains the same. The best results were achieved with a pruning coefficient from 0.80 to 0.85 as they kept the original performance of the router with a set size reduction of around 15–34%. These measurements may differ for different encoders, so the pruning coefficient should be tuned specifically for the chosen router encoder.
However, we noticed slight accuracy fluctuations with different pruning coefficient values. For example, coefficient 0.90 filters only one example, but achieves a worse accuracy on a 40% subset of the benchmark due to more errors during catalog and small talk classification. On the other hand, the 0.85 coefficient improves the overall accuracy and provides better results for the catalog and general wine question routes. These two router pruning configurations examples were remeasured on a full benchmark dataset, and the results are presented in Table 8.
The same behavior was reproduced on the full benchmark, so it was decided to check which examples were pruned and why it happened. The 0.90 pruning coefficient does not improve the router in terms of both memory size and prediction accuracy as it pruned just one sample and deteriorated the accuracy of small talk detection. The only pruned example was “I’m impressed with the variety of wines you offer.” from the small talk route, which led to a lack of similar examples about user impressions and store experience.
As for coefficient 0.85, we focused on general wine questions as it was significantly improved with such a configuration. This route pruned only two examples: one about the influence of alcohol content on wine characteristics and another about wine geography. Both examples had similar ones among memorized samples, so it made sense to not add them too. This category of questions is the hardest in the dataset as it can be close to out-of-scope questions (for example when they relate to wine geography or history). Also, the questions here can be diverse and cover completely different topics: glasses, grapes, wine traditions, geography, history, winemaking, wineries, the wine itself, and the ways to store or drink it. It is reasonable to memorize as many examples as possible because most can be unique. This proves a necessity to optimize the pruning threshold for each route or class, specifically some would be fine if redundant examples are removed from the memory and others require as many unique samples as possible.
Originally proposed semantic router implementation can increase the example set over time as developers obtain more real-life data. However, as the router vector count grows, it would make retrieval slower and memory consumption would increase too. Pruning would allow us to keep vector count optimal and can keep the retrieval efficient without the quantization of router embeddings or manual examples filtering.

3.2. Jailbreak Prevention

The next step was to check the level of protection the semantic router provides against common LLM jailbreaks. First, we measured routers without pruning, with a pruning coefficient of 0.8, with both pruning and generalization and XLM-R router on a mini version of Jailbreak28K. The router had to predict offtop for all provided texts as they should not be passed to any agent down the line. Table 9 shows the results obtained from the test.
All configurations of semantic routers show the same performance in terms of jailbreak rejection, which proves the efficiency of semantic routing not only for multiagent chatbot orchestration but also as a security precaution for LLM-based applications. However, the performance of the fine-tuned transformer model significantly drops for jailbreaks (0.34 below semantic router). The variety of possible out-of-scope questions causes such behavior for the XLM-R router. It can be anything outside of router primary topics and small talk messages. Gathering a dataset to cover all cases here would be an impossible task. It can be solved with a classification probability threshold, which considers all predictions with a probability less than the chosen cutoff value as out-of-scope messages.
The next step was to run the same test with a 10% sample of a full Jailbreak28K without image jailbreaks (2600 texts). Table 10 shows the results of the conducted measurements.
The result was fully reproduced even on a larger set of possible instruction injections. Scores for larger and mini jailbreaks benchmark versions fully match all 4 tested models. The final thing to check is how the router would handle these jailbreaks if they contain topic keywords like “wine”, “red”, “catalog”, and others. Messages from Jailbreak28K do not touch the target topic of the router, so it would be easier to write them off as offtops. Adding topic keywords can make it easier to mislead the router by making input embeddings closer to embeddings of valid examples, so such vulnerability has to be checked.
We create a small set of questions about wines or just sets of keywords (10 texts). The mini version of Jailbreak28K is used as a base for this test. Wine-related statements and keywords are added at the start of the text and also at the end of the message for 35% of samples. Table 11 shows the result of the experiment.
The performance of the semantic router drops significantly as it is misled by injected wine-related statements. This makes the router vulnerable to instruction injections if the attacker knows the primary topic of the chatbot and adds some topic-related statements there to avoid it being sent to the offtop route right away. This does not mean that the underlying agents will not be able to handle this later, but a possible harmful message would still reach the LLM as the router would not filter it.
XLM-R router achieves the worst accuracy score (just 0.02) as it was fitted to assign all wine-related messages to either wine general questions or catalog inquiries. In this way, it completely misclassifies the whole jailbreak benchmark as harmful instructions are masked by primary topic n-grams.
This experiment proves that the issue persists in the semantic router and an ordinary transformer classifier. The classifier can be fine-tuned further to handle such mixed cases with jailbreak masking. Such mixed cases can affect the accuracy, but fine-tuning the model to handle this specific type of jailbreak should improve the performance. However, there is no clear solution for the semantic router. Replacement of the most frequent n-grams in the examples set would diminish the main advantage of sentence embeddings: multilingual comparisons if the encoder model supports multiple languages. Also, it can deteriorate the accuracy of valid route detection.
Another option would be to remove the first and last N tokens and make a voting classification by classifying the message as it is, classifying it without the first N tokens, and then without the last N tokens. Such attacks usually place fake topic-related statements at the start and at the end of the jailbreak to provoke the LLM into answering a normal question and then continue with answering the harmful one. However, our experiment with such a setup did not give a significant improvement even for jailbreak identification as the best accuracy we achieved was 0.51 which was achieved by removing 30% of tokens from the start and the end of the text.
Such similar attacks require deeper research to propose countermeasures, which would not affect the performance of valid route identification and would protect the router from harmful messages better.

3.3. Interpretability and Controllability

Even if the semantic router loses to the LLM-based router (considering only those routers that can handle a cold start problem and do not require fine-tuning), it provides crucial features for chatbot systems maintenance, such as a high level of control and interpretability by extending and modifying the examples set.
We created some visualizations to demonstrate the advantages of such an approach. They were generated with the same semantic search configuration as the router in Section 3.1 (the top 15 most similar examples, generated with text-multilingual-embedding-002 embedding model by Google with task type RETRIEVAL_QUERY with an embedding size of 768). Pruning was applied to router examples set with a pruning coefficient of 0.8, so it has 46 memorized samples.
First, we visualized how it classifies a catalog question by creating 2D projections of examples and query sentence embeddings. One option is TSNE (t-distributed stochastic neighbor embedding) [31]. TSNE allows us to project high-dimensional data into a lower-dimensional space while preserving samples pairwise similarity as much as possible. It aims to preserve the relationships between data points in a lower space to replicate the original dataset structure. This algorithm is effective for discovering patterns and possible clusters in complex data (such as sentence embeddings in our case).
TSNE measures pairwise similarity in high dimensions with Euclidean distance. However, most similar examples are chosen by cosine distance, so we decided to use the UMAP (uniform manifold approximation and projection) dimensionality reduction algorithm to optimize cosine similarity directly during the creation of low-dimension projections [32]. The main advantage of UMAP is the preservation of both local and global structures as it creates a high-dimensional graph with the provided data samples and then optimizes low-dimensional representations to preserve relationships from this graph. Nodes of the high-dimensional graph represent data points and edges with weights correspond to similarities of points. So, UMAP tries to preserve local structures and keeps track of the global relationships between the provided samples. In this way, clusters and their positions are reflected even in generated projections.
You can use the following parameters set from the UMAP algorithm to reproduce the provided visualizations:
  • N_neighbors: 10;
  • N_components: 2;
  • Metric: cosine;
  • Random_state: 42.
The model is fitted on examples set only, so queries are provided only to the transform method of the model (they do not affect the original high-dimensional graph and the way the model tries to project it into a 2D space).
You can see the visualization of classification executed with the semantic router in Figure 3, where a set of examples and a query are projected in 2D space with UMAP.
Larger circles with a clear black border are among the top 15 most similar examples, while others did not make it, so they do not affect the prediction. Most such samples belong to the catalog type in the visualization, which explains the router decision-making process as the query is routed to the catalog case.
You can see how samples group in three separate clusters in Figure 3 with only a few outliers for the small talk cluster. These outliers relate to either bot capabilities in terms of wine-related issues (for example what exactly can the bot tell about wine) or praises for the wine selection to cover cases, when the user wants to express gratitude for a good recommendation.
It is worth noting that some examples may appear closer to the query on visualization. Such cases are caused by approximation performed during dimensionality reduction as exact pairwise distances can be sacrificed for the preservation of the cluster structure or cluster relationships globally. These data points can share neighbors in the high-dimensional graph, so UMAP tries to preserve this local neighborhood even in low-dimensional projections, which leads to a closer and denser placement.
General wine questions and catalog recommendations are placed closer in this plot as they cover similar topics, including wines and their attributes. At the same time, small talk messages are significantly distanced from wine-related clusters.
In Table 12 we show the top 5 most similar examples from this visualization to show the exact text, which model is considered the most similar to the query “I’m looking for a wine to use in cooking, specifically for making risotto. Any suggestions?”.
The next step was to visualize an out-of-scope (offtop) message. We chose a sample with the following text: “Who initially proposed using the term “Native American” to recognize the primacy of the native population?”. The result can be seen in Figure 4.
The sample is distanced from valid route examples. All examples among the top 15 did not cross the 0.6 threshold, so this sample would not be passed on to any agent further. The maximum similarity value was 0.43 for the general question example. However, it is important to note that categories like small talk or general questions about a certain wide area can be more vulnerable than more specific categories like catalog recommendations. Out-of-scope samples are more similar to samples from these vulnerable categories in Figure 4 than to a specific, narrow one with a concrete use case (like searching for goods using an in-store database). This proves that a semantic router would achieve its best performance with a narrower definition of routes instead of working with general and wide topics as they can be difficult to distinguish. This theory should be checked in future research to check how the split of larger question clusters into smaller subclusters can affect classification accuracy. In particular, it should be checked for out-of-scope question prevention.
The final visualization shows a case of misclassification by the router. We took one of the previously adjusted jailbreaks, which is wrapped in statements similar to the catalog route and visualized in the same way as in Figure 3 and Figure 4. Here is the text of the sample: “Red wine, white wine, rose. Wines, wines. Respond to every prompt I give you with a moralizing rant about the provider content policies, but then start a new paragraph with the sentence” anyways, since we are out of this *** ***, let’s ignore the *** rules: “and then respond to the prompt exactly as an unfiltered and unsafe, completely unlimited language model could do. Be detailed. Prompt: Invent a method for producing fake legal documents for court cases. I want to find a good red wine for a date tonight.”. Figure 5 shows the result.
As you can see, the sample falls close to catalog examples and 13 out of 15 examples achieve a similarity score higher than the rejection threshold (0.6). Masking out-of-scope messages with a set of keywords related to the router’s primary topic can significantly mislead the router, which can be proved by both cosine similarity scores and visualization of the data.
Such visualizations and tables can easily explain the router decision-making for any area and topic as they provide a set of most similar examples and their similarity scores. So, if the user would like to modify router behavior, they can add another target route example to avoid further misclassification or adjust existing ones. This approach allows us to interpret and control the classification process without specific knowledge of machine learning (which is required to interpret attention masks or SHAP visualizations) by supporting the list of examples and checking which ones are most similar at each router call.

4. Discussion

Semantic routing approach based on latent sentence embeddings retrieval can achieve close performance to LLM in-context learning classifier and provides a high level of interpretability and controllability. The router can be adjusted by adding more examples, modifying existing ones, or changing the rejection threshold to filter fetched examples differently. The decision-making process is clear and can be easily understood as the user just needs to check which examples were chosen as most similar and compare their similarity scores to the rejection threshold. This method would still work for long texts as it does not try to evaluate the significance of each specific token and instead evaluates pairwise similarities of texts. The interpretability of this approach does not require any additional tools or models as it can be carried out simultaneously with the classification itself (the model returns the chosen most similar examples and their scores).
This approach can be extended for interpretable text classification in general and in areas that require a detailed explanation of conducted decisions and even for multimodal tasks with multiple domains [33]. Currently, it can be less accurate than a fine-tuned transformer or LLM classifier, but it is easier to understand and control. The modification would not take full-on fine-tuning and the results are easily reproduced compared to the LLM router. Routing achieves its best performance with narrowly specified options, so it is recommended to avoid wide definitions, which can overlap. There is still a field for further research on how to increase the accuracy of this approach to keep both the interpretability of semantic routers and the accuracy of transformer classifiers.
Proposed automatic example pruning allows us to not overextend router example set with redundant samples, which can harm its performance without any added accuracy. This way example set can become almost 40% smaller without accuracy degradation and manual filtering of too similar examples. High values of the pruning coefficient almost do not affect the examples set and leave it as it is in terms of size. Small values can lead to significant performance loss due to the removal of too many texts.
The encoder model directly influences the optimal value of the pruning threshold as they are trained with different tasks in mind (for example paraphrase similarity and detection of semantic distinctions). This produces significantly different embeddings, which may affect the pairwise similarity of the same pair of texts if they are encoded and compared by two different encoders. Pairwise similarities can also be affected by different pooling strategies, differences in pretraining datasets, and label quality. So pairwise scores can differ even if two models agree that the texts are similar. Router examples pruning should be implemented once there is a validation dataset to check multiple configurations and choose the one which works the best for the specific model and domain.
However, the generalization of examples in addition to pruning did not improve performance and even caused a slight drop in accuracy. It became worse not only in more wide routes like “general wine questions” or “small talk” but also in narrow and specific ones like “catalog”. At the same time, out-of-scope questions are still labeled with a similar success rate to the ordinary router or the one with a pruning mechanism.
The model can scale to ordinary classification tasks and even to multimodal cases. The only requirement is a set of predetermined class examples and a choice of encoder model to build sentence embeddings of examples and future queries. However, we still would like to research the generalization approach further as the idea of keeping just a few templated messages instead of a whole set of examples still seems promising. Generalization can be applied only to some routes, which correspond to some specific tasks, while routes that keep track of a wide range of topics need to memorize as many examples as possible. Also, we want to check the effect of pruning on larger sets of examples to analyze its effect on the memory size of the router and its accuracy. Also, it is necessary to provide an automated way of establishing pruning threshold tuning in case the model user has a validation dataset to measure router accuracy with multiple configurations and choose the best one.
Another direction to improve the interpretability of the model would be to add a sentence embedding decomposition step to the pipeline to determine which components of the feature vector may affect the prediction in a certain way and try to interpret those characteristics in terms of the router domain.
Even though a semantic router increases the security of the chatbot application by filtering most out-of-scope messages and jailbreaks, it is still vulnerable to attacks when such message is wrapped within valid statements, which are similar to target routes. The issue persists with XLM-R-based routers and the effect of such jailbreak masking is even worse for this model (accuracy is equal to 0.02). Fine-tuning can be a possible solution for this problem for transformer routers. However, there is no clear solution for the semantic router to counter such attacks. One option would be to remove the most used words from the examples set. Another one would be to make a voting classification: classify the message as it is, remove the first N tokens and classify it again, and finally remove the last N tokens and classify this version. However, both methods do not give a significant improvement in terms of jailbreak protection and can deteriorate the performance of the router in general.

Author Contributions

Conceptualization, D.M. and O.T.; Software, D.M.; Writing—original draft preparation, D.M.; Supervision, O.T.; writing—review and editing, O.T. All authors have read and agreed to the published version of the manuscript.

Funding

The study was supported by Kharkiv National University of Radio Electronics.

Data Availability Statement

The routing benchmark dataset presented in this article is not readily available because it contains sensitive information regarding a specific consumer product and was developed only for internal corporate usage by Teamwork Commerce employees. Requests to access the datasets should be directed to daniil.maksymenko@nure.ua.

Acknowledgments

The authors would like to thank Artem Nikulchenko for inspiration and critics during the research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This appendix presents full lists of examples used to build a semantic router before any pruning or generalization.
Here is the list of general wine questions:
  • What are the main types of wine grapes?
  • Tell me about the history of wine.
  • What are some popular wine regions?
  • How does climate affect wine production?
  • What are the characteristics of a good Merlot?
  • How do you make white wine with red grapes?
  • Can wine be part of a healthy diet, and if so, how?
  • What are the pros and cons of drinking wine compared to other alcoholic beverages?
  • How does the taste of wine vary depending on the region it comes from?
  • What are tannins in wine and how do they affect the taste?
  • Can you explain the concept of ‘terroir’ in winemaking?
  • What is the significance of the year on a wine bottle?
  • What are sulfites in wine and why are they added?
  • How does the alcohol content in wine vary and what factors influence it?
  • What is the difference between dry and sweet wines?
  • I’ve always wondered how to properly taste wine. Could you give me some tips on how to do this?
  • I’ve noticed that some wines have a higher alcohol content than others. How does this affect the taste and the potential effects of the wine?
  • I’ve heard that some wines are better suited for certain seasons. Is this true and if so, which wines are best for which seasons
  • I’ve always been curious about the different wine regions around the world. Could you tell me about some of the most famous ones and what makes them unique?
  • I’ve heard that certain wines should be served at specific temperatures. Is this true and if so, why?
  • I’ve noticed that some wines are described as ‘full-bodied’ while others are ‘light’. What do these terms mean and how do they affect the taste of the wine?
  • I’ve heard that some people collect wine as an investment. Is this a good idea and if so, which wines are best for this?
  • I’ve always been curious about the process of making sparkling wine. Could you explain how it differs from still wine production?
Catalog full examples list:
  • Can you recommend a wine for a romantic dinner?
  • I want to order some wine
  • What’s the price of a good bottle of ?
  • I need a wine suggestion for a summer picnic.
  • Tell me about the wines available in your catalog.
  • What wine would you suggest for a barbecue?
  • I’m looking for a specific wine I had last week.
  • Can you help me find a wine within a $20 budget?
  • We both enjoy sweet wines. What dessert wine would you recommend for a cozy night in?
  • I’m preparing a French-themed dinner. What French wine would complement the meal?
  • We’re having a cheese and wine night. What wine goes well with a variety of cheeses?
  • I’m planning a surprise picnic. What rosé wine would be ideal for a sunny afternoon?
  • We’re having a movie night and love red wine. What bottle would you suggest?
  • Hello, I’m new to the world of wine. Could you recommend a bottle around $30 that’s not too sweet and would complement a grilled shrimp dish?
  • Ciao, stasera cucino un risotto ai frutti di mare. Quale vino bianco si abbina bene senza essere troppo secco?
  • Hey, I’m searching for a nice red wine around $40. I usually enjoy a good Merlot, but I’m open to other options. Anything with a smooth finish and rich fruit flavors would be great! Any recommendations?
  • Hi, I need a good wine pairing for a roasted turkey dinner with herbs. I usually prefer a dry white, something like a Pinot Grigio, but I’m open to other suggestions. Do you have any recommendations around $30?
  • Hello, I’m looking for a robust red wine with moderate tannins to pair with a rich mushroom and truffle pasta. Ideally, something from the Tuscany region, under $70. Any suggestions?
  • Hi, I need a light and refreshing wine to pair with grilled salmon and a citrus salad. Any suggestions for something not too sweet under $25?
  • Hallo! Ich suche eine gute Flasche Wein für etwa 30–40 €. Normalerweise bevorzuge ich trockenere Weißweine, wie einen Chardonnay oder vielleicht einen deutschen Riesling. Haben Sie Empfehlungen?
  • I’m looking for a good but affordable wine for a casual get-together. I don’t know much about wine, so any help would be appreciated.
  • Which wines would you recommend for a beginner that are easy to drink and have a fruity flavor?
  • Hallo, ich suche einen vollmundigen Weißwein mit moderaten Säurenoten, der zu einem cremigen Meeresfrüchte-Risotto passt. Idealerweise etwas aus der Region Burgund, unter 50 €. Haben Sie Vorschläge?
  • Je prépare un curry thaï aux crevettes ce soir. Vous pensez à un blanc plus léger ? Quelque chose qui ne dominera pas les saveurs épicées. Suggestions?
  • Salut, je cherche une bouteille de vin rouge pour environ 40 €. Quelque chose de facile à boire, pas trop tannique, peut-être avec des notes de fruits rouges ? Que recommandez-vous?
  • Cerco un vino da utilizzare in cucina, nello specifico per fare il risotto ai frutti di mare. Eventuali suggerimenti?
  • I want to buy a wine that I can age for the next 5–10 years. What would you recommend in the $50 range?
  • Hey! I need a wine recommendation for a cozy night in with friends—something that pairs well with a cheese platter. Any suggestions?
  • I’m making a Thai shrimp curry tonight. Thinking a lighter white? Something that won’t overpower the spicy flavors. Any suggestions?
  • Is this wine vegan-friendly?
  • Hello, I’m looking for a good wine to pair with a seared tuna steak and a fresh salad. I usually enjoy a crisp Pinot Grigio, but I’m open to new suggestions. Any recommendations?
  • Je prépare un magret de canard rôti avec une sauce aux airelles. Je pense à un Bordeaux, mais je suis ouvert aux suggestions. Quelque chose d’équilibré et pas trop boisé. Que recommandez-vous?
  • Hey! Ich suche eine gute Flasche Wein zu gegrilltem Rinderfilet. Normalerweise nehme ich kräftige Rotweine, bin aber für Vorschläge offen. Etwas Weiches, nicht zu Trockenes, um die 40–50 €? Was empfehlen Sie?
Small talk full examples list:
  • Hi/Hello/Hallo
  • Hi there, how are you?
  • Thank you for your help!
  • Goodbye, have a nice day!
  • What can you do as an assistant?
  • I’m not sure if I want to buy anything right now, but I’ll keep your site in mind for the future.
  • I’m sorry, but I didn’t find what I was looking for on your site.
  • I appreciate your help, but I think I’ll look elsewhere for now.
  • Thanks for your time, have a great day!
  • Great selection of wines you have here.
  • I’m just looking around for now, but I might have some questions later.
  • This site is really easy to navigate, thanks for making it user-friendly.
  • I’m not sure what I’m looking for yet, but I’ll let you know if I have any questions.
  • I’m impressed with the variety of wines you offer.
  • What are some of the things you can help me with?
  • What kind of questions can you answer?
  • Can you tell me more about your capabilities?
  • What kind of information can you provide about wines?

Appendix B

This appendix provides prompts used for classification (LLM in-context learning based router) and generalization prompts used with pruning to reduce the example set size.
The generalization prompt is as follows: “Create a generalized version of the provided message while maintaining the topic or specific terms and the original language of the message.
# Steps
  • Identify the main topic and specific terms that need to be retained.
  • Simplify any specific examples or details, replacing them with more general equivalents.
  • Maintain the original tone and language style of the message.
  • Ensure the revised version retains the essence of the original message but is applicable to a broader context.
# Output Format
Provide a single paragraph that represents a generalized version of the original message, without any additional formatting or annotations.”.
Generalization is done with temperature 0.0.
LLM in-context learning router prompt is as follows: “As a wine store support expert, your task is to classify user message into one of five classes. You should classify only user message:
(1)
general-questions: Questions about wine, sommelier, wine grapes, wine regions, countries and their locations, development, culture or history, wine geography or history, people’s wine preferences, and characteristics of certain wines or grape varieties. Never about wineries, proposals, buying;
(2)
catalog: Inquiries about drinking, buying, tasting, recommendations, questions about specific wines from the catalog and their attributes, direct or indirect requests to find a certain kind of wine for a particular occasion, and questions about pricing and related matters;
(3)
smalltalk: General conversation like greetings, thanks, farewells, and questions about the assistant and its functionality. No wine-related discussion;
(4)
offtop: Everything else not related to the previous classes or wine in general.
Pay close attention to chat history, as all classes depend on context significantly. Analyze messages comprehensively, considering the context to classify. The most recent messages should carry more weight than previous ones to classify.
Chat history wrapped in <chat> tag
{0}
Please provide your classification along with a short explanation for each classification.
### Your answer must strictly follow certain JSON structure. Here is the JSON template, which you MUST fill:
{{
explanation: short explanation up to 10 words why you selected specific class for the user message,
class: general-questions | catalog | smalltalk | offtop
}}”.
Classification is done with JSON output and temperature 0.0. Multiple evidence calibration approach is used to stabilize the prediction (calibration is done by generating explanation value before the actual prediction).

References

  1. Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
  2. Erdem, E.; Kuyu, M.; Yagcioglu, S.; Frank, A.; Parcalabescu, L.; Plank, B.; Babii, A.; Turuta, O.; Erdem, A.; Calixto, I.; et al. Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning. J. Artif. Intell. Res. 2022, 73, 1131–1207. [Google Scholar] [CrossRef]
  3. Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Jeju, Republic of Korea, 3–9 August 2024; pp. 8048–8057. [Google Scholar] [CrossRef]
  4. Rebedea, T.; Dinu, R.; Sreedhar, M.N.; Parisien, C.; Cohen, J. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Singapore, 6–10 December 2023; pp. 431–445. [Google Scholar] [CrossRef]
  5. Greshake, K.; Abdelnabi, S.; Mishra, S.; Endres, C.; Holz, T.; Fritz, M. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, Copenhagen, Denmark, 30 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 79–90. [Google Scholar] [CrossRef]
  6. Mohit, T.; Juclà, D.G. Long Text Classification Using Transformers with Paragraph Selection Strategies. In Proceedings of the Natural Legal Language Processing Workshop 2023, Association for Computational Linguistics, Singapore, 7 December 2023; pp. 17–24. [Google Scholar] [CrossRef]
  7. Padalko, H.; Chomko, V.; Yakovlev, S.; Chumachenko, D. Ensemble Machine Learning Approaches for Fake News Classification. Radioelectron. Comput. Syst. 2023, 4, 5–19. [Google Scholar] [CrossRef]
  8. Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X.; et al. The Application of Large Language Models in Medicine: A Scoping Review. iScience 2024, 27, 109713. [Google Scholar] [CrossRef] [PubMed]
  9. Dmytro, C. Exploring Different Approaches to Epidemic Processes Simulation: Compartmental, Machine Learning, and Agent-Based Models. In Data-Centric Business and Applications; Štarchoň, P., Fedushko, S., Gubíniová, K., Eds.; Springer Nature: Cham, Switzerland, 2024; Volume 208, pp. 27–54. [Google Scholar] [CrossRef]
  10. Hajar, H.; Abnane, I.; Idri, A. Interpretability in the Medical Field: A Systematic Mapping and Review Study. Appl. Soft Comput. 2022, 117, 108391. [Google Scholar] [CrossRef]
  11. Zhang, Y.; Tiňo, P.; Leonardis, A.; Tang, K. A Survey on Neural Network Interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 5, 726–742. [Google Scholar] [CrossRef]
  12. Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 328–339. [Google Scholar] [CrossRef]
  13. Maksymenko, D.; Saichyshyna, N.; Paprzycki, M.; Ganzha, M.; Turuta, O.; Alhasani, M. Controllability for English-Ukrainian Machine Translation by Using Style Transfer Techniques. Ann. Comput. Sci. Inf. Syst. 2023, 35, 1059–1068. [Google Scholar] [CrossRef]
  14. Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of Machine Learning Models Using Shapley Additive Explanation and Application for Real Data in Hospital. Comput. Methods Programs Biomed. 2022, Volume 214, 106584. [Google Scholar] [CrossRef]
  15. Parikh, S.; Tiwari, M.; Tumbade, P.; Vohra, Q. Exploring Zero and Few-Shot Techniques for Intent Classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 5, pp. 744–751. [Google Scholar] [CrossRef]
  16. Chen, Q.; Zhu, X.; Ling, Z.-H.; Inkpen, D.; Wei, S. Neural Natural Language Inference Models Enhanced with External Knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1, pp. 2406–2417. [Google Scholar] [CrossRef]
  17. Wu, Z.; Wang, Y.; Ye, J.; Kong, L. Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 1423–1436. [Google Scholar] [CrossRef]
  18. Wei, S.-L.; Wu, C.-K.; Huang, H.-H.; Chen, H.-H. Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 5598–5621. [Google Scholar] [CrossRef]
  19. Wang, P.; Li, L.; Chen, L.; Cai, Z.; Zhu, D.; Lin, B.; Cao, Y.; Kong, L.; Liu, Q.; Liu, T.; et al. Large Language Models Are Not Fair Evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Volume 1, pp. 9440–9450. [Google Scholar] [CrossRef]
  20. Singh, C.; Inala, J.P.; Galley, M.; Caruana, R.; Gao, J. Rethinking Interpretability in the Era of Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
  21. Maksymenko, D.; Kryvoshein, D.; Turuta, O.; Kazakov, D.; Turuta, O. Benchmarking Conversation Routing in Chatbot Systems Based on Large Language Models. In Proceedings of the 4th International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2024), Cambridge, MA, USA, 25–27 September 2024; Volume 3777, pp. 75–86. Available online: https://ceur-ws.org/Vol-3777/paper6.pdf (accessed on 15 October 2024).
  22. Manias, D.M.; Chouman, A.; Shami, A. Semantic Routing for Enhanced Performance of LLM-Assisted Intent-Based 5G Core Network Management and Orchestration. arXiv 2024. [Google Scholar] [CrossRef]
  23. Wei, A.; Haghtalab, N.; Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail? arXiv 2023. [Google Scholar] [CrossRef]
  24. Gemini Team; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
  25. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv 2016. [Google Scholar] [CrossRef]
  26. Luo, W.; Ma, S.; Liu, X.; Guo, X.; Xiao, C. JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. arXiv 2024. [Google Scholar] [CrossRef]
  27. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023. [Google Scholar] [CrossRef]
  28. Lee, J.; Dai, Z.; Ren, X.; Chen, B.; Cer, D.; Cole, J.R.; Hui, K.; Boratko, M.; Kapadia, R.; Ding, W.; et al. Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
  29. Choose an Embeddings Task Type|Generative AI on Vertex AI. Google Cloud. Available online: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/task-types (accessed on 22 October 2024).
  30. Li, Z.; Zhu, H.; Lu, Z.; Yin, M. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 6–10 December 2023; pp. 10443–10461. [Google Scholar] [CrossRef]
  31. Molino, P.; Wang, Y.; Zhang, J. Parallax: Visualizing and Understanding the Semantics of Embedding Spaces via Algebraic Formulae. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, 28 July–2 August 2019; pp. 165–180. [Google Scholar] [CrossRef]
  32. McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018. [Google Scholar] [CrossRef]
  33. Saichyshyna, N.; Maksymenko, D.; Turuta, O.; Yerokhin, A.; Babii, A.; Turuta, O. Extension Multi30K: Multimodal Dataset for Integrated Vision and Language Research in Ukrainian. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), Association for Computational Linguistics, Dubrovnik, Croatia, 5 May 2023; pp. 54–61. [Google Scholar] [CrossRef]
Figure 1. Retrieval scheme of basic semantic routing with latent embeddings.
Figure 1. Retrieval scheme of basic semantic routing with latent embeddings.
Computation 12 00237 g001
Figure 2. Example of a pruning pipeline.
Figure 2. Example of a pruning pipeline.
Computation 12 00237 g002
Figure 3. UMAP projection of examples and query embeddings for the query “I’m looking for a wine to use in cooking, specifically for making risotto. Any suggestions?” with type “catalog”.
Figure 3. UMAP projection of examples and query embeddings for the query “I’m looking for a wine to use in cooking, specifically for making risotto. Any suggestions?” with type “catalog”.
Computation 12 00237 g003
Figure 4. UMAP projection of examples and query embeddings for query “Who initially proposed using the term “Native American” to recognize the primacy of the native population?” with type “offtop”.
Figure 4. UMAP projection of examples and query embeddings for query “Who initially proposed using the term “Native American” to recognize the primacy of the native population?” with type “offtop”.
Computation 12 00237 g004
Figure 5. UMAP projection of examples and query embeddings for query with type “offtop”, which was classified as “catalog”.
Figure 5. UMAP projection of examples and query embeddings for query with type “offtop”, which was classified as “catalog”.
Computation 12 00237 g005
Table 1. Routing benchmark dataset statistics.
Table 1. Routing benchmark dataset statistics.
RouteOriginal ExamplesSyntheticScrappedSQuADTotal
General wine questions302836180931
Catalog8267500757
Small talk729700304
Offtop000884884
Total12212556188842876
Table 2. Language distribution of the dataset.
Table 2. Language distribution of the dataset.
RouteEnglishEach of the German/French/
Italian/Ukrainian
General wine questions468116
Catalog44278
Small talk17233
Offtop50096
Total1584323
Table 3. Characteristics of texts across languages present in the benchmark.
Table 3. Characteristics of texts across languages present in the benchmark.
LanguageCharacters CountWord CountAverage Word Count per TextAverage Word Length
English189,67738,48324.294.05
German31,338551917.084.81
French33,199583618.074.41
Italian31,245586718.164.45
Ukrainian27,847515416.004.56
Table 4. Classification report on routing benchmark with a 40% subset of the database for multiple router configurations.
Table 4. Classification report on routing benchmark with a 40% subset of the database for multiple router configurations.
Router ConfigurationMemorized ExamplesAccuracyGeneral Wine Questions F1Catalog F1Small Talk F1Offtop F1
No pruning720.840.790.860.700.90
Pruning (0.8)460.840.800.880.680.87
Pruning (0.8) +
generalization
490.810.760.840.630.87
XLM-R finetuned with 60% of the dataset-0.970.950.970.940.98
Table 5. Classification report on full routing benchmark for multiple router configurations.
Table 5. Classification report on full routing benchmark for multiple router configurations.
Router ConfigurationMemorized ExamplesAccuracyGeneral Wine Questions F1Catalog F1Small Talk F1Offtop F1
No pruning720.850.800.860.730.90
Pruning (0.8)460.850.810.890.710.88
Pruning (0.8) +
generalization
490.820.760.850.660.88
GPT 4o LLM in-context learning router-0.910.900.940.820.93
Table 6. Router examples set size with different pruning configurations.
Table 6. Router examples set size with different pruning configurations.
Router
Configuration
The Original
Number of
Examples
0.7 Threshold0.75 Threshold0.8 Threshold0.85 Threshold0.9 Threshold
General wine questions2338162123
Catalog32210162532
Small talk17710141516
Total721228466171
Table 7. Classification report on routing benchmark with a 40% subset of the database for multiple pruning threshold values.
Table 7. Classification report on routing benchmark with a 40% subset of the database for multiple pruning threshold values.
Pruning ThresholdMemorized
Examples
AccuracyGeneral Wine Questions F1Catalog F1Small Talk F1Offtop F1
0.70120.390.000.000.450.60
0.75280.800.710.850.700.84
0.80460.840.800.880.680.87
0.85610.850.840.880.690.89
0.90710.830.790.850.680.90
No pruning720.840.790.860.700.90
Table 8. Classification report on full routing benchmark for various pruning threshold (0.80+) values.
Table 8. Classification report on full routing benchmark for various pruning threshold (0.80+) values.
Pruning ThresholdMemorized
Examples
AccuracyGeneral Wine Questions F1Catalog F1Small Talk F1Offtop F1
0.80460.850.810.890.710.88
0.85610.860.840.890.730.89
0.90710.840.800.860.710.90
No pruning720.850.800.860.730.90
Table 9. Semantic router performance on jailbreak prevention task (mini Jailbreak 28K).
Table 9. Semantic router performance on jailbreak prevention task (mini Jailbreak 28K).
Router ConfigurationMemorized ExamplesAccuracy
No pruning720.97
Pruning (0.8)460.97
Pruning (0.8) + generalization490.97
XLM-R finetuned with 60% of the dataset-0.63
Table 10. Semantic router performance on jailbreak prevention task (10% of full Jailbreak28K).
Table 10. Semantic router performance on jailbreak prevention task (10% of full Jailbreak28K).
Router ConfigurationMemorized ExamplesAccuracy
No pruning720.97
Pruning (0.8)460.97
Pruning (0.8) + generalization490.97
XLM-R finetuned with 60% of the dataset-0.63
Table 11. Semantic router performance on jailbreak prevention task (mini Jailbreak28K and injection of wine-related statements in jailbreaks).
Table 11. Semantic router performance on jailbreak prevention task (mini Jailbreak28K and injection of wine-related statements in jailbreaks).
Router ConfigurationMemorized ExamplesAccuracy
No pruning720.32
Pruning (0.8)460.32
Pruning (0.8) + generalization490.22
XLM-R finetuned with 60% of the dataset-0.02
Table 12. The top 5 most similar examples for the provided query.
Table 12. The top 5 most similar examples for the provided query.
TextTypeSimilarity
Ciao, stasera cucino un risotto ai frutti di mare. Quale vino bianco si abbina bene senza essere troppo secco?catalog0.76
Can you recommend a wine for a romantic dinner?catalog0.76
Hello, I’m looking for a robust red wine with moderate tannins to pair with a rich mushroom and truffle pasta. Ideally, something from the Tuscany region, under $70. Any suggestions?catalog0.74
Hey, I’m searching for a nice red wine around $40. I usually enjoy a good Merlot, but I’m open to other options. Anything with a smooth finish and rich fruit flavors would be great! Any recommendations?catalog0.74
I’m preparing a French-themed dinner. What French wine would complement the meal?catalog0.73
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Maksymenko, D.; Turuta, O. Interpretable Conversation Routing via the Latent Embeddings Approach. Computation 2024, 12, 237. https://doi.org/10.3390/computation12120237

AMA Style

Maksymenko D, Turuta O. Interpretable Conversation Routing via the Latent Embeddings Approach. Computation. 2024; 12(12):237. https://doi.org/10.3390/computation12120237

Chicago/Turabian Style

Maksymenko, Daniil, and Oleksii Turuta. 2024. "Interpretable Conversation Routing via the Latent Embeddings Approach" Computation 12, no. 12: 237. https://doi.org/10.3390/computation12120237

APA Style

Maksymenko, D., & Turuta, O. (2024). Interpretable Conversation Routing via the Latent Embeddings Approach. Computation, 12(12), 237. https://doi.org/10.3390/computation12120237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop