Interpretable Conversation Routing via the Latent Embeddings Approach
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper presents an interpretable approach for conversation routing using latent embedding retrieval. It focuses on the challenge of effectively routing queries to multiple agents in a chatbot system that must handle various tasks, ranging from simple questions to more complex, domain-specific inquiries. The paper introduces a semantic routing method, proposes benchmark datasets, and compares it against traditional LLM-based routers. The results indicate that latent embedding routing can achieve performance comparable to LLM-based routing, with the added benefits of interpretability and easier control.
Strengths:
1. The proposed approach adds interpretability and control to routing systems, particularly relevant for real-world applications where accountability is essential.
2. The authors present a thorough benchmark demonstrating how semantic routing performs under multiple conditions, including multilingual contexts and various application scenarios.
3. The paper also proposes an innovative pruning mechanism to reduce redundant examples, making the routing system more efficient without losing accuracy.
Drawbacks:
1. The paper focuses heavily on comparing latent embedding routing with LLM-based routers, but it lacks a comprehensive evaluation against other baseline models, such as traditional text classifiers (e.g., SVMs or Random Forests). These methods might serve as a simpler alternative for specific use cases and provide valuable insights into the practical trade-offs.
2. While the paper presents F1 scores and accuracy for different routing methods, no statistical analysis is provided to validate whether the observed differences in performance are statistically significant. Given that some differences are quite small, ensuring that the results are not just due to chance is important.
Recommendations:
1. Add a comparison of the proposed routing mechanism with traditional text classification models, such as SVM, Naive Bayes, or other simpler ML techniques. This would help highlight the practical advantages of using latent embeddings over these established approaches.
2. Including statistical tests (e.g., t-tests or ANOVA) to validate the significance of performance differences between different routing approaches will improve the quality of the paper and help strengthen the claims made about the performance improvements.
3. The paper has some grammatical errors and awkward phrasing, particularly in the introduction and methodology sections. A thorough proofreading would enhance readability.
4. The visualizations are helpful but could be improved by adding legends and clearer labels to make them more interpretable, especially for readers who may not be familiar with TSNE plots.
Author Response
Comments 1: Add a comparison of the proposed routing mechanism with traditional text classification models, such as SVM, Naive Bayes, or other simpler ML techniques. This would help highlight the practical advantages of using latent embeddings over these established approaches.
Response 1: We conducted additional experiments with an XLM-R-based router (trained it with 60% of the dataset and measured its performance on the 40% subset of the benchmark and jailbreak detection task). XLM-R router results can be seen in sections 3.1 and 3.2. The experiment showed that the semantic router loses to a fine-tuned transformer regarding valid route classification. However, XLM-R significantly loses on the jailbreak detection task, which is one of the crucial advantages of the semantic router (additional protection of LLM agents from instruction injections). Also, we elaborated on why classic TF-IDF-based classifiers would not work for the routing task (difficulty with multilingual and multidomain scalability). This part was added to the Introduction section of the paper.
Comments 2: Including statistical tests (e.g., t-tests or ANOVA) to validate the significance of performance differences between different routing approaches will improve the quality of the paper and help strengthen the claims made about the performance improvements.
Response 2: We appreciate the suggestion to include statistical tests as they can provide additional insights. However, they would require a broader dataset and additional experiments. We focused on providing qualitative and quantitative results to highlight the lack of performance drops with the introduction of pruning of examples and also demonstrated a clear advantage of semantic router over a fine-tuned transformer on jailbreak detection task.
Comments 3: The paper has some grammatical errors and awkward phrasing, particularly in the introduction and methodology sections. A thorough proofreading would enhance readability.
Response 3: The whole text was proofread and edited to fix old mistakes and avoid making new ones in extended sections. The text was additionally checked with external grammar correction tools.
Comments 4: The visualizations are helpful but could be improved by adding legends and clearer labels to make them more interpretable, especially for readers who may not be familiar with TSNE plots.
Response 4: We replaced TSNE visualizations with UMAP projections as they allow us to balance the high-dimensional data's local and global structures. Both dimensionality reduction methods were described additionally in section 3.3 and we added comments on why the decision to switch from TSNE to UMAP was made. All visualizations were regenerated and legends were extended to ease the readability (especially large circles with a black border, which represent most similar examples from router memory). Also, we extended explanations of plots to ease their interpretability.
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for Authors1. The paper addresses the challenges of interpretability and control in large language model (LLM) routers, specifically in conversation routing systems, and introduces a semantic routing method based on latent embeddings. While the introductory sections effectively establish the motivation, adding more detailed context about alternative routing techniques (e.g., traditional classifiers or other few-shot methods) would strengthen the case for the proposed method.
2. The paper briefly mentions related methods, but expanding this section to include detailed comparisons with other routing strategies and explainability techniques (e.g., zero-shot and few-shot classification, SHAP, attention masks) could help contextualize the contributions. For example, explaining why alternative interpretability methods, such as SHAP or attention-based models, may be less effective for dynamic routing could clarify the advantages of using latent embeddings.
3. The paper presents the routing approach with clarity but could benefit from a structured breakdown of its components. For example, separating sections on embedding generation, similarity scoring, and threshold optimization would clarify the method for readers. More specifics on parameter choices, such as the rationale for using a similarity score threshold of 0.6 or an example pruning threshold of 0.8, would improve the method’s reproducibility. Explaining why specific encoder models (like Google’s text-multilingual-embedding-002) were chosen over others would also help readers understand the model's generalizability across contexts. The explanation on example pruning is a strong contribution, but adding pseudo-code or a flowchart could make the step-by-step pruning method more accessible. For example, showing a sample input, how it is filtered, and what is retained would better illustrate the process.
4. As the dataset is proprietary and currently unavailable for public use, this restricts the methodology's applicability. It would be beneficial to discuss this limitation in terms of generalizability and address how the method might perform with other datasets or in different domains. Additionally, consider elaborating on the characteristics of the dataset (e.g., the diversity in language and question types) to guide other researchers who might want to replicate the experiments with open-source datasets. Discussing potential domain-specific adjustments (e.g., how similar or different pruning thresholds might be needed in medical or retail domains) would also add valuable insight for readers considering this method for different use cases.
5. The results section effectively reports accuracy, F1 scores, and classification performance across routes. However, including statistical significance tests (e.g., p-values) for performance differences between routers could add rigor. For instance, is the observed performance difference between the pruning and no-pruning configurations statistically significant? The breakdown of performance on jailbreak datasets is highly relevant. However, more interpretation of these findings—such as a deeper discussion on the potential vulnerability of the routing method to domain-specific keyword injections—could strengthen the analysis. Consider proposing potential solutions or mitigations for these vulnerabilities to enhance the practical application of the method.
6. The use of TSNE projections to visualize embeddings is helpful. However, Figure captions could be expanded to clarify all elements, particularly the significance of larger circles in the decision-making process. Color-coding the routes or adding labels within the figures would make them more readable. The interpretation of the TSNE projections could also be expanded. For instance, explain why certain out-of-scope messages cluster closely with valid routes, especially for categories that are “closer” in vector space, and what implications this has for the robustness of the router.
7. The discussion section should elaborate on how this approach could be adapted or scaled. For instance, would increasing the diversity of training examples or using larger datasets affect the pruning thresholds or similarity thresholds? The paper could also suggest directions for future work to address keyword-injection vulnerabilities or investigate methods for filtering out misleading examples in the embedding space. Additionally, exploring the integration of the proposed method with other explainability techniques, like counterfactual explanations, could offer avenues for extending the research.
Comments on the Quality of English Language1. The paper generally maintains an acceptable standard of technical language. However, some complex sentences could benefit from simplification. For example, phrases like “reproducibility in contrast to LLM router” could be rephrased for clarity, e.g., “in contrast to the less interpretable LLM router, this approach offers improved reproducibility and ease of modification.” Certain terms and concepts, such as “pruning preprocessing” and “rejection threshold filtering,” are technical and could be briefly defined upon first use to ensure accessibility for a broader audience.
2. Minor grammatical inconsistencies are present, particularly in noun-verb agreements (e.g., “instructions conflict each other”, which should read “instructions conflict with each other”). A detailed proofread to catch these inconsistencies would improve readability. The overall sentence structure is sound, but in some cases, separating long sentences into shorter ones would enhance clarity, particularly in technical explanations.
3. The technical vocabulary is largely appropriate for the target audience, but more precision in terms like “interpretability” and “control” could help readers understand exactly what is being controlled or interpreted (e.g., decision paths, model outputs). Consistent terminology would help reinforce the methodology's coherence. For instance, if "semantic routing" is used to describe a specific process, this term should be maintained consistently rather than occasionally substituting terms like "latent embeddings routing".
Author Response
Comment 1: The paper addresses the challenges of interpretability and control in large language model (LLM) routers, specifically in conversation routing systems, and introduces a semantic routing method based on latent embeddings. While the introductory sections effectively establish the motivation, adding more detailed context about alternative routing techniques (e.g., traditional classifiers or other few-shot methods) would strengthen the case for the proposed method.
Response 1: The introduction section was extended with additional explanations on why traditional classifiers based on TF-IDF for example would not handle the semantic routing task (lack of multilingual and multidomain scalability). Also, we conducted additional experiments to measure the performance of the XLM-R-based router, which was fine-tuned with 60% of the dataset and measured on a 40% subset. We highlighted how it overperforms both semantic router and LLM-based classifiers on valid routes classification and how its performance drops on jailbreak detection tasks, which is one of the main tasks for the conversation router. The results of these experiments and explanations can be seen in sections 3.1 and 3.2.
Comment 2: The paper briefly mentions related methods, but expanding this section to include detailed comparisons with other routing strategies and explainability techniques (e.g., zero-shot and few-shot classification, SHAP, attention masks) could help contextualize the contributions. For example, explaining why alternative interpretability methods, such as SHAP or attention-based models, may be less effective for dynamic routing could clarify the advantages of using latent embeddings.
Response 2: The introduction section was extended with explanations on why SHAP and attention masks would not be as effective for conversation routing explanation (curse of dimensionality as input texts are going to be long + a lack of causality as these methods can highlight significant features of the input data, but do not provide a direct reason for why a certain decision was made and how the behavior of the model can be modified). We also added additional explanations on it in sections 3.3 and Discussion to highlight that the semantic router provides a clear reason for the decision (like a set of most similar examples chosen from router memory) and this allows users to modify the decision-making by introducing new examples, removing redundant or conflicting ones or modifying existing ones.
Comment 3: The paper presents the routing approach with clarity but could benefit from a structured breakdown of its components. For example, separating sections on embedding generation, similarity scoring, and threshold optimization would clarify the method for readers. More specifics on parameter choices, such as the rationale for using a similarity score threshold of 0.6 or an example pruning threshold of 0.8, would improve the method’s reproducibility. Explaining why specific encoder models (like Google’s text-multilingual-embedding-002) were chosen over others would also help readers understand the model's generalizability across contexts. The explanation on example pruning is a strong contribution, but adding pseudo-code or a flowchart could make the step-by-step pruning method more accessible. For example, showing a sample input, how it is filtered, and what is retained would better illustrate the process.
Response 3: A diagram to illustrate the pruning approach with a possible generalization of examples was added to section 2.3. Regarding the encoder model choice, we referred to our previous research, where the text-multilingual-embedding-002 was compared to e5, text-embedding-003 by Google and OpenAI embedding models (large and ada embeddings). text-multilingual-embedding-002 gave the best accuracy on the routing task. This paper is a continuation of the research we started in the paper called "Benchmarking Conversation Routing in Chatbot Systems Based on Large Language Models", so we referred to it to explain the choice of rejection threshold value too. The pruning threshold choice is shown in section 3.1 as we proved how lower values lead to the rejection of too many samples and cause a significant accuracy degradation. Higher values almost do not change the number of examples in router memory, so it does not make sense to use them. Also, we extended the discussion section to explain that the pruning threshold requires tuning for each specific encoder model and can not be reused between different encoders as their embeddings for the same pair of texts may have different pairwise similarities. This can be caused by the pertaining datasets of those encoders, their target task, and pooling mechanisms used to reduce the dimensionality.
Comment 4: As the dataset is proprietary and currently unavailable for public use, this restricts the methodology's applicability. It would be beneficial to discuss this limitation in terms of generalizability and address how the method might perform with other datasets or in different domains. Additionally, consider elaborating on the characteristics of the dataset (e.g., the diversity in language and question types) to guide other researchers who might want to replicate the experiments with open-source datasets. Discussing potential domain-specific adjustments (e.g., how similar or different pruning thresholds might be needed in medical or retail domains) would also add valuable insight for readers considering this method for different use cases.
Response 4: We added a table to describe the size samples in the dataset for each language, which is present in it. Also, the reference to our previous paper "Benchmarking Conversation Routing in Chatbot Systems Based on Large Language Models" was added. We presented this benchmark dataset there first, so it contains more visualizations of samples done with TSNE and scatter plots, additional statistics, and extended routes descriptions. As this paper continues and extends our previous research we refer to its results and findings to give more information on the currently proprietary dataset. The dataset should be open-sourced later to allow full reproducibility of results.
Comment 5: The results section effectively reports accuracy, F1 scores, and classification performance across routes. However, including statistical significance tests (e.g., p-values) for performance differences between routers could add rigor. For instance, is the observed performance difference between the pruning and no-pruning configurations statistically significant? The breakdown of performance on jailbreak datasets is highly relevant. However, more interpretation of these findings—such as a deeper discussion on the potential vulnerability of the routing method to domain-specific keyword injections—could strengthen the analysis. Consider proposing potential solutions or mitigations for these vulnerabilities to enhance the practical application of the method.
Response 5: The paper highlights the possibility of such similarity attacks on semantic routers as it was not mentioned in the original method proposal. We tried 2 simple ways to counter such attacks: removal of the most used words in the examples set of the router and voting classification with the removal of first and last N tokens. However, both methods do not give a significant improvement in jailbreak detection and can downgrade the performance of valid routes classification. Also, it was proved that the same attack works for XLM-R-based router as its accuracy of jailbreak detection drops to 0.02. However, the transformer accuracy can be possibly improved with fine-tuning examples of such attacks. The semantic router based on sentence embeddings still needs more research to determine a way of countering such attacks without accuracy drops during ordinary sample classification, so we plan to propose more solutions later in future publications. We appreciate the suggestion to include statistical tests, but it would require a broader dataset with more real-life examples to give some real insights and statistical significance of the pruning approach. We focused more on qualitative and quantitative research in this paper to demonstrate that pruning does not lead to a significant performance drop and also we highlighted the advantage of the semantic router in terms of jailbreak detection in comparison to classic transformer classifiers. The semantic router does not need significant fine-tuning to handle jailbreaks with an accuracy higher than 0.95, while the XLM-R router struggles with ordinary jailbreaks without their inclusion in the training set. It struggles even more once jailbreaks get masked with router topic-related statements or keywords.
Comment 6: The use of TSNE projections to visualize embeddings is helpful. However, Figure captions could be expanded to clarify all elements, particularly the significance of larger circles in the decision-making process. Color-coding the routes or adding labels within the figures would make them more readable. The interpretation of the TSNE projections could also be expanded. For instance, explain why certain out-of-scope messages cluster closely with valid routes, especially for categories that are “closer” in vector space, and what implications this has for the robustness of the router.
Response 6: TSNE projections were replaced with UMAP to achieve a better balance between local and global structure preservation of the original high-dimensional data. Both methods were explained at the start of section 3.3 and plot legends were extended to describe larger circles with a black border (most similar examples from router memory). Color coding was already present as all samples are marked with the same color on all plots, which corresponds to their route. The query is color-coded with the color of its target route and marked as an X symbol. Plot explanations were expanded to explain and interpret them better. Also, we explained why some points may appear closer to each other and still not be the most similar ones in reality (due to inconsistencies of 2D projections and possible relationships between points that an algorithm tries to preserve during the dimensionality reduction). We also mentioned a plan to research the usage of decomposition techniques to explain embeddings and their most significant features better in the discussion section. This can allow us to determine which features specifically lead to the closeness of certain points on those 2D projections and how to possibly counter similarity attacks with injections of topic-related statements.
Comment 7: The discussion section should elaborate on how this approach could be adapted or scaled. For instance, would increasing the diversity of training examples or using larger datasets affect the pruning thresholds or similarity thresholds? The paper could also suggest directions for future work to address keyword-injection vulnerabilities or investigate methods for filtering out misleading examples in the embedding space. Additionally, exploring the integration of the proposed method with other explainability techniques, like counterfactual explanations, could offer avenues for extending the research.
Response 7: The discussion section was extended to explain which factors may affect the tuning of the pruning threshold. Also, we proposed more ideas for further research and improvements of the findings presented in the paper (search for effective ways to counter similarity attacks on the semantic router, ways to automatically tune the pruning threshold, extend the dataset, and check the effectiveness of pruning on a larger router memory size). We also extended the explanation of semantic router interpretability and controllability advantage as it provides a direct cause for each decision, which makes it easier to modify and control further classifications. The approach can be scaled to multimodal tasks and requires parameter tuning (pruning and rejection coefficients) for each specific encoder model as their embeddings can give different pairwise similarities for the same pair of texts.
Comment 8: 1. The paper generally maintains an acceptable standard of technical language. However, some complex sentences could benefit from simplification. For example, phrases like “reproducibility in contrast to LLM router” could be rephrased for clarity, e.g., “in contrast to the less interpretable LLM router, this approach offers improved reproducibility and ease of modification.” Certain terms and concepts, such as “pruning preprocessing” and “rejection threshold filtering,” are technical and could be briefly defined upon first use to ensure accessibility for a broader audience.
Response 8: The paper was proofread and we made stylistic changes and grammatical corrections. The text was passed through external grammar correction tools to ensure the amount of stylistically inconvenient sentences or errors is minimized as much as possible.
Comment 9: Minor grammatical inconsistencies are present, particularly in noun-verb agreements (e.g., “instructions conflict each other”, which should read “instructions conflict with each other”). A detailed proofread to catch these inconsistencies would improve readability. The overall sentence structure is sound, but in some cases, separating long sentences into shorter ones would enhance clarity, particularly in technical explanations.
Response 9: The paper text was corrected to minimize the amount of possible mistakes. The structure was additionally fixed by adding another subsection about the semantic router approach modification and some abstracts were rearranged for better readability. Also, some mistakes with the wrong order of sentences were fixed during the proofreading.
Comment 10: The technical vocabulary is largely appropriate for the target audience, but more precision in terms like “interpretability” and “control” could help readers understand exactly what is being controlled or interpreted (e.g., decision paths, model outputs). Consistent terminology would help reinforce the methodology's coherence. For instance, if "semantic routing" is used to describe a specific process, this term should be maintained consistently rather than occasionally substituting terms like "latent embeddings routing".
Response 10: We tried to fix as many inconsistencies as possible, so there should be fewer such cases in the new revision of the paper. Also, some terms like interpretability and controllability were extended in introduction to explain that they refer to the model decision-making process.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsAll good now.