You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Editor’s Choice
  • Article
  • Open Access

24 February 2025

How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness

Department of Information Systems, Kielce University of Technology, 7 Tysiąclecia Państwa Polskiego Ave., 25-314 Kielce, Poland
This article belongs to the Special Issue New Trends in Artificial Neural Networks and Its Applications

Abstract

Modern businesses increasingly rely on chatbots to enhance customer communication and automate routine tasks. The research aimed to determine the optimal configurations of a telecommunications chatbot on the Rasa Pro platform, including the selection of large language models (LLMs), prompt formats, and command structures. The impact of various LLMs, prompt formats, and command precision on response quality was analyzed. Smaller models, like Gemini-1.5-Flash-8B and Gemma2-9B-IT, can achieve results comparable to larger models, offering a cost-effective solution. Specifically, the Gemini-1.5-Flash-8B model achieved an accuracy improvement of 21.62 points when using the JSON prompt format. This emphasizes the importance of prompt engineering techniques, like using structured formats (YAML, JSON) and precise commands. The study utilized a dataset of 400 sample test phrases created based on real customer service conversations with a mobile phone operator’s customers. Results suggest optimizing chatbot performance does not always require the most powerful models. Proper prompt preparation and data format choice are crucial. The theoretical framework focuses on the interaction between model size, prompt format, and command precision. Findings provide insights for chatbot designers to optimize performance through LLM selection and prompt construction. These findings have practical implications for businesses seeking cost-effective and efficient chatbot solutions.

1. Introduction

Modern businesses increasingly use chatbots to improve communication with customers. Their role as the first line of contact in customer service is steadily growing, increasingly automating the handling of routine tasks. This dynamic development drives intensive research on AI-based dialog systems, aiming to enhance their capabilities, improve user experience, and address the challenges of human-like interaction.
The growing demand for and success of conversational agents have spurred the development of various technologies to create these systems. Tech giants like Google (Dialogflow), IBM (Watson Assistant), Microsoft (Azure AI Bot Service), and Amazon (Lex) have released their own chatbot creation platforms. Smaller companies like Rasa, ManyChat, FlowXO, and Pandorabots have also proposed their tools, offering a wider range of solutions for businesses of different sizes [1,2].
Given the diversity of terminology across these systems, this work primarily adopted the terminology from Rasa Pro, a widely adopted open-source platform with a large and active community [3]. Its highly modular and configurable architecture, combined with the active involvement of a global community of over six hundred developers and ten thousand forum members, allows for continuous innovation and the integration of cutting-edge AI research and techniques [4].
In the initial phase of development, dialog systems were primarily based on Natural Language Understanding (NLU) models. These models function by defining intents, which represent the user’s goal or purpose, and training them on a set of example phrases that express each intent. This allows the NLU model to recognize user intents based on the provided phrases. These defined intents are then combined into conversation scenarios, forming the foundation for more complex interactions [5,6]. Chatbots built this way need to be continuously retrained for both new intents and those that are not effectively recognized. A significant advantage of NLU models is their relatively low computational requirements. They can be trained quickly and efficiently, and their small size minimizes the need for extensive hardware resources, making them cost-effective to deploy and maintain [7]. In practice, this meant that companies could deploy such chatbots without investing in expensive infrastructure. NLU models are also more understandable and easier to manage for developers, speeding up the iterative learning process and improving response quality.
The emergence of large language models (LLMs) has revolutionized AI dialog platforms, enabling more sophisticated conversation flows. Each flow is designed to not only recognize the customer’s initial intent but also to manage the entire conversation. This is possible because conversation flows leverage the powerful capabilities of LLMs, which allow for continuous understanding of the conversation context and can effectively process subsequent customer statements without the need for explicit definition of additional intents [8].
The aim of this research was to analyze the performance of various large language models (LLMs) on the Rasa Pro platform, highlighting the effectiveness of smaller models. This paper presents several key contributions to the field of AI-based dialog systems:
  • A comprehensive analysis of the performance of various large language models (LLMs) on the Rasa Pro platform, highlighting the effectiveness of smaller models such as Gemini-1.5-Flash-8B and Gemma2-9B-IT.
  • Demonstration of the significant impact of prompt engineering techniques, such as using structured formats like YAML and JSON, on the accuracy and efficiency of chatbot responses.
  • Presentation of practical insights for chatbot designers, emphasizing the importance of model selection and prompt construction in optimizing chatbot performance.
The structure of this paper is as follows: Section 2 presents a review of related work, Section 3 describes the methodology used in this study, Section 4 discusses the results, and Section 5 concludes with insights and future research directions.

3. Research Methodology

This study employs a mixed-methods approach, combining both quantitative and qualitative research methods to investigate how the choice of LLM language model and prompt engineering techniques influence the quality of responses generated by chatbots.

3.1. Quantitative Methods

The quantitative part of the study involved the analysis of historical customer service conversations conducted by a mobile phone provider. A total of 1054 real phone conversations and 545 real chat conversations were analyzed, which are described in more detail in my previous studies [22,23]. Based on this analysis, 10 main conversation flows were identified, and a total of 400 sample test phrases were developed for these flows (Table 1). The datasets were carefully constructed to ensure accurate and reliable evaluation. An expert reviewed each phrase to ensure its correct assignment to the appropriate conversation flow while introducing lexical diversity. For example, in the context of reporting device damage, synonyms such as phone, device, gadget, screen were used.
Table 1. Defined conversation flows and example phrases.

3.2. Qualitative Methods

The qualitative part of the study involved an in-depth analysis of the literature and a detailed study of the Rasa Pro platform. This analysis led to the formulation of the following research hypotheses:
  • The use of smaller language models (LLMs) can lead to achieving comparable results in terms of accuracy for user intent recognition and the selection of appropriate conversation flows compared to larger models [11].
  • Transforming the bot description within the prompt, such as providing information about the chatbot’s purpose, capabilities, and intended use cases, from plain text to a structured format (Markdown, YAML, JSON) can contribute to increasing the accuracy of responses generated by LLM models [19].
  • Specifying expected outcomes within the prompt precisely should translate into greater accuracy of responses generated by LLM models [21].

3.3. Inter-Rater Agreement

To ensure the reliability of the qualitative analysis, the inter-rater agreement was assessed. Several invited experts independently reviewed the assignment of phrases to conversation flows. The level of agreement among the experts was high. In cases where disagreements occurred, they were resolved through discussion and consensus among the experts.
To ensure the reliability and repeatability of studies comparing the accuracy of various language models (LLMs) in the context of chatbots serving telecommunications customers, a research environment was designed to meet the following criteria:
  • Availability and scalability: the chatbot system based on the Rasa Pro platform and the tested language models were configured to operate in a cloud environment, allowing for the easy scaling of computational resources and availability for other researchers.
  • Openness and reproducibility: all system components were selected to be available for free, at least in a limited version.
This allows other researchers to easily replicate the experiments conducted, both for the same and different conversation flows and datasets.
In this study, a telecommunications chatbot was developed using the Rasa Pro platform. The solution architecture, shown in Figure 1, includes the Rasa Inspector web application [24] designed for user interaction. The core of the system is the Rasa Pro platform, deployed in the GitHub Codespaces cloud environment [25], which has been integrated with other cloud environments such as Gemini API on Google Cloud [26] and Groq Cloud [27]. This integration provides the chatbot with access to a wide range of advanced language models, enhancing its capabilities and enabling the exploration of different AI/ML models.
Figure 1. Chatbot solution architecture.
Table 2 presents an overview of the LLMs used in the experiments. For each model, the following information was provided:
Table 2. Overview of LLMs used in the research.
  • Working name: adopted in this study for easier identification.
  • Cloud platform: where the model is available.
  • Full model name: according to the provider’s naming convention.
  • Number of parameters: characterizing the size and complexity of the model.
  • Reference to the literature: allowing for a detailed review of the model description.
To compare the efficiency of different language models (LLMs) and various prompt formulations, a series of experiments was conducted. Each experiment consisted of 400 iterations, during which each of the analyzed phrases was tested on 10 defined conversation flows. Before starting each experiment, the Rasa Pro configuration was modified, changing both the prompt template and the provider and language model (LLM). Sample configurations are listed in Table 3. This approach ensured a variety of experiments, comparing different LLM models and various prompt formulations. The specific models were selected based on several key factors: performance, scalability, customization capabilities, and their availability on popular cloud platforms. These models represent a range of sizes and complexities, allowing for a comprehensive evaluation of their effectiveness in different scenarios. Additionally, the selected models are widely used in the research community, providing a solid foundation for comparison and validation of results [17,18,30].
Table 3. Selected configurations of prompt templates, LLM provider, and models.
For a detailed analysis of interactions, Rasa Inspector was run in debug mode. This made it possible to track both the prompts sent to the LLM models, the LLM’s responses, and the chatbot’s actions. Table 4 presents the abbreviated prompt structure, including potential conversation flows, an example input phrase, possible actions, and the final command that instructs the LLM on how to generate the chatbot’s response. These fragments allow for the reconstruction of the full prompts used in the study using the Rasa Pro documentation and by running Rasa Inspector in debug mode. The full prompts were not included in the article due to their length, as each prompt contained between 58 and 116 lines for each format.
Table 4. Abbreviated prompt structure.
Despite the expectation that the LLM would consistently generate actions in the required format, instances were observed where the model provided more descriptive responses. This inconsistency with expectations affected the further course of the conversation, as the Rasa Pro system was unable to interpret the model’s responses. Since the purpose of the phrases was to unambiguously initiate a specific flow, only the actions StartFlow(flow_name) and Clarify(flow_name[1..n]) were considered correct. Other types of responses were treated as incorrect.
To quantitatively assess the model’s effectiveness in generating correct actions, the commonly used metric “Accuracy” was applied [33]. For each LLM response, accuracy was calculated as follows:
  • Actions StartFlow(flow_name): If the given flow_name corresponded to an existing flow, the accuracy was 1; otherwise, it was 0.
  • Actions Clarify(flow_name[1..n]): If at least one of the given flow_name was correct, the accuracy was calculated as the inverse of the number of provided flow names.
If none of the provided flows were correct, the accuracy was 0. Sample accuracy values for different model configurations are presented in Table 5. The accuracy for each row is explained as follows:
Table 5. Sample accuracy results for different LLM responses.
  • The LLM response correctly identified the flow name, resulting in an accuracy of 100.00%.
  • The LLM response did not match the correct flow name, resulting in an accuracy of 0.00%.
  • The LLM response included the correct flow name purchase_phone_number among the provided options. Since one out of two provided flow names was correct, the accuracy was calculated as one half, resulting in an accuracy of 50.00%.
  • The LLM response included the correct flow name payment_inquiry among the provided options. Since one out of three provided flow names was correct, the accuracy was calculated as one-third, resulting in an accuracy of 33.33%.
  • The LLM response did not include the correct flow name, resulting in an accuracy of 0.00%.
In the further part of the study, accuracy, expressed as a percentage, serves as the primary measure for evaluating the quality of responses generated by the model.
To verify the hypothesis that modifying the bot description in the prompt from plain text to a structured format affects the accuracy of LLM responses, experiments were conducted with different prompt formats. The bot descriptions and available actions were presented in the following formats: plain text, Markdown, YAML and JSON (Table 6).
Table 6. Prompt fragments in different formats.
Given the high level of detail in the default prompts in Rasa Pro, both in the flow definitions and action descriptions, including extensive explanations of the chatbot’s capabilities and potential user interactions and the observation that smaller LLMs tend to generate extensive descriptions instead of specific actions, it was decided to conduct tests with modified final commands. The default command “Your action list” was named the “Concise” command, and the more precise command “Return only the action or only the list of actions, no additional descriptions” was named the “Precise” command. The purpose of the tests was to examine the impact of these changes, particularly on the performance of smaller language models in generating the desired response.

4. Results and Analysis

The aim of the study was to determine the impact of different language models and prompt formats on the accuracy and efficiency of the chatbot. To this end, 64 experiments were conducted, each consisting of 400 iterations, in which various input phrases were tested on 10 defined conversation flows. A wide range of combinations of different language models, prompt formats (plain text, Markdown, YAML, JSON), and final commands (“Concise”, “Precise”) were used. Each iteration corresponded to a single user interaction with the chatbot.
In Table 7 and Figure 2, the results for all models using different prompt formats are presented. The analysis of the experimental results conducted on LLM models for the plain text format indicated a positive correlation between model size and performance. The smallest models, such as Llama-1B and Llama-3B, achieved very low scores of 7.46 and 32.33, respectively. Models with 8B and 9B parameters (Llama-8B, Gemini1.5F-8B, Gemma-9B) showed significantly better results: 68.99, 64.65, and 69.11. The largest models, such as Llama-70B, Gemini1.5F, and Gemini2.0F, achieved the best results of 84.03, 90.25, and 83.27, respectively, which is not surprising given their sizes.
Table 7. Accuracy of different LLM models across various prompt formats.
Figure 2. Bar chart comparing the accuracy of different LLM models across various prompt formats.
Subsequent experiments investigated the impact of different prompt formats on model performance. The accuracy analysis for structured formats indicated that they could improve results. The JSON format improved the performance of the Llama-3B, Gemini1.5F-8B, Gemma-9B, and Llama-70B models, while the YAML format improved the performance of the Gemini1.5F and Gemini2.0F models. The best improvement was shown by the Gemini1.5F-8B model, whose accuracy increased from 64.65 for the plain text format to 86.27 for the JSON format, representing an improvement of 21.62 points.
These findings are consistent with previous studies that have demonstrated the significant impact of prompt engineering on LLM performance [19]. Similar to my results, He et al. (2024) found that structured prompt formats, such as JSON and YAML, could significantly improve the accuracy of LLM responses.
The analysis of experimental results demonstrated that while larger language models generally exhibited higher performance, the use of structured prompt formats, such as JSON and YAML, enabled smaller LLM models to achieve comparable accuracy, highlighting the crucial role of prompt engineering in optimizing model performance.
These findings also align with the work of Bocklisch et al. (2024), who emphasized the potential of smaller, more resource-efficient language models. They suggested that smaller models could achieve comparable performance to larger models, which is supported by the results showing that models like Gemini1.5F-8B and Gemma-9B can perform nearly as well as larger models when using optimized prompts.
In Table 8 and Figure 3, the results of the experiments after changing the prompt command from “Concise” to “Precise” are presented. The analysis of the results shows that the prompt modification proved particularly effective for models that initially exhibited lower baseline accuracy, likely due to their tendency to generate more verbose responses. The most significant improvements were observed in the plain text format for Llama-1B (+3.04) and Llama-3B (+14.48). In the Markdown format, improvements were noted for Llama-1B (+6.68), Llama-3B (+7.11), and Llama-8B (+6.35). The YAML format showed varied responses, with significant improvements for Llama-3B (+11.07), Gemma-9B (+7.20), and Gemini1.5F-8B (+4.00), while Llama-1B (−4.73) and Gemini1.5F (−1.78) experienced a decrease in performance. In the JSON format, improvements were observed for Llama-1B (+5.83), Llama-3B (+5.36), Llama-8B (+3.63), and Gemini1.5F-8B (+1.83), while Gemini2.0F (−1.82) showed negative changes. Notably, Gemini1.5F-8B and Gemma-9B showed improvements with the “Precise” command in some formats.
Table 8. Accuracy improvement with “Precise” command compared to “Concise” command for different LLMs.
Figure 3. Bar chart comparing accuracy improvements for different LLMs with “Precise” vs. “Concise” command.
Table 9 and Figure 4 present a comparison of the effectiveness of the highest-performing prompts compared to “Plain text/Concise” commands for each LLM model. The smallest models, Llama-1B (13.50), Llama-3B (51.90), and Llama-8B (73.17), despite improvements, achieved ultimately low accuracy. In contrast, Gemini1.5F-8B (88.10) and Gemma-9B (83.28), despite their relatively small size, demonstrated surprisingly good results with the “JSON/Precise” and “YAML/Precise” formats, respectively. For the larger models, Llama-70B (89.64) and Gemini2.0F (89.92), changing the prompt format was beneficial, but changing the command did not help. The most mature model, Gemini1.5F (91.15), achieved the best results, but neither the format nor the final command had a significant impact on its performance.
Table 9. The highest accuracy achieved by different LLM models with the best prompt formats and commands.
Figure 4. Bar chart illustrating the highest accuracy achieved by different LLM models with the best prompt formats and commands.
The analysis of the results shows that the YAML format often outperformed plain text, especially for smaller models, resulting in higher accuracy scores. The JSON format was effective for some models, particularly those of medium size. Significantly, Gemini1.5F-8B and Gemma-9B, despite their relatively small size, demonstrated strong performance after changing the prompt format and command precision. The “Precise” command was beneficial mainly for models that initially achieved lower results, but it did not always bring improvement for high-performing models. The largest models, such as Llama-70B, Gemini1.5F and Gemini2.0F, achieved the best results in the YAML and JSON formats, but changing the command to “Precise” was not always beneficial. The analysis of the results highlights the crucial role of prompt engineering in optimizing LLM performance. The choice of the appropriate prompt format and command can significantly impact the accuracy of smaller models, enabling them to achieve results comparable to larger models.

5. Conclusions

The aim of the conducted research was to determine the optimal operating conditions for a telecommunications chatbot based on the Rasa Pro platform. To this end, a series of experiments were conducted using various language models (LLMs) and diverse prompts. The research focused on the impact of model size, prompt format, and command precision on the quality of the chatbot’s responses.
The obtained results confirmed several significant hypotheses, as summarized in Table 10. Firstly, it was found that smaller language models, such as Gemini1.5F-8B and Gemma-9B, could achieve results only slightly worse than more complex models, such as Gemini1.5F. This finding suggests that in some cases, it is not necessary to use the most complex and computationally expensive models to achieve satisfactory results.
Table 10. Summary of main findings.
Secondly, the study confirmed the significant impact of prompt format on response quality. Using structured formats, such as YAML or JSON, brought a clear improvement in response accuracy for many tested models. Particularly beneficial effects were observed for models Llama-3B, Gemini1.5F-8B, and Gemini-9B, where the difference in results was most noticeable.
Thirdly, the hypothesis assuming a direct relationship between command precision in the prompt and response quality was confirmed for smaller models Llama-1B, Llama-3B, Llama-8B, Gemini1.5F-8B, and Gemma-9B, while for the largest models, such as Llama-70B, Gemini1.5F, and Gemini2.0F, it was not significant.
The most important discovery is that relatively small models such as Gemini1.5F-8B and Gemma-9B, after applying prompt engineering (formats and the “Precise” command), improved their results and did not significantly differ from the largest models. This finding suggests that these models are suitable for use in chatbots, offering satisfactory performance at lower computational costs.
The methodology and experimental results presented in this study, although tested on the Rasa Pro system, can be generalized to other dialogue platforms based on language models (LLMs). Since the methodology focuses on constructing prompts for LLMs, the principles of prompt engineering and the evaluation metrics used are universal and can be applied to any system utilizing LLMs for generating responses. This makes the findings of this study broadly applicable beyond the specific context of Rasa Pro. In particular, these principles can be effectively applied to other platforms where there is the possibility of replacing the LLM, such as ManyChat, FlowXO, and Pandorabots, as well as modifying prompts. For larger platforms like Google Dialogflow, IBM Watson Assistant, and Microsoft Azure AI Bot Service, while there may not always be the possibility to modify the LLM, prompt modifications are still feasible [1,2]. The flexibility and modularity of these platforms allow for the adaptation of the prompt engineering techniques discussed in this study, ensuring that the insights gained can enhance the performance and reliability of various conversational AI systems. By leveraging the capabilities of these platforms, researchers and developers can experiment with different LLMs and prompt structures, further validating and extending the applicability of the findings presented here.
Limitations of this study include the restriction to specific language models available in the cloud environment, as listed in Table 2. These models were chosen due to their availability and the possibility of free testing. As mentioned in the methodology, the research environment was designed to ensure availability, scalability, openness, and reproducibility, allowing other researchers to easily replicate the experiments conducted.
Future research should focus on more complex conversation scenarios, such as emotion recognition [23,34], contextual language understanding, and multitasking. Additionally, the impact of various machine learning techniques on improving chatbot performance is worth investigating [35]. Future studies should also include additional LLM models for which licenses can be obtained [36]. An interesting research direction is to evaluate how LLMs handle non-obvious phrases, such as those containing sarcasm, to better understand their capabilities and limitations in real-world applications [37].
The results of the conducted research open new perspectives in the field of LLM-based chatbots. They suggest that optimizing chatbot performance does not always require the use of the most powerful available models. Equally important is the proper preparation of the prompt and the choice of the appropriate data format.
The conducted research provided valuable insights into the impact of various factors on the quality of LLM-based chatbots. The research results can contribute to the development of more advanced and efficient solutions in the field of customer service.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Benaddi, L.; Ouaddi, C.; Khriss, I.; Ouchao, B. Analysis of Tools for the Development of Conversational Agents. Comput. Sci. Math. Forum 2023, 6, 5. [Google Scholar] [CrossRef]
  2. Dagkoulis, I.; Moussiades, L. A Comparative Evaluation of Chatbot Development Platforms. In Proceedings of the 26th Pan-Hellenic Conference on Informatics, Athens, Greece, 25–27 November 2022. [Google Scholar] [CrossRef]
  3. Introduction to Rasa Pro. 2025. Available online: https://rasa.com/docs/rasa-pro/ (accessed on 22 January 2025).
  4. Costa, L.A.L.F.d.; Melchiades, M.B.; Girelli, V.S.; Colombelli, F.; Araujo, D.A.d.; Rigo, S.J.; Ramos, G.d.O.; Costa, C.A.d.; Righi, R.d.R.; Barbosa, J.L.V. Advancing Chatbot Conversations: A Review of Knowledge Update Approaches. J. Braz. Comput. Soc. 2024, 30, 55–68. [Google Scholar] [CrossRef]
  5. Tamrakar, R.; Wani, N. Design and Development of CHATBOT: A Review. In Proceedings of the International Conference on “Latest Trends in Civil, Mechanical and Electrical Engineering”, Online, 12–13 April 2021. [Google Scholar]
  6. Brabra, H.; Baez, M.; Benatallah, B.; Gaaloul, W.; Bouguelia, S.; Zamanirad, S. Dialogue Management in Conversational Systems: A Review of Approaches, Challenges, and Opportunities. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 783–798. [Google Scholar] [CrossRef]
  7. Matic, R.; Kabiljo, M.; Zivkovic, M.; Cabarkapa, M. Extensible Chatbot Architecture Using Metamodels of Natural Language Understanding. Electronics 2021, 10, 2300. [Google Scholar] [CrossRef]
  8. Sanchez Cuadrado, J.; Perez-Soler, S.; Guerra, E.; De Lara, J. Automating the Development of Task-oriented LLM-based Chatbots. In Proceedings of the 6th ACM Conference on Conversational User Interfaces, Luxembourg, 8–10 July 2024; CUI ’24. Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–10. [Google Scholar] [CrossRef]
  9. Marvin, G.; Hellen, N.; Jjingo, D.; Nakatumba-Nabende, J. Prompt Engineering in Large Language Models. In Proceedings of the Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2023; Jacob, I.J., Piramuthu, S., Falkowski-Gilski, P., Eds.; IEEE: Piscataway, NJ, USA, 2024; pp. 387–402. [Google Scholar] [CrossRef]
  10. Benram, G. Understanding the Cost of Large Language Models (LLMs). 2024. Available online: https://www.tensorops.ai/post/understanding-the-cost-of-large-language-models-llms (accessed on 20 January 2025).
  11. Bocklisch, T.; Werkmeister, T.; Varshneya, D.; Nichol, A. Task-Oriented Dialogue with In-Context Learning. arXiv 2024. [Google Scholar] [CrossRef]
  12. Nadeau, D.; Kroutikov, M.; McNeil, K.; Baribeau, S. Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations. arXiv 2024. [Google Scholar] [CrossRef]
  13. Yi, J.; Ye, R.; Chen, Q.; Zhu, B.; Chen, S.; Lian, D.; Sun, G.; Xie, X.; Wu, F. On the Vulnerability of Safety Alignment in Open-Access LLMs. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; IEEE: Piscataway, NJ, USA, 2024; pp. 9236–9260. [Google Scholar] [CrossRef]
  14. Zhao, S.; Tuan, L.A.; Fu, J.; Wen, J.; Luo, W. Exploring Clean Label Backdoor Attacks and Defense in Language Models. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3014–3024. [Google Scholar] [CrossRef]
  15. Niu, Z.; Ren, H.; Gao, X.; Hua, G.; Jin, R. Jailbreaking Attack against Multimodal Large Language Model. arXiv 2024, arXiv:2402.02309. [Google Scholar]
  16. Amujo, O.E.; Yang, S.J. Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making. arXiv 2024, arXiv:2407.11006. [Google Scholar]
  17. Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024. [Google Scholar] [CrossRef]
  18. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Roziere, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2024. [Google Scholar] [CrossRef]
  19. He, J.; Rungta, M.; Koleczek, D.; Sekhon, A.; Wang, F.X.; Hasan, S. Does Prompt Formatting Have Any Impact on LLM Performance? arXiv 2024. [Google Scholar] [CrossRef]
  20. Arora, G.; Jain, S.; Merugu, S. Intent Detection in the Age of LLMs. arXiv 2024. [Google Scholar] [CrossRef]
  21. Cao, B.; Cai, D.; Zhang, Z.; Zou, Y.; Lam, W. On the Worst Prompt Performance of Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
  22. Płaza, M.; Pawlik, Ł.; Deniziak, S. Call Transcription Methodology for Contact Center Systems. IEEE Access 2021, 9, 110975–110988. [Google Scholar] [CrossRef]
  23. Pawlik, L.; Plaza, M.; Deniziak, S.; Boksa, E. A method for improving bot effectiveness by recognising implicit customer intent in contact centre conversations. Speech Commun. 2022, 143, 33–45. [Google Scholar] [CrossRef]
  24. Rasa Inspector. 2025. Available online: https://rasa.com/docs/rasa-pro/production/inspect-assistant/ (accessed on 21 January 2025).
  25. Codespaces Documentation. Available online: https://docs.github.com/en/codespaces (accessed on 21 January 2025).
  26. Gemini API. Available online: https://ai.google.dev/gemini-api/docs (accessed on 21 January 2025).
  27. GroqCloud. Available online: https://groq.com/groqcloud/ (accessed on 21 January 2025).
  28. llama-models/models/llama3_2/MODEL_CARD.md at main · meta-llama/llama-models. Available online: https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md (accessed on 22 January 2025).
  29. llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models. Available online: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md (accessed on 22 January 2025).
  30. Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024. [Google Scholar] [CrossRef]
  31. google/gemma-2-9b-it · Hugging Face. 2024. Available online: https://huggingface.co/google/gemma-2-9b-it (accessed on 22 January 2025).
  32. Gemini 2.0 Flash (Experimental)|Gemini API. Available online: https://ai.google.dev/gemini-api/docs/models/gemini-v2 (accessed on 22 January 2025).
  33. Banerjee, D.; Singh, P.; Avadhanam, A.; Srivastava, S. Benchmarking LLM powered Chatbots: Methods and Metrics. arXiv 2023. [Google Scholar] [CrossRef]
  34. Kossack, P.; Unger, H. Emotion-Aware Chatbots: Understanding, Reacting and Adapting to Human Emotions in Text Conversations. In Proceedings of the Advances in Real-Time and Autonomous Systems; Unger, H., Schaible, M., Eds.; Springer: Cham, Switzerland, 2024; pp. 158–175. [Google Scholar]
  35. Vishal, M.; Vishalakshi Prabhu, H. A Comprehensive Review of Conversational AI-Based Chatbots: Types, Applications, and Future Trends. In Internet of Things (IoT): Key Digital Trends Shaping the Future; Misra, R., Rajarajan, M., Veeravalli, B., Kesswani, N., Patel, A., Eds.; Springer: Singapore, 2023; pp. 293–303. [Google Scholar]
  36. Dam, S.K.; Hong, C.S.; Qiao, Y.; Zhang, C. A Complete Survey on LLM-based AI Chatbots. arXiv 2024, arXiv:2406.16937. [Google Scholar]
  37. Wake, N.; Kanehira, A.; Sasabuchi, K.; Takamatsu, J.; Ikeuchi, K. Bias in Emotion Recognition with ChatGPT. arXiv 2023, arXiv:2310.11753. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.