Enhancing Software Usability Through LLMs: A Prompting and Fine-Tuning Framework for Analyzing Negative User Feedback

Alsaleh, Nahed; Alnanih, Reem; Alowidi, Nahed

doi:10.3390/computers14090363

Open AccessArticle

Enhancing Software Usability Through LLMs: A Prompting and Fine-Tuning Framework for Analyzing Negative User Feedback

by

Nahed Alsaleh

^1,2,*,

Reem Alnanih

^1,* and

Nahed Alowidi

¹

Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

²

Department of Computer Science, College of Computer Science and Engineering, University of Hail, Hail 55473, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(9), 363; https://doi.org/10.3390/computers14090363

Submission received: 30 July 2025 / Revised: 22 August 2025 / Accepted: 26 August 2025 / Published: 1 September 2025

(This article belongs to the Topic Recent Advances in AI-Enhanced Software Engineering and Web Services)

Download

Browse Figures

Versions Notes

Abstract

In today’s competitive digital landscape, application usability plays a critical role in user satisfaction and retention. Negative user reviews offer valuable insights into real-world usability issues, yet traditional analysis methods often fall short in scalability and contextual understanding. This paper proposes an intelligent framework that utilizes large language models (LLMs), including GPT-4, Gemini, and BLOOM, to automate the extraction of actionable usability recommendations from negative app reviews. By applying prompting and fine-tuning techniques, the framework transforms unstructured feedback into meaningful suggestions aligned with three core usability dimensions: correctness, completeness, and satisfaction. A manually annotated dataset of Instagram negative reviews was used to evaluate model performance. Results show that GPT-4 consistently outperformed other models, achieving BLEU scores up to 0.64, ROUGE scores up to 0.80, and METEOR scores up to 0.90—demonstrating high semantic accuracy and contextual relevance in generated recommendations. Gemini and BLOOM, while improved through fine-tuning, showed significantly lower performance. This study also introduces a practical, web-based tool that enables real-time review analysis and recommendation generation, supporting data-driven, user-centered software development. These findings illustrate the potential of LLM-based frameworks to enhance software usability analysis and accelerate feedback-driven design processes.

Keywords:

large language models (LLMs); usability analysis; software engineering; negative user reviews; prompt engineering; fine-tuning; software quality

1. Introduction

In the modern digital era, usability is not merely a desirable feature, but a critical determinant of an application’s success or failure. Empirical data indicate that 88% of users are less likely to return after a poor user experience, underscoring the high stakes of usability failure [1]. Additionally, approximately 70% of app uninstallations occur due to performance issues [2].

This underscores the growing demand for developers to prioritize usability in every stage of the design and deployment lifecycle.

The new version of ISO 9241-11 [3] preserves and elaborates upon the concepts established in the 1988 edition of the standard. It extends the original definition of usability to encompass both systems and services: “the extent to which a system, product, or service can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use.” Additionally, the content has been expanded to incorporate contemporary approaches to usability and user experience [3].

Traditional usability evaluation methods, such as heuristic evaluation, user testing, and survey-based feedback, though effective, are often resource-intensive, time-consuming, and difficult to scale in rapidly evolving software environments. In contrast, user-generated content—particularly negative app reviews—offers a continuous, real-time stream of usability insights directly from end-users; however, manually analyzing such content at scale is impractical, and existing automated sentiment analysis tools often lack the semantic depth and contextual awareness needed to translate feedback into actionable design improvements. This gap is significant, as usability is a key determinant of software quality, as it directly influences user satisfaction, reduces errors and frustrations, and enhances effectiveness, efficiency, adoption, and retention. In [4], the author further emphasizes that customer satisfaction arises from meeting or exceeding expectations, a principle particularly relevant in contexts such as logistics, where delays or errors often generate negative emotions and dissatisfaction. App reviews therefore serve as a valuable resource for eliciting user requirements, shaping design decisions, and guiding future releases. By systematically incorporating insights from app reviews into the development process, organizations can better align applications with user expectations, strengthen engagement, and improve long-term retention, making reviews one of the most important sources of feedback during the maintenance phase [5].

Recent advancements in LLMs—such as OpenAI’s GPT-4 and Google’s Gemini—offer a new frontier in bridging this gap. These models possess powerful capabilities in natural language understanding and generation, enabling them to not only analyze unstructured text but also to generate human-like recommendations grounded in real user feedback.

This paper proposes a novel, intelligent framework that leverages LLMs to extract and generate usability enhancement recommendations from negative app reviews. By integrating prompt engineering and fine-tuning strategies, the proposed system addresses critical usability dimensions including correctness, completeness, and satisfaction.

Through the construction of a manually labeled dataset of 5000 Instagram app reviews, this study provides a benchmark to evaluate the comparative performance of GPT, Gemini, and BLOOM. Evaluation is conducted using BLEU, ROUGE, and METEOR scores, which are widely recognized metrics in natural language generation research.

The key contributions of this study are as follows:

A novel and scalable framework that leverages large language models (LLMs) to transform negative user feedback into precise, usability-driven recommendations across core usability dimensions: correctness, completeness, and satisfaction.
A manually annotated dataset of negative app reviews, curated by software engineering experts, which serves as a benchmark resource for future research in automated usability analysis and AI-assisted quality assurance.
A comparative evaluation of three prominent LLMs—GPT-4, Gemini, and BLOOM—under prompting and fine-tuning settings, highlighting performance trade-offs, model-specific strengths, and practical implications for usability recommendation tasks.
A fully integrated, web-based tool that operationalizes the framework, enabling practitioners to analyze user reviews and receive real-time, AI-generated usability recommendations—facilitating faster decision-making in the software development lifecycle.
A practical contribution to user-centered software engineering, demonstrating how generative AI can augment traditional usability engineering by enabling scalable, context-aware analysis of user sentiment throughout the design, testing, and maintenance phases.

The remainder of this paper is structured as follows: Section 2 reviews related work on usability analysis and the application of large language models (LLMs) in software engineering. Section 3 outlines the methodology for developing the proposed LLM-based framework, including dataset collection, manual annotation, prompting, and fine-tuning strategies. Section 4 introduces the Review Analyzer tool, which operationalizes the framework in a web-based environment. Section 5 presents the experimental results, comparing the performance of GPT-4, Gemini, and BLOOM, and discusses the technical and methodological challenges encountered. Finally, Section 6 concludes the paper and highlights future research directions for advancing AI-driven usability engineering.

2. Related Work

The analysis of user reviews for usability insights has evolved into two primary paradigms: non-tuning approaches and tuning-based approaches. Non-tuning approaches rely on leveraging pre-trained models without further adaptation to specific datasets, while tuning-based approaches involve the fine-tuning of such models to improve performance in context-specific tasks [6]. This section reviews both paradigms, highlighting their strengths, limitations, and relevance to our proposed framework.

2.1. Non-Tuning Paradigm

LLMs have demonstrated remarkable zero-shot and few-shot capabilities across a wide range of previously unseen tasks [7]. As a result, recent studies suggest that LLMs possess inherent recommendation capabilities that can be activated through carefully crafted prompts. These studies employ contemporary paradigms such as instruction learning and in-context learning to adapt LLMs for recommendation tasks without the need for parameter tuning. Depending on whether the prompt includes demonstration examples, research within this paradigm can be broadly classified into two main categories: prompting and in-context learning.

The following subsection provides a detailed review of the prompting-based approaches under this paradigm.

Prompting

This line of research explores how carefully designed instructions and prompts can improve LLMs’ ability to perform recommendation tasks. Liu et al. [8] evaluated ChatGPT on five common tasks: rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. Similarly, Dai et al. [9] examined ChatGPT’s performance on three information retrieval tasks—point-wise, pair-wise, and list-wise ranking—using task-specific prompts and role instructions (e.g., “You are a news recommendation system”).

To test prompt effectiveness, Sanner et al. [10] created three template types: one based on item attributes, one on user preferences, and one combining both. Their results showed that zero-shot and few-shot prompting can perform competitively with traditional collaborative filtering, especially in cold-start cases. Other researchers have developed task-specific strategies. For example, Sileo et al. [11] built movie recommendation prompts from GPT-2′s pretraining corpus, while Hou et al. [12] proposed recency-focused sequential prompting to capture temporal patterns and bootstrapping to reduce ranking bias.

Since LLMs have token limits that make it difficult to handle long candidate lists, Sun et al. [13] introduced a sliding window technique to iteratively rank smaller segments. Yang et al. [14] developed a sequence-residual prompt that improves the interpretability of sequential recommenders by making recommendations clearer.

Beyond using LLMs as recommenders, several works explored their use for enriching model features in traditional systems. Yang et al. [15] and Lyu et al. [16] used prompting for content augmentation, improving item-side representations. On the user side, Wang and Lim [17] generated user preference descriptions, while Wang et al. [18] built reasoning graphs with ChatGPT and a knowledge base to enrich user representations.

Wei et al. [19] and Ren et al. [20] created textual features for users and items using ChatGPT, enhancing ID-based learning. Likewise, Liu et al. [21] proposed GENRE, which uses three prompting techniques to improve news recommendations: refining titles, extracting keywords from user reading histories, and generating synthetic articles to simulate user interactions.

Recommendation systems typically comprise ranking models, content repositories, and candidate retrieval modules. Some studies position LLMs as controllers across this pipeline. Frameworks such as Gao et al.’s ChatREC [22], Shu et al.’s RAH [23], Shi et al.’s BiLLP [24], and Huang et al.’s InteRecAgent [25] facilitate interactive recommendations through multi-turn conversations, enabling dynamic tool integration and user support, as further discussed by Huang et al. [26] in the context of trust and satisfaction.

Personalized recommendation services are also advancing through LLMs. Zhang et al. [27] explored agent–user–item collaboration, while Petrov et al. [28] developed GeneRec to determine whether to recommend existing or newly generated items. Simulation-based recommender environments have also emerged, such as RecAgent [29] and Agent4Rec [30], which model real-world user behaviors and enable reinforcement learning and feedback tracking.

As an alternative to using LLMs as system controllers, Zhang et al. [31] proposed UniLLMRec, an end-to-end recommendation framework that connects recall, ranking, and re-ranking using natural language prompts. This method utilizes the zero-shot power of LLMs, offering a practical and low-cost solution for real-world recommendation problems.

In summary, this line of research emphasizes the design of instructional prompts that allow LLMs to perform recommendation tasks effectively. Studies such as those by Wang and Lim [17] and Dai et al. [9] introduced tailored prompt structures and role instructions to improve task alignment and domain adaptation. Others, like Sanner et al. [10], focused on template evaluation and cold-start performance. Innovative strategies—like sliding window prompts and prompt-based feature generation—demonstrate the expanding role of LLMs in both standalone and hybrid recommender architectures. Frameworks like ChatREC and GeneRec showcase how interactive and generative paradigms are shaping the next generation of LLM-driven recommendation systems.

As evidenced by Table 1, the majority of prompting-based studies succeed in generating basic recommendations but fall short in delivering complete, correct, and satisfying outputs for end users. Most models achieve only partial effectiveness, particularly lacking in handling negative app reviews and meeting usability standards such as completeness and user satisfaction. While frameworks like BiLLP and Rec4Agentverse demonstrate isolated strengths (e.g., in satisfaction), they still do not provide holistic usability support. In contrast, our proposed approach—leveraging GPT, Gemini, and BLOOM—demonstrates a consistent and comprehensive usability advantage across all evaluated dimensions. This suggests that the design of our prompting strategy more effectively aligns with real-world user needs and highlights the importance of usability-centric evaluation in recommendation research.

2.2. Tuning Paradigm

As previously noted, LLMs demonstrate strong zero-shot and few-shot capabilities, enabling them to generate recommendations that significantly surpass random guessing when guided by well-designed prompts. However, it is not surprising that such recommendation systems may still fall short of the performance achieved by models explicitly trained on task-specific datasets. Consequently, a growing body of research focuses on enhancing the recommendation performance of LLMs through tuning techniques. These tuning approaches can be broadly categorized into two main types: fine-tuning and prompt tuning.

In fine-tuning, both discriminative and generative LLMs are typically used as encoders to extract user or item representations. The model parameters are then updated using task-specific loss functions to optimize performance for downstream recommendation tasks.

Fine-Tuning

Fine-tuning refers to the additional training of a pre-trained model on a task-specific dataset to tailor its capabilities to a particular domain. This process adjusts the model’s internal parameters, enhancing its performance on new data, especially in cases where pre-trained knowledge alone is insufficient. The “pre-train, fine-tune” paradigm has gained traction in recommendation systems for several reasons: it improves initialization for better generalization, accelerates convergence, and introduces universal knowledge that benefits downstream tasks. Additionally, fine-tuning helps mitigate overfitting in low-resource environments by serving as a form of regularization [32].

The holistic model approach pre-trains an LLM on large datasets and fine-tunes all parameters using varied downstream data sources. Kang et al. [33], for example, pre-trained a GPT model on segmented API code and fine-tuned it using code from another library to support cross-library recommendations. Similarly, Wang et al. [34] fine-tuned DialoGPT on domain-specific conversations for recommendation dialogues, while Xiao et al. [35] adapted a transformer model to learn user-news embeddings. Their results showed that tuning the entire model yielded better performance, underscoring the trade-off between accuracy and training cost.

In contrast, the partial model approach aims to fine-tune only a subset of the model’s parameters, offering a balanced solution between efficiency and effectiveness. Many Language Model Recommendation Systems (LMRSs) adopt this strategy. For instance, Hou et al. [30] applied a linear transformation layer to correct item embeddings across multiple domains, addressing domain-specific biases. They also proposed contrastive multi-task objectives to align domain patterns. Their findings suggest that fine-tuning a small fraction of parameters enables rapid adaptation, including in cold-start scenarios.

Further contributions include GPTRec [28], a generative sequential recommender built on GPT-2 that employs SVD tokenization for memory efficiency and supports flexible next-K predictions, distinguishing it from the discriminative BERT4Rec. Another study [36] formatted user–item interactions as prompts for rating prediction and found that FLAN-T5-XXL, when fine-tuned, outperformed other models. Moreover, [37] investigated GPT-3′s collaborative filtering performance, while [21] leveraged ChatGPT to enhance item information before transitioning to LLaMA for feature embedding. Lastly, [21] explored embedding-based fine-tuning for recommendation, finding that while LLMs can improve accuracy, traditional ID-based collaborative filtering still performs strongly in warm-start settings.

In summary, the “pre-train, fine-tune” paradigm is becoming increasingly influential in recommendation systems due to its strengths in generalization, convergence, and adaptability. Holistic models optimize the full parameter set for cross-domain recommendation, while partial models selectively update parameters to reduce computational cost and enhance domain robustness. Cutting-edge systems like GPTRec exemplify the use of generative strategies, while broader comparative analyses highlight the relative strengths of both LLM-based and traditional methods. Table 2 summarizes the usability evaluation of these approaches in recommendation tasks.

As shown in Table 2, most fine-tuned language models demonstrate limited usability performance, typically excelling in recommendation generation while failing to meet criteria such as completeness, correctness, and user satisfaction. Notably, our proposed approach using GPT, Gemini, and BLOOM outperforms all prior works by achieving full usability across all evaluation metrics. This highlights the effectiveness of our fine-tuning strategy in addressing the usability gaps commonly found in existing LLM-based recommendation models.

3. Methodology

This study aims to evaluate the usability metrics of LLMs in generating actionable usability recommendations from negative app reviews. The proposed methodology consists of five key stages. First, negative user reviews are collected and annotated based on usability issues related to three core factors: correctness, completeness, and satisfaction. Second, the annotated reviews are analyzed to identify the causes of usability issues using both prompt-based and fine-tuned LLM approaches. Third, the system generates usability improvement recommendations to assist developers in addressing the identified issues across the specified factors. Fourth, the quality and relevance of the generated recommendations are assessed using standard automatic evaluation metrics alongside expert human validation. Finally, an integrated tool is developed to streamline the entire process within a unified interface. Figure 1 presents the overall framework of the methodology.

3.1. Negative Reviews Collection and Annotation

In the authors’ previous study [38], the Instagram App Reviews dataset from Kaggle was utilized to develop an intelligent aspect-based sentiment analysis model capable of classifying user reviews by sentiment polarity (positive or negative) across three core usability factors: correctness, completeness, and satisfaction. Building on that work, the present study utilizes the same model to extract negative user reviews specifically linked to these usability factors, resulting in a dataset of 309 negative reviews.

Annotation of Negative Reviews by Identifying Usability Issue Causes and Corresponding Improvement Recommendations

To ensure a reliable evaluation of the usability issue causes identified by LLMs and the corresponding recommendations they generate, the negative app reviews were manually annotated by software engineering experts. These experts identified the root causes of usability problems and mapped them to appropriate, actionable recommendations. The resulting expert annotations serve as ground truth references against which the outputs of both prompt-based and fine-tuned LLMs can be systematically compared.

Three software engineering experts systematically analyzed each negative review to identify the root causes of user dissatisfaction and map them to the relevant usability factors—correctness, completeness, or satisfaction. Based on these identified causes, the experts formulated actionable recommendations aimed at effectively resolving the underlying usability issues. To ensure a consistent and systematic annotation of negative reviews, the following guidelines were provided to the experts. First, in Table 3 the following usability factors definitions and examples were provided.

Second, each expert was instructed to follow these steps independently:

Carefully read the entire review to understand the user’s complaint in context.
Identify the root cause of the issue—whether it reflects a lack of functionality (negative completeness), functional failure (negative correctness), or a poor user experience negative (negative satisfaction).
Formulate a clear, actionable recommendation that a developer or designer could feasibly implement to resolve the issue.
Avoid vague, non-specific suggestions and ensure all recommendations are user-centered and aligned with usability best practices.

Third, to ensure annotation reliability:

All reviews were independently analyzed by the three experts.
In cases of disagreement, annotations were discussed and resolved through consensus.
Recommendations were reviewed to ensure they are aligned with user-centered design principles and actionable from a development perspective.

Table 4 presents annotation examples for each negative usability factor, along with the identified cause of the issue and the corresponding recommendation to address it.

3.2. Prompting and Fine-Tuning LLMs to Identify the Causes of Usability-Related Issues

Three distinct LLMs—GPT-4, Gemini, and BLOOM—were deployed to identify the underlying causes of usability issues from negative reviews. Each model offers unique technical advantages. GPT, developed by OpenAI, is recognized for its advanced language understanding and ability to generate coherent, contextually relevant responses. Gemini, introduced by Google DeepMind, excels in long-context reasoning and multimodal integration, enabling more nuanced interpretation of user feedback. BLOOM, from Hugging Face, stands out for its open-source nature and strong multilingual capabilities, making it ideal for research and diverse language support.

3.2.1. Prompting Approach

Two prompting strategies were employed: Zero-Shot Prompting and Few-Shot Prompting. In Table 5 both approaches aim to guide the LLMs in identifying the causes of usability issues from negative reviews [7].

3.2.2. Fine-Tuning Approach

The three LLMs (GPT-4, Gemini, and BLOOM) were fine-tuned to accurately identify the root causes of usability issues in negative app reviews. The fine-tuning process involved retraining each model on the negative reviews’ dataset annotated by software engineering experts. Each negative review in the dataset was annotated with the underlying cause of usability issue.

The fine-tuning process begins by installing essential libraries and importing dependencies for data preprocessing, model training, and evaluation [30]. A CSV file containing negative reviews and expert-annotated causes of usability issues is then uploaded, cleaned, and prepared for training. Next, appropriate hyperparameters are configured, including 10 training epochs, a batch size of 32, a learning rate of 1 × 10⁻⁵, and the AdamW optimizer [34]. The model’s tokenizer is then used to encode the input–output pairs for training. The LLM is fine-tuned using the Hugging Face Transformers framework on the annotated dataset, with model weights saved at each epoch for checkpointing [33]. Once fine-tuning is complete, the trained model is employed to identify the causes of usability issues from negative reviews. Finally, the model’s outputs are evaluated using standard metrics and validated by domain experts to assess accuracy and reliability.

3.3. Prompting and Fine-Tuning LLMs to Generate Actionable Recommendations

The same three LLMs—GPT-4, Gemini, and BLOOM—used to identify the causes of usability issues were also employed to generate actionable recommendations aimed at addressing the identified causes.

3.3.1. Prompting Approach

Similarly, two prompting strategies—Zero-Shot and Few-Shot Prompting—were employed with three LLMs: GPT-4, Gemini, and BLOOM. In Table 6, these prompts were carefully designed to guide the models in generating clear, developer-relevant recommendations based on the identified causes of usability issues related to the core factors of correctness, completeness, and satisfaction.

3.3.2. Fine-Tuning Approach

The fine-tuning process begins by installing the necessary libraries and importing dependencies required for data preprocessing, model training, and evaluation [26]. A CSV file containing negative app reviews and corresponding expert-authored recommendations is uploaded, cleaned, and formatted for training. The models are then fine-tuned using a standard supervised learning approach, with hyperparameters set to 10 training epochs, a batch size of 32, a learning rate of 1 × 10⁻⁵, and the AdamW optimizer. Each model’s native tokenizer is used to encode the data, ensuring compatibility and preserving semantic integrity. Once fine-tuned, the LLMs are employed to generate usability improvement recommendations for new, previously unseen reviews. These outputs are evaluated through a combination of automated metrics and expert human validation to assess their clarity, practicality, and relevance to the identified usability issues.

3.4. Results Evaluation Strategy

The recommendations generated by LLMs were evaluated using a combination of automated metrics and expert human validation to assess their quality, clarity, and relevance.

3.4.1. Automated Evaluation Using Standard Metrics

To evaluate the quality of the outputs generated by the LLMs, three widely adopted metrics were used: BLEU, ROUGE, and METEOR. These metrics provide a robust framework for evaluating lexical similarity, grammatical quality, and semantic adequacy of the generated usability recommendations.

BLEU Score (Bilingual Evaluation Understudy): Measures the degree of overlap between the generated output and a reference text, focusing on precision of n-grams. The formula is as follows:

B l e u = P r e c i s i o n \times B r e v i t y P e n a l t y

(1)

2.: ROUGE Score (Recall-Oriented Understudy for Gusting Evaluation): Evaluates the overlap of n-grams and sequences between the generated and reference texts, particularly useful for summarization-style outputs. The formula is as follows:

R O U G E = O v e r l a p \div R e f e r e n c e w o r d s

(2)

3.: METEOR Score (Metric for Evaluation of Translation with Explicit Ordering): Considers both precision and recall, while incorporating synonymy and stemming to evaluate semantic similarity. The METEOR score is computed as follows:

M E T E O R = F m e a n \cdot (1 - P e n a l t y)

(3)

3.4.2. Expert Human Validation

In parallel with automated metrics, a panel of software engineering experts independently reviewed a sample of outputs. Each output was evaluated based on the following qualitative criteria:

Clarity: Is the cause or recommendation clearly articulated and easy to understand?

Usefulness: Does the output offer actionable insight that would be meaningful to a developer?

Relevance: Is the output accurately aligned with the underlying usability factor (correctness, completeness, or satisfaction)?

4. Tool Implementation: The Review Analyzer System

A web-based tool named Review Analyzer was developed using Streamlit to automate the end-to-end process of usability feedback analysis. Designed with a clean and intuitive interface, the tool allows users to input negative app reviews, securely enter their OpenAI API key, and select their preferred language for analysis. A dedicated text box supports direct review entry, with built-in examples to guide formatting, while an “Analyze Review” button initiates real-time processing. The tool then detects negative sentiment, identifies the root causes of usability issues, and generates structured, AI-driven recommendations aligned. By bridging raw user sentiment with expert-like structured feedback, Review Analyzer provides actionable insights that enable software teams to prioritize fixes and enhancements more effectively, thereby strengthening the development feedback loop and improving overall user experience. Figure 2 illustrates the tool interface and highlights these primary functionalities.

Figure 3 presents an example of a negative user review entered through the interface, demonstrating the format and nature of typical input data.

Figure 4 displays the tool’s output: detecting negative sentiment, extracting root causes of usability issues, and generating structured, actionable recommendations.

This example highlights the tool’s ability to bridge raw user sentiment and structured usability feedback, effectively mimicking expert analysis in real time. The generated suggestions align with core usability dimensions (e.g., completeness and satisfaction), demonstrating the model’s contextual understanding and practical value. Such automated triage enables software teams to prioritize design fixes and enhancements based on actual user pain points, significantly improving the development feedback loop.

-: Review Analyzer Interface:

This figure displays the user interface of a web-based application called Review Analyzer, developed by Nahed Al-Saleh, and deployed using the Streamlit framework. The purpose of this tool is to analyze user-submitted app reviews and identify potential usability issues.

-

Step-by-Step Description:

1-

Interface Title and Author:

The application title, Review Analyzer, is prominently displayed at the top of the interface.
The name of the developer, Nahed Al-Saleh, is listed below the title, indicating authorship.

2-

Language and API Key Inputs:

On the left sidebar:

There is a language selection dropdown, defaulted to English.
Below it is a field labeled “Enter your OpenAI API key”, indicating that the application relies on OpenAI’s services for review analysis and requires the user to input a valid API key for functionality.

3-

Review Input Field:

In the center of the interface, there is a large text input box where users are prompted to enter a review.
A sample review is provided as a placeholder example:

“The app crashes frequently and takes too long to load.”

4-

Action Button:

Below the review input box is a button labeled “Analyze Review”. When clicked, this button triggers the analysis process using the provided review and OpenAI API.

5-

Footer and Interface Elements:

A red heart icon is shown in the bottom-right corner, likely indicating an appreciation or support feature (typical in Streamlit apps).
Standard browser interface elements, tabs, and file path indicators are visible, suggesting that the app is accessed locally or through a deployed Streamlit link.

-

Purpose of the Application:

The application is designed to:
- Accept a user-submitted negative app review,
- Analyze the text using a language model (via OpenAI API).
- Return insights or identified root causes of usability issues within the app.

5. Experimental Results and Discussion

5.1. Experimental Setup

This section outlines the experimental design and implementation details used to evaluate the effectiveness of large language models (LLMs) in identifying the causes of usability issues and generating actionable recommendations from negative app reviews. The experiment was structured around five key stages: dataset preparation and annotation, LLM-based usability issue identification, recommendation generation, evaluation using automated and expert methods, and tool integration.

This section presents the experimental results assessing the usability metrics of three advanced LLMs—GPT, Gemini, and BLOOM—in generating actionable recommendations aligned with the identified causes of usability issues across the core usability factors. The analysis emphasizes key usability dimensions and employs established evaluation metrics—BLEU, ROUGE, and METEOR—to assess the clarity, relevance, and quality of the generated outputs.

5.2. Results of Prompting LLMs

The performance of three LLMs—GPT, Gemini, and BLOOM—in generating usability improvement recommendations using a prompt-based approach was evaluated across three types of usability issues: completeness, correctness, and satisfaction. The results were evaluated using standard metrics: BLEU, ROUGE, and METEOR. These metrics assess lexical similarity (BLEU), content overlap (ROUGE), and semantic alignment and fluency (METEOR) against expert-labeled references.

The experimental results demonstrate that GPT consistently outperforms Gemini and BLOOM in generating usability recommendations across all three factors—completeness, correctness, and satisfaction. For completeness-related issues, such as missing features, in Figure 5A, GPT achieves the highest scores (BLEU 0.62, ROUGE 0.80, METEOR 0.90), indicating strong linguistic and semantic alignment with expert references, while Gemini (0.06, 0.38, 0.37) and BLOOM (0.01, 0.27, 0.22) perform significantly worse. A similar pattern is observed for correctness-related issues, where GPT again in Figure 5B records high scores (BLEU 0.60, ROUGE 0.79, METEOR 0.90), reflecting its stable performance across different usability contexts; Gemini performs slightly better in this category (0.07, 0.39, 0.38) compared to completeness, whereas BLOOM remains weak (0.01, 0.28, 0.23). For satisfaction-related issues, such as interface intuitiveness and performance, GPT maintains its lead in Figure 5C (0.61, 0.80, 0.90), highlighting its ability to interpret user sentiment and address experiential concerns more effectively, while Gemini (0.05, 0.37, 0.36) and BLOOM (0.01, 0.20, 0.16) again lag behind. Overall, these results confirm GPT’s consistent superiority in producing contextually appropriate and actionable recommendations, with Gemini showing moderate capability and BLOOM demonstrating limited effectiveness across all usability dimensions.

Overall, as detailed in Table 7, GPT demonstrated consistently strong performance across all usability dimensions, highlighting its robustness in addressing issues related to missing functionality (completeness), functional errors (correctness), and suboptimal user experience (satisfaction). In contrast, Gemini and BLOOM exhibited minimal variation across these categories, suggesting a limited capacity to differentiate between nuanced usability factors within a prompt-based setting. These findings underscore that while prompt-based LLMs are capable of producing actionable usability recommendations, the choice of model significantly impacts the quality of the outputs. GPT emerges as the most reliable across all metrics and usability dimensions, whereas Gemini and BLOOM may require further optimization, such as advanced prompting techniques or fine-tuning, to achieve comparable performance.

Figure 6 shows sample usability recommendations generated using prompting techniques with the three LLMs (GPT, Gemini, and BLOOM) in response to a given negative user review, shown alongside the corresponding expert-provided recommendation.

5.3. Results of Fine-Tuning LLMs

This section analyzes the performance of the three LLMs—GPT, Gemini, and BLOOM—in generating actionable usability improvement recommendations using fine-tuning approach based on negative app reviews. The models were evaluated across three core usability factors: completeness, correctness, and satisfaction, using standard metrics: BLEU, ROUGE, and METEOR.

The results are displayed in Table 5 and further visualized in Figure 7A–C, which highlight model-specific performance across the three usability dimensions—completeness, correctness, and satisfaction—using standard evaluation metrics.

The fine-tuned GPT model consistently outperforms Gemini and BLOOM across all usability dimensions—completeness, correctness, and satisfaction—demonstrating its strong ability to generate accurate and contextually appropriate recommendations. For completeness-related issues (Figure 7A), GPT achieved the highest scores (BLEU 0.64, ROUGE 0.45, METEOR 0.89), reflecting its effectiveness in addressing gaps in application functionality, while Gemini (0.04, 0.15, 0.35) showed limited improvement and BLOOM (0.12, 0.28, 0.26) displayed moderate gains from fine-tuning. For correctness-related concerns (Figure 7B), GPT again led with scores of 0.61, 0.44, and 0.88, confirming its capacity to interpret technical problems and provide reliable solutions, whereas Gemini (0.04, 0.17, 0.36) and BLOOM (0.12, 0.28, 0.25) remained comparatively weak. Similarly, for satisfaction-related issues (Figure 7C), GPT maintained top performance (BLEU 0.64, ROUGE 0.45, METEOR 0.88), effectively capturing user experience and emotional aspects, while Gemini and BLOOM showed only modest improvements with performance patterns similar to earlier categories. Overall, these results confirm GPT’s robustness and adaptability when fine-tuned, while underscoring the need for further optimization of Gemini and BLOOM to address the full complexity of usability-related feedback.

Overall, the results presented in Table 8 clearly indicate that the fine-tuned GPT model significantly outperforms both Gemini and BLOOM in generating high-quality, actionable usability recommendations across all core usability factors. GPT’s consistent scores across BLEU, ROUGE, and METEOR suggest it effectively understands and responds to the causes of user dissatisfaction. Meanwhile, even after fine-tuning, Gemini and BLOOM require further optimization or access to larger, domain-specific datasets to achieve performance levels comparable to GPT.

Figure 8 presents sample recommendations generated by the three fine-tuned LLMs (GPT, Gemini, and BLOOM) in response to a given negative user review, alongside the corresponding recommendation provided by a human expert to enhance usability.

This sample reveals several key insights. GPT-4′s output closely mirrors the expert recommendation, demonstrating high contextual alignment and completeness. In contrast, Gemini’s recommendation lacks specificity, while BLOOM’s output shows partial relevance but omits key usability aspects. The side-by-side analysis emphasizes the advantage of fine-tuning in enhancing model accuracy, while also underscoring the continued value of human oversight. Ultimately, such comparisons validate the feasibility of integrating LLMs as assistive agents in usability review workflows—particularly when models are properly adapted to domain-specific tasks.

5.4. Comparison Between Prompting and Fine-Tuning Approaches

GPT-4 consistently delivered strong results across both approaches. In the prompting setting, it achieved high METEOR scores (0.90) and ROUGE scores up to 0.80. Fine-tuning slightly improved BLEU (from 0.62 to 0.64) but led to reduced ROUGE scores, suggesting a trade-off between fluency and lexical overlap. Gemini showed weak performance in both settings, with minimal improvement after fine-tuning, indicating limited sensitivity to usability context. BLOOM, while weak under prompting (BLEU: 0.01), benefited from fine-tuning, achieving BLEU scores of 0.12 and modest gains in METEOR and ROUGE.

In summary, GPT-4 excels in both prompting and fine-tuning, while BLOOM improves notably with fine-tuning. Prompting is efficient for high-performing models, but fine-tuning offers better alignment for less capable LLMs.

To further clarify the trade-offs between prompting and fine-tuning, Table 9 summarizes their key strengths, limitations, and recommended usage scenarios based on the experimental results and practical implications of this study.

This comparative view suggests that prompting is ideal for high-performing models and scenarios requiring rapid deployment, while fine-tuning is more effective when working with lower-tier or open-source models and where task-specific adaptation is essential. Software teams can leverage this insight to balance accuracy, cost, and speed when integrating LLMs into usability engineering workflows.

5.5. Discussion

While prior studies such as Dai et al. [3], Sanner et al. [4], and Wang and Lim [11] have demonstrated the potential of large language models (LLMs) like ChatGPT in recommendation tasks using zero-shot or few-shot prompting techniques, their focus has primarily been on traditional domains such as product and content recommendation. Dai et al. [3] assessed ChatGPT across ranking-based recommendation paradigms using prompt engineering and role-based context, while Sanner et al. [4] showed that LLMs can compete with cold-start recommenders through template-based prompting for item and language preferences. Similarly, Wang and Lim [11] explored zero-shot next-item recommendation by repurposing pre-trained LLMs to predict future user interactions without fine-tuning. In contrast, our study targets the software usability domain and goes beyond prompting by incorporating expert-annotated datasets and fine-tuning three distinct LLMs—GPT-4, Gemini, and BLOOM—to identify the causes of usability issues in negative app reviews and generate actionable, developer-oriented recommendations. Furthermore, our work contributes a deployable system and validates output quality through both automatic metrics and expert human evaluation, offering a more application-driven and domain-specific approach than the more exploratory or generalized methodologies seen in prior work.

The results of this study provide clear evidence that LLMs, particularly GPT-4, can play a transformative role in usability engineering. The ability to accurately interpret and generate responses from negative app reviews suggests a new paradigm in AI-assisted software quality assurance. GPT-4′s strong performance across completeness, correctness, and satisfaction demonstrates its capacity to generate actionable, contextually sound recommendations that align with developer expectations and user-centered design. These findings also contribute to a growing body of research that supports the integration of LLMs into human-in-the-loop software engineering processes. Notably, the success of prompting-based approaches with high-performing models like GPT-4 suggests that lightweight integration of LLMs can provide value without the need for extensive fine-tuning, which is resource-intensive. Meanwhile, for models like BLOOM and Gemini, fine-tuning considerably improved their output quality, emphasizing its importance for lower-tier or general-purpose models. From a software engineering perspective, this framework aligns with principles of continuous feedback, agile development, and usability testing. LLMs can augment traditional usability evaluation by providing scalable, automated interpretation of user sentiment—particularly valuable in early prototyping or post-deployment stages. Furthermore, the integration of this framework into a web-based tool bridges the gap between research and real-world practice, offering software teams immediate insight into usability bottlenecks. However, it is critical to recognize that LLMs are not yet replacements for human usability experts. Their outputs can be inconsistent without domain adaptation and expert verification. As such, their optimal use is as intelligent assistants that augment, rather than replace, human judgment in the software development lifecycle. While LLMs significantly streamline the analysis of negative user reviews and the generation of usability recommendations, their integration should follow a human-in-the-loop (HITL) approach. In this paradigm, the AI system serves as a first-pass analyzer, transforming large volumes of unstructured feedback into structured, actionable suggestions. These suggestions are then reviewed, validated, and refined by usability experts, product managers, or designers before implementation. This process ensures the following:

Contextual Relevance: Human experts can assess whether a recommendation fits the app’s goals, user base, and design constraints.
Feasibility Check: Not all AI-generated suggestions are technically or practically feasible; expert review ensures realistic decision-making.
Bias Mitigation: Human oversight helps identify and correct possible biases or misinterpretations produced by the model.
Iterative Improvement: Feedback from experts can be used to fine-tune the model or refine prompting strategies over time.

Therefore, LLMs should not be seen as autonomous decision-makers but rather as collaborative partners that augment human judgment. This hybrid workflow balances efficiency and accuracy, supporting more robust, ethical, and user-centered software development practices.

These insights pave the way for future research into hybrid AI–human usability workflows and highlight the importance of maintaining expert involvement in AI-assisted design processes.

5.6. Limitations

This study presents several limitations that should be acknowledged. First, the reliance on automated evaluation metrics such as BLEU, ROUGE, and METEOR, while useful for assessing lexical and semantic alignment, may not fully capture the contextual relevance and practical feasibility of the generated recommendations. Human validation is essential for ensuring the quality of outputs. Additionally, the manual annotation process for creating the dataset demanded significant time and expertise from domain specialists, which may pose challenges for scalability and rapid adoption in different contexts.

Moreover, while GPT-4 demonstrated consistent performance, the sensitivity of models like BLOOM and Gemini to prompt formulation and training data quality limits their reproducibility and generalization across varied usability contexts. Fine-tuning these large-scale language models requires substantial computational resources, which may not be accessible to smaller research teams or software developers, further complicating implementation.

These limitations underscore the need for further research and development to enhance model robustness, expand the applicability of the framework, and incorporate more comprehensive evaluation methods to ensure practical usability in diverse scenarios.

6. Challenges

Despite the strong performance of GPT-4 and the feasibility of the proposed framework, several technical, practical, and methodological challenges remain. These challenges encompass data-related limitations, model usability concerns, and broader issues of scalability and generalization, as follows:

Manual Annotation Burden: Creating a high-quality annotated dataset demanded considerable time and expertise from domain specialists, posing challenges to scalability and rapid adoption.

Model Sensitivity: While GPT-4 delivered consistently strong results, BLOOM and Gemini exhibited greater sensitivity to prompt formulation and training data quality, limiting reproducibility and generalizability.

Computational Demands: Fine-tuning large-scale language models required substantial computational resources, including high-end GPUs, which may not be accessible to smaller research groups or software development teams.

Prompt Engineering Expertise: Designing effective prompts—especially in zero-shot and few-shot settings—requires in-depth understanding of model behavior and natural language generation, making it challenging for non-expert users to achieve reliable results.

Evaluation Limitations: While automated metrics such as BLEU, ROUGE, and METEOR provide valuable insights into lexical and semantic alignment, they fall short in capturing contextual relevance and implementation feasibility. Human validation remains indispensable.

Limited Tool Generalization: The current tool implementation focuses solely on mobile app reviews. Adapting it to other domains or platforms would require retraining models and redesigning certain interface and workflow components.

7. Conclusions

This study proposed a large language model (LLM)-based framework that leverages GPT-4, Gemini, and BLOOM to automatically extract actionable usability recommendations from negative app reviews. By employing both prompting (zero-shot and few-shot) and fine-tuning strategies, the framework systematically maps unstructured user feedback to three key usability dimensions: correctness, completeness, and satisfaction. Using a manually annotated dataset of 309 Instagram reviews, the results demonstrated that fine-tuned models significantly outperformed prompt-based methods in terms of accuracy, clarity, and relevance. Among the models evaluated, GPT-4 consistently achieved the highest performance across BLEU (0.64), ROUGE (0.80), and METEOR (0.90) scores, confirming its superior capacity for contextual understanding and recommendation generation. In addition to quantitative evaluations, expert human assessments validated the clarity and usefulness of the outputs, particularly for developer-oriented usability improvements. As a practical contribution, the study introduced a web-based Review Analyzer Tool that integrates the fine-tuned models to facilitate real-time analysis of negative reviews. Overall, the findings underscore the potential of fine-tuned LLMs—especially GPT-4—as intelligent, scalable tools for enhancing usability evaluation and supporting user-centered software design.

Building on the promising outcomes of this study, several directions can be pursued to further enhance the proposed framework:

Expanding the framework to encompass more diverse applications beyond mobile apps could provide broader insights into user experience across various digital platforms.
Explore more advanced and hybrid LLM architectures.
Enhance fine-tuning strategies by incorporating advanced techniques such as Reinforcement Learning from Human Feedback (RLHF).

Author Contributions

Study conception and design: N.A. (Nahed Alsaleh), R.A. and N.A. (Nahed Alowidi); data collection: N.A. (Nahed Alsaleh); analysis and interpretation of results: N.A. (Nahed Alsaleh), R.A. and N.A. (Nahed Alowidi); draft manuscript preparation: N.A. (Nahed Alsaleh); paper reviewing and editing: R.A. and N.A. (Nahed Alowidi); paper proofreading: N.A. (Nahed Alsaleh), R.A. and N.A. (Nahed Alowidi). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data and materials used in this study are available to interested researchers upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

UXCam. UX Statistics Every Designer Should Know. Available online: https://uxcam.com/blog/ux-statistics/ (accessed on 15 August 2025).
Testlio. Mobile App Testing Statistics. Available online: https://testlio.com/blog/mobile-app-testing-statistics/ (accessed on 15 August 2025).
Bevan, N.; Carter, J.; Earthy, J.; Geis, T.; Harker, S. New ISO Standards for Usability, Usability Reports and Usability Measures. In Human-Computer Interaction. Theory, Design, Development and Practice; Kurosu, M., Ed.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9731, pp. 268–278. ISBN 978-3-319-39509-8. [Google Scholar]
Asamoah, J. Evaluating Key Determinant of Customer Satisfaction: A Focus on Customer Relations, Service Quality, Product Quality and Supply Chain Management. Master’s Thesis, KAMK University of Applied Sciences, Kajaani, Finland, 2025. [Google Scholar]
Chung, L.; Do Prado Leite, J.C.S. On Non-Functional Requirements in Software Engineering. In Conceptual Modeling: Foundations and Applications; Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5600, pp. 363–379. [Google Scholar] [CrossRef]
Wang, T.; Sudhir, K.; Hong, D. Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach. arXiv 2024, arXiv:2408.07238. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Liu, J.; Liu, C.; Zhou, P.; Lv, R.; Zhou, K.; Zhang, Y. Is ChatGPT a Good Recommender? A Preliminary Study. arXiv 2023, arXiv:2304.10149. [Google Scholar] [CrossRef]
Dai, S.; Shao, N.; Zhao, H.; Yu, W.; Si, Z.; Xu, C.; Sun, Z.; Zhang, X.; Xu, J. Uncovering ChatGPT’s Capabilities in Recommender Systems. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 14 September 2023; pp. 1126–1132. [Google Scholar] [CrossRef]
Sanner, S.; Balog, K.; Radlinski, F.; Wedin, B.; Dixon, L. Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 14 September 2023; pp. 890–896. [Google Scholar] [CrossRef]
Sileo, D.; Vossen, W.; Raymaekers, R. Zero-Shot Recommendation as Language Modeling. arXiv 2021, arXiv:2112.04184. [Google Scholar] [CrossRef]
Hou, Y.; Zhang, J.; Lin, Z.; Lu, H.; Xie, R.; McAuley, J.; Zhao, W.X. Large Language Models are Zero-Shot Rankers for Recommender Systems. arXiv 2024, arXiv:2305.08845. [Google Scholar] [CrossRef]
Sun, W.; Yan, L.; Ma, X.; Wang, S.; Ren, P.; Chen, Z.; Yin, D.; Ren, Z. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. arXiv 2024, arXiv:2304.09542. [Google Scholar] [CrossRef]
Yang, Z.; Wu, J.; Luo, Y.; Zhang, J.; Yuan, Y.; Zhang, A.; Wang, X.; He, X. Large Language Model Can Interpret Latent Space of Sequential Recommender. arXiv 2023, arXiv:2310.20487. [Google Scholar] [CrossRef]
Yang, S.; Ma, W.; Sun, P.; Ai, Q.; Liu, Y.; Cai, M.; Zhang, M. Sequential Recommendation with Latent Relations based on Large Language Model. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 11 July 2024; pp. 335–344. [Google Scholar] [CrossRef]
Lyu, H.; Jiang, S.; Zeng, H.; Xia, Y.; Wang, Q.; Zhang, S.; Chen, R.; Leung, C.; Tang, J.; Luo, J. LLM-Rec: Personalized Recommendation via Prompting Large Language Models. arXiv 2024, arXiv:2307.15780. [Google Scholar] [CrossRef]
Wang, L.; Lim, E.-P. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv 2023, arXiv:2304.03153. [Google Scholar] [CrossRef]
Wang, Y.; Chu, Z.; Ouyang, X.; Wang, S.; Hao, H.; Shen, Y.; Gu, J.; Xue, S.; Zhang, J.Y.; Cui, Q.; et al. Enhancing Recommender Systems with Large Language Model Reasoning Graphs. arXiv 2024, arXiv:2308.10835. [Google Scholar] [CrossRef]
Wei, W.; Ren, X.; Tang, J.; Wang, Q.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. LLMRec: Large Language Models with Graph Augmentation for Recommendation. arXiv 2024, arXiv:2311.00423. [Google Scholar] [CrossRef]
Ren, X.; Wei, W.; Xia, L.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. Representation Learning with Large Language Models for Recommendation. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 3464–3475. [Google Scholar] [CrossRef]
Liu, Q.; Chen, N.; Sakai, T.; Wu, X.-M. ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Merida, Mexico, 4–8 March 2024; pp. 452–461. [Google Scholar] [CrossRef]
Gao, Y.; Sheng, T.; Xiang, Y.; Xiong, Y.; Wang, H.; Zhang, J. Chat-REC: Towards Interactive and Explainable LLMs-Augmented Recommender System. arXiv 2023, arXiv:2303.14524. [Google Scholar] [CrossRef]
Shu, Y.; Zhang, H.; Gu, H.; Zhang, P.; Lu, T.; Li, D.; Gu, N. RAH! RecSys-Assistant-Human: A Human-Centered Recommendation Framework with LLM Agents. arXiv 2023, arXiv:2303.14524. [Google Scholar] [CrossRef]
Shi, W.; He, X.; Zhang, Y.; Gao, C.; Li, X.; Zhang, J.; Wang, Q.; Feng, F. Large Language Models are Learnable Planners for Long-Term Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, 14–18 July 2024; pp. 1893–1903. [Google Scholar] [CrossRef]
Huang, X.; Lian, J.; Lei, Y.; Yao, J.; Lian, D.; Xie, X. Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations. arXiv 2024, arXiv:2308.16505. [Google Scholar] [CrossRef]
Huang, D.; Markovitch, D.G.; Stough, R.A. Can chatbot customer service match human service agents on customer satisfaction? An investigation in the role of trust. J. Retail. Consum. Serv. 2024, 76, 103600. [Google Scholar] [CrossRef]
Zhang, J.; Bao, K.; Wang, W.; Zhang, Y.; Shi, W.; Xu, W.; Feng, F.; Chua, T.-S. Prospect Personalized Recommendation on Large Language Model-based Agent Platform. arXiv 2024, arXiv:2402.18240. [Google Scholar] [CrossRef]
Petrov, A.V.; Macdonald, C. Generative Sequential Recommendation with GPTRec. arXiv 2023, arXiv:2306.11114. [Google Scholar] [CrossRef]
Wang, L.; Zhang, J.; Yang, H.; Chen, Z.; Tang, J.; Zhang, Z.; Chen, X.; Lin, Y.; Song, R.; Zhao, W.X.; et al. User Behavior Simulation with Large Language Model based Agents. arXiv 2024, arXiv:2306.02552. [Google Scholar] [CrossRef]
Zhang, A.; Chen, Y.; Sheng, L.; Wang, X.; Chua, T.-S. On Generative Agents in Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 1807–1817. [Google Scholar] [CrossRef]
Zhang, W.; Wu, C.; Li, X.; Wang, Y.; Dong, K.; Wang, Y.; Dai, X.; Zhao, X.; Guo, H.; Tang, R. LLMTreeRec: Unleashing the Power of Large Language Models for Cold-Start Recommendations. arXiv 2024, arXiv:2404.00702. [Google Scholar] [CrossRef]
Erhan, D.; Courville, A.; Bengio, Y.; Vincent, P. Why Does Unsupervised Pre-training Help Deep Learning? In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Cagliari, Italy, 13–15 May 2010. [Google Scholar]
Kang, Y.; Wang, Z.; Zhang, H.; Chen, J.; You, H. APIRecX: Cross-Library API Recommendation via Pre-Trained Language Model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3425–3436. [Google Scholar] [CrossRef]
Wang, L.; Hu, H.; Sha, L.; Xu, C.; Wong, K.-F.; Jiang, D. RecInDial: A Unified Framework for Conversational Recommendation with Pretrained Language Models. arXiv 2022, arXiv:2110.07477. [Google Scholar] [CrossRef]
Xiao, S.; Liu, Z.; Shao, Y.; Di, T.; Xie, X. Training Large-Scale News Recommenders with Pretrained Language Models in the Loop. arXiv 2021, arXiv:2102.09268. [Google Scholar] [CrossRef]
Kang, W.-C.; Ni, J.; Mehta, N.; Sathiamoorthy, M.; Hong, L.; Chi, E.; Cheng, D.Z. Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv 2023, arXiv:2305.06474. [Google Scholar] [CrossRef]
Li, R.; Deng, W.; Cheng, Y.; Yuan, Z.; Zhang, J.; Yuan, F. Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights. arXiv 2023, arXiv:2305.11700. [Google Scholar] [CrossRef]
Alowidi, N.; Alnanih, R.; Alsaleh, N. Hybrid Deep Learning Approach for Automating App Review Classification: Advancing Usability Metrics Classification with an Aspect-Based Sentiment Analysis Framework. Comput. Mater. Contin. 2025, 82, 949–976. [Google Scholar] [CrossRef]
Pfleeger, S.L. Software Engineering: The Production of Quality Software, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 1991; ISBN 978-0-02-395115-2. [Google Scholar]
Green, P. ISO Human-Computer Interaction Standards: Finding Them and What They Contain. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2020, 64, 400–404. [Google Scholar] [CrossRef]

Figure 1. Methodology Framework.

Figure 2. Tool Interface showing language selection, API key input, and review submission panel.

Figure 3. Sample negative app review input with identifiable usability concerns.

Figure 4. Example output showing sentiment classification, usability issue detection, and AI-generated recommendations.

Figure 5. Prompt-Based LLMs Performance on Usability Metrics Issues Using BLEU, ROUGE, and METEOR.

Figure 6. Sample of usability recommendations generated by prompting techniques with the three LLMs alongside corresponding expert recommendation.

Figure 7. Fine-Tuning-Based LLMs Performance on Usability Metrics Issues Using BLEU, ROUGE, and METEOR.

Figure 8. Sample of Usability Recommendations Generated by Fine-Tuned LLMs Alongside Corresponding Expert Recommendation.

Table 1. Comparison of Usability Metrics for Language Models in Generating Recommendations in Prompting.

Refs.	Model	Negative App Review	Recommendation	Usability Metrics
Refs.	Model	Negative App Review	Recommendation	Completeness	Correctness	Satisfaction
[8]	ChatGPT	✖️	✔️	✖️	✖️	✖️
[9]	ChatGPT	✖️	✔️	✖️	✖️	✖️
[10]	Different prompting strategies	✖️	✖️	✔️	✖️	✖️
[11]	Pre-training GPT-2	✖️	✔️	✖️	✔️	✖️
[12]	Pre-trained	✖️	✔️	✖️	✖️	✖️
[13]	ChatGPT	✖️	✖️	✖️	✖️	✖️
[14]	GPT	✖️	✔️	✖️	✖️	✖️
[15]	LRD	✖️	✔️	✖️	✖️	✖️
[16]	LLM-Rec	✖️	✔️	✖️	✖️	✖️
[17]	Pretrained LLM	✖️	✔️	✖️	✖️	✖️
[18]	LLMRG	✖️	✔️	✖️	✖️	✖️
[19]	LLMRec	✖️	✔️	✖️	✖️	✖️
[20]	RLMRec	✖️	✔️	✖️	✖️	✖️
[21]	ONCE	✖️	✔️	✖️	✖️	✖️
[22]	Chat-Rec	✖️	✔️	✖️	✖️	✖️
[23]	RAH	✖️	✔️	✖️	✖️	✖️
[24]	BiLLP	✖️	✔️	✖️	✖️	✔️
[25]	InteRecAgent	✖️	✔️	✖️	✖️	✖️
[26]	Three-Tiered Trust	✖️	✖️	✖️	✖️	✔️
[27]	GPTs Rec4Agentverse	✖️	✔️	✖️	✖️	✔️
[28]	GeneRec	✖️	✔️	✖️	✖️	✖️
[29]	LLM-Based User Behavior Simulation Framework.	✖️	✖️	✖️	✖️	✖️
[30]	Agent4Rec	✖️	✔️	✖️	✖️	✖️
[31]	UniLLMRec	✖️	✔️	✖️	✖️	✖️
Our Paper	GPT Gemini BLOOM	✔️	✔️	✔️	✔️	✔️

✔️ use ✖️ Not use.

Table 2. Comparison of Usability Metrics for Language Models in Generating Recommendations in Fine-tuning.

Refs.	Model	Negative App Review	Recommendation	Usability Metrics
Refs.	Model	Negative App Review	Recommendation	Completeness	Correctness	Satisfaction
[32]	Pre-training	✖️	✔️	✖️	✖️	✖️
[33]	APIRecX	✖️	✔️	✖️	✖️	✖️
[34]	RecInDial	✖️	✖️	✔️	✖️	✖️
[35]	SpeedyFeed	✖️	✔️	✖️	✔️	✖️
[36]	LLMs	✖️	✔️	✖️	✖️	✖️
[37]	GPT-3	✖️	✔️	✖️	✖️	✖️
[33]	APIRecX	✖️	✔️	✖️	✖️	✖️
Our Paper	GPT Gemini BLOOM	✔️	✔️	✔️	✔️	✔️

✔️ use ✖️ Not use.

Table 3. Usability factors definitions and examples.

Usability Metrics	Aspects
Usability Metrics	Definition	Example & Cause	Recommendation
Correctness	Refers to whether the app behaves as expected and delivers accurate functionality [39]. Issues in this category involve bugs, crashes, errors, features not working, or deviations from expected behavior [40].	Example: “The login button doesn’t work.” Cause: Functional error.	“Fix the login API call and ensure input validation.”
Completeness	Refers to the extent to which the app provides all the features or content the user expects [40]. This factor concerns missing or incomplete functionality.	Example: “There’s no option to reset the password.” Cause: Missing essential functionality.	“Add a password reset feature accessible from the login screen.”
Satisfaction	Refers to the overall user experience, including aesthetics, performance, design, and emotional response.	Example: “It’s too slow and confusing to use.” Cause: Poor performance and unintuitive UI.	“Improve app speed and redesign the navigation for better clarity.”

Table 4. Annotation Examples.

Usability Factor	Review Text	Identified Causes of the Issue(s)	Actionable Recommendation
Correctness	The app crashes every time I try to log in.	Crash during login process	Investigate and fix crash caused by login handler; add fallback error reporting
Completeness	There’s no way to edit my profile.	Missing profile edit feature	Implement profile editing functionality under user settings
Satisfaction	Takes too long to load pages.	Slow performance	Optimize data loading and reduce latency on key user flows

Table 5. Identifying the causes of usability issues.

Prompting Approach

Definition

Example

Zero-Shot Prompting

In the zero-shot setting, each of the three LLMs models (GPT-4, Gemini, and BLOOM) is provided with clearly structured instructions but no prior examples. The prompt asks the LLM model to analyze a given negative review and identify the cause of dissatisfaction.

“This is a negative review. Read it and identify the causes of the usability issue as bullet points. Review: The app crashes every time I open it.”

Few-Shot Prompting

In the few-shot setting, each of the three LLMs—GPT-4, Gemini, and BLOOM—is provided with a few labeled examples before being asked to analyze new negative reviews. This strategy provides contextual guidance and helps improve accuracy by showing the model the expected format and reasoning process [10].

“These are examples of negative reviews alongside their causes of negativity:
Example 1

Review: “The search function doesn’t return any results.”
Cause: Broken search functionality.

Example 2

Review: “There’s no option to reset my password.”
Cause: Missing essential feature.

Now, read the following negative review and identify the causes of the usability issue in bullet points:
Review: “The app takes too long to load and is hard to navigate.”
Output: ”

Table 6. Use prompts to generate recommendations.

Prompting Approach	Definition	Example
Zero-Shot Prompting	In the zero-shot setting, each LLM was given a well-structured instructional prompt that defines the three-core usability factors and instructs the model to generate a recommendation based solely on a single input: the identified usability issue or cause.	“The following is a usability issue identified from a negative app review. Based on the issue, generate a specific and actionable recommendation to improve the application’s usability. Usability factors include the following: Correctness (functional errors), Completeness (missing features), and Satisfaction (performance or UI/UX problems). Issue cause: ‘The app crashes every time I open the settings page.’ Recommendation: ”
Few-Shot Prompting	In the few-shot setting, each model was first provided with annotated examples that illustrate how to transform usability issues into practical recommendations.	“Here are examples of usability issues and corresponding recommendations: Issue: ‘There’s no option to reset my password.’ Factor: Completeness Recommendation: ‘Implement a password reset feature accessible from the login screen.’ Issue: ‘It takes too long to load the homepage.’ Factor: Satisfaction Recommendation: ‘Optimize image loading and reduce API response times on the homepage.’ Now, given the following issue, generate a recommendation: Issue: ‘Profile picture upload fails with no error message.’ Recommendation: ”

Table 7. Prompt-Based LLMs Performance on Usability Recommendations Using BLEU, ROUGE, and METEOR.

Usability Metrics	GPT			Gemini			BLOOM
Usability Metrics	BLEU	ROUGE	METEOR	BLEU	ROUGE	METEOR	BLEU	ROUGE	METEOR
Completeness	0.62	0.8	0.9	0.06	0.38	0.37	0.01	0.27	0.22
Correctness	0.6	0.79	0.9	0.07	0.39	0.38	0.01	0.28	0.23
Satisfaction	0.61	0.8	0.9	0.05	0.37	0.36	0.01	0.2	0.16

Table 8. Fine-Tuning-Based LLMs Performance on Usability Recommendations Using BLEU, ROUGE, and METEOR.

Usability Metrics	GPT			Gemini			BLOOM
Usability Metrics	BLEU	ROUGE	METEOR	BLEU	ROUGE	METEOR	BLEU	ROUGE	METEOR
Completeness	0.64	0.45	0.89	0.04	0.15	0.35	0.12	0.28	0.26
Correctness	0.61	0.44	0.88	0.04	0.17	0.36	0.12	0.28	0.25
Satisfaction	0.64	0.45	0.88	0.04	0.17	0.36	0.12	0.28	0.25

Table 9. Comparative Summary of Prompting and Fine-Tuning for Usability Recommendations.

Aspect	Prompting	Fine-Tuning
Setup Complexity	Low (no model retraining needed)	High (requires retraining and tuning)
Resource Requirements	Minimal (API access only)	High (GPU, storage, compute time)
Adaptability	Limited to general tasks	Tailored to specific domain/dataset
Performance (LLM-dependent)	High with advanced LLMs (e.g., GPT-4)	Higher performance for less capable LLMs
Scalability	Immediate scalability via APIs	Less scalable due to training overhead
When to Use	Quick prototyping, production with high-end LLMs	When domain-specific accuracy is required and resources are available

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alsaleh, N.; Alnanih, R.; Alowidi, N. Enhancing Software Usability Through LLMs: A Prompting and Fine-Tuning Framework for Analyzing Negative User Feedback. Computers 2025, 14, 363. https://doi.org/10.3390/computers14090363

AMA Style

Alsaleh N, Alnanih R, Alowidi N. Enhancing Software Usability Through LLMs: A Prompting and Fine-Tuning Framework for Analyzing Negative User Feedback. Computers. 2025; 14(9):363. https://doi.org/10.3390/computers14090363

Chicago/Turabian Style

Alsaleh, Nahed, Reem Alnanih, and Nahed Alowidi. 2025. "Enhancing Software Usability Through LLMs: A Prompting and Fine-Tuning Framework for Analyzing Negative User Feedback" Computers 14, no. 9: 363. https://doi.org/10.3390/computers14090363

APA Style

Alsaleh, N., Alnanih, R., & Alowidi, N. (2025). Enhancing Software Usability Through LLMs: A Prompting and Fine-Tuning Framework for Analyzing Negative User Feedback. Computers, 14(9), 363. https://doi.org/10.3390/computers14090363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Software Usability Through LLMs: A Prompting and Fine-Tuning Framework for Analyzing Negative User Feedback

Abstract

1. Introduction

2. Related Work

2.1. Non-Tuning Paradigm

Prompting

2.2. Tuning Paradigm

Fine-Tuning

3. Methodology

3.1. Negative Reviews Collection and Annotation

Annotation of Negative Reviews by Identifying Usability Issue Causes and Corresponding Improvement Recommendations

3.2. Prompting and Fine-Tuning LLMs to Identify the Causes of Usability-Related Issues

3.2.1. Prompting Approach

3.2.2. Fine-Tuning Approach

3.3. Prompting and Fine-Tuning LLMs to Generate Actionable Recommendations

3.3.1. Prompting Approach

3.3.2. Fine-Tuning Approach

3.4. Results Evaluation Strategy

3.4.1. Automated Evaluation Using Standard Metrics

3.4.2. Expert Human Validation

4. Tool Implementation: The Review Analyzer System

5. Experimental Results and Discussion

5.1. Experimental Setup

5.2. Results of Prompting LLMs

5.3. Results of Fine-Tuning LLMs

5.4. Comparison Between Prompting and Fine-Tuning Approaches

5.5. Discussion

5.6. Limitations

6. Challenges

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI