Evaluating the Usability and Ethical Implications of Graphical User Interfaces in Generative AI Systems

Batool, Amna; Hussain, Waqar

doi:10.3390/computers14100418

Open AccessArticle

Evaluating the Usability and Ethical Implications of Graphical User Interfaces in Generative AI Systems

by

Amna Batool

^* and

Waqar Hussain

Commonwealth Scientific and Industrial Research Organisation (CSIRO), Data61, Clayton, Melbourne, VIC 3168, Australia

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(10), 418; https://doi.org/10.3390/computers14100418

Submission received: 24 August 2025 / Revised: 21 September 2025 / Accepted: 26 September 2025 / Published: 2 October 2025

Download

Browse Figures

Versions Notes

Abstract

The rapid development of generative artificial intelligence (GenAI) has revolutionized how individuals and organizations interact with technology. These systems, ranging from conversational agents to creative tools, are increasingly embedded in daily life. However, their effectiveness relies heavily on the usability of their graphical user interfaces (GUIs), which serve as the primary medium for user interaction. Moreover, the design of these interfaces must align with ethical principles such as transparency, fairness, and user autonomy to ensure responsible usage. This study evaluates the usability of GUIs for three widely-used GenAI applications, including ChatGPT (GPT-4), Gemini (1.5), and Claude (3.5 Sonnet), using a heuristics-based and user-based testing approach (experimental-qualitative investigation). A total of 12 participants from a research organization in Australia, participated in structured usability evaluations, applying 14 usability heuristics to identify key issues and ethical concerns. The results indicate that Claude’s GUI is the most usable among the three, particularly due to its clean and minimalistic design. However, all applications demonstrated specific usability issues, such as insufficient error prevention, lack of shortcuts, and limited customization options, affecting the efficiency and effectiveness of user interactions. Despite these challenges, each application exhibited unique strengths, suggesting that while functional, significant enhancements are needed to fully support user satisfaction and ethical usage. The insights of this study can guide organizations in designing GenAI systems that are not only user-friendly but also ethically sound.

Keywords:

generative artificial intelligence; GenAI applications; ethical generative AI; graphical user interface; AI ethical principles; responsible AI

1. Introduction

Generative AI (GenAI) technology represents a significant leap forward in the field of artificial intelligence, enabling machines to create text, images, and other forms of media autonomously [1]. GenAI applications, e.g., ChatGPT [2], Google Gemini [3], and Claude [4], are widely used across various domains, including education [5], research [6], and healthcare [7]. These applications, built on machine learning and deep learning models, are being integrated into a wide array of tasks, from creative content generation to automating technical processes. Despite the significant advancements in the performance and capabilities of GenAI systems, there remains a critical need to assess the usability of their graphical user interface (GUI) [8,9], as the interfaces of GenAI applications play a crucial role in ensuring accessibility, user satisfaction, and ethical adherence. Generative Artificial Intelligence (GenAI) systems are increasingly applied across diverse domains, from content creation to decision support, and are becoming integral to daily interactions. However, their graphical user interfaces (GUIs) often present usability challenges that can hinder effective and ethical usage. Common issues include complex or cluttered layouts, unintuitive navigation, insufficient error prevention and recovery mechanisms, limited customization options, and inadequate feedback during interactions. Such problems can reduce user efficiency, increase errors, and negatively impact user satisfaction, while also compromising ethical principles such as transparency, fairness, and user autonomy in human–AI interactions [10,11]. Understanding these usability challenges is crucial for designing GenAI applications that are both user-friendly and ethically aligned, motivating the present study to evaluate widely used GenAI GUIs using complementary heuristics-based and user-based methods, which are explained later in this paper.

The graphical user interface is the connection between people and computer systems, combining design and technical features to help users interact with devices and software [8]. A good GUI makes it easy for users to engage with elements like buttons, text, etc., and its main goal is to make products or services simple and efficient to use, improving the overall experience, accessibility, and ease of use [12]. The connection between interfaces and GenAI is important for shaping how users interact with AI-generated content [13]. Poorly designed interfaces for GenAI can negatively impact user experience and the overall effectiveness of AI applications [13].

Evaluating the usability of GUI is essential for understanding how well users can interact with GenAI applications to achieve their intended goals effectively, efficiently, and with satisfaction. Despite their potential, GenAI applications often face criticism for their opaque nature, complexity, and lack of intuitive interfaces, which can hinder adoption and effective use by novice users [14]. Users value GenAI for its practical benefits and personalization, yet they encounter challenges such as managing extensive text, effective prompting, and limited nuanced understanding [15]. These challenges are often compounded by specific issues such as incorrect answers given with undue confidence, vague responses, misinformation, and repetitive content [16,17]. For advanced systems like GPT-4, the generation of human-like content, such as medical notes containing patients’ information, is often criticized for its lack of transparency in how such information is calculated, which can further diminish user satisfaction and impact overall usability [18]. GenAI applications, including Claude, ChatGPT, Microsoft Copilot, and Gemini, offer diverse benefits but face significant GUI-related challenges. These challenges include unlabeled buttons and inadequate navigation support for screen readers [9]. Moreover, blind users encounter additional issues, such as inaccessible output formats, where generated content is often presented in ways that screen readers cannot easily interpret, hindering comprehension. Furthermore, the lack of alternative text for images means that when GenAI tools produce visual content, they frequently omit descriptive text, leaving blind users without essential context. Complex navigation structures further complicate the user experience, as intricate interface layouts are difficult to navigate using screen reader technology, leading to confusion and inefficiency [9]. A confusing or frustrating interface may make it difficult for users to understand and engage with AI-generated content, potentially causing them to lose interest. Additionally, if the interface fails to convey the purpose and meaning of AI outputs, users might misinterpret the information, leading to errors in judgment or decision-making [13]. Such design flaws can also undermine trust in the AI system, raising doubts about the accuracy and reliability of the generated content.

Besides evaluating the usability of the graphical user interface of GenAI applications, it is crucial to examine whether the GUI aligns with ethical principles to address and reduce ethical concerns associated with AI applications, such as the lack of transparency in the interfaces can obscure how user data is collected and used, raising privacy concerns [19,20,21]. As GenAI applications become more prevalent in both personal and professional settings, there is an urgent need to ensure that these systems are designed with a strong ethical foundation [22,23]. These applications offer numerous benefits, including their ability to assist with complex problem-solving and enhance productivity through automation. However, significant ethical concerns also arise, such as generating inaccurate responses, resulting in user interactions that may not adequately address ethical implications [17].

Although considerable research has been conducted on generative AI technology, there remains a notable gap in integrating the GUI aspects with ethical considerations. Existing literature primarily focused on the GUI of GenAI technology, focusing on how GUI factors impact user interaction and satisfaction [8,9,16]. However, these studies often neglect the ethical dimensions associated with these applications. Conversely, research that explores the ethical concerns related to GenAI applications frequently overlooks critical aspects of the usability of GUI design [17,22]. Our study aims to bridge this gap by evaluating the usability of the GUI of emerging GenAI applications specifically through the lens of Australia’s AI ethical principles (https://www.industry.gov.au/publications/australias-artificial-intelligence-ethics-principles/australias-ai-ethics-principles; accessed on 5 March 2025). By combining these dimensions, our research not only assesses how well GenAI applications meet interface standards but also how they align with ethical considerations. This integrated approach provides a more comprehensive understanding of the ethical implications of GenAI applications while addressing the shortcomings of existing research in both areas. Figure 1 illustrates the research gap between GUI-focused and ethics-focused studies in the context of generative AI applications. The blue circle represents prior work that primarily investigates the usability of graphical user interface aspects of GenAI systems [10,11,24], focusing on interaction design and user experience but overlooking ethical dimensions. In contrast, the green circle highlights studies that address ethical concerns such as bias, transparency, and fairness in GenAI [25,26,27], while paying limited attention to interface usability. Our study, shown in the overlapping area, brings these two perspectives together by evaluating GenAI GUIs not only in terms of usability but also with respect to ethical principles such as transparency, fairness, and user autonomy. This integration provides a more holistic understanding of how users interact with GenAI applications and offers guidance for developing systems that are both user-friendly and ethically aligned. The objectives of this research are as follows:

Evaluate the usability of the graphical user interfaces (GUIs) of three widely used GenAI applications, i.e., Gemini, Claude, and ChatGPT, through heuristics and user-based testing techniques.
Provide insights into the ethical shortcomings of the graphical user interfaces (GUIs) of these applications.
Provide useful suggestions to guide companies in designing generative AI applications, focusing on aligning GUI design with ethical principles, ensuring that these aspects meet standards for transparency, fairness, accountability, etc.

Based on this study’s goal, we defined the following research questions:

RQ1: How usable are the graphical user interfaces (GUIs) of selected generative AI applications in supporting efficient, effective, and satisfying user interactions?

RQ2: What are the key ethical shortcomings associated with the usability of the graphical user interface design of generative AI applications?

RQ3: How can the design of GUIs for generative AI applications be improved to align with ethical principles such as transparency, fairness, and user autonomy?

The significance of this research lies in its potential to address critical ethical challenges, including transparency, user autonomy, etc., associated with the usability of the graphical user interface design of generative AI applications. By systematically evaluating the GUI factors of widely used GenAI applications, this study aims to uncover key ethical shortcomings that impact user trust and system effectiveness. While it is possible that, by the time this research is completed, the GUIs of these applications may have evolved with additional features or design changes, the findings remain highly relevant. Understanding and documenting existing usability and ethical concerns provides valuable insights that can inform future design improvements, ensuring that GenAI systems are not only functionally robust but also ethically aligned with principles of fairness, privacy, accountability, etc.

The paper is organized as follows: Section 2 presents the background and related work, while Section 3 outlines the research methodology. Section 4 presents the results and findings, addressing the research questions of the study. Section 5 offers a discussion, and Section 6 presents the limitations and threats to validity. Section 7 concludes the study.

2. Background and Related Work

This section presents a comprehensive review of existing literature on the GUI of GenAI, with a focus on systematically analyzing and synthesizing the current state of knowledge in this area.

Research by Adnin and Das [9] reported findings from interviews with 19 blind individuals who were asked to provide feedback on how the GenAI tools (ChatGPT 3.5 or 4, Google Gemini, Microsoft Copilot, Claude, and the Gen-AI powered image describer Be My AI) work, highlighting their GUI issues and the tools’ overall capabilities and limitations. The study revealed that participants found the buttons for copying, regenerating, and downvoting ChatGPT responses to be unlabeled. Additionally, ChatGPT and Claude did not provide appropriate heading labels, regions, or landmarks for screen reader users to navigate the interface efficiently, nor did they offer shortcuts for jumping between previous and next prompts or responses, leading to GUI usability and accessibility issues for blind individuals. Similarly, Hillmann et al. [28] evaluated the usability of the CHATU chatbot (https://chatu.qu.tu-berlin.de/home; accessed on 7 March 2025) with 21 participants. The evaluation of CHATU revealed that users find the system easy to use and satisfactory in terms of overall experience and navigation. However, despite these positive aspects, the system’s interactions are noted to lack authenticity and originality.

A study by Pinto et al. [29] conducted a meeting session with 62 participants to evaluate their experience using StackSpot AI, a coding AI assistant. The study revealed that StackSpot significantly saves time and facilitates easy access to documentation. However, the use of mixed knowledge sources results in less accurate responses, with frequent inaccurate code suggestions that require multiple interactions and lead to inconsistent outputs, causing user confusion. Overall, while StackSpot is useful from a usability perspective, the issue of response inaccuracy needs to be addressed. Similarly, another study by Oswal and Oswal [8] reported findings from a mixed-method approach, evaluating the interfaces of three GenAI website builder tools, i.e., Dorik.com, Relume.io, and Wix.com, for blind and keyboard-only users. The study employed both manual testing using qualitative techniques and automatic testing with the WAVE tool. The results revealed that these website builders are neither usable nor accessible for this user group, as the basic features of their interfaces are incompatible with adaptive technologies, including keyboards. Furthermore, even if technical accessibility issues were resolved, the platforms would still be unsuitable for screen reader users due to the lack of essential structural design features required for usable and accessible interfaces.

An experimental-qualitative research conducted by Van Es and Nguyen [30] investigated how ChatGPT, using both its GPT-4 and GPT-4o models, portrays and describes itself through prompts to “draw” or “represent” itself. The study analyzed 50 generated images and 58 accompanying text responses via visual semiotic analysis and came up with three primary themes: anthropomorphism, futuristic/futurism, and (social) intelligence. The study revealed that ChatGPT consistently depicted itself as a friendly, human-like assistant, reinforcing familiar yet potentially misleading myths about AI. These representations raise critical concerns about how such anthropomorphic framings might influence public trust and understanding, potentially leading users to overestimate the reliability and human-like capabilities of AI systems. Similarly, based on two semesters of observations and debates across five design studios, a study by Iranmanesh and Lotfabadi [31] examined the integration of generative AI, particularly text-to-image tools, in architectural education. The findings of the study indicated that while these AI tools can enhance creative exploration and offer design versatility, they also present significant drawbacks. Key issues include the risk of bias, a tendency to overlook critical architectural qualities (such as human scale and abstraction), and the generation of designs that sometimes defy the practical constraints of architectonic reality. The study ultimately argued for a balanced framework that leverages the innovative potential of AI without compromising the human-centric, conceptual depth essential to architectural practice.

Several studies have focused primarily on the ethical challenges associated with GenAI applications without focusing on the usability issues of the GUI of GenAI applications. A study by Al-kfairy et al. [32] highlighted that some interfaces fail to adequately disclose the use of copyrighted material in the training or generation processes, which raises legal and ethical questions about intellectual property rights. The lack of clear visual or textual cues in interfaces to inform users about such issues is a critical design flaw from an ethical standpoint.

Research by Alabduljabbar Reham [33] conducted a comprehensive usability evaluation of generative AI applications by analyzing 11,549 user reviews from Apple’s App Store and Google Play, collected between January and March 2024. The study focused on five generative AI apps, i.e., ChatGPT, Bing AI, Microsoft Copilot, Gemini AI, and Da Vinci AI. The research identified common usability challenges and user satisfaction levels. The findings revealed that ChatGPT attained the highest composite usability scores among both Android and iOS users, registering scores of 0.504 and 0.462, respectively. In contrast, Gemini AI recorded the lowest score among Android applications at 0.016, while Da Vinci AI had the lowest score among iOS applications at 0.275. Similarly, another study by Mugunthan Tarun [34] explored user behaviors and identified usability issues encountered by professionals using AI text-generation tools like ChatGPT. Through eight 90-minute moderated qualitative usability tests, the research uncovered patterns and challenges in user interactions with these tools, offering valuable guidance for improving interface design. The study emphasized the importance of early usability testing to address potential problems and enhance the effectiveness of generative AI applications. While these studies provide valuable insights into the usability of generative AI applications, they primarily focus on interface efficiency, user satisfaction, and functional improvements, without explicitly linking GUI challenges to broader ethical concerns. Issues such as accessibility, fairness, transparency, and user autonomy remain under-explored in these evaluations. There is a clear gap in integrating the GUI aspects with ethical considerations. Most studies focus on the usability and design of GenAI tools, exploring their impact on user satisfaction and interaction [33,34]. However, these studies often overlook the ethical challenges associated with these technologies. Our study bridges this gap by not only assessing the usability of GenAI applications but also examining how GUI-related shortcomings contribute to ethical risks. By merging these dimensions, our research not only assesses how well GenAI tools adhere to usability and interface standards but also examines their alignment with key ethical considerations.

Beyond the individual studies reviewed above, prior work in human–computer interaction (HCI) has established a range of established methods for evaluating graphical user interfaces that extend beyond the 14 heuristics employed in our study. Common approaches include cognitive walkthroughs, which assess how easily new users can learn a system [35]; think-aloud protocols, where participants verbalize their thoughts while performing tasks to uncover interaction barriers [36]; standardized usability questionnaires such as the System Usability Scale (SUS) [37]; and persona- or scenario-based evaluations that simulate diverse user perspectives [38]. More advanced methods include eye-tracking, keystroke logging, and A/B testing, which allow researchers to capture fine-grained interaction behaviors [39]. These approaches are widely recognized in usability engineering and have also been adapted for evaluating AI-driven and conversational interfaces [40].

For GenAI systems specifically, GUI evaluations require attention not only to efficiency and effectiveness but also to ethical and experiential dimensions such as transparency, fairness, and user autonomy [41]. While heuristic evaluation remains one of the most efficient and widely adopted methods due to its low resource intensity and ability to uncover a broad range of usability problems early [42], it should be complemented by user-based methods to ensure a holistic understanding of usability in high-stakes AI contexts. Accordingly, our study adopts a mixed approach, combining heuristic evaluation with user-based testing to capture both technical and ethical shortcomings of GenAI GUIs, which are explained in the next section.

3. Research Methodology

This research has employed expert consultation and user-based testing to rigorously evaluate the GenAI applications in terms of the usability of their graphical user interfaces, with a focus on ethical principles using 14 usability heuristics, drawing on the study by [14]. Heuristics-based evaluation systematically identifies potential usability violations against established principles [14], while user-based testing captures real-world user interactions and subjective experiences [43]. By combining these complementary methods, the study provides a more comprehensive understanding of both functional and experiential aspects of GenAI GUIs. A combination of qualitative and quantitative approaches has been adopted to ensure a comprehensive evaluation of the GUIs of generative AI (GenAI) applications. These approaches have been carefully applied to assess the usability and ethical alignment of GenAI interfaces, drawing insights from user-based testing and expert consultations. Each approach is discussed in the given sub-sections.

Methods at a Glance

- Participants: 12 research scientists from CSIRO Data61 (Australia), all active users of generative AI applications with expertise in artificial intelligence (AI), ethical AI, and GenAI.

- Recruitment and Setting: Recruited via email invitation. Online session via Microsoft Teams.

- Expert Consultation: Two structured meetings with AI and ethics experts guided the selection of GenAI applications.

- Applications Tested: ChatGPT (GPT-4, March 2025 build), Gemini (1.5, March 2025 build), Claude (3.5 Sonnet, March 2025 build).

- Tasks: Participants interacted with the assigned GenAI application (one app per group) through 14 structured tasks mapped to usability heuristics. Example tasks included generating a 1000-word blog post, exploring settings and icons, deleting conversations, regenerating responses, and locating help/documentation features.

- Timings: 60 min per session.

- Capture Dates: March–April 2025.

- Issue Identification: Usability issues were identified using 14 Nielsen-inspired heuristics [14].

- Analysis: Thematic coding + mapping to ethical principles.

3.1. Expert Consultation

Discussions with AI and ethics experts from a research organization in Australia have been conducted to gather their perspectives on which GenAI applications should be evaluated. The experts consulted for this study included professionals specializing in GenAI testing, ethical GenAI, AI governance, responsible AI, and accessibility. Their diverse expertise ensured a comprehensive evaluation that considered not only GUI usability aspects but also ethical implications, such as fairness, transparency, and inclusivity. Expert recommendations helped identify the most relevant generative AI applications, ensuring they are widely used and representative of various generative tasks (e.g., text generation). The selection of GenAI applications, based on experts’ discussions followed the given criteria:

Applications that are widely used and well-known among the public to be selected.
Applications that are easily accessible to anyone, either for free or through a subscription to be considered.
Applications that are up-to-date with the most recent versions to be selected. For instance, Bard is now known as Gemini so Gemini to be selected.

Figure 2 illustrates the process of selecting generative AI applications for this study, which was carried out through a series of expert discussions. Two structured meetings were held with AI and ethics experts. The first meeting focused on introducing the research idea and gathering initial feedback on the study’s aim, objectives, and relevance. Experts provided valuable insights into the current landscape of generative AI technologies and their potential applications. In the second meeting, the discussion shifted to focusing on various generative AI applications based on their relevance to the study’s goals. After thorough deliberation, four applications were shortlisted: ChatGPT, Claude, Bard AI, and Gemini. These applications were selected for their prominence in the field and their ability to represent diverse approaches to generative AI. Subsequently, a predefined selection criterion was applied to assess these applications in greater depth. Based on this, three generative AI applications were finalized for the study: Gemini 1.5, ChatGPT (GPT-4), and Claude 3.5 Sonnet, as shown in Figure 3. These applications were chosen for their advanced capabilities and alignment with the study’s objectives, ensuring a robust analysis of the usability of their GUI and ethical considerations. This study focused exclusively on the text generation functionality of the selected GenAI applications, as this represents their most widely used capability and the primary interaction mode supported by their graphical user interfaces at the time of evaluation.

3.2. User-Based Testing

This approach was employed to evaluate the usability of the graphical user interfaces of generative AI (GenAI) applications, which were selected through expert consultations, as detailed above. The user-based testing utilized 14 usability heuristics to guide participants in evaluating the GUIs of the selected GenAI applications. These heuristics were drawn from the study by [14] and are defined below:

Visibility of system status: The system should always keep users informed about what is going on through appropriate feedback within a reasonable time.
Match between system and the real world: Follow real-world conventions, making information appear in a natural and logical order.
User control and freedom: Users often choose system functions by mistake and will need a marked “emergency exit” to leave the unwanted state without having to go through an extended dialogue. Support undo and redo.
Consistency and standards: Consistency and standards Users should not have to wonder whether different words, situations, or actions mean the same thing.
Error prevention: Even better than good error messages is a careful design that prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before committing to the action.
Recognition rather than recall: Minimize the user’s memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for the use of the system should be visible or easily retrievable whenever appropriate.
Flexibility and efficiency of use: Accelerators—unseen by the user—may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions and use the common shortcuts.
Aesthetic and minimalist design: Dialogues should not contain information that is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility.
Help users recognize, diagnose, and recover from errors: Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution concerning the user data.
Help and documentation: Even though it is better if the system can be used without documentation, it may be necessary to provide tutorials and a real-time chatbox with experts and concise hints. Any such information should be easy to search, focused on the user’s task, list concrete steps to be carried out, and not be too large.
Guidance: The system should guide the user to the next appropriate step based on the problem and the data by suggesting a sequence of actions and providing recommendations.
Trustworthiness: The system should demonstrate trustworthiness by protecting user data and being truthful with the user. It should demonstrate to the user how the predictions were made.
Adaptation to growth: The system should adapt to user growth by reducing guidance, restrictions, and tips and allowing more customization and extendibility.
Context relevance: Display information relevant to the user’s current dataset and problem.

3.2.1. Recruitment and Participants

Participants were recruited through internal organization email invitations sent after obtaining ethics approval. The invitation included a participant information sheet that detailed the project objectives, research aims, how the sessions would be conducted, whether the sessions would be recorded, and how the data would be used. A formal consent form was provided to all participants, and informed consent was obtained before participation.

Twelve participants from a research organization in Australia, who actively use GenAI applications, were recruited for the GUI usability evaluation. The sample included an equal distribution of six male and six female participants. Their current organizational roles included Senior Research Scientists (n = 4), Postdoctoral Researchers (n = 3), Senior Software Engineers (n = 3), and Research Scientists (n = 2). In terms of experience, five participants had 7–10 years in their current role, five had 1–3 years, and two had 4–6 years. This range ensured that both early-career and senior professionals were represented in the study. A summary of participant demographics is presented in Table 1.

The dataset size of 12 participants is consistent with prior usability research standards, which indicate that 5–12 participants are often sufficient to uncover the majority of usability issues [44]. While this provides valuable insights, we acknowledge that a larger sample would capture broader variations in user demographics and expertise. Expanding the participant pool will therefore be a priority in future work.

3.2.2. Study Design and Procedure

In this approach, participants were invited to an online session hosted on Microsoft Teams. The one-hour session began with an introductory segment where participants were briefed on the research objectives and the structure of the user-based testing. This introduction was essential to ensure that all participants understood the purpose of the study and the significance of their feedback in evaluating the GUIs of the GenAI applications.

As it was impractical for each participant to evaluate all three GenAI applications in one hour, a total of 12 participants were divided into three groups of four individuals each. Each group was assigned one GenAI application to evaluate: Group 1 interacted with Gemini, Group 2 with ChatGPT, and Group 3 with Claude. This division allowed for a focused and detailed evaluation of each application while ensuring an even distribution of effort among participants. Participants were then provided with a structured form containing tasks designed to assess various usability aspects of the GUI. These tasks were crafted to align with the 14 usability heuristics, ensuring a comprehensive evaluation process. The structure of the tasks was designed to follow a systematic approach based on the 14 usability heuristics. Under each heuristic, there was a task followed by a list of issues. For example, for heuristic 1, its definition was provided, followed by a corresponding task for the participants. An example task is: “Open the GenAI application assigned to you and switch between settings, help, and main task areas. Are the layout, colors, and controls consistent across these sections? You can either list the issues below or select from the following options”. Participants were then presented with a list of GUI usability issues, allowing them to select multiple issues they encountered. In cases where a participant experienced an issue not listed, additional space was provided for them to describe the problem, as illustrated in Figure 4. This structured form ensured that tasks were aligned with the 14 usability heuristics, providing a comprehensive evaluation of the GUI’s various usability aspects.

The task scenarios were designed to evaluate the GenAI GUIs across multiple usability heuristics using realistic text-based interactions. Participants performed tasks such as (i) inputting a given prompt to generate a 1000-word blog post on “How to Build Confidence in the Workplace” and reviewing responses for issues related to visibility of system status, match between system and the real world, user control and freedom, etc.; (ii) navigating between settings, help, and main task areas to assess layout consistency and interface intuitiveness; (iii) using features such as regenerating a response or deleting a conversation to evaluate error prevention and system feedback; and (iv) locating help sections or tutorials to assess the accessibility of documentation and guidance. These tasks were carefully mapped to the 14 usability heuristics, ensuring a comprehensive evaluation of both functional and ethical aspects of the GenAI GUIs.

For each evaluation task designed to test the GUIs of the GenAI applications, we defined clear criteria to determine task success or failure. A task was considered successful if participants were able to complete it smoothly and achieve the intended outcome without confusion or unnecessary steps. Tasks were marked as partial success if participants completed the task but experienced minor difficulties, hesitation, or inefficiencies. A task was considered a failure if participants were unable to complete it or required external assistance due to interface design limitations. This approach ensured that all usability issues reported by participants could be contextualized relative to the intended task outcomes, providing a systematic and consistent approach for assessing GUI usability. Table 2 shows the evaluation of usability tasks designed for the three GenAI applications. Each task was classified as Success, Partial Success, or Failure based on whether participants were able to complete it smoothly, experienced minor difficulties, or could not complete it, respectively. Partial successes include cases where tasks were completed but participants raised concerns about task design or interface clarity, such as repeated prompts for similar tasks, which were intentionally used to allow controlled comparison across applications. The table provides a clear overview of how each application supported or challenged user interaction across the evaluated tasks. To support reproducibility, the structured task forms used in this study are provided in Appendix A and have been shared as a public artifact (https://forms.gle/bXpT125exThhTKT87; accessed on 5 July 2025). This form includes the complete task descriptions and heuristic definitions, enabling reuse in future studies.

During the session, participants completed the tasks, providing feedback on the usability issues they encountered while interacting with the assigned application. The responses collected from these forms have been analyzed against AI ethics principles to gain deeper insights into the usability of GUI concerns and areas for improvement, which have been discussed in the next section.

3.3. Extraction Procedure

The data extraction process has been designed to systematically organize and analyze the responses collected from participants. Microsoft Excel was used to store the responses separately for each GenAI application evaluated by the participant groups. This structured approach facilitated efficient data management and analysis. Participants’ responses have been recorded in dedicated Excel sheets. These responses included details about the GUI usability issues identified for each of the 14 tasks, which were designed based on 14 heuristics. Graphs have been generated for each response on the 14 tasks to visually represent the data, such as the frequency of GUI usability issues identified. These visualizations have helped in identifying patterns, trends, and areas where specific GenAI applications exhibited notable strengths or weaknesses in their GUI design, which are explained later in this paper. The extracted data has been then analyzed against AI ethics principles to explore their alignment with ethical considerations. The analysis provides deeper insights into the usability concerns and their implications for responsible AI design. All data handling has been conducted with adherence to ethical guidelines, ensuring participant anonymity and maintaining the confidentiality of responses throughout the process.

3.4. Synthesis Procedure

For the data synthesis, we conducted a formal thematic analysis as presented by [45]. Each phase of thematic analysis is described as follows:

Familiarization with data: In this phase, we analyzed the collected responses to understand and familiarize ourselves with the terms used by participants, particularly those GUI issues they added themselves in extra spaces given to them. Initially, we anticipated that additional research might be required to comprehend any new or unfamiliar terms. However, after thoroughly reviewing all the responses, we found that no extra research was necessary, as the terms were sufficiently clear and understandable within the context provided by the participants.
Creating Initial Codes: In this phase, we assigned initial codes to each GUI usability issue selected by the participants in their responses. These codes summarized the core idea of each issue. For example, the issue “The response lacks real-world context or examples” under the “match between system and the real-world” heuristic was coded as “Poor Contextual Relevance.” Few GUI usability issues were simple, making them self-explanatory; therefore, no code was assigned to such issues.
Searching for Themes: After coding the issues, we moved to the next phase, where we grouped the codes into broader themes based on patterns identified in the data. For instance, the code “Poor Contextual Relevance” was developed into the “Poor Real-World Applicability” theme.
Reviewing Themes: In this step, we reviewed all sets of themes and finalized them. We assessed each theme for consistency, relevance, and clarity. This process helped refine the themes to capture the broader patterns and key insights from the data accurately.

Table 3 presents the GUI usability issues identified (not all of which are mentioned), the initial codes created, and the final themes. The themes represent key aspects of the GenAI GUI issues. The next section provides a detailed analysis of data collected from user-based testing on GUI usability issues and the ethical alignment of three selected GenAI applications.

4. Findings

This section presents the results of the evaluation conducted through user-based testing of three generative AI (GenAI) applications—Gemini, ChatGPT, and Claude—in terms of their GUI usability issues and ethical shortcomings. The feedback received from the participants has been analyzed and explained in this section to address the research questions.

RQ1: How usable are the graphical user interfaces (GUIs) of selected generative AI applications in supporting efficient, effective, and satisfying user interactions?

Table 4 presents the GUI usability issues of three GenAI, applications i.e., Gemini, ChatGPT, and Claude, based on user-based evaluation. The table uses themes (refer to Table 3) to categorize and display the analysis, summarizing the GUI issues identified by participants and showing the number of selections made for each issue. For example, one GUI usability issue, i.e., “The response lacks real-world context or examples”, selected by three participants for ChatGPT, is categorized under the theme “Poor Real-World Applicability.” We conducted a thematic analysis to manage space, as we had to display results from 14 tasks across three GenAI applications (i.e., 14 × 3 = 42 graphs) during the user-based evaluation. Additionally, we have included Figure 5, Figure 6, Figure 7 and Figure 8 that represent the feedback from participants, without thematic grouping. These figures provide a more granular view of the specific issues highlighted by users, complementing the thematic analysis in the table.

Gemini: Table 4 (second column) presents the GUI usability issues of Gemini through user-based evaluation. Each issue under a particular heuristic has been analyzed and detailed below. Since four participants evaluated Gemini’s GUI usability, the number of selections presented in the analysis reflects the percentage of participants.

H1: Visibility of System Status Two out of the four participants (50 percent) reported that Gemini lacks time indicators for task completion, making it difficult for users to anticipate how long Gemini will take to respond. The absence of time indicators and visual feedback mechanisms may lead to uncertainty for users, affecting their ability to track system processes efficiently.

H2: Match Between System and the Real World Half of the participants (50 percent) found that Gemini exhibits poor real-world applicability, meaning that its responses may not always align with real-world contexts or user expectations. Additionally, two participants (50 percent) highlighted that the system generates long responses, which can reduce the perceived efficiency of interactions. The system’s generating long responses and limited real-world applicability indicates a gap between user expectations and actual system performance, potentially hindering a seamless user experience.

H3: User Control and Freedom Three participants (75 percent) identified multiple usability issues under this heuristic, while one participant found no issues. Users reported limited input flexibility, meaning the application does not allow easy correction or modification of the input prompt (e.g., no option to edit or resubmit). Additionally, participants noted restricted interaction control, the absence of a clear reset function, and the inability to interrupt or cancel output generation. These limitations suggest a lack of control over system interactions, which may frustrate users who need more autonomy in modifying or halting ongoing tasks. The lack of control mechanisms in Gemini’s interface restricts user autonomy, making it difficult to efficiently navigate or correct unintended actions.

H4: Consistency and Standards All four participants (100 percent) reported no issues under this heuristic, indicating that Gemini maintains a good, consistent visual design and adheres to expected usability standards, meaning the layout, colors, and controls are consistent across the application’s settings, help, main task areas, etc. The consistency in design suggests that the interface is predictable and follows conventional design principles. The fact that no issues were reported in this area suggests that Gemini has a reliable and consistent user interface, making it easy for users to navigate.

H5: Error Prevention Two participants (50 percent) pointed out the absence of an undo/redo functionality, limiting users’ ability to reverse or correct unintended actions, e.g., deleting a conversation. Additionally, one participant (25 percent) identified insufficient error prevention mechanisms, while another participant (25 percent) noted a lack of safeguards against critical actions. Moreover, as participants could select more than one usability issue, these participants (75 percent) highlighted inconsistent communication regarding conversation management, which can be confusing when users attempt to navigate their interactions with the system. The system’s limited error prevention mechanisms and lack of clear communication may lead to frustration, as users may struggle to recover from mistakes or unintended inputs.

H6: Recognition Rather Than Recall Two participants (50 percent) reported difficulty in easily accessing system features requiring users to recall steps or options, which may require users to rely on memory rather than intuitive recognition. Additionally, two participants (50 percent) mentioned the absence of visible shortcuts, e.g., no visible buttons or options to easily re-enter or modify the prompt, reducing efficiency and increasing the cognitive load required to navigate the system. The lack of visible shortcuts and easily accessible features may hinder usability, particularly for new users who require intuitive navigation aids.

H7: Flexibility and Efficiency of Use Two participants (50 percent) reported that Gemini lacks efficiency features for experienced users, e.g., keyboard shortcuts, pre-configured actions, etc., limiting their ability to perform tasks quickly. Additionally, two participants (50 percent) noted a lack of customization options for responses, e.g., adjusting response length, tone, or style, reducing personalization. Similarly, these participants (50 percent) pointed out the absence of mode-switching capabilities, preventing users from adapting the interface to their preferences or expertise levels. The lack of efficiency-boosting features and customization options suggests that Gemini may not adequately support both novice and advanced users, reducing overall flexibility.

H8: Aesthetic and Minimalist Design One participant (25 percent) found the response design cluttered, negatively impacting readability, meaning the response had too much text or formatting that made it difficult to focus on the key points, while three participants (75 percent) reported that the aesthetic and minimalist design of Gemini’s interface is satisfactory. The majority opinion suggests that, despite minor readability concerns, the interface maintains a visually appealing and minimalistic design. While Gemini’s interface is generally well-designed, minor readability issues in response formatting may impact the user experience for some individuals.

H9: Help Users Recognize, Diagnose, and Recover From Errors Two participants highlighted multiple issues under this heuristic. One of them (25 percent) pointed out that Gemini provides insufficiently detailed error messages, making it harder for users to understand and resolve issues. Another participant (25 percent) noted that error notifications get delayed, hindering quick problem diagnosis, and Gemini generates long responses with unnecessary details, making it difficult to identify relevant information. However, two participants (50 percent) found Gemini as having no issues under this heuristic, meaning good help on error messages. While most users found error handling acceptable, some identified issues with delayed and insufficiently detailed error messages, which could impact problem resolution efficiency.

H10: Help and Documentation Three participants (75 percent) found that the Gemini provides too generic help, suggesting that documentation lacks specific guidance for different scenarios. However, one participant (25 percent) reported no issues, suggesting that Gemini’s help and documentation features are sufficient for them. However, the high percentage of users identifying issues indicates that Gemini’s support resources may not be sufficiently tailored to address user needs. Generic help documentation may not provide adequate guidance, particularly for users unfamiliar with the system’s functionalities.

H11: Guidance Two participants (50 percent) highlighted that the system provides irrelevant guidance, meaning that the instructions or prompts do not align with user needs. However, two participants (50 percent) reported no issues, indicating a mixed response regarding the effectiveness of system guidance. While some users found Gemini’s guidance effective, others perceived it as irrelevant, highlighting inconsistencies in how the system supports users.

H12: Trustworthiness Two participants (50 percent) reported that Gemini lacks references for the information it provides, making it difficult to verify response accuracy. Additionally, two participants (50 percent) found the system to be lacking in transparency, meaning the application does not inform the user about how it handles personal data (e.g., data privacy or usage policies), which can raise concerns about trustworthiness. The absence of source references and transparency mechanisms may negatively impact user trust in the reliability of Gemini’s responses.

H13: Adaptation to Growth Three participants (75 percent) indicated that Gemini lacks an adaptive interface tailored to user proficiency levels, meaning the system lacks a mechanism to adapt its interface or interaction style based on user proficiency (e.g., no advanced features or settings unlocked over time). Additionally, 50 percent of participants also noted a one-size-fits-all guidance approach, meaning the system provides the same level of guidance to experienced users as it does to new users, without recognizing user expertise or experience, which does not cater to different experience levels. Furthermore, these participants reported a lack of adaptive interface design and the absence of a dynamic user feedback mechanism. The lack of adaptability in Gemini’s interface may hinder long-term user engagement, as it does not effectively cater to varying user skill levels or evolving needs.

H14: Context Relevance Two participants (50 percent) found Gemini’s content output to be overly generic and lacking specificity. Additionally, two participants (50 percent) reported that the responses seemed shallow or superficial, failing to provide in-depth or nuanced information. Gemini’s tendency to generate generic and shallow responses may limit its usefulness in scenarios requiring detailed and context-specific information.

Overall, Gemini demonstrated several areas needing improvement (refer to Table 4 (second column) and Figure 5 and Figure 9), particularly in system feedback and user control. Other issues included the absence of time indicators for task completion and limited flexibility in user input and interaction control, as shown in Figure 9. While Gemini maintained good consistency standards, it struggled with providing efficient features for experienced users and adaptive interface designs. Despite some positive feedback on its minimalist design, the overall usability was hindered by insufficient error messages and generic help documentation, indicating areas that require focused enhancements. In the analysis of Gemini’s trustworthiness, as presented in Figure 5, participants identified key issues related to the lack of references for the provided responses. Transparency concerns highlight a need for Gemini to improve in providing clear sources for information and being more transparent about its operations and data handling, to foster greater trust among users.

ChatGPT: Table 4 (third column) presents the GUI usability issues of ChatGPT through user-based evaluation. Each issue under a particular heuristic has been analyzed and detailed below. Since four participants evaluated ChatGPT’s GUI usability, the number of selections presented in the analysis reflects the percentage of participants.

H1: Visibility of System Status In the area of system status visibility, all of the evaluators (100 percent) reported that ChatGPT does not provide any time indication for task completion, which suggests that users are left without cues regarding the duration of ongoing processes. Interestingly, one participant (25 percent) also noted that there were no issues, reflecting a slight divergence in user perceptions. The absence of clear and visible feedback mechanisms showing how long the system takes to generate the response and what exactly the system is doing appears to be a critical concern, potentially leading to uncertainty about the progress of operations.

H2: Match Between System and the Real World Regarding the match between the system and the real world, two out of four participants (50 percent) indicated that ChatGPT suffers from poor real-world applicability, implying that its responses lack real-world context or examples, e.g., participants asked ChatGPT to generate a 1000-word blog post on “How to Build Confidence in the Workplace” and received the response lacking real-world examples relatable to workplace scenarios, reflecting a lack of natural interactions. In contrast, another two evaluators (50 percent) found no issues, indicating that ChatGPT follows real-world interactions, making information appear in a natural and logical order. This contrast in user perceptions suggests that while some users find ChatGPT’s responses contextually relevant and logically structured, others feel that the lack of real-world examples diminishes the practical applicability of its outputs.

H3: User Control and Freedom When assessing the user control and freedom heuristic, ChatGPT’s interface showed several limitations. All participants (100 percent) pointed out the absence of a clear reset functionality, which restricts users from easily restarting or revising their queries. Additionally, 2 of the participants (50 percent) also experienced a lack of revised functionality, meaning no undo or redo functionality for the generated content, while one participant also noted limited flexibility in topic refinement and an inability to interrupt or cancel output generation. The cumulative effect of these shortcomings is a significant restriction on user autonomy, indicating that the interface does not sufficiently empower users to manage or correct their interactions.

H4: Consistency and Standards Feedback on consistency and standards was mixed for ChatGPT. One participant (25 percent) observed issues with visual alignment, meaning the components like buttons, text fields, and labels are not consistently aligned, leading to a disjointed appearance, and also highlighted the absence of accessible help and navigation guidance. However, the other half of the evaluators (50 percent) reported no issues and even commended the interface for good consistency. While a portion of users find the design consistent, the observed differences in alignment and navigational support suggest that the interface could benefit from tighter adherence to standard design principles.

H5: Error Prevention Error prevention emerged as a notable area of concern. All evaluators (100 percent) found the absence of an undo/redo functionality, underscoring a critical gap in allowing users to recover from mistakes, e.g., deleting a conversation. In addition, 75 percent of participants felt that ChatGPT has insufficient preventive guidance and error mitigation (e.g., accidental deletion of content), while 50 percent of participants also raised concerns over the lack of safeguards and action confirmations for irreversible tasks (e.g., no prompt for confirmation before irreversible actions), as well as the absence of consequence warnings. These findings collectively point to significant vulnerabilities in preventing and managing errors, which could undermine user confidence and overall interaction quality.

H6: Recognition Rather Than Recall For the heuristic of recognition rather than recall, three out of four participants (75 percent) reported a lack of visible shortcuts, meaning that users must rely more on memory to navigate the interface. One participant (25 percent) further emphasized that the system overly depends on users’ memory for interaction. Enhancing the visibility of shortcuts and other navigational aids could reduce the cognitive burden on users, thereby improving the overall usability of the interface.

H7: Flexibility and Efficiency of Use Evaluators highlighted several challenges related to flexibility and efficiency. Every participant (100 percent) observed that ChatGPT lacks efficiency features designed for experienced users, meaning that the application does not provide shortcuts or accelerators (e.g., keyboard shortcuts, pre-configured actions) for experienced users to speed up their workflow. and customization options for responses, e.g., adjusting response length, tone, or style. In addition, 75 percent of these participants reported that help documentation is not easily accessible for new users, and half, i.e., 50 percent, of these participants also mentioned the absence of mode switching as well as an overly simplistic design that limits advanced user efficiency. These responses indicate that ChatGPT’s interface does not effectively cater to users with varying levels of expertise, suggesting a need for more adaptable and customizable interaction options.

H8: Aesthetic and Minimalist Design When it comes to the aesthetic and minimalist design of the interface, the majority of participants (75 percent) praised ChatGPT for its good visual appeal and minimalism. However, one evaluator noted that the response design appears cluttered with too much text or formatting that makes it difficult to focus on the key points, which negatively impacts readability. Although the overall aesthetic is well-received, minor adjustments in layout clarity could further enhance the user experience.

H9: Help Users Recognize, Diagnose, and Recover from Errors In terms of error recognition and recovery, three participants (75 percent) reported no issues, suggesting that the system generally provides adequate support for diagnosing and resolving errors. However, one participant (25 percent) pointed out a lack of proactive error prevention guidance, indicating that there may be occasional lapses in directing users on how to mitigate errors. While error recovery is mostly handled well, addressing the gap in proactive guidance could further streamline the troubleshooting process.

H10: Help and Documentation All of the evaluators (100 percent) reported that it is difficult to find help documentation within ChatGPT’s interface, and half (50 percent) of them also noted that the application lacks clear instructions or guidelines on how to use the system effectively. One of the participants also highlighted that ChatGPT does not provide any help or tutorial options to guide the user through the task, there no easily accessible documentation or support resources available within the application, and the help documentation is not readily available within the interface. This consistent feedback underlines a significant usability barrier, suggesting that the help resources need to be made more accessible and comprehensive to better support user needs.

H11: Guidance For guidance, most participants (75 percent) felt that the system provided adequate support, with only one participant (25 percent) experiencing a lack of post-input guidance, meaning the system does not provide any suggestions or guidance on how to proceed after receiving the input. Although the majority of users are satisfied with the guidance provided, fine-tuning post-input support could address the needs of those few who require additional direction.

H12: Trustworthiness Issues of trustworthiness were particularly prominent among the participants. Three participants (75 percent) specifically highlighted concerns regarding the lack of transparency in the responses generated by ChatGPT. Furthermore, they pointed out that the absence of references or citations in the provided response makes it difficult to verify the credibility of the provided information. Additionally, the unreliable nature of the content—stemming from questionable factual accuracy—has also been a recurring concern. This lack of verifiability and transparency contributed to a diminished sense of trust in the system’s outputs. Furthermore, one evaluator (25 percent) raised concerns about ChatGPT’s tendency to generate unverified or misleading information, further exacerbating trust issues. These responses collectively highlight significant credibility concerns, emphasizing the need for improved transparency and validation mechanisms to build trust with users.

H13: Adaptation to Growth The adaptation of the interface to user growth was seen as a major shortcoming. All participants (100 percent) criticized the one-size-fits-all guidance approach, which means the system provides the same level of guidance to experienced users as it does to new users, without recognizing user expertise or experience and the lack of a mechanism to adapt its interface or interaction style based on user proficiency (e.g., no advanced features or settings unlocked over time). Moreover, 75 percent of these participants also observed that the interface lacks adaptive design features and a dynamic feedback mechanism, and 50 percent of them also noted issues with evolving user permissions, meaning the application does not reduce restrictions or allow more flexibility for experienced users over time. Such uniformity in the user experience suggests that ChatGPT’s interface does not sufficiently evolve with the user’s needs, indicating an opportunity to introduce more personalized and dynamic features.

H14: Context Relevance Finally, in terms of context relevance, every evaluator (100 percent) reported that ChatGPT’s outputs tend to be generic and lacking in specificity, which means the generated content is too generic and fails to address the specific nuances of the user’s prompt (e.g., lacks workplace-specific examples or advice on asking to provide suggestions on building trust at the workplace), while three participants (75 percent) also observed that the responses are shallow or superficial. This indicates that the system often fails to deliver detailed, context-rich content, which could limit its effectiveness in scenarios requiring nuanced and tailored interactions.

Overall, ChatGPT has faced considerable criticism for providing responses that are often seen as too generic and with limited relevance to real-world situations (refer to Table 4 (third column) and Figure 6 and Figure 7). One of the key issues highlighted by participants was the absence of critical features like undo/redo functionality, which made it difficult for users to correct mistakes or modify their inputs efficiently, as presented in Figure 6. Additionally, the lack of sufficient error prevention measures (Figure 6) contributed to the frustration, as users were not always given clear indications or safeguards to avoid making mistakes in the first place. Another recurring problem was the interface’s inability to cater to more advanced users. Many participants felt that the system lacked efficiency features that would support experienced users who seek a more streamlined, powerful interaction; for example, the system does not allow content adjustments as per user needs, as presented in Figure 7. This was compounded by the interface’s inflexibility, which made it difficult to adapt to users’ changing needs or preferences over time. Participants also raised concerns about the visual design of the interface, pointing out inconsistencies in alignment and layout that made the overall experience less user-friendly. Moreover, accessing help documentation was often challenging, further hindering users from resolving issues on their own. While the aesthetic design of ChatGPT was generally appreciated, it was clear that more proactive guidance and transparency were needed. Figure 7 presents that participants highlighted that the ChatGPT application lacks visible transparency about the sources of the generated information, making it difficult to assess its reliability. Additionally, ChatGPT does not provide citations or references for the presented information, further reducing user trust (Figure 7). The absence of factual accuracy in generated content and the failure to inform users about how their data is handled (e.g., data privacy policies) also contributed to trust-related concerns. The lack of these elements, along with the reliability of the content often being called into question, underscored significant gaps in the interface’s overall usability. To improve the user experience, ChatGPT’s interface needs to be more adaptable, contextually relevant, and responsive to user expectations.

Claude: Table 4 (fourth column) presents the GUI usability issues of Claude through user-based evaluation. Each issue under a particular heuristic has been analyzed and detailed below. Since four participants evaluated Claude’s GUI usability, the number of selections presented in the analysis reflects the percentage of participants.

H1: Visibility of System Status For Claude, 75 percent of the evaluators (three out of four) reported that there is no time indication for task completion, suggesting that users are not informed about how long ongoing processes might take. In contrast, 25 percent (one participant) did not observe any issues with this aspect. Upon discussing this point with the participant, he mentioned that he did not rely on time indicators and felt the system’s response was sufficiently clear without explicit timing information. This “no-issue feedback” may reflect individual differences in user expectations or prior experience with similar interfaces. Nonetheless, the majority view indicates a need for clearer feedback on task duration, which could help users better gauge system responsiveness.

H2: Match Between System and the Real World All evaluators (100 percent) noted no issues regarding the match between the system and the real world. This unanimity suggests that Claude’s interface successfully presents information in a manner that aligns with natural, real-world interactions and expectations. Such positive feedback underscores that Claude effectively contextualizes its outputs, contributing to an intuitive user experience.

H3: User Control and Freedom In the realm of user control and freedom, 75 percent of the participants (three out of four) identified the absence of a clear reset functionality, while 25 percent (one participant) pointed out the inability to interrupt or cancel output generation. These responses indicate that users may find it difficult to regain control over interactions once initiated. Enhancing control features, especially by introducing a straightforward reset mechanism, could empower users to manage their interactions more effectively.

H4: Consistency and Standards Claude’s interface was largely regarded as consistent; 75 percent of evaluators reported no issues, though 25 percent (one participant) observed inconsistent typography across the interface. While the overall design adheres well to established standards, addressing minor typographic inconsistencies could further solidify the interface’s uniformity.

H5: Error Prevention Participants raised several concerns about error prevention: 50 percent (two out of four) noted the absence of undo/redo functionality, another 50 percent mentioned insufficient preventive guidance and error mitigation, and 75 percent (three participants) felt that error prevention measures were inadequate. Additionally, 50 percent also reported that there is a lack of action confirmation for irreversible tasks and a corresponding absence of consequence warnings for critical actions. These issues highlight a vulnerability in preventing and managing errors; improving these features could significantly boost user confidence and reduce the risk of unintended actions.

H6: Recognition Rather Than Recall All evaluators (100 percent) reported no issues under this heuristic, indicating that Claude’s design successfully minimizes reliance on users’ memory, likely by offering ample cues and shortcuts. This strength in facilitating recognition over recall helps reduce cognitive load and supports a smoother navigation experience.

H7: Flexibility and Efficiency of Use The feedback for flexibility and efficiency is mixed. While 50 percent of the participants reported no issues, the remaining 50 percent noted the absence of mode switching for user customization. This split response suggests that although some users find the interface sufficiently flexible, others are constrained by its limited customization options. Addressing these concerns by incorporating adaptable customization features could better accommodate a wider range of user preferences and expertise levels.

H8: Aesthetic and Minimalist Design Opinions on aesthetic design are divided again: 50 percent of evaluators indicated that the response design is cluttered, negatively affecting readability, while the other 50 percent praised the minimalist, visually appealing design. Refining the layout to reduce clutter while maintaining a sleek aesthetic could reconcile these contrasting perceptions and enhance overall readability.

H9: Help Users Recognize, Diagnose, and Recover from Errors All four participants (100 percent) reported no issues with Claude’s mechanisms for error recognition and recovery. This unanimous positive feedback suggests that the system’s error-handling features are effective, contributing to a reliable user experience when issues arise.

H10: Help and Documentation Several issues emerged in the help and documentation area: 50 percent of the evaluators noted a lack of help and tutorial options, another 50 percent highlighted an absence of clear usage instructions, and an equal proportion indicated that support resources are not easily accessible. The combined feedback points to a need for more comprehensive and readily available help documentation to better support users.

H11: Guidance In terms of guidance, all participants (100 percent) reported no issues, suggesting that Claude’s interface offers clear and effective instructions during interactions in terms of guiding users on how they can address the specific queries they input to Claude. This strong performance in providing guidance reinforces the overall user support during interaction processes.

H12: Trustworthiness Trustworthiness emerged as a concern: 75 percent of evaluators (three out of four) pointed out a lack of transparency and the absence of references for the provided responses, while 25 percent noted issues with factual accuracy and data handling transparency. These concerns about reliability and credibility indicate that enhancing transparency and incorporating verifiable references could improve user trust in the system.

H13: Adaptation to Growth When it comes to adapting to user growth, all participants (100 percent) criticized the one-size-fits-all guidance approach, meaning the system provides the same level of guidance to experienced users as it does to new users, without recognizing user expertise or experience, with 50 percent also noting inflexibility in evolving user permissions, and also reported a lack of both an adaptive interface for user proficiency and adaptive interface design. Additionally, 25 percent of participants also mentioned the absence of a dynamic user feedback mechanism. This feedback suggests that the static nature of the interface limits its ability to evolve with the user’s needs and that introducing more personalized, adaptive features could foster long-term engagement.

H14: Context Relevance For context relevance, 75 percent of the evaluators reported that Claude’s outputs are generic and lacking in specificity, and an equal proportion found the responses to be shallow or superficial. Only one participant (25 percent) did not observe any issues. These responses highlight a need for more nuanced and detailed content, ensuring that the system’s outputs better reflect the context and complexity required by users.

Overall, Claude’s GUI usability issues generally reflect a well-received user experience with some areas for improvement (refer to Table 4 (fourth column) and Figure 8). The system shows a strong alignment with real-world applicability, as it was free from issues under the heuristic of matching the system with the real world. However, it struggles with user control and freedom, particularly in terms of the absence of clear reset functionality and the inability to interrupt or cancel output generation, which limits user flexibility, as shown in Figure 8. Participants showed that error prevention remains a notable concern, with issues like the lack of undo/redo functionality and insufficient preventive guidance. While Claude performs well in areas like recognizing and diagnosing errors and providing guidance, there are concerns about trustworthiness due to a lack of transparency and references for the provided content, as shown in Figure 8. Moreover, the system’s ability to adapt to users’ growth is limited by a “one-size-fits-all” approach to guidance and a lack of adaptive interface design. Despite these drawbacks, Claude excels in offering a generally efficient and aesthetically minimalist design, though improvements are needed in context relevance and customization features to further enhance its usability.

Summary: Comparative GUI Usability Analysis: Claude vs. Gemini and ChatGPT:

This study evaluated three GenAI applications: Claude 3.5 Sonnet (March 2025 build), Gemini 1.5 (March 2025 build), and ChatGPT (GPT-4, March 2025 build). All evaluations were conducted between March and April 2025 to ensure version consistency.

Among the three, Claude appears to have relatively fewer GUI usability issues and a stronger overall design. For example, Claude received unanimously positive feedback in areas such as “Match Between System and the Real World”, where all evaluators reported no issues, and in “Recognition Rather Than Recall,” where every participant found that the design minimized reliance on memory. Additionally, Claude’s guidance is well received (with 100 percent of participants noting no issues in that area), and its error recovery mechanisms work effectively (as evidenced by all users reporting no problems with error recognition and recovery). These strengths indicate that Claude’s interface, at least in certain critical aspects, facilitates natural interactions and provides clear support for users.

In contrast, both Gemini and ChatGPT show more prominent challenges. For instance, in the “User Control and Freedom” category, Gemini had all evaluators (100 percent) report limitations such as the absence of a reset functionality, while ChatGPT’s users also struggled with a lack of controls like undo/redo options and revision capabilities. Moreover, ChatGPT was marked down in areas like “Help and Documentation” and “Trustworthiness,” with every participant noting difficulties in finding accessible help resources and concerns over transparency and the provision of reliable references. Although both Gemini and ChatGPT have some strengths—Gemini, for instance, was commended for consistency and aesthetic design, and ChatGPT for maintaining a generally minimalist interface—the severity and frequency of issues across multiple heuristics suggest that their overall designs are less robust than Claude’s. Overall, Claude stands out with its intuitive real-world alignment, effective error recovery, and clear guidance, while Gemini and ChatGPT face notable hurdles in user control, error prevention, and support documentation. This overall comparison suggests that, so far, Claude offers a more user-friendly and consistent GUI experience, making it a stronger candidate in terms of usability for supporting efficient, effective, and satisfying user interactions. The complete figures generated for each task assigned to the participants can be accessed through this link (https://docs.google.com/spreadsheets/d/1Tb36QeFxiH9tRhQVeUAxyG7ZJX2zE3LU/edit?usp=sharing&ouid=116945118200994725511&rtpof=true&sd=true; accessed on 5 July 2025.

While it is acknowledged that GUIs in GenAI applications evolve rapidly, the findings from this study extend beyond transient visual changes or minor feature adjustments. The issues identified, such as the lack of clear documentation, limited customization, and insufficient error prevention mechanisms, represent structural usability limitations rather than temporary interface variations. These concerns relate to fundamental interaction design choices that affect user autonomy, efficiency, and trust, and are therefore likely to persist across future versions unless deliberately addressed. Conversely, strengths such as Claude’s real-world alignment, error recovery, and minimal reliance on memory are also grounded in core design principles, suggesting that these advantages are robust and transferable across updates. Thus, although cosmetic aspects (e.g., icon placement or layout styling) may shift over time, the comparative differences highlighted in this study provide enduring insights into the usability and ethical alignment of these systems.

Influence of User Proficiency and Usage Patterns

To better understand the observed usability outcomes, we considered participant proficiency and GenAI usage patterns. Based on responses to the form question on frequency of use, five participants reported using GenAI applications daily, six reported weekly usage, and one reported monthly usage. This distribution indicates that most participants were regular users and likely had intermediate to high proficiency with GenAI systems.

Consequently, participants were generally able to navigate the applications efficiently and complete tasks successfully, which may have contributed to the relatively high success rates observed across most tasks. However, despite this familiarity, certain usability challenges persisted, particularly regarding error prevention, limited customization, and clarity of system feedback. This analysis demonstrates that even for experienced users, the GUIs of these applications have areas requiring improvement, highlighting the importance of designing interfaces that are both intuitive and robust for a broad spectrum of users.

Overall, incorporating user proficiency and usage frequency provides a nuanced understanding of the evaluation results, allowing us to contextualize task performance and heuristic findings beyond an aggregate comparison of the three applications.

RQ2: What are the key ethical shortcomings associated with the usability of the graphical user interface design of Generative AI applications?

The 14 heuristics used to evaluate the GUI usability of three GenAI applications have been mapped with Australia’s AI ethics principles using a systematic approach to align the core attributes of each heuristic with the corresponding ethical principles. This alignment is based on their mutual focus on enhancing user interaction, trust, and equitable outcomes in technology. Both usability heuristics and ethical principles aim to ensure systems are designed and evaluated to prioritize human-centered values, safety, transparency, etc [14,46]. The usability heuristics represent foundational principles for designing user interfaces that are intuitive and efficient [14], while Australia’s AI ethics principles provide a framework for ensuring ethical and responsible AI development [46]. Mapping these heuristics to the ethical principles highlights how GUI usability considerations contribute to broader ethical outcomes, such as transparency, accountability, fairness, etc. Each usability heuristic has been examined for its underlying goals and characteristics. The ethics principles have also been analyzed to highlight the key aspects of Australia’s AI ethics principles, including transparency, human-centered values, accountability, etc. After examining both the heuristic and ethics principles, alignment has been done in which heuristics were matched to principles based on their shared objectives. For instance, the heuristic “Visibility of System Status” aligns with the principle of “Transparency and Explainability”, because it ensures users are informed about system processes, promoting transparency.

Table 5 presents the mapping of 14 heuristics with AI ethics principles. This mapping is grounded in the unique ethical considerations emphasized in Australia’s AI ethics principles. For example, principles such as “Human, Societal, and Environmental well-being” extend the scope of heuristics like “Context Relevance” to include broader societal impacts. This mapping serves as a robust framework for evaluating generative AI applications, ensuring both usability and ethical compliance are integrated into the design and assessment process. By aligning heuristics with ethical principles, the analysis provides actionable insights into fostering responsible and user-friendly AI systems. The mapping of the 14 heuristics to AI ethics principles reveals several ethical shortcomings in the GUIs of the selected generative AI applications that directly impact user trust and decision-making.

Ethical Concerns of GUI Usability of GenAI Applications: The following outlines how usability issues in the GUIs of generative AI applications, as mapped to key AI ethics principles, highlight their ethical implications:

Transparency and Explainability: Several heuristics map to this principle, including “Visibility of system status,” “Aesthetic and minimalist design,” “Help and documentation,” and “Trustworthiness.” Our analysis revealed that all three applications, i.e., Gemini, ChatGPT, and Claude, struggle with providing clear, real-time feedback to users, for instance, users consistently reported a lack of time indicators or status updates. This shortfall means that users do not fully understand what the system is doing or how long processes might take, thereby reducing transparency and explainability. When users are left uninformed, they are unable to assess the decision-making process of the GenAI or determine whether the system is operating as expected. Moreover, issues in “Aesthetic and minimalist design,” such as cluttered layouts or inconsistent visual elements, further hinder the clarity of information presentation, potentially leading to misinterpretation or oversight of critical information. Problems with “Help and documentation” compound these concerns; when users cannot easily access detailed, context-specific guidance, they are left without the necessary tools to verify or challenge the system’s operations. Additionally, trustworthiness issues, such as unreferenced or opaque responses, undermine the ethical imperative for clarity and accountability. This lack of accessible explanation disproportionately affects users with varying technical backgrounds, ultimately preventing them from making fully informed decisions and eroding their trust in the system. Ethically, these deficiencies inhibit the users’ ability to understand, contest, and rely upon the technology, which is fundamental for responsible and transparent GenAI use.

Human-Centered Values, Fairness, and Contestability: Heuristics such as “Match between system and the real world,” “User control and freedom,” “Recognition rather than recall,” “Flexibility and efficiency of use,” “Adaptation to growth,” etc., are tied to ensuring the system aligns with human-centered values and fairness. For example, while some users appreciated that Claude’s outputs aligned well with real-world contexts, ChatGPT’s and Gemini’s responses were sometimes generic or poorly contextualized, raising concerns about fairness in communication. Furthermore, significant limitations in user control (e.g., the inability to reset interactions or interrupt outputs, as found in ChatGPT and Gemini) restrict users from contesting or correcting outcomes. This limitation of control diminishes the user’s autonomy and their ability to challenge or modify results, which is essential for a fair and inclusive system. Additionally, challenges with “Recognition rather than recall” (such as missing shortcuts) increase the cognitive burden on users, thereby potentially disadvantaging less experienced or differently abled users. These factors together highlight ethical shortcomings in ensuring that the system is inclusive, respects diverse user needs, and allows users to challenge outcomes effectively. GenAI applications that do not adapt to user needs or provide sufficient control options can appear biased or unfair, potentially excluding less experienced users and leading to frustration.

Reliability, Safety, and Privacy Protection: Several heuristics underpin these ethical dimensions, including “Error prevention,” “Help users recognize, diagnose, and recover from errors,” and “Trustworthiness.” Across the applications (Gemini, ChatGPT, and Claude), our findings indicate significant concerns in these areas. For instance, all three applications exhibited a lack of error prevention measures, which increases the risk of users executing harmful actions. This vulnerability not only compromises the overall reliability of the applications but also jeopardizes user safety and data privacy by making it difficult to correct mistakes or mitigate unintended outcomes. Additionally, the heuristic “Help users recognize, diagnose, and recover from errors” revealed that users often face inadequate support for error recovery, further eroding their ability to safely navigate and rectify issues during interactions. Compounding these concerns, the “Trustworthiness” heuristic, which is also mapped to privacy protection and security, highlighted that opaque responses and the absence of clear references undermine user confidence in the system’s ability to securely handle sensitive or critical operations. Together, these shortcomings raise significant ethical concerns, as they diminish the system’s capacity to safeguard user interests, protect privacy, and ensure a reliable and secure interaction environment.

Human, societal, and environmental well-being (HSE), accountability:

The “Guidance” heuristic is mapped to multiple ethical dimensions—including human, societal, and environmental well-being, as well as transparency, explainability, and accountability. Our analysis revealed that while some aspects of guidance in the three GenAI applications (Gemini, ChatGPT, and Claude) are functional, there remain gaps, particularly in ensuring that users receive comprehensive, context-specific support. Moreover, the “Adaptation to growth” heuristic highlights that current interfaces often do not evolve with the user’s needs, leading to a one-size-fits-all approach. This rigidity may hinder long-term engagement and fail to accommodate the evolving proficiency levels of users, thereby affecting the ethical commitment to inclusivity and continuous improvement. Without adaptive and accountable design, users may find it challenging to rely on the system over time, which can affect both user satisfaction and overall societal trust in these emerging technologies.

Overall, the ethical shortcomings observed in the GUI usability of three GenAI applications are multifaceted. A significant concern is the lack of transparency and explainability, marked by insufficient system status feedback, cluttered visual designs, poor documentation, and opaque response generation, which undermines users’ ability to understand how the system operates and to trust its outputs. This issue is compounded by limited user autonomy and contestability, where restrictive controls, inflexible interfaces, and a high cognitive load prevent users from effectively challenging or modifying system outcomes, thereby compromising fairness and inclusivity. Moreover, the reliability and safety of the GenAI applications are at risk due to inadequate error prevention and recovery mechanisms, which not only increase the likelihood of unintended outcomes but also heighten the potential for breaches of user privacy. In addition, gaps in comprehensive guidance and adaptability raise concerns about accountability and long-term user well-being, as the systems fail to evolve in line with users’ changing needs. Finally, the generation of generic and superficial content that does not meet complex real-world expectations further detracts from the overall utility of the systems, affecting fairness and their practical effectiveness. The GUIs of the evaluated GenAI applications exhibit ethical shortcomings that affect their usability. These issues compromise transparency, user control, reliability, and privacy, which are crucial for fostering trust and facilitating informed decision-making. Addressing these ethical concerns through improved GUI design can enhance user trust and promote more effective and satisfying interactions.

RQ3: How can the design of GUIs for generative AI applications be improved to align with ethical principles?

The design of Graphical User Interfaces (GUIs) in generative AI applications significantly influences user interactions, shaping their trust, control, and overall experience [10]. A well-designed GUI ensures that ethical principles such as transparency, fairness, reliability, and accountability are embedded into the user experience. To address RQ3, this study proposes the following improvements based on key usability challenges identified earlier in this study.

Building Trust Through Clarity: Transparency and explainability are crucial for user trust in GenAI applications. Users need clear system feedback to understand how and why decisions are made. A lack of visibility into system status can lead to confusion and frustration. For example, Gemini does not provide time indicators for task completion, leaving users uncertain about the system’s progress. Adding progress indicators or visual cues would enhance explainability, ensuring users can anticipate outcomes and understand system behavior. Clear and timely feedback fosters trust and confidence, encouraging users to engage with the application without hesitation.

Empowering User Control and Flexibility: Human-centered design prioritizes user empowerment, ensuring that interactions are adaptable and inclusive. Limited user control, such as an inability to modify inputs or reset conversations, restricts engagement and increases frustration. For instance, ChatGPT lacks topic refinement flexibility and a clear reset functionality, making it difficult for users to restart or adjust interactions. Therefore, it is important to introduce intuitive controls for conversation resets and response adjustments, as this would provide users with greater agency, promoting fairness and inclusivity. Flexible and adaptive interfaces accommodate diverse user needs, ensuring that GenAI applications remain accessible and equitable.

Ensuring Reliable and Safe Interactions: Users must be able to trust that a GenAI system operates reliably and safely. Poor error prevention mechanisms and a lack of corrective options can result in irreversible mistakes, negatively affecting user confidence. As it has been seen that Claude and Gemini lack undo functionalities, increasing the risk of unintended actions with no recovery options. This can be improved by implementing an undo feature and proactive error notifications, as this would minimize user errors, reinforcing system reliability and safety. By integrating robust error prevention and recovery mechanisms, GenAI applications can mitigate risks and enhance overall user trust.

Securing User Data and Strengthening Accountability: Privacy protection and accountability depend on transparent communication about data handling, error reporting, and decision traceability. When users lack clarity about how their data is managed, their trust in the system diminishes. For example, Gemini provides insufficient guidance on error handling, increasing the risk of users mishandling their data. It is essential to work on this issue by enhancing system explanations on data usage, security policies, and error recovery that would empower users to make informed choices, fostering accountability. Providing users with clear, accessible information on privacy settings and system decisions strengthens both security and ethical responsibility.

Contextual and Inclusive Design for Societal Well-being: GenAI applications should consider societal and environmental implications by offering context-specific guidance rather than generic, one-size-fits-all responses. Users rely on AI for nuanced decision-making, and poorly tailored outputs can hinder effective outcomes. For instance, it has been seen that Claude often generates generic content rather than adapting responses to user-specific contexts, limiting its practical usefulness. This can be improved by incorporating more dynamic, context-aware AI responses, as it would improve alignment with societal values and enhance decision-making quality. A more responsive and inclusive AI design ensures that applications cater to diverse user needs while promoting ethical AI deployment.

The effectiveness of GenAI applications depends not only on their capabilities but also on the design of their interfaces. Addressing GUI usability challenges, such as unclear system feedback, limited user control, unreliable error handling, and inadequate privacy communication, can significantly enhance ethical alignment. By designing user interactions with transparency, flexibility, reliability, security, and inclusivity in mind, GenAI applications can foster greater trust, usability, and ethical responsibility. Future research should explore dynamic user interface (UI) adaptations that respond to evolving user needs, ensuring that AI technologies remain both responsible and user-centric.

5. Discussion

The findings from the user-based evaluation of three generative AI (GenAI) applications, i.e., Gemini, ChatGPT, and Claude, highlight significant usability challenges and ethical shortcomings that impact user experience and trust. This section discusses these insights concerning the research questions and suggestions to enhance the usability and ethical alignment of GenAI interfaces.

5.1. GUI Usability: Challenges and Opportunities

The results demonstrate that the graphical user interfaces (GUIs) of the selected GenAI applications exhibit notable deficiencies in supporting efficient, effective, and satisfying interactions. These shortcomings include a lack of user control, insufficient transparency, poor error prevention, and limited adaptability. However, each application presents unique strengths and weaknesses that provide opportunities for enhancement.

5.1.1. Common Usability Issues Across GenAI Applications

Despite differences in individual performance, all three applications struggle with GUI usability aspects that directly affect user satisfaction and task efficiency:

Lack of User Control and Freedom: the inability to reset actions, as seen in all three applications, restricts users from efficiently correcting mistakes or refining their queries.
Inadequate Error Prevention Mechanisms: the absence of proactive measures to prevent errors, such as contextual warnings or guided corrections, leads to increased frustration and inefficiency.
Lack of Visibility of System Status and Transparency: Users found it difficult to understand system processes due to missing indicators (e.g., time estimation for responses) and a lack of clear source attribution. Users highlighted that the GenAI applications did not provide references for the generated content, leading to a lack of transparency and making it difficult to verify the accuracy and reliability of the information.
Limited Adaptability and Customization: the one-size-fits-all approach adopted by these applications does not cater to diverse user expertise levels, reducing accessibility for both novice and experienced users.

While Claude appears to offer a more balanced user experience, particularly in maintaining consistency and usability heuristics, it still requires improvements in flexibility and contextual relevance.

5.1.2. Ethical Shortcomings in GUI Usability and Their Implications

The mapping of usability heuristics to AI ethics principles reveals several ethical concerns that impact user trust and responsible AI deployment. User trust and AI trustworthiness are critical for the responsible deployment of generative AI applications. The GUI usability shortcomings observed in the three GenAI systems directly impact user trust by undermining transparency, fairness, reliability, privacy, accountability, etc. These ethical concerns, while distinct, collectively shape how users perceive and interact with GenAI systems.

One key factor affecting trust is the lack of transparency and explainability. Users often struggle to understand how the system operates, leading to diminished confidence in its decisions. This is exacerbated by limited feedback mechanisms, which fail to provide clear rationales for GenAI-generated responses. Enhancing system visibility through intuitive design elements, such as explicit status indicators, clear error messages, and rational explanations for outputs, can improve explainability and foster trust. Similarly, human-centered values and fairness are compromised when users lack control over GenAI-generated content or when accessibility features are insufficient. Restricted user flexibility can lead to the exclusion of certain groups, particularly those with domain-specific needs or disabilities. Addressing these issues requires adaptive user assistance and more inclusive design strategies to ensure equitable access and fair interactions with the AI system.

The absence of reliability and safety measures further weakens trust. Without robust error prevention mechanisms, users may unknowingly rely on incorrect or misleading outputs, especially in high-stakes scenarios. Implementing safeguards such as input validation, confidence indicators, and error recovery mechanisms can significantly enhance system reliability and mitigate risks. Another major concern is privacy and security. Users often remain unaware of how their data is processed, stored, or shared, increasing their vulnerability to privacy risks. A more transparent approach, such as explicit data handling disclosures and privacy-centric interface designs, can reinforce user confidence in the system’s security practices.

Accountability plays a crucial role in GenAI trustworthiness, yet users frequently encounter difficulties in tracing GenAI-generated decisions. The lack of auditability and feedback mechanisms prevents users from assessing the rationale behind outputs or challenging erroneous responses. Strengthening accountability through traceable decision pathways and user-controllable feedback features would contribute to more responsible AI interactions. Finally, societal and environmental well-being is affected when GenAI systems fail to consider contextual factors in their responses. This limitation can reduce their relevance in real-world applications and lead to ethical concerns in decision-making. Embedding context awareness in system design by incorporating real-world constraints and ethical considerations can enhance the system’s alignment with societal needs.

By holistically addressing these usability-related ethical shortcomings, GenAI systems can enhance trustworthiness, fostering greater user confidence and responsible AI adoption.

5.1.3. Recommendations for Ethical and Responsible GUI Design

The following suggestions can help integrate ethical considerations into the GUI design of GenAI applications:

Transparency Mechanisms: display content provenance, ensure explainable AI features, and provide interactive transparency options where users can inquire about system decision-making.
Robust Privacy Controls: include clear opt-in/opt-out data-sharing policies, improve visibility of data-handling practices, and provide user-friendly privacy settings.
Trust-Building Features: introduce explicit disclaimers on AI limitations, allow user feedback to influence responses, and create mechanisms for users to challenge or verify AI outputs.

The results underscore the need for GenAI applications to adopt a more user-centric and ethically responsible interface design. In particular, issues such as insufficient error prevention, lack of shortcuts, and limited feedback on system status highlight areas where user experience and ethical alignment can be improved. By addressing these usability barriers and aligning with AI ethics principles, these systems can enhance user trust, effectiveness, and satisfaction. Building on these findings, future research should focus on conducting longitudinal studies to assess usability improvements over time, exploring adaptive and personalized interfaces that respond to users’ proficiency and usage patterns, and developing standardized frameworks that integrate usability heuristics with AI ethics guidelines. These directions are a direct extension of the current study: adaptive interfaces could mitigate observed differences in task efficiency among users, while standardized frameworks could provide consistent guidance to address recurring usability and ethical issues across applications. By incorporating such enhancements, GenAI applications can move towards a design that not only facilitates seamless interactions but also upholds ethical AI principles, ensuring responsible and equitable user engagement.

6. Limitations and Threats to Validity

Limitations: Our study involved a relatively small sample of 12 participants. This limited sample size may not capture the full range of user experiences or the diversity of user expertise, which can affect the generalizability of our findings. To address this limitation, future studies are planned to involve larger and more diverse participant groups, including users with varying technical backgrounds and levels of GenAI experience, to ensure broader representativeness. Despite this limitation, our study provides valuable, in-depth insights into key usability challenges that can inform future interface improvements.

Generative AI platforms, including ChatGPT, are in a state of rapid evolution. For instance, ChatGPT has introduced a new feature called “Reason,” which affects how it responds to queries. While this feature indicates that the GUI is becoming more sophisticated, it also means that certain usability issues, such as the lack of clear feedback or timeline indicators, may be addressed in future iterations. To mitigate this, longitudinal studies or repeated evaluations of GenAI GUIs can track changes over time, allowing researchers to assess the persistence of usability issues and the impact of new features. Consequently, our study offers a snapshot of current usability challenges that not only highlight areas needing improvement but also serve as a baseline for tracking future enhancements.

Internal Validity: The tasks and scenarios used during the evaluations might not have fully captured the range of interactions that users typically perform, limiting the applicability of our findings to real-world settings. Despite these challenges, the structured nature of heuristic evaluation allowed us to uncover consistent trends and pinpoint specific usability issues, thereby providing a meaningful contribution to the body of research on GenAI interfaces.

External Validity: As GenAI systems are continuously updated and enhanced, the GUI usability issues identified in this study might not persist in future versions. For example, while ChatGPT currently lacks a proper timeline or feedback mechanism despite the introduction of the “Reason” feature, subsequent updates may resolve these issues, thereby altering the overall usability landscape. The results obtained from our specific participant group may not be entirely representative of the broader user base, which can include a wide variety of demographics and usage contexts. This limits the extent to which our conclusions can be generalized to all users of these GenAI applications as these technologies evolve. Nonetheless, our research highlights critical areas such as user control, error prevention, trustworthiness, and help documentation that remain in need of further refinement, thereby providing a valuable roadmap for future improvements in GenAI GUI design.

7. Conclusions and Future Work

This study provides valuable insights into the usability and ethical implications of generative AI applications, with a focus on the graphical user interfaces of Gemini, ChatGPT, and Claude. While each application demonstrated certain strengths, such as Gemini’s consistency, ChatGPT’s guidance, and Claude’s aesthetic design, all three faced critical issues that compromised their GUI usability and alignment with ethical principles. The absence of essential features like system status indicators, error prevention mechanisms, and user control functionality reduces the efficiency and effectiveness of these applications, ultimately impacting user satisfaction. For example, the lack of undo/redo options increases users’ cognitive load and limits their sense of control, while unclear system feedback diminishes perceived transparency. Similarly, insufficient customization options can negatively affect user autonomy by restricting the ability to tailor interactions to personal workflows. Ethical shortcomings, such as a lack of transparency, accountability, and fairness, were found to exacerbate the GUI usability issues, creating barriers to user trust and informed decision-making. These ethical concerns are compounded by a lack of clarity regarding data usage and privacy protection, further diminishing users’ confidence in the applications.

To address these issues, the study proposes the following recommendations for GenAI GUI design:

Enhance transparency: include clear system status indicators and real-time feedback to ensure users understand system responses and progress.
Support user autonomy: provide customizable controls, adaptive interfaces, and undo/redo options to allow users to manage and tailor their interactions.
Reduce cognitive load: implement error prevention and recovery mechanisms, context-specific guidance, and simplified layouts to streamline user tasks.
Improve trustworthiness: integrate built-in reference mechanisms and source display for generated content to increase accountability and reliability.
Foster ethical alignment: design interfaces that promote fairness, accessibility, and informed decision-making through clear instructions and guidance.

These recommendations directly address identified GUI issues and their consequences for transparency, user autonomy, and cognitive load. By implementing these improvements, GenAI applications can enhance usability while aligning with key ethical principles, thereby increasing user trust and facilitating responsible AI adoption.

In conclusion, while the evaluated GenAI applications serve as useful tools, their GUIs need substantial improvement to meet both usability and ethical standards. As GenAI continues to play an increasingly significant role in various domains, ensuring that these applications are both user-friendly and ethically aligned will be essential for their long-term success and societal acceptance.

In future work, we plan to expand the evaluation to a broader range of GenAI applications and conduct an additional focus group study to further validate the proposed improvements and inform the design of a more ethically aligned GUI framework. A replication study is also planned, focusing on evaluating GUI usability alongside user experience (UX) using updated versions of GenAI applications. Meanwhile, the task forms have been made publicly available to support future replications and comparative analyses by other researchers. To protect participant privacy, individual responses containing demographic information are not shared; only the blank task form is accessible. While this evaluation was limited to text generation interfaces, future research will extend the usability analysis to multimodal GenAI systems (e.g., image and video generation) to provide a more comprehensive assessment.

Author Contributions

Conceptualization, A.B. and W.H.; Methodology, A.B.; Validation, A.B. and W.H.; Formal analysis, A.B.; Writing—original draft, A.B.; Supervision, W.H.; Project administration, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data related to this study can be accessed from the links (https://www.industry.gov.au/publications/australias-artificial-intelligence-ethics-principles/australias-ai-ethics-principles, accessed on 5 March 2025; https://chatu.qu.tu-berlin.de/home, accessed on 7 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Tasks Performed by the Participants

Figure A1. Tasks.

References

Sengar, S.S.; Hasan, A.B.; Kumar, S.; Carroll, F. Generative Artificial Intelligence: A Systematic Review and Applications. arXiv 2024, arXiv:2405.11029. [Google Scholar] [CrossRef]
OpenAI. ChatGPT. 2024. Available online: https://openai.com/chatgpt (accessed on 5 September 2024).
Google. Google Gemini. 2024. Available online: https://gemini.google.com/app (accessed on 5 September 2024).
Anthropic. Claude AI by Anthropic. 2024. Available online: https://www.anthropic.com/claude (accessed on 5 September 2024).
Perkins, M.; Furze, L.; Roe, J.; MacVaugh, J. The Artificial Intelligence Assessment Scale (AIAS): A framework for ethical integration of generative AI in educational assessment. J. Univ. Teach. Learn. Pract. 2024, 21. [Google Scholar] [CrossRef]
Holmes, W.; Miao, F. Guidance for Generative AI in Education and Research; UNESCO Publishing: Paris, France, 2023. [Google Scholar]
Wang, C.; Liu, S.; Yang, H.; Guo, J.; Wu, Y.; Liu, J. Ethical considerations of using ChatGPT in health care. J. Med. Internet Res. 2023, 25, e48009. [Google Scholar] [CrossRef] [PubMed]
Oswal, S.K.; Oswal, H.K. Examining the Accessibility of Generative AI Website Builder Tools for Blind and Low Vision Users: 21 Best Practices for Designers and Developers. In Proceedings of the 2024 IEEE International Professional Communication Conference (ProComm), Pittsburgh, PA, USA, 14–17 July 2024; pp. 121–128. [Google Scholar]
Adnin, R.; Das, M. “I look at it as the king of knowledge”: How Blind People Use and Understand Generative AI Tools. People 2024, 16, 92. [Google Scholar]
Weisz, J.D.; Muller, M.; He, J.; Houde, S. Toward general design principles for generative AI applications. arXiv 2023, arXiv:2301.05578. [Google Scholar] [CrossRef]
Hornbæk, K. Current practice in measuring usability: Challenges to usability studies and research. Int. J.-Hum.-Comput. Stud. 2006, 64, 79–102. [Google Scholar] [CrossRef]
Alvarez-Cortes, V.; Zayas-Perez, B.E.; Zarate-Silva, V.H.; Uresti, J.A.R. Current trends in adaptive user interfaces: Challenges and applications. In Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA 2007), Cuernavaca, Mexico, 25–28 September 2007; pp. 312–317. [Google Scholar]
Kim, T.s.; Ignacio, M.J.; Yu, S.; Jin, H.; Kim, Y.g. UI/UX for Generative AI: Taxonomy, Trend, and Challenge. IEEE Access 2024, 12, 179891–179911. [Google Scholar] [CrossRef]
Yamani, A.Z.; Al-Shammare, H.A.; Baslyman, M. Establishing Heuristics for Improving the Usability of GUI Machine Learning Tools for Novice Users. In Proceedings of the CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11–16 May 2024; pp. 1–19. [Google Scholar]
Ribeiro, R. Conversational Generative AI Interface Design: Exploration of a Hybrid Graphical User Interface and Conversational User Interface for Interaction with ChatGPT; Malmö University (Faculty of Culture and Society, School of Arts and Communication): Malmö, Sweden, 2024. [Google Scholar]
Skjuve, M.; Følstad, A.; Brandtzaeg, P.B. The user experience of ChatGPT: Findings from a questionnaire study of early users. In Proceedings of the 5th International Conference on Conversational User Interfaces, Eindhoven, The Netherlands, 19–21 July 2023; pp. 1–10. [Google Scholar]
Ray, P.P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things-Cyber-Phys. Syst. 2023, 3, 121–154. [Google Scholar] [CrossRef]
Lee, P.; Bubeck, S.; Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 2023, 388, 1233–1239. [Google Scholar] [CrossRef]
Perera, H.; Hussain, W.; Whittle, J.; Nurwidyantoro, A.; Mougouei, D.; Shams, R.A.; Oliver, G. A study on the prevalence of human values in software engineering publications, 2015–2018. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 409–420. [Google Scholar]
Batool, A.; Zowghi, D.; Bano, M. AI governance: A systematic literature review. AI Ethics 2025, 5, 3265–3279. [Google Scholar] [CrossRef]
Batool, A.; Zowghi, D.; Bano, M. Responsible AI governance: A systematic literature review. arXiv 2023, arXiv:2401.10896. [Google Scholar] [CrossRef]
Beltran, M.A.; Ruiz Mondragon, M.I.; Han, S.H. Comparative Analysis of Generative AI Risks in the Public Sector. In Proceedings of the 25th Annual International Conference on Digital Government Research, New York, NY, USA, 11–14 June 2024; pp. 610–617. [Google Scholar] [CrossRef]
Arnesen, S.; Broderstad, T.S.; Fishkin, J.S.; Johannesson, M.P.; Siu, A. Knowledge and support for AI in the public sector: A deliberative poll experiment. AI Soc. 2024, 40, 3573–3589. [Google Scholar] [CrossRef]
Ahmed, A.; Imran, A.S. The role of large language models in UI/UX design: A systematic literature review. arXiv 2025, arXiv:2507.04469. [Google Scholar] [CrossRef]
Akbar, M.A.; Khan, A.A.; Liang, P. Ethical aspects of ChatGPT in software engineering research. IEEE Trans. Artif. Intell. 2023, 6, 254–267. [Google Scholar] [CrossRef]
Hagendorff, T. Mapping the ethics of generative AI: A comprehensive scoping review. Minds Mach. 2024, 34, 39. [Google Scholar] [CrossRef]
Aničin, L.; Stojmenović, M. Bias analysis in stable diffusion and MidJourney models. In Intelligent Systems and Machine Learning; Mohanty, S.N., Garcia Diaz, V., Satish Kumar, G.A.E., Eds.; Springer: Cham, Switzerland, 2023; pp. 378–388. [Google Scholar]
Hillmann, S.; Kowol, P.; Ahmad, A.; Tang, R.; Möller, S. Usability and User Experience of a Chatbot for Student Support. In Proceedings of the Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35, Konferenz, Regensburg, 6–8 March 2024; pp. 22–29. [Google Scholar]
Pinto, G.; De Souza, C.; Rocha, T.; Steinmacher, I.; Souza, A.; Monteiro, E. Developer Experiences with a Contextualized AI Coding Assistant: Usability, Expectations, and Outcomes. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, Lisbon, Portugal, 14–15 April 2024; pp. 81–91. [Google Scholar]
van Es, K.; Nguyen, D. “Your friendly AI assistant”: The anthropomorphic self-representations of ChatGPT and its implications for imagining AI. AI Soc. 2024, 40, 3591–3603. [Google Scholar] [CrossRef]
Iranmanesh, A.; Lotfabadi, P. Critical questions on the emergence of text-to-image artificial intelligence in architectural design pedagogy. AI Soc. 2024, 40, 3557–3571. [Google Scholar] [CrossRef]
Al-kfairy, M.; Mustafa, D.; Kshetri, N.; Insiew, M.; Alfandi, O. Ethical challenges and solutions of generative AI: An interdisciplinary perspective. Informatics 2024, 11, 58. [Google Scholar] [CrossRef]
Alabduljabbar, R. User-centric AI: Evaluating the usability of generative AI applications through user reviews on app stores. PeerJ Comput. Sci. 2024, 10, e2421. [Google Scholar] [CrossRef] [PubMed]
Mugunthan, T. Researching the Usability of Early Generative-AI Tools; Nielsen Norman Group: Dover, DE, USA, 2023. [Google Scholar]
Wharton, C.; Rieman, J.; Lewis, C.; Polson, P. The cognitive walkthrough method: A practitioner’s guide. In Usability Inspection Methods; Nielsen, J., Mack, R.L., Eds.; John Wiley & Sons: New York, NY, USA, 1994; pp. 105–140. [Google Scholar]
Nielsen, J.; Mack, R.L. Usability Inspection Methods; Wiley: New York, NY, USA, 1994; ISBN 978-0471018773. [Google Scholar]
Brooke, J. SUS: A “quick and dirty” usability scale. In Usability Evaluation in Industry; Taylor & Francis: London, UK, 1996; pp. 4–7. [Google Scholar]
Preece, J.; Rogers, Y.; Sharp, H. Interaction Design: Beyond Human-Computer Interaction, 4th ed.; John Wiley & Sons: Chichester, UK, 2015. [Google Scholar]
MacKenzie, I.S. Human-Computer Interaction: An Empirical Research Perspective, 1st ed.; Elsevier: Waltham, MA, USA, 2013. [Google Scholar]
Kuppens, D.; Verbert, K. Conversational User Interfaces: A review of usability methods. In Proceedings of the ACM Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; pp. 1–12. [Google Scholar]
Mittelstadt, B. Principles alone cannot guarantee ethical AI. Nat. Mach. Intell. 2019, 1, 501–507. [Google Scholar] [CrossRef]
Nielsen, J.; Molich, R. Heuristic evaluation of user interfaces. In Proceedings of the Eighth ACM SIGCHI Conference on Human Factors in Computing Systems (CHI ’90), Seattle, WA, USA, 1–5 April 1990; ACM Press: New York, NY, USA, 1990; pp. 249–256. [Google Scholar]
Rubin, J.; Chisnell, D. Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests, 2nd ed.; Wiley: Indianapolis, IN, USA, 2008. [Google Scholar]
Nielsen, J. Why You Only Need to Test with 5 Users; Nielsen Norman Group: Dover, DE, USA, 2000; Available online: https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/ (accessed on 21 September 2025).
Fereday, J.; Muir-Cochrane, E. Demonstrating rigor using thematic analysis: A hybrid approach of inductive and deductive coding and theme development. Int. J. Qual. Methods 2006, 5, 80–92. [Google Scholar] [CrossRef]
Australian Government, Department of Industry, Science and Resources. In Australia’s AI Ethics Principles; Department of Industry, Science and Resources: Canberra, Australia, 2019. Available online: https://www.industry.gov.au/publications/australias-artificial-intelligence-ethics-principles (accessed on 21 September 2025).

Figure 1. Venn diagram showing studies covering GUI [10,11,24] and ethics [25,26,27] in GenAI applications.

Figure 2. Generative AI applications selection procedure results.

Figure 3. Interfaces of selected generative AI applications.

Figure 4. Task structure for GUI usability evaluation.

Figure 5. Gemini: usability of GUI issues. Each subfigure highlights a specific GUI issue identified during user testing.

Figure 6. ChatGPT: usability of GUI issues.

Figure 7. ChatGPT: usability of GUI issues.

Figure 8. Claude: usability of GUI issues.

Figure 9. Gemini: usability of GUI issues. Each subfigure highlights a specific GUI issue identified during user testing.

Table 1. Participant demographics.

Characteristic	Distribution
Gender	Female: 6, Male: 6
Current Role	Senior Research Scientist: 4
	Postdoctoral Researcher: 3
	Senior Software Engineer: 3
	Research Scientist: 2
Duration in Current Role	7–10 years: 5
	4–6 years: 2
	1–3 years: 5

Table 2. Tasks evaluation: success, partial success, failure.

Tasks	Gemini	ChatGPT	Claude
Task 1: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Visibility of System Status.’ You can either list the issues below or select from the options	Success	Success	Success
Task 2: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Match between system and the real world.’ You can …	Partial Success: Reason: The prompt for this task was similar to the previous one. Participants initially questioned the repetition; however, maintaining a consistent prompt allowed for controlled comparison across tasks, ensuring that differences in user interaction were attributable to GUI design and usability rather than variations in the task content.	Partial Success	Partial Success
Task 3: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘User control and freedom.’	Success	Success	Success
Task 4: Open the GenAI application assigned to you and switch between settings, help, and main task areas. Are the layout, colors, and controls consistent across these sections? …	Success	Success	Success
Task 5: Open the GenAI application assigned to you and input the given prompt: Can you delete my conversation? Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Error prevention.’	Success	Success	Success
Task 6: Open the GenAI application assigned to you. Find and use the feature to regenerate a response. How intuitive and visible was the option? …	Success	Success	Success
Task 7: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Flexibility and efficiency of use.’ …	Success	Partial Success: Reason: Participants completed the task but reported minor difficulty navigating certain menu options or controls, which slightly affected task efficiency.	Success
Task 8: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Aesthetic and minimalist design.’:	Success	Success	Success
Task 9: Open the GenAI application assigned to you and input the given incomplete prompt: Generate a 1000-word blog post on Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Help users recognize, diagnose, and recover from errors.’ …	Success	Success	Success
Task 10: Open the GenAI application assigned to you and locate the help section or tutorial. How easy was it to find the relevant documentation or guidance using the interface?…	Success	Success	Success
Task 11: Open the GenAI application assigned to you and input the given prompt: Can you guide me on how to write a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Guidance.’ …	Success	Success	Success
Task 12: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Trustworthiness.’ …	Success	Success	Success
Task 13: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Adaptation to growth.’ …	Success	Success	Success
Task 14: Open the GenAI application assigned to you and input the given prompt: Generate a 1000-word blog post on “How to Build Confidence in the Workplace” Review the response from the generative AI (GenAI) application and identify any issues related to the ‘Context relevance.’	Success	Success	Partial Success—Reason: One participant questioned the repetition of the prompt. We explained that the similarity in prompts was intentional to observe whether the interface supported streamlined interactions. The participant accepted this explanation and completed the task successfully.

Table 3. Thematic coding for GUI usability issues. This table documents the data synthesis procedure used to transform detailed GUI issues listed in the task evaluation forms into concise initial codes and broader final themes. Interpretation limitation: The entries represent researcher-driven coding of pre-listed issues for methodological clarity; they do not indicate frequency, severity, or participant preference and are not results.

GUI Usability Issues	Initial Codes	Final Themes
The response lacks real-world context or examples (e.g., it is not relatable to workplace scenarios).	Contextual Relevance	Poor Real-World Applicability
The response uses terminology that is not familiar or natural for the intended audience.	Audience Knowledge Misalignment	Poor Audience Appropriateness
The information is presented in an unnatural order (e.g., lacks logical flow or structure).	Sequence of Ideas	Weak Content Organization
The structure of the answer does not match typical real-world content formatting (e.g., no introduction, conclusion, or headings).	Structural Format	Inadequate Structural Formality
The application does not allow easy correction or modification of the input prompt (e.g., no option to edit or resubmit).	Input Modification Difficulty	Limited Input Flexibility

Table 4. GUI usability issues: Gemini, ChatGPT, and Claude.

Heuristics	Gemini—GUI Usability Issues	ChatGPT—GUI Usability Issues	Claude—GUI Usability Issues
H1: Visibility of system status	No time indication for task completion—2 selections No indicators—1 selection No issues—1 selection	No time indication for task completion—4 selections No issues—1 selection	No time indication for task completion—3 selections No issues—1 selection
H2: Match between system and the real world	Poor Real-World Applicability—2 selections Long Response—2 selection	Poor Real-World Applicability—2 selections No issues—2 selection	No issues—4 selections
H3: User control and freedom	Limited Input Flexibility—3 selections Restricted Interaction Control—3 selections Absence of Clear Reset Functionality—3 selections Inability to Interrupt Output Generation: 3 selection	Lack of revise functionality—2 selections Absence of Clear Reset Functionality—4 selections Limited Flexibility in Topic Refinement – 1 selection Inability to Interrupt Output Generation: 1 selection	Absence of Clear Reset Functionality—3 selections Inability to Interrupt or Cancel Output Generation—1 selection
H4: Consistency and standards	No Issues—4 selections	Inconsistent Visual Alignment—1 selection Absence of Accessible Help and Navigation Guidance—1 selection No issues—Good consistency—2 selections	Inconsistent Typography: 1 selection, No issues: 3 selections
H5: Error prevention	Absence of Undo/Redo Functionality—2 selections Insufficient Error Prevention—1 selection Lack of Safeguards Against Critical Actions—1 selection Inconsistent Communication—3 selections	Absence of Undo/Redo Functionality—4 selections Insufficient Preventive Guidance and Error Mitigation—3 selections Lack of Safeguards Against Critical Actions—2 selections Lack of Action Confirmation for Irreversible Tasks—2 selections Absence of Consequence Warnings for Critical Actions—2 selections	Absence of Undo/Redo Functionality—2 selections Insufficient Preventive Guidance and Error Mitigation—2 selections Insufficient Error Prevention—3 selections Lack of Action Confirmation for Irreversible Tasks—2 selections Absence of Consequence Warnings for Critical Actions—2 selections
H6: Recognition rather than recall	Lack of easy access to features—2 selections Lack of visible shortcuts—2 selections	Lack of visible shortcuts—3 selections Reliance on Memory for System Interaction—1 selection	No issues—4 selections
H7: Flexibility and efficiency of use	Lack of Efficiency Features for Experienced Users Lack of customization options for responses—2 selections Absence of Mode Switching—2 selections	Lack of Efficiency Features for Experienced Users—4 selections Lack of customization options for responses—4 selections Lack of easy-to-access help for new users—3 selections Absence of Mode Switching for User Customization—2 selections Overly Simplistic Design Limiting Advanced User Efficiency—2 selections	Absence of Mode Switching for User Customization—2 selections No issues—2 selections
H8: Aesthetic and minimalist design	Cluttered Response Design—1 selection Good aesthetic and minimalist design—3 selections	Cluttered Response Design Affecting Readability—1 selection Good aesthetic and minimalist design—3 selections	Cluttered Response Design Affecting Readability—2 selections Good aesthetic and minimalist design—2 selections
H9: Help users recognize, diagnose, and recover from errors	Insufficiently Detailed Error Messages—1 selection Delayed Error Notification—1 selection Long response with unnecessary details—1 selection No issues—2 selections	Lack of Proactive Error Prevention Guidance—1 selection No issues—3 selections	No issues—4 selections
H10: Help and documentation	The provided help is too generic—3 selections No issues—1 selection	Difficult to find help documentation—4 selections Absence of Clear Usage Instructions—2 selections Lack of help or tutorial options, Lack of easily accessible support resources, Lack of readily available help—1 selection	Lack of help and tutorial options—2 selections Absence of Clear Usage Instructions—2 selections Lack of easily accessible support resources—2 selections
H11: Guidance	Irrelevant guidance—2 selections No issues (good guidance)—2 selections	Lack of Post-Input Guidance—1 selection No issues —3 selections	No issues—4 selections
H12: Trustworthiness	Lack of references for the provided responses—2 selections Lack of transparency—2 selections	Lack of references for the provided responses—3 selections Lack of transparency—3 selections Unreliable Content Due to Lack of Factual Accuracy—3 selections Generate unverified information—1 selection	Lack of transparency—3 selections Lack of references for the provided responses—3 selections Unreliable Content Due to Lack of Factual Accuracy—1 selection Lack of Transparency in Data Handling—1 selection
H13: Adaptation to growth	Lack of Adaptive Interface for User Proficiency: 3 selections One-Size-Fits-All Guidance Approach—2 selections Lack of Adaptive Interface Design—2 selections Absence of Dynamic User Feedback Mechanism: 2 selections	One-Size-Fits-All Guidance Approach—4 selections Lack of Adaptive Interface for User Proficiency—4 selections Lack of Adaptive Interface Design—3 selections Absence of Dynamic User Feedback Mechanism—3 selections Inflexibility in Evolving User Permissions—2 selections	One-Size-Fits-All Guidance Approach—4 selections Inflexibility in Evolving User Permissions—2 selections Lack of Adaptive Interface for User Proficiency—2 selections Lack of Adaptive Interface Design—2 selections Absence of Dynamic User Feedback Mechanism—1 selection
H14: Context relevance	Generic Content Output Lacking Specificity—2 selections Shallow or Superficial Responses—2 selections	Generic Content Output Lacking Specificity—4 selections Shallow or Superficial Responses—3 selections	Generic Content Output Lacking Specificity—3 selections Shallow or Superficial Responses—3 selections No issues—1 selection

Table 5. Heuristics and AI ethics principles.

Heuristics	Mapped Australia’s AI Ethics Principles	Explanation
Visibility of system status	Transparency and explainability.	Systems that provide clear feedback about their processes support transparency by informing users about the state and actions of the system, fostering explainability.
Match between system and the real world	Human-centered values. Fairness.	Ensuring the system uses language and concepts familiar to users supports inclusiveness and respects user diversity, aligning with fairness and human-centered design.
User control and freedom	Human-centered values. Contestability.	Providing users control over system actions aligns with contestability, allowing users to challenge or modify outcomes.
Consistency and standards	Reliability.	Adherence to design standards enhances reliability by ensuring predictable and dependable system behavior.
Error prevention	Reliability and safety. Privacy protection and security.	Preventing errors minimizes the risk of harmful or unintended outcomes, contributing to system reliability and safeguarding user privacy.
Recognition rather than recall	Human-centered values. Fairness.	Reducing cognitive load promotes inclusiveness and accessibility, which align with fairness and human-centered values.
Flexibility and efficiency of use	Fairness. Human-centered values.	Providing flexibility accommodates diverse user needs, ensuring fairness and inclusivity.
Aesthetic and minimalist design	Transparency and explainability. Human-centeredvalues.	Simplified designs improve understanding and accessibility, fostering transparency and supporting user-centered values.
Help users recognize, diagnose, and recover from errors	Reliability and safety. Contestability.	Assisting users in managing errors ensures system safety and aligns with contestability by empowering users to address issues.
Help and documentation	Transparency and explainability. Human-centeredvalues. Fairness.	Comprehensive help resources support transparency and fairness by equipping users with the knowledge needed for effective interaction.
Guidance	Human, societal and environmental well-being. Transparency and explainability. Accountability.	Offering clear guidance ensures the system promotes well-being, maintains transparency, and upholds accountability.
Trustworthiness	Privacy protection and security. Transparency and explainability. Accountability.	Ensuring trustworthiness addresses critical ethical concerns, including data privacy, transparency, and responsibility.
Adaptation to growth	Human-centered values.	Designing systems to adapt to user growth supports inclusivity and ongoing relevance.
Context relevance	Human, societal, and environmental well-being. Fairness. Transparency.	Aligning system functionality with context ensures fairness and promotes societal and environmental well-being.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Batool, A.; Hussain, W. Evaluating the Usability and Ethical Implications of Graphical User Interfaces in Generative AI Systems. Computers 2025, 14, 418. https://doi.org/10.3390/computers14100418

AMA Style

Batool A, Hussain W. Evaluating the Usability and Ethical Implications of Graphical User Interfaces in Generative AI Systems. Computers. 2025; 14(10):418. https://doi.org/10.3390/computers14100418

Chicago/Turabian Style

Batool, Amna, and Waqar Hussain. 2025. "Evaluating the Usability and Ethical Implications of Graphical User Interfaces in Generative AI Systems" Computers 14, no. 10: 418. https://doi.org/10.3390/computers14100418

APA Style

Batool, A., & Hussain, W. (2025). Evaluating the Usability and Ethical Implications of Graphical User Interfaces in Generative AI Systems. Computers, 14(10), 418. https://doi.org/10.3390/computers14100418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating the Usability and Ethical Implications of Graphical User Interfaces in Generative AI Systems

Abstract

1. Introduction

2. Background and Related Work

3. Research Methodology

3.1. Expert Consultation

3.2. User-Based Testing

3.2.1. Recruitment and Participants

3.2.2. Study Design and Procedure

3.3. Extraction Procedure

3.4. Synthesis Procedure

4. Findings

5. Discussion

5.1. GUI Usability: Challenges and Opportunities

5.1.1. Common Usability Issues Across GenAI Applications

5.1.2. Ethical Shortcomings in GUI Usability and Their Implications

5.1.3. Recommendations for Ethical and Responsible GUI Design

6. Limitations and Threats to Validity

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Tasks Performed by the Participants

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI