Usability Evaluation of Wearable Smartwatches Using Customized Heuristics and System Usability Scale Score

: The mobile and wearable nature of smartwatches poses challenges in evaluating their usability. This paper presents a study employing customized heuristic evaluation and use of the sys-tem usability scale (SUS) on four smartwatches, along with their mobile applications. A total of 11 heuristics were developed and validated by experts by combining Nielsen’s heuristic and Mo tt i and Caines’ heuristics. In this study, 20 participants used the watches and participated in the SUS survey. A total of 307 usability issues were reported by the evaluators. The results of this study show that the Galaxy Watch 5 scored highest in terms of e ﬃ ciency, ease of use, features, and ba tt ery life compared to the other three smartwatches and has fewer usability issues. The results indicate that ease of use, features, and ﬂ exibility are important usability a tt ributes for future smartwatches. The Galaxy Watch 5 received the highest SUS score of 87.375. Both evaluation methods showed no signi ﬁ cant di ﬀ erences in results, and customized heuristics were found to be useful for smartwatch evaluation.


Introduction
The relationship between humans and technology is best exemplified through wearable technology, which increases human potential through the provision of smart gadgets that can help to monitor and track body parameters.Wearable devices provide continuous monitoring of an individual's health, activity, and fitness, which has significant potential for enhancing human activities and standard of living.There are sensors embedded in these devices that gather personal information and exchange this information through Wi-Fi, Bluetooth, cellular technology, etc. [1].Wearable technologies offer a wide range of functionalities for unobtrusive health monitoring, such as heart rate tracking, blood sugar tracking, blood pressure analysis, gait analysis, sleep score, as well as a variety of other health factors.These devices are the subject of extensive research [2] and can measure physical activity, such as steps walked, calories burnt, or workout intensity, using a bracelet-like gadget worn on the wrist.Then, the data are transmitted to a mobile application, either wirelessly through Bluetooth synchronizing or through the connection of the device to a smartphone, where objectives, progress, and activity may be recorded [3].The most widely used wearable devices are wrist-worn smartwatches, which can receive and send notifications, messages, or calls.For example, the Apple watch series can receive incoming calls and has a comparatively large screen, but is more expensive than other smartwatches.The mainstream wrist-worn devices include the Apple watch, Samsung Gear S, Mi Band, Huawei Honor, Fitbit Surge, and Jawbone Up3, with many others available for purchase.
Currently, the major aspects of wearable devices are extensively researched, such as the dependability and accuracy of evaluation.Recently, researchers have raised concerns regarding the long-term usage of wearable devices, highlighting the importance of combining behavioral change approaches such as goal setting, feedback, and rewards with other evidence-based techniques [4].In addition, whilst wearable technology has a wider range of potential applications, users do face usability issues with these devices [5].Furthermore, studies have found that the acceptability of wearable items for regular consumers, as well as their comfort level, is critical to a product's success [6].For these reasons, it is necessary to identify and address the usability issues of wearable devices because the success of these devices relies on users' product experience.As user experience is one of the core aspects contributing to a product's popularity, it is crucial that the device performs its functionality in an easy, comfortable, and intuitive way.Therefore, continuous usability testing is essential to observe users as they interact with a product or prototype and ensure a satisfactory user experience, contributing to the widespread and quick adoption of wearable devices.
Usability is a multifaceted concept that can be explored in various ways.The term "usability" has been used extensively for the past few decades, and various individuals interpret it differently.Some people associate usability with ease of use or convenience and examine it from the standpoint of an interactive interface; meanwhile, others refer to usability as a conceptual scale for the real-time evaluation of a product's functionality obtained from feedback from potential users [7].Therefore, despite the testing paradigm, usability testing assesses a product's ability to fulfill its intended functions.Food, consumer items, websites or web applications, computer interfaces, papers, and electronics are all examples of products that benefit from usability testing.
There are various factors in developing usability criteria.According to Nielsen, usability has five key characteristics: learnability, satisfaction, efficiency, low mistake rate/quick error recovery, and memorability [8].The International Standards Organization (ISO) describes the effectiveness and efficiency of, and satisfaction with, a product as usability parameters [9].Shackel defined usability attributes as effectiveness, learnability, flexibility, and user attitude [10].According to Hix and Hariston, the usability parameters are initial performance, long-term performance, learnability, retainability, advanced feature usage, first impressions, and long-term user satisfaction [11].Furtado defined the usability parameters as ease of use and learning [12].The parameters defined by ISO and Nielsen are commonly used for usability evaluation [13].There are several usability evaluation techniques.Any approach or technique used to perform usability evaluation or testing to enhance the usability of an interactive system at any point of its development is known as a usability evaluation method (UEM).Several usability testing methods have been introduced, such as laboratory-based formative assessment with users, heuristics, questionnaires, and other expert-based usability evaluation techniques, model-based analytic approaches, all types of expert assessment, and the remote evaluation of interactive software after field deployment.
The factors that influence the adoption of smart wearable devices among potential customers are wearability, ease of use, compelling design, functionality, and price.Wearability refers to the existence of pain, degree of comfort, ease in wearing, requiring support to wear, or willingness to wear again.Meanwhile, ease of use means no interference in daily activities, comprehensibility, and learnability.Compelling design means visually appealing.Functionality refers to the core features.Although conventional usability evaluation methods such as thematic analysis, heuristic evaluation, interviews, and thinkaloud and cognitive walkthroughs can be used, the main shortcoming of these methods is that the findings in a laboratory can sometimes be difficult to illustrate [14].Alternatively, the Agency for Healthcare Research and Quality has suggested that questionnaires can be effectively used for usability evaluation [15].The most widely used questionnaires are the system usability scale (SUS) and UTUAT, the post-study system usability questionnaire covering use, satisfaction, ease, etc.
The dependability and accuracy evaluation of wearable devices are currently the focus of extensive research.Concerns have been raised relating to the use of wearable devices, and researchers have indicated the importance of combining behavioral change approaches such as goal setting, feedback, and rewards alongside other evidence-based techniques [15].A mixed response to adopting wearable devices has also been observed, with the response of users being less positive than expected [5].In addition, whilst wearable technology represents a significant range of potential applications, usability issues still remain to be solved by research [5].Two key features have also been identified that are critical for a product's success: the acceptability of the items to regular consumers and their comfort level [6].
Therefore, the usability issues of wearable devices remain to be solved, as their success depends on users' experience; therefore, it is critical that a wearable device is easy and intuitive to use and comfortable to wear.As a result, continuous usability testing is essential for manufacturers to ensure the widespread and rapid adoption of their particular product.This testing observes users interacting with the product or prototype and is a key factor in ensuring a satisfactory user experience.
The key aim of this paper is to investigate the usability and other issues relating to the existing commercially available wearable smartwatches.The usability will be analyzed based on the customized heuristic evaluations and the system usability scale (SUS).These usability evaluations of smartwatches will provide a comprehensive, user-specific, and customized measure of usability approach, which will provide quantitative and statistical data for benchmarking and further design improvements.

Literature Review
Smartwatches have received significant attention for their versatility, which satisfies a wide range of consumer interests, including fitness and health monitoring [4].Other than basic health and fitness features, smartwatches offer a wide range of features varying from person to person, such as real-time vital signs and overall health monitoring in senior people with Parkinson's, heart disease, or other chronic conditions.In addition, they may capture both crucial and trivial data regarding patient location and behavior more quickly and precisely [16].According to recent IDC surveys on wristwatch usage, the industry will continue to expand exponentially, with 373 million units expected to be shipped in 2020, up from 100 million in 2016.In 2016, smartwatches made up 25% of all smart wearables.Although smartwatches have a wide range of functionalities, they still face technical and usability challenges, such as aged people being less familiar with new technologies or ease of use or comfort issues for patients [17].The usability testing needs to examine how consumers really use their smartwatches rather than observing how they were intended to be used.The usability test aims to understand consumers' requirements of these devices through the examination of usability concerns connected to the real tasks that users undertake using their smartwatches [12].
Despite many studies exploring the usability, perceived value, and role of smartwatches in patient monitoring, only limited research has evaluated the relationship of usability and brand factors influencing usability using different usability evaluation methods.This section provides a summary of the fragmented research on wearable technology.Y. Wu et al. [18] proposed a novel method for evaluating a smartwatch's usability based on eye movement tracking.The eye tracker recorded the testers' eye movements, and the eye movement data were added to the system for calculating the usability rating index.In the study, 10 participants were asked to perform specific tasks with Motorola 360 smartwatches and were interviewed afterwards.The results of the task test showed that eye movement data can accurately assess the icons on smartwatch interfaces and illustrate how users search for certain features.J. Chun et al. [19] completed a study to identify challenges in smartwatch usability using in-depth interviews.Many users appeared to find mobile devices convenient for checking pushed alerts.However, smartwatches obtained a poor response when individuals tried to interact with visually rich material.
According to the study, individuals like to use smartwatches to listen to music and check the weather.The study of N. Anggraini et al. [20] observed how usability, brand, and price influenced customers' impressions of smartwatches in Indonesia.In order to conduct the study, 116 Indonesian respondents were surveyed.The participants were less concerned with usability and instead focused on brands and pricing when making their purchases.Most of them did not consult evaluations and recommendations from others when they already had a brand in mind.However, one of the shortcomings was the limited number of participants involved, as with a greater number of participants, the results could be generalized to Indonesia's smartwatch users.
M. Bang et al. [21] designed a nurses' watch app with the Motorola 360 smartwatch in order to automate patient monitoring and checklist systems.The usability of the IT systems, including complexity, training, and need for support for the user interface, was evaluated using the system usability scale (SUS).The mean score in the SUS resulted in the average score as the question on comfort and security from the SUS questionnaire received very poor marks.Neuropsychiatric diseases are the primary cause of disability globally, but current mental health monitoring methods rely on subjective DSM-5 specifications; however, developments in EEG and video monitoring technology have not been extensively embraced because of inconvenience.Kamdar and Wu [22] presented a novel platform-the Passive, Real-time Information for Sensing Mental Health (PRISM)through the integration of heart rate, light, and motion data from a smartwatch application and text input from a web application.The SUS questionnaire was used to evaluate its usability, and a total of 13 healthy participants were asked to wear the Samsung Gear S smartwatch for usability evaluation.The SUS questionnaire demonstrated that participants had a positive attitude toward PRISM.The participants said that the system was simple to use, requiring little expertise or training; however, they lacked the motivation to use it regularly.
In the study of C.R. Laborde et al. [23], the authors evaluated user satisfaction, usability, and compliance with the help of a real-time, online assessment and mobility monitoring (ROAMM) mobile app designed for smartwatches.In the study, 28 participants were asked to wear smartwatches and fill out a standardized questionnaire.The ROAMM wristwatch app received high marks from older people with knee osteoarthritis, indicating their satisfaction with the app.The condition of atrial fibrillation (AF) is difficult to diagnose since it often presents with mild symptoms.Smartwatches can be used for longterm, non-invasive monitoring, which could improve AF care.The main objective of the study of E.Y. Ding et al. [24] was to evaluate the efficacy of arrhythmia discrimination using a wristwatch.A total of 40 participants were observed, and a questionnaire was completed evaluating several aspects of the device's usability.A real-time algorithm was used to analyze the pulse recordings.The results showed that the majority of participants thought the smartwatch was very usable.The general level of comfort and the level of data privacy when wearing a wristwatch for rhythm monitoring were both positively correlated with younger age and past cardioversion, respectively.The participants deemed the smartwatch to be extremely acceptable despite their age, lack of knowledge of smartwatches, and a significant load of comorbidities.Although smartwatches may be potential tools for atrial fibrillation identification, elderly stroke patients have not been given enough attention.
E. Dickson et al. [25] introduced the pulse watch, a smartwatch-based AF detection system, and evaluated its precision, usability, and adherence in stroke patients.The participants filled out questionnaires to evaluate numerous psychosocial factors and healthrelated behaviors.The results demonstrated that older stroke patients found the scheme useful and would stick to the monitoring schedule.Smartwatches have also been demonstrated to accurately assess blood pressure in glaucoma patients; however, their usability evaluation has been neglected in previous research.For this purpose, S.B. Bhanvadia et al. [26] conducted an experiment where adult participants received a wristwatch blood pressure monitor for indoor monitoring using the Omron BP monitor and an associated mobile app.Usability testing methodologies included the post-study system usability questionnaire (PSSUQ), which was used for assessing aspects of user satisfaction such as the overall system usefulness, information quality, and interface quality, and the system usability scale (SUS), which was used to assess the overall usability of smartwatches.Furthermore, usability on the basis of age, gender, and race was also evaluated.The usability evaluations demonstrated that the smartwatches had satisfactory usability ratings, although older age was linked to lower levels of perceived usability and user experience.Table 1 shows the current state-of-the-art methods to evaluate the usability of smartwatches.The interface of smartwatches is one of the main factors contributing to a satisfactory user experience.Everything visualized on the screen should be clear and self-explanatory so that users do not become confused.However, this is limited by its size.The real dilemma is to identify usability issues while improving the user experience for smartwatches.The buyers' judgment of the level of satisfaction for a certain product might vary.Smartwatches are particularly adapted to record the variances in mood, exhaustion, sleep quality, etc., remotely and conveniently as experienced by users.These are the factors that influence the usability of smartwatches, and a single heuristic method alone may be insufficient when evaluating the usability of smart devices, which also require more contextspecific heuristics.However, as Table 1 shows, researchers have infrequently used heuristics for the usability evaluation.In addition, there is only one evaluation method in each of the studies; however, if we conduct usability evaluations using more than one method, we may reveal more smartwatch usability issues.Therefore, in this study, we sought to evaluate the usability of four different smartwatches which includes Samsung Galaxy watch 5, Samsung Galaxy watch 4, Fibit Charge 5, and Fibit Versa 2. Our main contributions are the following:

•
We used two evaluation methods for the usability evaluation of these watches to find the technical, design issues, etc., in the real world.One is heuristic evaluation with customized heuristics, and the other is using the SUS score.The details are provided in Sections 3 and 4.

•
We developed a total of 11 customized heuristics through a combination of Neilson's heuristics for the user interface evaluation of watches with Motti and Caine's [27] heuristics, which are specifically utilized for the usability evaluation of wearable devices, i.e., smartwatches.

•
A total of 10 usability evaluators used the smartwatches for 10 days and completed the customized survey.Alongside the customized heuristics questionnaire, a further 20 users completed the SUS (system usability scale) questionnaire.

Heuristic Evaluation
Heuristic evaluation, proposed by Nielsen and Molich, is a method of usability engineering that employs standard heuristics to evaluate smart devices' usability and to identify design problems [28].Its classical form involves multiple evaluators exploring and investigating the product separately and listing the problems encountered while interacting with the system.Finally, they assign a severity score to each problem and present a report with suggested solutions to the identified problems [28].Nielsen and Molich's heuristics are a common, effective tool for usability evaluation, are widely recognized and have been successfully applied to various digital interfaces, desktop applications, and website interfaces [28].
To qualify the usability problems identified in the questionnaires, various research has employed a severity scale that graded the magnitude of the identified usability concern, making it possible to detect problems that prevent smart devices' proper functioning [29].Figure 1 shows Jakob Nielsen's widely recognized set of 10 usability principles that were used to improve user interface through identifying usability problems [30].These heuristics principles have been developed by Nielsen and Molich through the introduction of heuristic evaluation for usability inspection [31].Some researchers have suggested, however, that Nielsen and Molich's heuristics may be insufficient or unsuitable for evaluating smartwatches' usability [32], as the screen size, input capability, battery life, and interaction time of wearable interfaces is limited, yet their heuristics can provide a starting point in the usability evaluation to supplement heuristics or guidelines specifically tailored to smartwatch interfaces.In 2014, Motti and Caine designed a specific heuristic for wearable devices with a human-centered focus [27].Their heuristics also consider the aesthetics of smartwatch interfaces, which are crucial for user acceptance and satisfaction.Motti and Caine's 20 heuristics for human-centered wearable devices are mostly applicable to the tested device [27].One or more heuristics can be attributed to each problem identified.Table 2 shows the 20 principles enumerated by Motti and Caine.User-friendliness H20 Wearability Aesthetics embrace the attractiveness of any wearable object [33]; an attractive design improves its desirability [34].Affordance relates to a device's intuitiveness in terms of physical interactions between the user and the device [35].Comfort requires an acceptable temperature, texture, shape, weight, and tightness and implies freedom from pain and discomfort [36].After wearing a device for some time, users should be adequately comfortable and no longer feel it [37].Contextual awareness, embracing the scenarios in which wearable devices are used, must be clearly understood and considered during the design process.
Comfort is strongly affected by a device's purpose [38] and varies significantly by social context.Humans vary in shape, size, and dimensions, as well as in preferences, interests, and wishes.Thus, wearable devices' look and feel should allow customization, accounting for factors such as users' sensibilities, wishes, and interests [39].Ease of use recognizes the need for a simple, straightforward, intuitive interface [38], which enhances device usability and increases user engagement.Input and output interfaces should be easy to use [40].Ergonomy refers to a device's physical shape, constraints, ergonomic aspects related to bodily anatomy, and how users perceive it [41].
Fashion includes the perception of a wearable device's desirability [38].In other words, it indicates how stylish the technology is and contributes to its becoming more (or less) ubiquitous.Intuitiveness describes how interactions occur, such as those involving the existing buttons, keys, commands, and features [38].This heuristic applies the concept of affordance to the cognitive aspects of interactions.Physiological sensors have various degrees of intrusiveness, which may involve using body tissue to diagnose a given physiological state or condition.Devices that are non-intrusive are often obtrusive and, to some extent, cumbersome.Devices should be anatomically transparent [27], allowing natural body movements, and the design should duly consider the human body's anatomical characteristics and constraints.
Unlike technology, which has continually grown in capacity, humans have a finite processing capacity and can perform only a limited number of concurrent activities before experiencing cognitive overload, posing a distinct challenge to designers of wearable devices.A mobile interface that does not account for human cognitive capabilities in its design may hinder a user's primary task [38].Privacy refers to how confidential interactions can be performed when using the device [42].Reliability describes users' extent of trust and confidence in a device [36].Resistance involves understanding the context in which a wearable device is used so as to improve its resistance to specific types of wear; this heuristic helps practitioners identify acceptable levels of resistance, with special consideration given to impact, temperature, humidity, flexure, and laundering.To ensure durability, devices must stand up to wearing and cleaning [37].
Users tend to be less patient on the move than at a desk, making it imperative to give them feedback in near real-time, thus offering outstanding system responsiveness [38]; users' efficiency and productivity are enhanced when they can be responsive to their tasks [39].Satisfaction concerns how a device meets users' expectations, wishes, and requirements.A device's simplicity embraces ease of use, intuitiveness, and affordance, enabling users to interact more efficiently through straightforward interaction options and necessary feedback [38].The principles of minimalistic design are respected by including only those features and interaction options that are fundamental to accomplishing available tasks.Subtlety refers to the transparency of the device's communication; for example, notifications for the device's owner should not disturb bystanders.In other words, notifications should not cause social problems [43].User-friendliness accommodates the mental model of the end user, proposing options that easily and intuitively facilitate interaction.Recovery from errors should be possible [36].Wearability considers objects' physical shapes and devices' active relationships with the human form [38].
The combination of Nielsen and Molich's and Motti and Caine's heuristics has helped to categorize and formulate better solutions to smartwatches' usability problems.

SUS Score
The system usability scale (SUS) is a widely used, standardized questionnaire that assesses perceived usability.In 2009, researchers reported that 43% of post-study questionnaires used in industrial usability studies incorporated the SUS [44].Survey respondents rate statements about the system's complexity and whether they believe training or support is needed to use it effectively.According to Brooke [45], this simple, 10-item posttest questionnaire quickly assesses a product's usability without requiring complicated analysis.Table 3 presents the questionnaire. 2 I found the system unnecessarily complex.3 I thought the system was easy to use.

4
I think that I would need the support of a technical person to be able to use this system.

5
I found the various functions in this system were well integrated.

6
I thought there was too much inconsistency in this system.

7
I would imagine that most people would learn to use this system very quickly.
8 I found the system very cumbersome to use. 9 I felt very confident using the system.

10
I needed to learn a lot of things before I could get going with this system.
In general, the SUS can quickly, easily.and comprehensively evaluate usability.The SUS comprises 10 closely related questions on a 5-point Likert scale, of which questions 1, 3, 5, 7, and 9 are positive, and 2, 4, 6, 8, and 10 are negative.A higher SUS score implies better product usability.
Evaluating the usability criteria requires expertise in both the customization of usability principles and their specific domain for wearable smartwatches.So, evaluators need a thorough understanding of all the challenges and considerations that should be kept in mind to develop effective heuristics.In turn, this involves extensive research, analysis, and iteration to ensure the heuristics accurately reflect the requirements and expectations of smartwatch users.Moreover, participants involved in the heuristics should be representative of the target user population, i.e., those who are professionals or fitness enthusiasts, which is a challenging task.Thus, the data acquisition, its analysis, and interpretation is a complex problem that is resolved in this study through careful planning and comprehensive analysis of the evaluation methods.

Proposed Methodology
In this section, we outline our research design and describe the study conducted for the evaluation of smartwatch usability.The human-computer interface (HCI) has been using usability evaluation methods since the early 1980s, spurred on by the need to improve the usability of smart devices and applications.Usability testing methods include field studies, laboratory experiments, expert-based inspection methods, etc. [46].There is only scarce research that compares the evaluation methods for smartwatches; however, there have been studies investigating wearable usability using different methods.Therefore, in this research, we will be testing the usability of four smartwatches using two usability evaluation methods.We conducted a heuristic evaluation and used a system usability scale to evaluate the usability of the four smartwatches, comprising the Samsung Galaxy Watch 4, Samsung Galaxy Watch 5, Fitbit Charge 5, and Fitbit Versa 2, along with their mobile applications, Gear Samsung for Samsung Watches, and the Fitbit mobile application for Fitbit watches.For the heuristic evaluation, we used customized heuristics combining Nielson's 10 heuristics [47] and Motti and Caine's 20 heuristics [27,28] designed and validated by HCI experts.

Smartwatches and Applications to Be Evaluated
In this study, we use the following four watches to investigate their usability with their mobile applications: 1. Samsung Galaxy Watch 4, Galaxy Wearable (Samsung Gear) app; 2. Samsung Galaxy Watch 5, Galaxy Wearable (Samsung Gear) app; 3. Fitbit Charge 5, Fitbit application; 4. Fitbit Versa 2, Fitbit Application.
The Samsung Galaxy Watch 4 (Figure 2a) was released in August 2021 alongside other Samsung watches but became the most popular due to some significant changes, including health and fitness features [48].This device is equipped with an optical heart rate sensor, an electrical heart rate sensor, and a bioelectrical impedance analysis (BIA) sensor, which provides data to improve health.The advanced BIA sensor measures body skeleton mass, body water, body fat mass, and body mass index (BMI) and provides insights to the user to help them manage their health more effectively [49].Samsung introduced this after the significant concerns of increasing obesity rates, not just among seniors but also among young adults and children.This has been a particular feature of the post-COVID-19 lockdowns around the world [50].
The Samsung Galaxy Watch 5 (Figure 2b) was released in August 2022 and included a new temperature feature through the addition of an infrared temperature sensor.This can determine the user's basal body temperature and uses it as a baseline to determine different changes in body temperature.Moreover, the battery life has been improved to last 50 h due to the powerful Exynos-W920 chipset along with 1.5 GB RAM and 16 GB storage.In addition to existing body monitoring sensors, the Samsung Galaxy Watch 5 watch provides improved sleep tracking.The Galaxy Watch 5 was the most popular watch for Samsung in 2022.
Fitbit Charge 5 (Figure 2c) was also released in August 2021 and is considered the best fitness tracker among Fitbit watches due to its improved features, including heart rate monitoring, sleep tracking, stress-management tools, GPS tracking, and activity tracking.The Fitbit Versa marked a move into smartwatch territory with a much larger, squaredoff screen and a few extra features beyond health and fitness tracking.Fitbit Versa 2 is the latest device released by Fitbit Inc.The Fitbit Versa 2 improved on its predecessor with a raft of updated features, including Alexa support, better sleep tracking, and Fitbit Pay on all models [51].Fitbit is a popular brand of wearable trackers [52].Fitbit Versa 2 and its value-price works for anyone who is intrigued with the idea of a smartwatch and, thus, has served 27.6 million users worldwide in 2018, selling over 13.9 million units [53].More recent statistics state that Fitbit had 31 million users by the end of 2020 [54].
Fitbit Versa 2 (Figure 2d) is the most affordable Fitbit smartwatch.As these watches were only released in recent years but are used widely, only a handful of articles cover their functionality; therefore, we selected these widely used smartwatches for our usability evaluation.

Study Design for Heuristic Evaluation
A customized heuristic evaluation was performed using Nielsen's 10 heuristics and Motti and Caine's [27] 20 principles, with 10 evaluators analyzing the usability of the four smartwatches and their four mobile applications.The Nielsen and Molich study originally found that three to five evaluators were sufficient for detecting the majority of usability issues; however, this number remains under debate [55].For this reason, we chose 10 evaluation experts in the field of usability testing to perform the heuristic evaluations.Our group's HCI experts chose this variation in the heuristic evaluation based on the fact that it proved to be simpler and faster.For the usability evaluation of the interfaces of smartwatches and their mobile applications, we used Nielsen's heuristics.Motti and Caines's heuristics are design decisions toward the human-centered aspects in the wearable domain.Combining these sets of principles, the HCI experts designed a questionnaire comprising a total of 11 heuristics, as shown in Table 2, with details of each heuristic; these are relevant to the context in which our evaluation is conducted.In terms of user interfaces, qualities such as self-descriptiveness, consistency and standards, aesthetic design, reducing short-term memory load, and matching the system with the real world are important.Additionally, interface evaluation is crucial for usability evaluation.The interface is the primary means through which users interact with a smartwatch.Its design and usability greatly impact the user experience.Ease of use, flexibility, and efficiency of use and features evaluate the intuitiveness, simplicity, and availability or functionality of specific features the smartwatch provides, respectively [18].These heuristics provide valuable guidelines for assessing different aspects of the user experience.HCI experts validated their evaluation findings through multiple discussion sessions.
The evaluators assigned a severity level to each problem they found in either the interface, the design of a mobile application, or the smartwatch.An estimation of how much more usability work may be needed was also based on the severity ratings, with the results informing the decision on the allocation of resources to the most serious problems.The severity ratings are a combination of how many times a problem is occurring, its impact, and its persistence [56].Therefore, for the overall assessment of the usability problems, a rating scale was used that combined all three factors to facilitate decision-making, as presented in Table 4.The severity rating value scale can be seen in Table 4.This severity rating scale followed the tradition established in previous research studies using the system usability scale (SUS) in order to maintain consistency with previous research.The severity rating scale used in our study is a widely studied and common approach documented in academic papers [57].This method ensured that our evaluation results were consistent with current usability research standards.While the severity rating scale employed in our study's design was adapted from existing studies, we also ensured that it was appropriate and fit our study aims.The evaluators also provided comments on why they allocated this severity level to a particular problem and suggested possible solutions.Due to the radical changes in technology over the years, evaluating usability with the 10 heuristics (HEs) of Nielsen and Molich alone could not provide sufficient insights.In addition, because smart devices demand more context-specific heuristics rather than a generic set, the usability requirements of different user interfaces, audiences, and tasks require customized heuristics.In the process of tailoring the heuristics to the specific interface and task, the evaluation can be more efficient due to orienting the focus to the most relevant usability issues [58].Based on the existing information in different studies, scant research on the customization of usability heuristics using these methods is available.Therefore, we combined two heuristic principles: one for the interface usability analysis and the other specifically designed for wearable devices.The HCI experts of our group designed 11 heuristics for the evaluation of smartwatches and applications.The researchers state that a more human-centered approach to wearable design must take into account human aspects, facilitating consideration of human factors during design phases [27].These human factors would include users' needs, preferences, and expectations; moreover, human-centered wearable design contributes to a better user experience.As an innovative approach to design, human-centered design starts with understanding users' perspectives and designing accordingly [59].Therefore, our selection of six heuristics (visibility of system status, match between system and real-world, consistency and standards, aesthetic and minimalist design, flexibility and efficiency of use, and error tolerance) were sourced from Neilsen's heuristics, while Motti and Caine's heuristics (privacy, wearability, comfort, ease of use, and satisfaction) are also relevant and helpful in the design process of wearables and mobile applications.These are the heuristics that were unanimously agreed on in this study and used in the subsequent evaluation process.They were further reviewed by three human-computer interaction (HCI) experts to ensure the consistency and validation of these heuristics.In order to provide a comprehensive assessment and evaluation of HEs, each item was broken down into multiple items.These 11 heuristics are defined below:

•
Self-descriptiveness: This heuristic is used to evaluate whether the design is userfriendly or not.Each screen header should define the purpose of that screen; additionally, the user should be able to know where they are when looking at the screen header and the sentences and words that have been used, i.e., when looking at the time, the header of that screen should be self-explanatory, or when the user touches on a button, there should be visual feedback that describes that action has been noted.For instance, if users delete something, successful deletion, or that object disappearing, provides the visual feedback.In total, we designed five items to obtain the mean severity value for this heuristic.

•
Consistency and standards: This heuristic identifies the need for user interface elements across different tasks or across different versions to be consistent in their language, icons, and symbolism.We designed four items specifically related to this heuristic.

•
Aesthetic and minimalist design: The Neilson and Norman group [4] presented aesthetics and minimalism as their eighth usability heuristic.This principle was summarized by Donald Norman as follows: interfaces should not contain unnecessary information or information that is rarely required.The evaluators were given six items under this heuristic.

•
Reduce short-term memory load: Short-term memory load can be reduced by showing pulldowns and menus on screens.As Nielsen says, the more you recognize something, the easier it is to remember it.Thus, it is important that objects, actions, and options can be accessed by the user in a way that minimizes the user's memory load.We presented three items for this category.• Match between the system and the real world: This heuristic is the second usability principle presented by Nielsen and indicates that the language of the system should be the same as that of the user.The words and phrases should be simple to understand.Experts can score three items with severity ratings under this category in the questionnaire.

•
Error tolerance: If the user makes a mistake, there should always be a way to go back instead of penalizing the user.Systems should also provide options for user customization.

•
Privacy: The brand should ask for permission before collecting their user's personal data.

•
Ease of use: This heuristic represents the capacity of the smartwatch to let its users perform tasks effectively and quickly while relishing the experience.This is the heuristic which fundamentally influences the adoption of smartwatches.
• Flexibility and efficiency of Use: This represents the speed with which the system responds to user actions or requests.We designed a total of six items under this heuristic.

•
Features: This requires investigation as to whether these smartwatches provide basic functionality with good accuracy or are only focused on the design.Evaluator's seven items available for scoring that relate to the time a task took, the response time of the system, etc.

•
Overall satisfaction: This heuristic evaluates if the user is satisfied with the application and the smartwatch's performance, features, etc.

Procedure for Performing Heuristic Evaluation
A total of 10 evaluators were recruited to perform the usability evaluation.They were HCI experts and had evaluated other interfaces and applications.This study was conducted within a controlled environment in a soundproof laboratory.Throughout this study, the laboratory remained vacant, devoid of any extraneous disturbances.The researchers ensured that every participant's mobile phone was connected to the internet and their watches were charged and working properly before conducting usability sessions.The smartwatches and their corresponding mobile applications were evaluated by the 10 evaluators for 10 days, and evaluations lasted up to 1 h every day.The evaluators used the applications and smartwatches and wrote daily notes about usability problems they identified.After 10 days of use, they completed the designed questionnaire, giving each problem a severity rating and also providing comments on how that problem could be solved or the reason underlying the allocation of this severity number.The expert evaluators investigated the user interface of mobile applications and smartwatches, as well as the hardware design, the features of each watch, and their assessment of the data accuracy.
Finally, we performed heuristic evaluation calculations using Equation ( 1) [60], as follows: where ∑Hx = total rating score on all sub-aspects of each usability heuristic; x1, x2, x3… xn are the usability ratings on each question of that heuristic.The severity rating of each usability aspect 2 was calculated using Equation ( 2): where n is the number of usability sub-aspects using the severity rating value to indicate the magnitude of the problems identified.

Study Design for Usability Evaluation Using SUS Score
In addition to the heuristic evaluation, we chose SUS for usability evaluation as it has been widely adopted in the usability evaluation of products.Its versatility makes it useful for the usability evaluation of mobile devices and wearables using a short 10-item questionnaire.A total of 20 participants who had no prior experience with Samsung and Fitbit watches were recruited; their consent was obtained, and they were given the smartwatches.The participants were all aged between 20 and 30 years old.They were asked to use the smartwatches for 30 days.At the end of the 30-day period, they were invited to complete an evaluation SUS questionnaire to rate their experiences with these products.For the descriptive statistical analysis, their basic information, including age and gender, was collected and are presented in the Results section.We calculated the SUS score as follows [61]: 1.The respondent's response to statements of odd numbers was minus by 5 (see Equation (3)); 2.Then, 25 was subtracted from the respondent's response to statements of even numbers (see Equation ( 4)); 3. The results from Equations ( 3) and ( 4) were summed up and multiplied by 2.5 (see Equation ( 5)) X = (Q1, Q3, Q5, Q7, Q9) − SUS_Score = (X + Y) × 2.5 (5)

Results and Discussion
In this section, the results from the questionnaire and findings are discussed.The usability evaluations were performed on both wearable devices and apps, using both the heuristic evaluation and SUS.This section also provides a comparison between the results of this research with the current research.We conducted the heuristic evaluation using 10 evaluators who used the watches for 10 days and completed the questionnaires, adding, if necessary, brief comments or explanations about problems and their solutions.In the group of evaluators, six were females and the rest were males.Three of the users were Samsung watch users, and seven were Fitbit users.In our study, the Fitbit and Samsung Watch users were randomly selected as evaluators.Therefore, the distribution of Samsung Watch users and Fitbit users was coincidental.For the SUS survey, 20 users used the watches for 30 days and completed the SUS questionnaire.Of the users, 15 were male and 5 were female.None of them had any prior experience with the Samsung Galaxy Watch 4, Galaxy Watch 5, Fitbit Charge 5, or Fitbit Versa 2.

Evaluation Method 1: Heuristic Evaluation Results
The HCI experts were given the task of installing the application on mobile phones, connecting it to the watches, and analyzing them for usability issues.A total of 20 usability principles were evaluated by each of the 10 experts independently.We compiled the usability issues encountered by each usability expert to create this heuristic evaluation report and present it in Tables 5 and 6.These tables show the number of criteria violated per severity rating for each heuristic listed.Three hundred and seven (307) usability issues were identified in all four watches, where one-hundred and nine (35.5%) were discovered in the Fitbit Charge 5 with twenty-two minor, nineteen major, and fifteen disastrous problems.However, only 46 (14.9%) usability issues were reported for the Samsung Galaxy Watch 5. Out of 307, 66 issues (21.4%) were reported in the Galaxy Watch 4, and the remaining 86 (28%) were reported in the Fitbit Versa 2. Most of the problems were located in H8 "Ease of Use", H9 "Flexibility and Efficiency of use", and H10 "Features".The Galaxy Watch 4 demonstrates huge advances in its hardware and software due to its Exynos W920 architecture, Samsung's first 5 nm wearable processor [61].There are two Cortex-A55 cores on the Exynos W920, along with a Mali-G68 MP2 GPU, designed to deliver 20 percent faster CPU performance and 10 times faster GPU performance than the Exynos W9110.In addition to higher power efficiency, the 5 nm processor also results in longer smartwatch battery life.A dedicated low-power processor on the Exynos W920 handles always-on displays and other tasks while consuming very little power [62].This architecture also manages heart rate and notifications in the background more easily.The heuristic evaluations observed that it has a great UI design as compared to other watches.The Galaxy Watch 4′s UI design combines the look and feel of Samsung's Tizen platform and Wear OS [63].The watch has a virtual rotating bezel, two buttons on the right-hand side, and a colorful and intuitive interface that allows for easy navigation and customization.Watch 4 has a redesigned notification system that allows more actions and interactions, including a more consistent and seamless experience across Samsung devices, such as syncing settings, installing apps, and transferring data.The customizable watch face editor allows you to choose from various customization options, colors, and styles.The simplified settings menu is easier to navigate and adjust [64], again allowing user customization, including gesture features that are so sensitive that even moving the wrist can swipe up or down the layout on the watch.
A few evaluators commented that its haptic bezel is not very smooth.The one other design feature is the access strap tucked underneath the wrist strap; therefore, there is no extra flap that can bother the user, dangle, or become lost.This is conclusively a good design approach and increases its wearability.Samsung Galaxy Watch 5 (GW5) has the same chipset, RAM, and storage as in the GW4.The main difference is its battery size.The watch has 284 milliampere per hour (MAH), which offers 40 h of battery life.However, it was noted by evaluators that it has 100% battery and stays live for around 32 h, with a touch-sensitive bezel for easy navigation, as in the GW4.Another difference is the Bluetooth version, which is upgraded to 5.2.The GW4 uses a C-type cable to charge and provides a 45% charge in 30 min.According to the evaluators, it also has sapphire glass upgraded from gorilla glass, which provides a more solid and expensive feel.Fitbit Versa 2 only weighs 40 gm, which makes it comfortable to wear; however, the Samsung watches weigh less.Versa 2 provides a voice input feature in the watch to set timers and reminders using Alexa.The evaluators observed that its voice recognition was more than 95% effective.The evaluators found completing tasks with it was quite easy and fast, and the system information was understandable, as shown in Table 5.For example, the Fitbit obtained the highest score for the heuristic relating to flexibility and efficiency of use.On the other hand, the Fitbit Charge 5 is a discrete fitness tracker band, that is simple to use and with a bright display, but at the expense of the battery.The Fitibt is also easily wearable in daily life as it is very light, has auto activity detection, works accurately, and runs an electrodermal activity (EDA) sensor instead of a button, making it more aesthetically pleasing.Table 5 clearly shows that minimum usability issues were found in the Galaxy Watch 5. A report on the heuristic evaluation was produced based on the usability issues encountered by each usability expert in Tables 5 and 6.Out of 307 usability issues, 109 (35.5%) were found in the Fitbit Charge 5, with 22 minor, 19 major, and 15 disastrous problems.However, only 46 (14.9%) usability issues were reported for the Samsung Galaxy Watch 5, 66 (21.4%) were reported for the Galaxy Watch 4, and the remaining 86 (28%) were reported for the Fitbit Versa 2.
The evaluators mostly mentioned self-descriptiveness problems as cosmetic issues, indicating that this was not something requiring immediate attention for these watches.Most problems identified as being located in H8 as "Ease of Use", H9 was "Flexibility and Efficiency of Use", and H10 was "features".The usability problems are explained above.Smartwatches from Samsung and Fitbit require user consent before storing any personal data.Samsung takes privacy seriously and seeks user consent before collecting personal data through their devices, including smartwatches.As part of the set-up process for Samsung or Fitbit smartwatches, the user is required to agree to the terms and conditions.These terms include information about data collection and usage.The brands collect data such as your name, email, location, health metrics, payment information, etc.These brands also state that they ask for your consent before collecting or sharing your data [65].In addition, the evaluators reported that they could manage app permissions to control what data they wanted to provide.The usability problems are explained above.The evaluators have also provided comments regarding usability and design issues and the better features of smartwatches, and we categorized those comments into positive and negative comments, which are presented in Tables 7 and 8. Table 7 presents the comments on Samsung watches and Table 8 presents those for Fitbit watches.
Negative comments relate to the usability problems, and it is observed that Samsung watches need improvement for quick and efficiency synchronization with the mobile applications.Table 7 shows that the GW5 has more positive comments than the GW4 from the evaluators.Although some improvements are still required, i.e., relating to the accuracy of activity tracking features in the GW5 because four of the evaluators were dissatisfied with the results.We also observed that Charge 5 has design issues, missing voice input, and with the speakers and voice assistant; however, Versa 2 has a better battery life and a voice assistant that makes setting reminders and alarms easier.From the results in Table 6 and the comments from Table 8, we concluded that Fitbit Versa 2 has fewer usability problems than Fitbit Charge 5 according to the evaluators.However, the Galaxy Watch 5 still has the lowest number of usability problems compared to the other three watches.We present these usability problems in Table 9 with their severity rating value.

Evaluation Method 2: SUS Evaluation Results
An SUS score above 68 is considered above average, and a score below 68 is considered below average; thus, 68 is the average score, making it the 50th percentile [55].An SUS score greater than 80 is considered excellent, while a score below 51 is considered an awful design in terms of efficiency, effectiveness, and overall ease of use.Once the 20 participants filled out the SUS survey for each watch, we calculated the mean for each watch regarding that item number (#), as shown in Table 10.GW5 obtained the highest SUS score of 87.375 among all four watches, and was considered as possessing excellent usability.The battery life of these devices is one of the most significant obstacles to their acceptance and usability in consumer markets [66].With a battery life of up to 50 h, the GW5 is the most useful in daily life, even though it is almost identical to GW4.In addition to improving battery life, the GW5 has some features that make it more appealing, such as an aluminum metal frame with a sapphire crystal display.However, GW5 is much more expensive than GW4.Moreover, the SUS score in Table 7 shows Fitbit Versa 2 has a greater usability score than Fitbit Charge 5.In the Samsung Galaxy range, the Galaxy Watch 5 is better.From these results, we also analyzed that the latest launch of the GW5 had a higher usability value than other Samsung watches (GW4).However, in the Fitbit series, Fitbit Versa 2 was more efficient than the Fitbit Charge 5. Galaxy has higher values in odd questions, which indicates that the users mostly agree that this system is easy to use and efficient, and lower values in even questions indicating that they did not find the system unnecessarily complex and inefficient, as compared to Fitbit.
From these evaluation results, we concluded that the customized heuristics and SUS both favored Galaxy Watch 5 because the evaluators identified fewer usability issues, and there were more positive comments among all four watches in terms of usability.Hence, this also demonstrated that the customized heuristics validated the heuristics for the usability evaluation of watches.In this work, we overcame the limitations of previous research with a combination of different evaluation methods and the customization of 11 heuristics for usability evaluation.
We performed a comparative analysis of our findings with the previous research to find if we have generalized values with other studies.The Galaxy Watch 4′s haptic bezel was identified as not being very smooth, which sometimes made navigation difficult.As a result, the user lost control and freedom, and the system became intolerable.The findings were similar to those found in other studies [67], where users found it difficult to navigate between pages when applications lacked backward and forward navigation buttons.As a result, this aspect of the design made it difficult to navigate a smartwatch.Fitbit Versa 2 does not pair automatically after resetting, thus violating the heuristic of ease of use.In addition, Fitbit Charge 5 does not support voice assistants because it does not have a microphone and speaker, making alarm settings difficult and violating the heuristic of ease of use.In a previous study [68], it was shown that ease of use in the system improves the quality experience of the user with that system.
The results of the SUS survey for smartwatches showed that the users found them easy to learn and use, resulting in improved performance.This aligns with previous studies [69], which also found the systems easy to learn.The Watch 5 was found to be particularly effective for everyday use due to its long battery life, customization options, and wearability, as also reported by the authors of [70].Overall, the usability results were positive for the effectiveness of the watch applications.However, issues were reported after updates, which confirmed the findings of earlier studies by the authors of [69,71] that the applications were efficient for users.Additionally, the participants reported that they would easily remember how to use the watches in the future, a finding supported by the authors of [72,73].The users also encountered fewer errors while using the watch applications, a result consistent with earlier studies by [71,72,74].The users were also satisfied with the features, functionalities, design, information, and display quality of the Watch 5 among all four watches, as reported in studies by [69,74,75].The Galaxy watches were also found to be easy to use, leading to user satisfaction, as reported in studies by [75,76].The new haptic bezel feature of the Galaxy watch was highly rated for its aesthetic appeal, as also observed in [76].The Fitbit Charge 5 provided advanced fitness tracking features, as reported in [75], which found the application useful in achieving fitness-related goals.

Conclusions
The goal of our study was to evaluate the usability of smartwatches using different usability evaluation methods.To achieve this goal, we inspected four smartwatches, including the Samsung Galaxy Watch 4, the Galaxy Watch 5, the Fitbit Charge 5, and the Fitbit Versa 2 with their respective mobile applications, Samsung Gear and the Fitbit mobile.We employed customized heuristics and the system usability scale (SUS).The customized heuristics employed a collaborative evaluation using Nielsen and Molich's ("Usability 101: Introduction to Usability," n.d.) heuristics and Motti and Caine's [27] principles for the usability of wearable devices.A total of 10 experts used the watches for 10 days and investigated the usability issues that arose from the watches and their applications.Some identified problems were that there was no automatic reconnection between the smartwatch and the application, the automatic settings after updates were reported as annoying, and inconsistent battery life impeded everyday use.Evaluators also reported problems regarding the wearability of the Fitbit because the strap of the watch appeared to be loose.However, the Galaxy Watch 5 obtained the highest usability mean value among all the watches.Most of its usability problems were found in ease of use, flexibility, and efficiency of use.
The system usability survey was completed by 20 participants who had no prior experience with Samsung and Fitbit watches.The results from the SUS score indicated that the Galaxy Watch 5 had a slightly higher SUS score than the Galaxy Watch 4 and thus was rated as having the highest usability.In addition, the Fitbit Versa 2 obtained a higher score value than the Charge 5. We also observed that people rated the GW5 the highest because of the battery life and its classic design.Therefore, this paper conclusively contributes to the usability evaluation of smartwatches, overcoming problems encountered with customized heuristic evaluation and the system usability scale, and identifies the watch with the best usability among all watches.The results achieved in this study will be influential in providing better design features and shaping the user experience of smartwatches as they are continuously gaining popularity among users due to their innovative use of technology.Future work will include extending the parameters of the user evaluation study and investigating how usability evaluation methods can be better adapted to capture relevant usability issues in other wearable devices, i.e., headsets, smart rings, and smart shoes.
touch display nonfunctional -GW4 error messages closed faster than the time to read them Ease of Use -The GW4 rotating bezel is a handy navigation tool Typing keyboard is better in GW5 -satisfied with the accuracy of sleep and heart monitoring features in GW5 -GW4 has inconsistent battery life -GW5 has most of the features that are the same as Watch 4

Table 1 .
A summary of the literature review on usability evaluation of smartwatches.
1Reference Number.2Number of participants who participate in the study.

Table 3 .
The system usability survey questionnaire.

Table 4 .
Severity scale for heuristic evaluation.

Table 5 .
Usability issues encountered by evaluators with severity for the Galaxy watches.

Table 6 .
Usability issues encountered by evaluators with the severity for the Fitbit watches.

Table 7 .
Participants' comments on Galaxy smartwatches sorted by the usability principle.

Table 8 .
Participants' comments on Fitbit smartwatches sorted by usability principle.

Table 9 .
Usability problems with their severity value based on comments.

Table 10 .
System usability score for each watch.