Quantifying Usability via Task Flow-Based Usability Checklists for User-Centered Design

: In this study, we investigated the effectiveness of a method to quantify the overall product usability using an expert review. The expert review involved a general-purpose task ﬂow-based usability checklist that provided a single quantitative usability score. This checklist was expected to reduce rating variation among evaluators. To conﬁrm the effectiveness of the checklist, two experiments were performed. In Experiment 1, the usability score obtained using the proposed checklist was compared with traditional usability measures (task completion ration, task completion time, and subjective rating). The results demonstrated that the usability score obtained using the proposed checklist shows a tendency similar to that of the traditional measures. In Experiment 2, we investigated the inter-rater agreement of the proposed checklist by comparing it with a similar method. The results demonstrate that the inter-rater agreement of the proposed task ﬂow-based usability checklist is greater than that of structured user interface design and evaluation.


Introduction
To achieve human-centric designs in the industry, quantifying usability is important [1].Generally, usability evaluations are classified as formative or summative, and summative evaluations quantify usability, which has several advantages in comparison with product development.A quantitative usability value provides an overall intuitive sense of product usability, which can be used to compare a product under development with existing product models and/or competitive products.Moreover, a single quantitative score can be considered as a benchmark that can facilitate effective communication with the related person, i.e., it is beneficial to consider a numerical value that can be easily understood by the related person.In this study, we propose a method to quantify usability that reduces rating variations among expert review evaluators.Our method is expected to provide a single quantitative usability score without usability testing and help to resolve the disadvantages of quantification via inspection methods.Furthermore, to evaluate each task and subtask of evaluated products, we examined a general-purpose usability checklist such that the scope of evaluation of each item in the checklist is narrow and specific, which helps enhance evaluation reliability [2].Moreover, the checklist items that focused on the task of evaluated products are not specialized only for specific products.Therefore, in this study, we conducted two experiments to investigate the effectiveness of the proposed checklist.In Experiment 1, to demonstrate the validity of the proposed checklist, we compared the measured usability score of the proposed checklist to traditional usability measures (i.e., task completion ratio, task completion time, and subjective ratings).In Experiment 2, we investigated the inter-rater agreement of the proposed checklist and compared the same to the structured user

Overview of the Task Flow-Based Usability Checklist
Wada [23] proposed 14 flow design patterns based on the investigation of actual task flows of over 100 products.He summarized the task flow of each product by DEMATEL (decision-making and trial evaluation laboratory) method and correspondence analysis.13 patterns of 14 patterns were used to develop the proposed task flow-based usability checklist [24].To use the usability checklist along the task flow, the usability evaluation based on user scenario could be achieved.Table 1 shows 13 patterns.
Designs 2019, 3, 2 3 of 21 These patterns were developed as a reference for designing operation flows and demonstrate several types of common task flows in user interfaces.Figure 1 shows an example of a flow design pattern (pattern 1). Figure 1 is the typical task flow of the pattern 1 ("procedure with parameter adjustment") that is summarized from several tasks of the actual systems.Note that the proposed task flow-based usability checklist comprises 13 patterns with respect to the flow design pattern (the supplementary pattern (pattern 14) was not considered).Because the patterns show just typical task flows, the weight of the importance between each pattern was not discussed.Search in special terminals to find and view information 10 Access to information in a screen (shallow hierarchy) 11 Access and search information in a screen (deep hierarchy) 12 Search in an information-intensive system 13 Access and edit stored information pattern (the supplementary pattern (pattern 14) was not considered).Because the patterns show just typical task flows, the weight of the importance between each pattern was not discussed.
The items of the task flow-based usability checklist are available for each design pattern (Appendix A).Because the checklist items evaluate the common tasks and subtasks in each pattern, the checklist can be used for not only the specific product, but also for several products that have common tasks with respect to the flow design pattern.In addition, the scope of evaluation is narrower than that of general checklists because the scope of each item focuses on the task and/or the subtask of a product, not on the overall impression of a product.Note that narrowing the evaluation scope of each item makes the rating process much easier, which can result into reduced rating variation among evaluators.Moreover, five-or seven-point scales are expected to result into greater rating variation.Thus, to minimize rating variation, as part of the proposed checklist, a binary scale (0: No, 1: Yes) is adopted.Search in special terminals to find and view information 10 Access to information in a screen (shallow hierarchy) 11 Access and search information in a screen (deep hierarchy) 12 Search in an information-intensive system 13 Access and edit stored information  The items of the task flow-based usability checklist are available for each design pattern (Appendix A).Because the checklist items evaluate the common tasks and subtasks in each pattern, the checklist can be used for not only the specific product, but also for several products that have common tasks with respect to the flow design pattern.In addition, the scope of evaluation is narrower than that of general checklists because the scope of each item focuses on the task and/or the subtask of a product, not on the overall impression of a product.Note that narrowing the evaluation scope of each item makes the rating process much easier, which can result into reduced rating variation among evaluators.Moreover, five-or seven-point scales are expected to result into greater rating variation.Thus, to minimize rating variation, as part of the proposed checklist, a binary scale (0: No, 1: Yes) is adopted.

Scope of the Checklist
There are many methods to evaluate the usability with different purposes and scopes.It is difficult to cover all the purposes and scopes of the usability evaluation by just one usability evaluation method.The proposed checklist also has a limited purpose and scope.As mentioned in the Introduction, we assumed that the proposed checklist is used for quantifying usability during an iterative design process of the upper process of design development.In the iterative design process, a total usability score can be set as a benchmark.During the next iteration, the improvements regarding usability can be confirmed by comparing with the previous usability score.This also indicates that the proposed checklist focused on the short-term usability, i.e., temporary use of the product.The proposed checklist does not cover the long-term usability, which should be considered for user experience.

Evaluation Procedure of the Proposed Checklist
Figure 2 shows the evaluation procedure of the proposed checklist.

1.
Selecting tasks of an evaluated product: for evaluation, frequently used and/or important tasks should be selected.

2.
Selecting flow design patterns fitting the evaluated task flow: from the 13 patterns, the flow design patterns corresponding to each task are selected, and all patterns in a given task are selected.

3.
Evaluating each task using checklist items: each checklist item is then rated using the two-point scale (0: No, 1: Yes).

4.
Calculating total score: finally, the ratio of all "1: Yes" ratings to all checklist items is calculated, which represents the overall usability score of the proposed task flow-based usability checklist.

Scope of the Checklist
There are many methods to evaluate the usability with different purposes and scopes.It is difficult to cover all the purposes and scopes of the usability evaluation by just one usability evaluation method.The proposed checklist also has a limited purpose and scope.As mentioned in the Introduction, we assumed that the proposed checklist is used for quantifying usability during an iterative design process of the upper process of design development.In the iterative design process, a total usability score can be set as a benchmark.During the next iteration, the improvements regarding usability can be confirmed by comparing with the previous usability score.This also indicates that the proposed checklist focused on the short-term usability, i.e., temporary use of the product.The proposed checklist does not cover the long-term usability, which should be considered for user experience.1. Selecting tasks of an evaluated product: for evaluation, frequently used and/or important tasks should be selected.2. Selecting flow design patterns fitting the evaluated task flow: from the 13 patterns, the flow design patterns corresponding to each task are selected, and all patterns in a given task are selected.3. Evaluating each task using checklist items: each checklist item is then rated using the two-point scale (0: No, 1: Yes). 4. Calculating total score: finally, the ratio of all "1: Yes" ratings to all checklist items is calculated, which represents the overall usability score of the proposed task flow-based usability checklist.

Checklist Items of Each Pattern
The checklist items were designed based on the task analysis [25,26] of the user interfaces involving tasks that corresponded to each design pattern.For this purpose, five tasks were analyzed per pattern, and the items required to design the user interface were examined using the items of the SIDE approach for each task [21].When considering the items for each task, the task was considered Designs 2019, 3, 2 5 of 21 with respect to the three steps of human information processing: effective acquisition of information → ease of understanding and judgment → comfortable operation [26].To ensure the validity of the checklist items, the checklist items were examined based on the result of the 3P task analysis [26] of several products that have the task flows of each pattern.For the user interface design of each task flow, the checklist items that should be considered were obtained using a summarized result of this investigation (Table 2).Table 3 lists examples of checklist items (pattern 1).Here, the checklist items comprise task-and subtask-level items.In particular, task-level items were used to evaluate the usability of the overall task (Table 3), i.e., the common items in all subtasks of the pattern.Then, the subtask-level items were used to evaluate each subtask in the task flow of the pattern (Table 4).Both task-and subtask-level items are rated to calculate the usability score.

Checklist Items
Are there any clues for identifying the following operation?Can the users easily understand the vocabulary or the icons?Are there any friendly or smooth forms of feedback for the operation?Can the users easily understand the operation method?Can the users immediately understand the relationship between different aspects of the UI? Are the layouts of operation panels or screens standardized?Is there consistency in the operation method?Table 4. Checklist items of pattern 2 (subtask-level).

Select the function
Can the users easily understand where the choices are?Is the operation panel or screen simple?Can the users easily grasp the entirety of the selecting functions?

Enter necessary information by choice
Can the users easily understand where the choices are?Can the users easily grasp all the choices?

Enter necessary information by key operation
Can the users operate the UI with few and efficient operation procedures?Can the users easily understand the operation portion?Can the users easily grasp the entirety of the operation portion?
Begin a task Can the users easily understand the operation portion?

Method
To validate the proposed checklist, Experiment 1 was conducted wherein the usability evaluation results of the proposed checklist and traditional usability metrics were compared.From the proposed checklist, usability evaluations were performed by two usability professionals and the usability score of each task was calculated.Both professionals were certified as Certified HCD Professional by Human Centered Design Organization (HCD-Net) in Japan and Certified Professional Ergonomist (CPE, qualification that is endorsed by the International Ergonomics Association) by Japan Ergonomics Society with five years of industry experience.The evaluated products and tasks are listed below.Note that the average score of the two evaluators was used as the checklist's usability score.For the traditional usability metrics, we performed usability testing [27] in which the subjects were 10 undergraduate and graduate students (average age = 23.1;standard deviation (SD) = 1.2).Because usability testing in the industry empirically does not recruit many participants and repeats usability testing with small sample size (≈10 subjects) at the upper process, we decided that the sample size of each task is 10.All subjects gave their consent after receiving a brief explanation of the goal and content of the experiment.Importantly, the subjects did not have previous experience using the evaluated products.Note that the evaluated products and tasks were same as those used in the above checklist evaluation.
The experiment was performed with an individual experimenter in an experimental chamber.The experiment was initiated after the experimenter confirmed that the participant understood the experiment.When the participant believed he or she had completed the task, he or she was required to report this orally to the experimenter.Then, after the participant finished the task, a questionnaire was administered to obtain subjective ratings.The order of the eight tasks was randomized, and the participants were required to complete each task as quickly and accurately as possible.Moreover, the evaluation measures included task completion ratio, task completion time, and subjective ratings [28].Task completion ratio is defined as the ratio of participants that completed a task to all participants.Task completion time is defined as the average time from the start of the task to its successful completion.Furthermore, the ASQ, which is a post-task satisfaction questionnaire, was adopted for obtaining subjective ratings [13,14].The ASQ questionnaire comprises the following items: (1) overall, I am satisfied with the ease of completing the tasks in this scenario; (2) overall, I am satisfied with the amount of time it took to complete the tasks in this scenario; and (3) overall, I am satisfied with the support information when completing the tasks.Each participant was required to answer each item on a seven-point scale (1 = strongly disagree and 7 = strongly agree).The ASQ score is defined as the average score of the three abovementioned items.

Results
Table 5 lists all usability metrics that were calculated using the proposed checklist and usability testing for each task.Table 6 lists the correlation coefficients among each score.Note that all correlation coefficients were significant and >0.6 or <−0.6, indicating that each usability score had a strong correlation with other scores.Although the sample size to calculate the correlation coefficients was small, the p value of each coefficient was sufficiently small.Thus, we think this result is reliable.

Discussion
As summarized in Table 5, among each usability score, the correlation coefficients were significantly strong.From the proposed checklist, we found a strong correlation between the task completion ratio (R = 0.63), task completion time (R = −0.64),and ASQ score (R = 0.77), indicating that the checklist score shows the same tendency as the evaluation result obtained via usability testing.Furthermore, the correlation coefficients among the task completion ratio, task completion time, and ASQ score were >0.7.The correlation coefficients were higher than those of the correlation with the checklist.This difference in score was caused by the differences between the expert review and usability testing characteristics.The checklist is a type of expert review, which is also an estimation made by usability professionals.On the contrary, usability testing extracts data from actual users.Thus, correlation among scores obtained via usability testing should be greater than that obtained based on checklist scores.
Usability testing attempts to evaluate three aspects of usability, namely, effectiveness, efficiency, and satisfaction, which are evaluated according to task completion ratio, task completion time, and subjective ratings, respectively.Satisfaction is considered to be a dependent variable, and effectiveness and efficiency are independent variables [29][30][31].The correlation coefficient between the ASQ score and usability performance was ~0.9.Moreover, we examined a multiple regression model that predicted the ASQ score based on the task completion ratio and time (Table 7).Both task completion ratio and time had a significant standardized β value (partial regression coefficient), as listed in Table 7, which explains the variation in the ASQ score (subjective satisfaction; R 2 = 0.93).It also indicates why the usability testing result should be valid because performance scores were highly related to subjective satisfaction.Moreover, the correlation coefficients between the ASQ and checklist scores were strong (R = 0.77) and the tendency of the score for each task was similar (Figure 3), which indicates that the checklist score could also explain the variation in the ASQ score (R 2 = 0.60).Because the prediction of subjective satisfaction is important, we consider that the proposed checklist provides an effective usability score.Because the prediction of subjective satisfaction is important, we consider that the proposed checklist provides an effective usability score.

Method
To evaluate the rating variation of the proposed checklist among evaluators, we conducted Experiment 2.Moreover, the rating variation was compared between the proposed checklist and the SIDE approach.Furthermore, the participants were four graduate students that majored in usability and human-centered design.We recruited four participants to examine the rating variation among evaluators because three to four evaluators generally used a usability checklist for a given product.The participants were not explained about the goal of this experiment.The evaluation included 13 tasks that corresponded to each design pattern.The products and tasks are described as follows. •

Method
To evaluate the rating variation of the proposed checklist among evaluators, we conducted Experiment 2.Moreover, the rating variation was compared between the proposed checklist and the SIDE approach.Furthermore, the participants were four graduate students that majored in usability and human-centered design.We recruited four participants to examine the rating variation among evaluators because three to four evaluators generally used a usability checklist for a given product.The participants were not explained about the goal of this experiment.The evaluation included 13 tasks that corresponded to each design pattern.The products and tasks are described as follows.For evaluation, we used 13 checklists that corresponded to each task.For SIDE, the original 29 items (Appendix B) were used because SIDE is a general-purpose checklist and the evaluated product and/or task is not confined, similar to the case of the proposed checklist.According to Designs 2019, 3, 2 9 of 21 Yamaoka [18], SIDE can be used not only by usability experts but also by development engineers.Using the original SIDE, each item had to be rated on a three-point scale (−1: Not good, 0: Fair, 1: Good); however, in our study, we employed a two-point scale (0: Not good and fair, 1: Good) for fair comparison with the proposed checklist.Note that, with a large rating scale, the rating variation among raters tends to increase.
Furthermore, all participants evaluated all the tasks using both these methods, and the order of the two methods was counter-balanced.To minimize the order effect, the evaluation interval between the two methods was three months.The dependent variable of the experiment was inter-rater agreement.It was defined as the ratio of the number of checklist items that all evaluators rated with the same score to the checklist items for the given task.

Results
Table 8 shows the examples of evaluation of task 1. Figure 4 shows the mean inter-rater agreement of each method.To compare the inter-rater agreement between the two methods, a paired t-test was performed.The results indicated that the inter-rater agreement of the proposed checklist was significantly greater than that of the SIDE approach (|t| = 5.40, p < 0.01).Moreover, the difference between the proposed checklist and SIDE was ~33.5%.For evaluation, we used 13 checklists that corresponded to each task.For SIDE, the original 29 items (Appendix B) were used because SIDE is a general-purpose checklist and the evaluated product and/or task is not confined, similar to the case of the proposed checklist.According to Yamaoka [18], SIDE can be used not only by usability experts but also by development engineers.Using the original SIDE, each item had to be rated on a three-point scale (−1: Not good, 0: Fair, 1: Good); however, in our study, we employed a two-point scale (0: Not good and fair, 1: Good) for fair comparison with the proposed checklist.Note that, with a large rating scale, the rating variation among raters tends to increase.
Furthermore, all participants evaluated all the tasks using both these methods, and the order of the two methods was counter-balanced.To minimize the order effect, the evaluation interval between the two methods was three months.The dependent variable of the experiment was inter-rater agreement.It was defined as the ratio of the number of checklist items that all evaluators rated with the same score to the checklist items for the given task.

Results
Table 8 shows the examples of evaluation of task 1. Figure 4 shows the mean inter-rater agreement of each method.To compare the inter-rater agreement between the two methods, a paired t-test was performed.The results indicated that the inter-rater agreement of the proposed checklist was significantly greater than that of the SIDE approach (|t| = 5.40, p < 0.01).Moreover, the difference between the proposed checklist and SIDE was ~33.5%.

Discussion
As shown in Figure 4, the inter-rater agreement of the proposed checklist was significantly greater than that of the SIDE approach, which indicates that the rating variation of the proposed checklist was lesser, and the evaluators used the same score for multiple checklist items.A smaller rating variation is expected to lead to lesser variation in overall usability evaluation results.To confirm the variation of the usability score, we calculated the standard deviation (SD) of the usability score of both methods.The usability score of both methods is defined as the ratio of "1" ratings to that of the total number of items.The SD of the proposed checklist was 19.16, whereas that of SIDE was 28.61.Importantly, the usability score of SIDE showed larger variation, indicating that the usability score of the proposed checklist could be more reliable compared to the usability score of SIDE.Although a binary scale is assumed to provide higher inter-rater agreement in any methods, the inter-rater agreement of the proposed checklist was significantly higher than SIDE of a binary scale.This indicates the 13 task flow patterns and the checklists are effective to summarize the usability score.
The participants provided some comments about the difficulty of the proposed checklist.They think selecting the proper checklist may be difficult because the proposed checklist consists of several patterns.Although the inter-rater agreement of the proposed checklist was higher than another method, this problem may be a challenge for the beginners of usability evaluation.

General Discussion
In Experiment 1, the results confirmed that the usability score of the proposed checklist correlated with those of the usability testing result.Moreover, the checklist could also be used to predict the subjective rating score.Thus, the proposed checklist is considered a valid usability evaluation method.In Experiment 2, the results confirmed that, compared to SIDE, the proposed checklist achieved smaller inter-rater agreement among usability evaluators, which indicates that the proposed checklist might be a more reliable method compared to the current expert review.Using both Experiment 1 and 2, verification and validation of the proposed checklist can be clarified.
For both Experiments 1 and 2, the proposed task flow-based checklist was used to evaluate multiple products.Using the proposed checklist, the eight tasks in Experiment 1 and 13 tasks in Experiment 2 were successfully evaluated, which indicates that the proposed usability checklist can be used as a general-purpose checklist for multiple products.
In both Experiments 1 and 2, the evaluators using the proposed checklist were instructed about the checklist for about an hour by the one of the authors.The author demonstrated the procedure of the proposed method with an example.Because the concept and the procedure of the proposed method is simple, the evaluators did not report any difficulty in understanding and applying the checklist.Additionally, we did not observe any misunderstandings or errors in the evaluations by the evaluators.Thus, the proposed checklist is easily understandable for usability practitioners.
However, in our study, we did not apply the proposed checklist to an industry-development process.Ultimately, the effectiveness of the proposed usability evaluation method has to be considered in the actual design development process.In the future, an evaluation via actual development should be planned.

Limitations
We summarized the limitations of this study that should be addressed in the future work as follows: first, the sample sizes of both Experiment 1 and 2 were minimal.To enhance the reliability and generalize the findings of this study, the number of the participants, evaluators, type of products, and tasks of the usability testing should be increased.Second, the applicable products were limited in this study because of the limited tested tasks.To apply the proposed checklist to many types of products/tasks, the applicable products should be investigated and enhanced.The types of products/tasks should be systematically categorized into their respective types.Third, the learnability of the proposed checklist should be clarified such that the proposed checklist can be expanded to the industry-development process.This study only showed a subjective report of the evaluators.The usability of the proposed checklist itself should also confirmed and enhanced.Moreover, a binary scale evaluation might be a cause to miss detailed information about usability problems.Although one of the concepts of the proposed checklist was to increase the inter-rater agreement, this concern should be addressed for future works.Finally, selecting the correct pattern may be difficult.Both the experts of Experiment 1 and the participants of Experiment 2 reported the difficulty of how to select the correct pattern, although they were aware of the advantages of the proposed checklist as mentioned in this paper.The problems should be improved by providing concrete procedures to select the correct pattern in future research.

Conclusions
In this study, we verified the effectiveness of a task flow-based usability checklist that utilizes flow design patterns.We conducted Experiment 1 to validate the checklist by comparing it with the traditional usability measures.Consequently, the usability score of the checklist demonstrated a tendency same as that of the traditional usability measures.In Experiment 2, we focused on rating variations among evaluators.Furthermore, the inter-rater agreement of both the proposed checklist and SIDE was investigated to demonstrate the proposed method's reliability.The results confirmed that the inter-rater agreement of the proposed task flow-based usability checklist was greater than that of SIDE.In this manner, the effectiveness of quantifying usability using the proposed task flow-based usability checklist was confirmed.

Figure 1 .
Figure 1.Example of task flow shown in the flow design pattern 1.Figure 1. Example of task flow shown in the flow design pattern 1.

Figure 1 .
Figure 1.Example of task flow shown in the flow design pattern 1.Figure 1. Example of task flow shown in the flow design pattern 1.

Figure 2
Figure2shows the evaluation procedure of the proposed checklist.

Figure 2 .
Figure 2. Calculation of usability score by proposed task flow-based usability checklist.

Figure 2 .
Figure 2. Calculation of usability score by proposed task flow-based usability checklist.

Figure 3 .
Figure 3. Usability score of each task for the proposed checklist and.ASQ.after-scenario questionnaire.

Task 1 : 8 :Figure 3 .
Figure 3. Usability score of each task for the proposed checklist and.ASQ, after-scenario questionnaire.

Table 8 .
Example of the evaluation of the proposed checklist (Pattern 1)."0" means "No", "1" means "Yes".)Are there any clues for identifying the following operation?

Table 1 .
List of flow design patterns.

Table 1 .
List of flow design patterns

Table 2 .
Correspondence between task flow in patterns and structured user interface design and evaluation's (SIDE's) items (pattern 1).

Table 5 .
Usability measures of each method.

Table 6 .
Correlation coefficients among usability measures.

Table 7 .
Result of multiple regression analysis.

Table 7 .
Result of multiple regression analysis.