Review Reports - The Development of User-Centric Design Guidelines for Web3 Applications: An Empirical Study

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This article presents the development of a set of guidelines for designing Web3 applications. The introduction and related work are well presented, although both are very focused on Web3 technologies, and no previous results are presented on the design of guidelines or heuristic evaluation. As the aim of this work is guideline development, and heuristic evaluation is used as a basis for guidelines improvement, I think it would be necessary to introduce a literature review on both aspects.

The methodology seems adequate, but I consider the rationale for the work to be misguided. The authors present a 3 phase methodology for developing and validating their design guidelines for Web3 applications. First, they apply a systematic approach to get initial guidelines, but I can’t see the result of this phase. What are the first initial guidelines? Appendix B refers to the intermediate version of the guidelines framework and initial guidelines are not presented. In this phase authors state that they selected 31 sources and identified 18 papers, but none of them are referenced or added in the supplementary file.

My major concerns are about the methodology employed for heuristic evaluation. First of all, the authors selected 25 apps for this phase. Why so many apps for a heuristic evaluation? And why only 2 evaluators for a heuristic evaluation? Did the evaluators install all 25 apps on their mobiles? Were they both iOS or Android mobiles? If not, the results will not be valid.

On the other hand, the 14 heuristics are poorly defined. No methodology has been applied to define heuristics. Several articles and papers can be found in the literature. There is no definition or examples of the use of each heuristic. I do not understand how the evaluators have applied severity criteria to each heuristic. The authors perform statistical calculations with the quantitative data obtained from the two evaluators. The results obtained could be not significant, which would invalidate the work done.

Instead of developing 14 heuristics, why not use Ethereum's 7 heuristics? https://ethereum.org/en/developers/docs/design-and-ux/heuristics-for-web3/ Or even Nielsen's 10 heuristics for complex apps: https://www.nngroup.com/articles/usability-heuristics-complex-applications/

The use of only two evaluators to conduct a heuristic evaluation is not representative. It is widely accepted among the HCI community that the minimum number of evaluators in a heuristic evaluation should be 3 to 5 evaluators: https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/theory-heuristic-evaluations/

If the aim of the work is to define guidelines for Web3 apps, shouldn't the authors have focused on developing an app (or prototype) from their guidelines to heuristically evaluate that app/prototype and then refine the guidelines?

I don't understand how the intermediate version of the guidelines (Appendix B) has been refined and how the final version of the guidelines (Appendix C) has been reached. I think the final version (Appendix C) should be evaluated to see how much the guidelines have been improved and whether this version can be considered a candidate version for developers to start using or whether it needs further refinement.

In conclusion, the work is encouraging, but it should be reconsidered how the methodology of evaluation and refinement of the guidelines has been implemented and whether the stated objectives have been achieved by obtaining final guidelines ready for use by developers.

Author Response

We extend our sincere gratitude to the reviewer for their thorough and constructive feedback and for their help to improve our manuscript. Please find below a detailed account of the amendments made to the manuscript in response to your comments and suggestions.

Comment: “This article presents the development of a set of guidelines for designing Web3 applications. The introduction and related work are well presented, although both are very focused on Web3 technologies, and no previous results are presented on the design of guidelines or heuristic evaluation. As the aim of this work is guideline development, and heuristic evaluation is used as a basis for guidelines improvement, I think it would be necessary to introduce a literature review on both aspects.”
Response: The heuristic evaluation in this study is utilized as one of several validation methods for the guidelines, focusing on preliminary usability and coverage. Section 2.7 now includes a detailed discussion of methodologies for developing design guidelines.

Comment: “The methodology seems adequate, but I consider the rationale for the work to be misguided. The authors present a 3 phase methodology for developing and validating their design guidelines for Web3 applications. First, they apply a systematic approach to get initial guidelines, but I can’t see the result of this phase. What are the first initial guidelines? Appendix B refers to the intermediate version of the guidelines framework and initial guidelines are not presented.”
Response: The first set of guidelines has been included in Appendix A, and more details about the methodology for developing these guidelines have been expanded in Section 3.1. Additionally, Section 4.1 now elaborates on the initial guidelines and the process leading to their refinement.

Comment: “In this phase authors state that they selected 31 sources and identified 18 papers, but none of them are referenced or added in the supplementary file.”
Response: The references for all 31 sources are provided in Section 3.1.1.

Comment: “My major concerns are about the methodology employed for heuristic evaluation. First of all, the authors selected 25 apps for this phase. Why so many apps for a heuristic evaluation? And why only 2 evaluators for a heuristic evaluation? Did the evaluators install all 25 apps on their mobiles? Were they both iOS or Android mobiles? If not, the results will not be valid.”
Response: The heuristic evaluation was intended as a broad preliminary test of the guidelines' applicability across diverse application types. Section 3.2.2 specifies that for cross-platform applications, we evaluated all available platforms (web, mobile iOS, mobile Android) to ensure comprehensive coverage. The use of 25 applications allowed us to examine how the guidelines performed across different Web3 categories including DeFi, NFT marketplaces, wallets, and social platforms, as detailed in Table 1. While we acknowledge that two evaluators is fewer than typically recommended for traditional heuristic evaluation, our focus was on preliminary guideline validation rather than definitive usability assessment of the applications themselves.

Comment: “On the other hand, the 14 heuristics are poorly defined. No methodology has been applied to define heuristics. Several articles and papers can be found in the literature.”
Response: We acknowledge this limitation in our methodology. We have expanded Section 3.2.3 to better explain how the heuristics were derived from our initial guidelines and developed based on the unique characteristics of Web3 applications. The detailed evaluation template used is provided in Appendix B.

Comment: “There is no definition or examples of the use of each heuristic. I do not understand how the evaluators have applied severity criteria to each heuristic.”
Response: Section 3.2.5 provides a detailed explanation of the evaluation process, including how severity scores from 0-4 were applied and how evaluators reached consensus on ratings.

Comments: “The authors perform statistical calculations with the quantitative data obtained from the two evaluators. The results obtained could be not significant, which would invalidate the work done.” and “The use of only two evaluators to conduct a heuristic evaluation is not representative. It is widely accepted among the HCI community that the minimum number of evaluators in a heuristic evaluation should be 3 to 5 evaluators: https://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/theory-heuristic-evaluations/”
Response: The heuristic evaluation phase was designed as a preliminary validation step, not a comprehensive usability assessment. While we acknowledge that two evaluators is fewer than typically recommended for traditional heuristic evaluation, our purpose was to identify patterns for guideline refinement rather than generate statistically significant metrics about the applications themselves. This is clarified in Sections 3.2.4 and 3.2.5, which detail our evaluation process and consensus-building approach.

Comment: “Instead of developing 14 heuristics, why not use Ethereum's 7 heuristics? https://ethereum.org/en/developers/docs/design-and-ux/heuristics-for-web3/ Or even Nielsen's 10 heuristics for complex apps: https://www.nngroup.com/articles/usability-heuristics-complex-applications/”
Response: While the Ethereum and Nielsen heuristics were considered as reference points, the 14 heuristics were tailored specifically to the Web3 design challenges identified during this study. Section 3.2.3 outlines how these heuristics were developed to ensure full alignment with the proposed guidelines.

Comment: “If the aim of the work is to define guidelines for Web3 apps, shouldn't the authors have focused on developing an app (or prototype) from their guidelines to heuristically evaluate that app/prototype and then refine the guidelines?”
Response: Validation of the guidelines was conducted through expert design tasks, where seven professionals applied the guidelines to design a Web3 wallet interface for peer-to-peer transfers. This process, detailed in Section 3.4, allowed practical testing of the guidelines through concrete design challenges.

Comment: “I don't understand how the intermediate version of the guidelines (Appendix B) has been refined and how the final version of the guidelines (Appendix C) has been reached.”
Response: Section 4.6 now details the evolution of the guidelines, documenting the iterative refinement process from the intermediate to the final version.

Comments: “I think the final version (Appendix C) should be evaluated to see how much the guidelines have been improved and whether this version can be considered a candidate version for developers to start using or whether it needs further refinement.” and “it should be reconsidered how the methodology of evaluation and refinement of the guidelines has been implemented and whether the stated objectives have been achieved by obtaining final guidelines ready for use by developers.
Response: Section 5.4 outlines plans for future evaluation through the Smart Fidget Toy project, as well as the need for longitudinal studies and investigation of how the guidelines apply to emerging technologies. Section 6 positions this research as foundational work, with the guidelines serving as a stepping stone for continued refinement and application based on real-world implementation feedback.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents a proposed guideline for the human-centered design of Web3 applications. It is carefully prepared and deals with current topics. However, I have a significant doubt that disturbs my perception of the whole paper.

Line 316 states: "The evaluation was conducted by two researchers with expertise in Web3 technologies and HCI."
Does this mean that the research described in subsection 4.1.1 and used as a basis for drawing conclusions and proposing general principles is based on a two-element research sample?

If this is indeed the case, then unfortunately the studies described cannot be treated as statistically verifiable because the opinion of two people (even if they are excellent specialists) cannot be extrapolated to the entire population.

Other problems arise in this context. Figure 2 shows the "Distribution of Severity Scores." The histogram shows a total of 25 scores (3 + 5 + 9 + 6 + 2), suggesting that 25 selected applications were evaluated. But why 25 scores if there were two expert evaluators? Did they agree on the scores? How were the scores for each of the fourteen heuristics identified in Appendix A aggregated? Each heuristic represents a different aspect of information system quality, so how they are aggregated is critical to the reliability and meaningfulness of the results obtained. In this situation, what is the basis for using the formal statistical methods described in lines 530-532?

Figure 3 shows the AVR as a continuous variable. This does not make sense because there are independent application types on the X-axis, so it is unwarranted to present the average for each of them as a continuous line.

As for other aspects of the manuscript (sections: Introduction, Related Work, Materials and Methods), I rate them highly because they are well-written. The summary definitely lacks a clear response to the research questions posed in lines 55–57.

Author Response

Comment: Line 316 states: "The evaluation was conducted by two researchers with expertise in Web3 technologies and HCI." Does this mean that the research described in subsection 4.1.1 and used as a basis for drawing conclusions and proposing general principles is based on a two-element research sample?
If this is indeed the case, then unfortunately the studies described cannot be treated as statistically verifiable because the opinion of two people (even if they are excellent specialists) cannot be extrapolated to the entire population.”
Response: The evaluation conducted by two researchers was a preliminary validation step to assess the guidelines' applicability across diverse Web3 applications. This phase was specifically designed to refine the guidelines rather than draw statistically verifiable conclusions, as clarified in Sections 3.2.1 and 4.2. The subsequent expert validation with seven professionals, described in Section 3.4, provided a broader perspective through concrete design tasks and structured evaluation sessions, offering more robust validation of the refined guidelines.

Comment: “Other problems arise in this context. Figure 2 shows the "Distribution of Severity Scores." The histogram shows a total of 25 scores (3 + 5 + 9 + 6 + 2), suggesting that 25 selected applications were evaluated. But why 25 scores if there were two expert evaluators? Did they agree on the scores? How were the scores for each of the fourteen heuristics identified in Appendix A aggregated? Each heuristic represents a different aspect of information system quality, so how they are aggregated is critical to the reliability and meaningfulness of the results obtained. In this situation, what is the basis for using the formal statistical methods described in lines 530-532?”
Response: Section 3.2.5 now provides a detailed explanation of the evaluation process and consensus-building between evaluators. When evaluating each application, both researchers independently applied a 0-4 severity scale to each heuristic, with most variations being minor and easily reconcilable. In cases where ratings differed, evaluators discussed to reach consensus, typically differing by no more than one point on the severity scale. Section 4.2 clarifies that these quantitative results were used to identify recurring patterns and areas for refinement rather than to derive statistically significant findings. The discussion of statistical methods has been revised to better reflect the exploratory nature of this phase.

Comment: “Figure 3 shows the AVR as a continuous variable. This does not make sense because there are independent application types on the X-axis, so it is unwarranted to present the average for each of them as a continuous line.”
Response: The visualization in Figure 3 has been corrected to use a bar chart representation, accurately reflecting the discrete and independent nature of the application categories being compared.

Comment: “The summary definitely lacks a clear response to the research questions posed in lines 55–57.”
Response: Section 5.3 now explicitly addresses each research question posed in the introduction, providing a detailed synthesis of our findings. For RQ1, we identify key usability challenges specific to Web3 applications. For RQ2, we analyze how current applications address these challenges through various approaches. For RQ3, we present the four key task-based principles that emerged from our research to guide Web3 application development.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a comprehensive and empirically validated framework for user-centric design guidelines tailored to Web3 applications, addressing significant gaps in usability and user experience research within this domain. It effectively combines a thorough literature review, heuristic evaluation of 25 diverse Web3 applications, and expert validation sessions to propose actionable recommendations. The findings emphasize progressive disclosure, task-oriented structures, and enhanced user education as critical factors for overcoming barriers like cognitive load and the complexity of blockchain concepts. However, the framework could be further refined by incorporating longitudinal studies to measure its long-term impact and adapting to rapidly evolving Web3 technologies. Additionally, clearer distinctions between overlapping categories, such as user understanding and educational materials, could improve practical implementation.

Author Response

Comment: However, the framework could be further refined by incorporating longitudinal studies to measure its long-term impact and adapting to rapidly evolving Web3 technologies.
Response: Plans for incorporating longitudinal studies are outlined in Section 5.4. Specifically, we detail a future project involving the Smart Fidget Toy, which will apply these guidelines and include user assessments of design outcomes over time. This project will help validate the guidelines' effectiveness in real-world implementations. Additionally, Section 6 emphasizes that this research serves as a foundational step in a longer journey to improve Web3 usability, acknowledging the need for continued adaptation to emerging technologies such as Layer 2 solutions, cross-chain interactions, and AI integration.

Comment: “Additionally, clearer distinctions between overlapping categories, such as user understanding and educational materials, could improve practical implementation.”
Response: This issue has been addressed through the evolution of the guidelines, particularly in the transition from our category-based structure to a task-oriented framework. As detailed in Section 4.6, the final version organizes the guidelines around four key task flows: Onboarding Flow, Transaction Flow, Settings & Configuration, and Other Features. This reorganization, informed by expert feedback during validation sessions, eliminates category overlap and provides clearer implementation guidance for practitioners.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

After reading the new revised version of this paper again, I consider it acceptable to publish it as it is. The authors have taken care of all the comments and recommendations. Although some aspects of the heuristic evaluation and some explanations of the results could be improved, the paper is now much more improved than in its previous version.

Author Response

We sincerely thank you for your thorough review of our revised manuscript. Your feedback throughout this process has been invaluable in strengthening the paper. We appreciate your support for publication and are grateful for all the constructive comments that helped us improve the quality of our work.

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you very much for addressing most of my comments and concerns. Nevertheless, I ask you again to remove the block citation (line 281). Information about which articles the authors read is irrelevant; the key is what interesting and important information they found and used for the manuscript. If such important information was in the cited articles, it should be clearly and individually indicated.

In addition, I am not convinced that the research sample used (2 + 7 experts) allows the study to be considered reliable, unbiased, and statistically significant. Unfortunately, this is not something that can be easily corrected in the manuscript, so I leave it to the editor to decide.

Author Response

Comment: "Nevertheless, I ask you again to remove the block citation (line 281). Information about which articles the authors read is irrelevant; the key is what interesting and important information they found and used for the manuscript. If such important information was in the cited articles, it should be clearly and individually indicated."
Response: We have completely revised Section 3.1.1 to remove the block citation and better integrate individual sources throughout the methodology description. The citations are now meaningfully woven into the narrative, highlighting specific contributions from each source to our understanding of Web3 design principles.

Comment:"In addition, I am not convinced that the research sample used (2 + 7 experts) allows the study to be considered reliable, unbiased, and statistically significant."
Response: We appreciate your concern about the sample size. Our approach was intentionally structured as a multi-phase validation process, where the initial two-expert evaluation served as a preliminary screening phase, followed by more extensive expert validation. We acknowledge this limitation in our paper and have ensured transparency about our methodology's scope and limitations. No edits are required.

Reviewer 3 Report

Comments and Suggestions for Authors

The revised manuscript has addressed the previously identified issues effectively. The authors have provided clear explanations and made improvements in the areas of methodology and data analysis.

Author Response

We greatly appreciate your time and effort in reviewing both versions of our manuscript. Thank you for confirming that our revisions have effectively addressed the previously identified issues.