Review Reports - Usability of Virtual Reality Systems in Engineering Product Development: A Multi-Experiment Evaluation of Software, Hardware, and User Factors

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the opportunity to review this manuscript. The topic is novel, relevant and practically important, particularly given the growing role of VR systems in engineering product development and industrial design review. The manuscript addresses an area with clear applied potential, and the multi-experiment structure suggests that the authors have collected valuable practical observations.

However, my recommendation for rejection should not be understood as an act of bad will or as a dismissal of the topic. Rather, it reflects the fact that the current version does not yet meet the methodological, statistical and reporting standards expected of an empirical scientific article. In particular, the manuscript makes comparative and impact-related claims that are not sufficiently supported by the study design or statistical analysis.

I believe that the work could become a valuable contribution after substantial restructuring. The authors may consider reframing the manuscript as an exploratory multi-case usability evaluation, reducing causal language, clarifying the scoring procedures, improving the statistical reporting, and strengthening the methodological transparency. Detailed comments and suggestions are provided in the attached review file.

Comments for author File: Comments.pdf

Author Response

Reviewer 1:

We would like to thank the reviewer for the thorough, constructive, and insightful comments. We greatly appreciate the time and effort invested, and the suggestions have helped us substantially improve the clarity, rigor, and overall quality of the manuscript.

Below, we address the issues raised in detail. The changes to the text are highlighted accordingly in the manuscript.

Major issues

No formal statistical analysis is reported despite the stated objective to "quantify the impact" of software, hardware and user background (lines 60-70) and despite several causal/comparative claims in the abstract (lines 16-20). The Results section is almost entirely descriptive and interpretive. No inferential tests, effect sizes, confidence intervals, variance estimates, model specification, missing-data handling or reliability analyses are reported
- The original manuscript presented descriptive usability outcomes because the six experiments differed in multiple contextual factors. To address this concern, the manuscript has been revised in two ways. First, we clarified that the study represents an exploratory multi-experiment usability evaluation rather than a controlled impact study (Line 24-25 ,Line1108-1109), . Second, the interpretation of results has been moderated accordingly. These changes ensure that the analytical claims align with the study design.
The term "significant" is used in the abstract (line 16) and the Discussion repeatedly uses confirmatory phrasing such as "clearly demonstrated", "confirms", and "significantly shaped" ( e.g., lines 875-889), but the manuscript does not present statistical tests that would justify significance or confirmation. These statements should be softened or supported with appropriate analyses.
- We agree that the use of confirmatory language could imply statistical testing. All instances of terms such as “significant,” “clearly demonstrated,” and “confirms” have been revised to more cautious formulations (e.g., “suggests (Line 16) ,” “indicates,” “observed in this dataset”) to better reflect the exploratory nature of the study
The experimental design is heavily confounded. Software version, hardware, user group, context, sample size and task difficulty change across experiments. Therefore, differences in usability degree cannot be attributed to one factor. For example, Experiment IV used PC-based VR with five senior engineers, while Experiment V used standalone VR with 40 junior engineers and a different software version and context (lines 432-467). This prevents valid claims about the independent effect of hardware or software.
- à We agree that, due to the simultaneous variation of multiple factors (e.g., software version, hardware configuration, user characteristics, and usage context), the effects observed in this study cannot be attributed to any single factor in isolation.
- However, this variability is inherent to the nature of VR systems and reflects the conditions under which such systems are developed and used in practice. A VR system is fundamentally defined by the interplay of its core components, including hardware (e.g., standalone and PC-based systems), software implementation, user interaction, and contextual factors. These elements cannot be meaningfully separated without altering the nature of the system itself.
- Accordingly, this study adopts an exploratory approach that aims to examine usability across different configurations rather than to isolate causal effects of individual variables. The variability across experiments is therefore a deliberate feature of the study design, intended to capture realistic usage scenarios and to provide broader insights into the behavior of VR systems in practice. To avoid overinterpretation, we have revised the manuscript to clarify the exploratory nature of the study and to ensure that the conclusions are presented in a cautious manner.
There is a major inconsistency in the description of Experiment I. In the Methods, participants are described as divided into a standalone headset group and a PC-connected headset group (lines 340-343), whereas the Results describe a cross-cultural comparison of junior engineers from Germany and South Africa (lines 512- 518), and Table 3 lists Experiment I simply as "PC-Based / International Teams". The authors must clarify the actual design, group sizes, hardware allocation, country allocation and analysis target.
- This was an error, the Experiment 1 was only with PC based computer. The description of Experiment I has been corrected (Line452).
SUS scoring appears to be non-standard. The manuscript describes SUS as ten five-point Likert statements (lines 243-245), but reports "SUS" as values such as 3.17, 3.05, 3.18, 2.77, 2.9 and 3.0. A standard SUS score should be transformed to a 0-100 scale. If the authors intentionally report mean item ratings, this should not be called a SUS score and should be clearly labelled as a non-standard mean Likert score
- We acknowledge that the System Usability Scale (SUS) is typically scored on a 0–100 scale following a standardized calculation procedure.
- In the present study, each of the ten SUS items was rated on a 5-point Likert scale, and mean values were computed across participants for each item, followed by an overall average score across all items. This approach was adopted to provide a straightforward and interpretable representation of user responses within the data collection environment (Google Forms), which directly supports Likert-scale aggregation but does not automatically implement the standard SUS scoring transformation.
- While this approach differs from the conventional SUS scoring method, the use of a 1–5 scale allows for intuitive interpretation of user perceptions, where higher values indicate more positive usability evaluations. To avoid confusion, we have clarified this scoring approach in the revised manuscript and explicitly stated that the results should be interpreted as relative usability indicators rather than standardized SUS scores.
- We clarified this method of calculation now in the paper (line 292-295).
The "proportion of discovered problems" is presented as if it were an empirical outcome, but it is calculated from an assumed p = 0.31 (lines 321-323). Table 3 consequently reports values such as 99.9%, 97.5% and 84.4%, which are theoretical expectations based only on n and the assumed discovery probability. This does not demonstrate that those proportions of actual problems were found. The claim at lines 921-922 that 99.9% of the system's problems had been identified after the first experiment is especially problematic, because Experiment V later identified 18 missing functions and 23 recommendations (lines 7 51-7 54 ).
- The proportion of discovered problems reported in Table 3 is based on the well-established model proposed by Nielsen and Landauer, in which the probability of discovering usability problems (p = 0.31) is assumed as an average value derived from prior studies.
  However, the identification of additional missing functions and recommendations in later experiments does not contradict the theoretical model. Rather, it reflects the evolving nature of the system and the variability in testing conditions, including changes in software versions, user groups, and usage contexts. As new system features are introduced and different user perspectives are considered, new usability issues may emerge that were not present or detectable in earlier stages.
- This observation supports our overall argument that usability outcomes in VR systems are influenced by a combination of factors such as context, hardware, software, and user characteristics and reinforces the exploratory nature of this study.
Outcome measures are insufficiently described. The manuscript mentions a tailored inspection/empirical questionnaire and seven dimensions (lines 223-238), but does not provide the exact items, item counts per dimension, scoring rules, whether items were validated, or whether reliability was assessed. Without the questionnaire or appendix, the method cannot be replicated
- à The questionnaire and the corresponding evaluation method used in this study are not newly developed but are based on a previously published and validated instrument, which is fully described in Reference [33]. This reference includes the complete set of items, the distribution of items across the seven dimensions, the scoring procedure, as well as details regarding validation and reliability assessment. We have clarified this in the text (Line 356-659).
The Results section mixes results, interpretation, discussion and speculation. For instance, lines 860-864 suggest that students attributed unfavorable course outcomes to the software. This is a strong psychological interpretation and should not be stated unless it is supported by transparent qualitative coding, quotes, interrater procedures and/or triangulation
- We agree that the original wording may have implied a stronger causal or psychological interpretation than is supported by the data. These observations are based on patterns identified in participant responses, including inconsistencies between closed-ended ratings (e.g., yes/no questions or scale items) and corresponding open-text comments provided by the same participants.
- However, we acknowledge that these observations were not derived from a formal qualitative analysis procedure (e.g., systematic coding, inter-rater validation, or triangulation). Therefore, the interpretation that students may have attributed unfavorable outcomes to the software should not be presented as a definitive conclusion.
- To address this, we have revised the manuscript by: (1) removing or rephrasing interpretative statements from the Results section, (2) relocating such observations to the Discussion section, and (3) presenting them in a more cautious manner as tentative interpretations.
  Additionally, we explicitly acknowledge as a limitation that discrepancies between response formats were observed but not systematically analyzed or statistically tested, and that potential influences such as grading-related bias were not controlled in Experiment VI. The text has now been clarified as follows in (Line 976-981): “Since the participants were students enrolled in the course and the experiment formed part of their assessment, grading considerations may have influenced their responses. Although discrepancies were observed between open-text comments and scoring fields, the study did not explicitly control for or statistically evaluate the potential effect of grading-related bias. This should therefore be considered a limitation of Experiment VI.”
The manuscript is not yet linguistically polished. Numerous grammar, collocation and style errors occur throughout the text. The paper can be understood, but it requires thorough proofreading by a proficient academic English editor before publication.
- Before the initial submission, the manuscript was proofread by several experienced researchers with extensive publication records and also checked using AI tools. Unfortunately, the reviewer did not provide any specific examples, so we do not know exactly where we should make improvements.
- Nevertheless, we have revised the text once again, placing particular emphasis on greater clarity and improved linguistic quality.
Reference quality is uneven. All numbered references [l ]-[31] appear to be cited in the text, but several entries Jack DOI, publication venue, URL or access information where appropriate. Reference formatting is also inconsistent, especially for conference papers, books and web sources
- We have checked the references. In one instance, we had originally omitted the DOI and have now added it. Other publications do not have a DOI, only an ISBN. We believe that all cited references can now be clearly identified.

Introduction

The introduction is generally relevant and leads logically to the stated research gap. The objective at lines is clear and appropriate for an applied usability paper. However, the research aim is too broad relative to the study design. The authors claim they will quantify the impact of software version, hardware type and user background (lines 65-66), but the subsequent design does not isolate these variables. This should be reframed as an exploratory multi-case evaluation rather than an impact study.
- We have revised the manuscript to better reflect its exploratory multi-case design and have made this clear now on several occasions. Accordingly, the study does not aim to isolate or quantify causal effects but instead focuses on identifying and describing usability patterns across different configurations.
The introduction should provide explicit research questions or hypotheses. At present, the aim is presented as a list of applied objectives, but not as testable research questions. Suggested RQs: (1) How do usability ratings vary across iterative software versions and use cases? (2) What usability problems recur across engineering VR scenarios? (3) Which user-group-specific requirements emerge from students, junior engineers and senior engineers?
- Thank you for this valuable suggestion. We agree that the research aims can be clarified further. As the study follows an exploratory, multi-case design across several experimental setups, it is not structured around predefined hypotheses or strictly testable research questions. Instead, each experiment addresses a specific objective, and the overall contribution lies in identifying patterns and trends across different configurations, including user groups, use cases, software versions, and hardware setups. To address this comment, we have revised the introduction to more clearly articulate the overarching research focus and to explicitly frame the study in terms of exploratory research questions aligned with its design.
- However, we will certainly incorporate the excellent research questions you have raised into the planning of the next study. To do so, however, the study design will need to be structured somewhat differently from that of the present study.
The statement that standalone systems are better for field testing and PC-based systems provide higher precision (lines 54-57) is relevant, but it becomes problematic when later used to interpret the results without a controlled comparison. The introduction should distinguish between literature-based assumptions and findings tested in the present study.
- Thank you for this important clarification. We agree that the distinction between literature-based assumptions and findings from the present study should be made more explicit.
- We have revised the introduction to clearly separate prior literature statements from the empirical investigation presented in this paper. The statements regarding the advantages of standalone and PC-based systems are now explicitly framed as background information derived from existing studies.
- In the results and discussion sections, we now more carefully relate these assumptions to our experimental observations. In particular, Experiment 4 (cable routing use case) demonstrated that standalone systems were limited in rendering quality due to hardware constraints, whereas PC-based systems allowed higher visual fidelity when using the same software configuration on a high-performance workstation.
The introduction would benefit from a sharper definition of the expected scientific contribution. Is the contribution the evaluation framework, the six-experiment dataset, software-development feedback, or practical implementation recommendations? This should be stated explicitly
- We have clarified that such observations are case-specific and do not constitute a controlled comparison of hardware types, but rather support exploratory insights into how system configurations affect usability in different contexts.

State of the Art

The State of the Art section covers VR technology, VR in product development and usability methods. This creates a broad background, but much of Section 2.1 is textbook-like and could be shortened to make room for a more analytical literature review.
- We’ve shortened and reduced the textbook-style content and therefore shifted focus to analysis.
The review should engage more directly with empirical studies on VR usability in engineering design review, ergonomic evaluation, collaborative VR and industrial adoption. The current section does not sufficiently compare the present study with similar empirical designs.
- Added comparisons with prior work in engineering review, ergonomics, collaborative VR, and industrial use (Line 152 – 192).
- Clarified relation to ISO 9241, SUS, NASA-TLX, and justified aggregation method with added explanation/limitations (line 206- 341).
The authors should clarify how the proposed seven usability dimensions relate to ISO 9241, SUS, NASATLX and existing VR usability instruments. The current explanation at lines 223-238 is informative, but it does not justify why averaging yes/no items across seven dimensions is psychometrically appropriate
- We would like to clarify that the seven usability dimensions used in our study are not newly proposed by the authors, but are derived from the established principles of human-centered design defined in ISO 9241-210:2019 (Ergonomics of Human-System Interaction – Part 210). Specifically, these dimensions reflect standardized usability-related constructs grounded in internationally recognized ergonomics guidelines rather than an ad hoc framework.
- We have also included the key points from the standard in the text and added a reference to an earlier publication (line 234-266).
Reference [12] is incomplete ("Classification of the Topicality and Relevance of Evaluation Tools for VR Applications, 2025") and lacks publication venue, publisher/proceedings, DOI or URL. It should be completed or replaced.
- Thank you for bringing this to our attention. We’ve completed the reference with full bibliographic details (venue, DOI/URL).
Some references are very recent and useful, but the section relies strongly on broad adoption/digital-transformation literature. The authors should strengthen the direct VR usability and HCI methodological background.
- The VR usability and HCI methodological background was strengthened primarily in Section 2.3 (“Usability of VR Systems”), where the manuscript now includes expanded discussion of ISO 9241 usability principles, SUS, NASA-TLX, usability dimensions, workload assessment, and methodological justification of the evaluation framework (pages 6–10). In addition, Section 2.2 was revised to include more VR-specific engineering and collaborative-review literature, thereby reducing reliance on broader digital-transformation and adoption-focused discussion..

Methodology and Experiment Setup

The multi-experiment structure is promising, and the five-step procedure at lines 291–308 gives the manuscript a coherent methodological frame.
- Thank you very much for your feedback
Participant information is insufficient. The manuscript should provide per-experiment demographics and expertise: age, gender, VR experience, CAD experience, product-development experience, nationality where relevant, recruitment method and inclusion/exclusion criteria.
- We’ve added per-experiment demographics, expertise, recruitment, and inclusion/exclusion criteria (cp. Section 3.1).
Hardware and software are underreported. The manuscript uses H1 and H2 (lines 313–314), but does not identify headset models, PC specifications, refresh rate, display resolution, tracking method, controller type, software features per version, or the exact changes between versions 1.70–1.73. This is necessary for reproducibility and for interpreting hardware/software effects.
- We specified the detailed software changes across versions 1.70–1.73. However, hardware and computer specifications were intentionally not included, as we did not aim to promote or advertise particular specifications or providers. End users can consult providers for suitable system requirements, which are typically already available as recommended specifications for supported computer or standalone systems.
Task descriptions remain too general. For each experiment, the authors should report the number of tasks, exact task instructions, task duration, time limits, whether order was standardized/randomized, whether task success was measured, and whether assistance was allowed.
- We did not include these factors because the experiments were conducted without time pressure or strict conditions, reflecting real design review situations where engineers or stakeholders perform tasks without imposed constraints. However, the task descriptions are provided in each experiment
The study lacks a clear sampling rationale. The participant numbers differ substantially across experiments (n = 5, 10, 23, 40). The authors discuss usability problem discovery, but this is not the same as a sample-size justification for comparing SUS/TLX/usability scores across groups.
- The aim of the study was to identify patterns across different user groups. Unfortunately, senior engineers in particular are very difficult to recruit in large numbers.
- We have added an explanation or clarification regarding the difference between exploratory and comparative analyses, as well as the limitations of small sample sizes, at the beginning of the section.
Experiment IV is confounded by lack of practice time: participants "did not engage in any prior practice interaction" because of limited employee time (lines 440–441). This should be treated as a major limitation, not simply as part of the procedure.
- This allows better understanding of the differences between prior training and no prior training.
Experiment VI is confounded by graded assessment pressure (lines 487–488; 799–803). This makes it difficult to compare with Experiment V even though the tasks are said to be similar. The authors should present this as a contextual case rather than as evidence of software/hardware usability.
- Reframed as a contextual case; noted grading pressure as a confound (Line 975-981).
The manuscript should clarify the distinction between inspectors and immersed users. Lines 303–305 state that inspectors complete the inspection questionnaire while immersed participants complete the empirical questionnaire, but the distribution of these roles is not reported for each experiment.
- We’ve clarified inspector and immersed user roles and distribution per experiment in description of step 4.
- In line with ISO 9241, objectively measurable criteria are assessed by an inspector. For example, there is an ‘undo’ function. Criteria based on subjective perception, on the other hand, are assessed by immersed users.
Ethics information is minimal. Although informed consent is mentioned (lines 1044–1045), the manuscript should state whether institutional ethics approval was required/obtained or waived, especially because students participated in a graded course context.
- Added statement on ethics approval/waiver and consent details (Line 437- 441).

Statistical Analysis

A dedicated Statistical Analysis section is completely missing. This is the central weakness of the manuscript.
- All comments have been addressed. A statistical analysis method has been added, including Yes/No questions (Lines 258–263) for usability questions, TLX (Lines 276–282), and SUS (Lines 285–296).

Analysis of the participants' responses

Table 3 is useful because it summarizes software version, test year, hardware, group, usability degree, and theoretical problem discovery proportion (lines 867 onward). However, it is not sufficient as a Results section.
- Thank you very much for pointing that out. The table is intended as a summary of the results, which were previously explained individually for each experiment. To improve the descriptions of the individual experiments, some information has been added to the text.
The six usability degrees are reported as: Experiment I = 51.7%, II = 62.3%, III = 50.3%, IV = 54.5%, V = 68.7%, VI = 63.6%. These values should be accompanied by dimension-level scores, confidence intervals and raw descriptive statistics.
- (Table 7) is now added for dimension-level scores.
SUS values are reported as 3.17/3.05 in Experiment I (lines 519–521), 3.18 in Experiment II, 2.77 in Experiment III, 2.9 in Experiment IV, 3.0 in Experiment V and 2.9 in Experiment VI. These should be recalculated or relabelled as non-standard 1–5 item means.
- Please refer to the answer to point 1.5
The highest usability degree appears in Experiment V (68.7%) with 40 junior engineers using standalone VR (lines 741–746), but this does not prove that standalone VR is superior, because Experiment V differs from other experiments in sample size, user background, task context and software version.
- That’s right. In that sense, Experiment VI is most directly comparable to Experiment II. Here we see a slight improvement in usability. However, because of the newer version of the software, a direct comparison here is not entirely conclusive.
- A comparison of Experiments IV and V using functionally identical software shows a significant leap in usability. However, the composition of the test group has also changed here. So, whilst we cannot provide direct proof, we do see a strong indicator. This hypothesis must, of course, be proven in further studies.
Experiment V reports 18 missing functions and 23 recommendations (lines 751–754). This is one of the strongest practical findings and should be presented in a separate table, grouped by category and frequency.
- Do you mean that the missing features should be described in detail? I can well imagine that would be helpful. However, we are not authorized to do so here, as we have signed a cooperation agreement as part of the research project.
Experiment VI reports high workload under time pressure and a SUS value of 2.9 (lines 847–855). The text should include actual NASA-TLX subscale values and explain whether time pressure was experimentally controlled or merely contextual.
- The TLX responses are text-based; therefore, no numerical subscale values are available. Instead, the text responses were analyzed in detail and summarized to derive conclusions. For example for Experiment VI :

The TLX assesses the participants’ subjective workload while completing the specific tasks across six categories. The survey resulted in the following evaluation:

Task Complexity:

The responses to the question regarding mental and perceptual activity indicate that users generally rate the tasks as rather easy or moderate, although some familiarization time is required to become accustomed to the software. Many users report that, after this initial learning phase, the tasks become intuitive and easy to complete. However, it also became apparent that some users require support in the form of operational guidance or tips in order to navigate the system effectively. This suggests that additional support materials or instructions could help ease the onboarding process and increase users’ confidence. Overall, the feedback is predominantly positive, indicating a high level of satisfaction with the software, particularly after the initial familiarization phase.

Physical Activity:

The responses regarding physical activity show that users generally perceive the tasks as easy and relaxed. Many users emphasize that the effective motion control allows them to complete the tasks comfortably while seated or with only minimal arm movement, thereby reducing physical strain. Some users mention that incorrect handling can require more body movement, suggesting that user guidance in such cases could be improved. Overall, the equipment (headset and controllers) is described as lightweight and comfortable to wear, contributing to the positive perception of the physical activity involved. The feedback indicates that the physical demands of the tasks are well designed and that users have a pleasant experience.

Time Pressure:

The responses regarding time pressure and task pace present a mixed picture. Some users perceive the pace as fast or appropriate, while others consider it slow. Many users report feeling no time pressure and note that the pace can be self-determined, indicating a flexible and user-friendly design. The feedback suggests that users had sufficient time to complete the tasks and that time pressure was generally perceived as absent or minimal. This reflects a positive user experience, as users are able to work at their own pace without feeling stressed.

Success and Satisfaction:

The responses regarding success in completing the tasks and satisfaction with personal performance are predominantly positive. Many users report a high level of satisfaction with their performance, with some even stating that they were very satisfied.

Effort and Performance Level:

The responses concerning the effort required to achieve the desired performance level show that most users experienced little to no effort. Many users report that they only had to exert minimal mental effort and no physical effort in order to complete their tasks successfully. Some users even state that they did not need to make any physical or cognitive effort to achieve their performance level. This suggests that the tasks are well designed and highly user-friendly, resulting in a relaxed and effortless experience.

Stress and Satisfaction:

The responses regarding emotional experiences during the tasks present a predominantly positive picture. Most users report that they did not feel stressed or irritated and instead felt relaxed and satisfied. However, there is also an indication that unintuitive operation caused stress for some users. Overall, feelings of satisfaction and relaxation predominate, suggesting that the user experience was largely positive, provided that the tasks did not last too long.

The statement that PC-based systems remain necessary for high-precision engineering applications (lines 900–905) may be plausible, but the Results should present task-specific precision data or objective performance evidence before making this claim.
- We’ve added “For example, in the specific application of cable routing in Experiment IV, it was found that the resolution in stand-alone systems is insufficient to accurately represent bending radii, for instance.”
Several result paragraphs read like discussion. The authors should separate factual findings from interpretation. For example, statements about CEO intentions or organizational ROI should be placed in Discussion or qualitative results with supporting quotes and context.
- I’m not quite sure what the reviewer means here. ROI figures were recorded. Consequently, the method of calculation was described and the results presented.

Discussion

The Discussion provides useful practical interpretation and identifies context-specific implementation principles (lines 1001–1018). These recommendations are relevant for practitioners. However, the Discussion overstates the evidence. Words such as "clearly demonstrated", "confirms", "significantly influenced" and "it is evident" should be replaced with "suggests", "indicates", or "was observed in this exploratory dataset", unless supported by statistical tests.
- We have made the necessary adjustments. Wording was moderated to avoid overstatement
The Discussion should explicitly acknowledge that user group, task type, software version, hardware and organizational context were not independently controlled. This is the main limitation for interpreting differences across experiments.
- We've added that to section 3 and 6. The lack of experimental control across variables was acknowledged.
The limitations section is currently too general (lines 1027–1033). It should list specific limitations: unbalanced sample sizes; n = 5 in Experiments III/IV; no randomized allocation; lack of controlled hardware comparison; non-standard SUS reporting; incomplete TLX reporting; no reliability metrics; no raw data; and qualitative coding not described.
- We’ve extended this paragraph. Limitations were expanded and explicitly detailed.
The practical recommendation "Use VR as a complement, not a replacement, for CAD" (lines 1016–1018) is reasonable and well aligned with the industrial comments. This should remain, but should be framed as a recommendation derived from exploratory usability feedback.
- We’ve added that. Practical recommendations were retained but framed as exploratory.
The discussion of student responses in Experiment VI should be made more neutral. The current explanation risks attributing negative feedback to student motivation rather than to the system or task design.
- We have revised the wording. Student-related interpretations were made more neutral.
The authors should more clearly relate their findings back to TAM, perceived usefulness, ease of use and workflow integration. These constructs are introduced in the literature review but not systematically operationalized in the analysis.
- The objective of this study is not to explore user acceptance. The TAM introduced in the state of the art is only used to explain why usability is part of TAM and to clarify the meaning of usability

Reviewer 2 Report

Comments and Suggestions for Authors

The paper evaluates the usability of VR systems in engineering product development and explores how software configuration, hardware type, user background, and context of use influence usability outcomes.

The topic is interesting, and the study has practical value. It effectively presents the experimental work across different user groups and industrial scenarios. I would like to make the following comments and suggestions.

In the introduction, I would suggest further explaining the practical industrial problem motivating the study.

In the state-of-the-art section covers VR technology, VR applications in product development, and usability evaluation methods. I would suggest adding more recent references related to immersive engineering systems and industrial VR evaluation. In Section 2.3, I would suggest further justifying the relationship between the 7 usability dimensions and established usability theory.

In the methodology, I would suggest further describing the experimental design, control variables, and comparison strategy between experiments. The sample sizes vary between experiments, ranging from five to forty participants. Please justify.

As regards the evaluation instruments, further discuss the validity of the customized questionnaire, explaining whether the questionnaire was pre-tested or validated statistically.

The results section contains important observations and practical insights. The discussion of usability dimensions is detailed. However, statistical comparisons between experiments, user groups, or hardware types would improve the study.

In the conclusions, the limited number of industrial participants and the absence of longitudinal evaluation should be acknowledged as limitations.

The figures are relevant and support the understanding of the experiments. I would suggest providing more informative captions explaining the relevance of each figure to the usability evaluation.

The tables are adequately organized.

Author Response

Reviewer 2:

In the introduction, I would suggest further explaining the practical industrial problem motivating the study.
- We have added an example to the introduction, which will be discussed in more detail later in Experiment IV.
In the state-of-the-art section covers VR technology, VR applications in product development, and usability evaluation methods. I would suggest adding more recent references related to immersive engineering systems and industrial VR evaluation. In Section 2.3, I would suggest further justifying the relationship between the 7 usability dimensions and established usability theory.
- The usability dimensions and their application in this study were now explained in more detail in section 2.3.
In the methodology, I would suggest further describing the experimental design, control variables, and comparison strategy between experiments. The sample sizes vary between experiments, ranging from five to forty participants. Please justify.
- The points mentioned were addressed in the text and explained in more detail. The number of participants varies depending on the specific objectives and the availability of test subjects. The limited number of participants in Experiments III and IV reflects the actual size of the development teams within the company, which typically consist of around five members.
As regards the evaluation instruments, further discuss the validity of the customized questionnaire, explaining whether the questionnaire was pre-tested or validated statistically.
- The manuscript now clarifies that the questionnaire and the evaluation framework are not newly developed instruments, but are based on a previously published and validated approach described in reference [33]. This reference contains the questionnaire items, the evaluation procedure, the validation approach and the reliability assessment.
- Furthermore, the revised manuscript explains more clearly how the dimensions of the questionnaire are based on established usability principles from the ISO 9241 standard, TLX and SUS.
The results section contains important observations and practical insights. The discussion of usability dimensions is detailed. However, statistical comparisons between experiments, user groups, or hardware types would improve the study.
- We agree that statistical comparisons between experiments, user groups and hardware configurations would strengthen the analysis. However, given the exploratory nature of the study and the simultaneous variation of several contextual factors across the experiments, the experiments were not designed as controlled statistical comparisons.
- To address this issue, the manuscript has been revised to frame the results in a more cautious and descriptive manner. The interpretation of the results has been adjusted accordingly, and the study is now presented throughout as an exploratory usability evaluation involving multiple experiments, rather than as a confirmatory comparative analysis.
In the conclusions, the limited number of industrial participants and the absence of longitudinal evaluation should be acknowledged as limitations.
- Thank you very much for this valuable feedback. The concluding section has been revised to highlight the limitations of the study more clearly. In particular, the manuscript now explicitly addresses the limited number of industrial participants in some of the experiments.
The figures are relevant and support the understanding of the experiments. I would suggest providing more informative captions explaining the relevance of each figure to the usability evaluation.
- We have reviewed the figures again and, where the caption was unclear or the link to the text was ambiguous, we have added the necessary details.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper demonstrates that user background and task context influence VR usability as much as hardware or software. It provides practical recommendations for context-specific VR implementation and validates an iterative evaluation approach to support industrial adoption. However, I think the following issues should be addressed before its publication.

The study lists three quantitative goals but does not formulate testable hypotheses, making the analysis largely descriptive rather than confirmatory.
Experiments III and IV used only 5 participants each, which severely limits statistical power and the generalizability of findings.
Prior VR experience, technical familiarity, and hardware exposure were not treated as covariates, potentially biasing usability comparisons across groups.
Each experiment used different tasks and evaluation criteria, making cross-experiment comparisons of "Usability Degree" questionable.
The absence of a non-VR control condition (e.g., CAD or physical mock-up) prevents attribution of observed effects specifically to VR technology.
Experiment III acknowledges limited valid questionnaire responses but does not describe how missing data were addressed (e.g., exclusion, imputation).
Internal consistency (e.g., Cronbach's α) is not reported for SUS, NASA-TLX, or the custom usability questionnaire, undermining measurement credibility.
The "Usability Degree" scores in Table 3 (e.g., 51.7%, 62.3%) are not defined in the methodology, preventing replication.
Critical user comments are cited frequently, while balanced or positive remarks from the same participants are largely omitted.
Comparisons between students, junior engineers, and senior engineers are descriptive only; significance tests such as t-test or ANOVA are absent.
Conclusions about system usability and workload from experiments with only 5 participants (III and IV) are presented with unwarranted confidence.
SUS and NASA-TLX measure different constructs, yet the paper often treats them as interchangeable indicators of "usability."
In Experiment VI, the authors suspect students were biased by their course grading but neither control nor statistically test for this effect.
Experiments V and VI switched to a standalone headset without disentangling the effects of resolution, tracking precision, or weight from usability outcomes.
Software versions improved over time, but it is unclear whether any participants repeated experiments, introducing potential learning bias.
Table 2 does not explicitly list "multi-user collaboration" as an evaluation focus for Experiment III, despite it being a key objective.
Methods and procedures such as participant introduction and task familiarization are repeatedly described across experiments, making the text unnecessarily verbose.

Author Response

Reviewer 3:

The study lists three quantitative goals but does not formulate testable hypotheses, making the analysis largely descriptive rather than confirmatory.
- While the study outlines three quantitative objectives, it was designed as an exploratory investigation due to the limited sample size. Therefore, formal hypotheses were not explicitly defined. This has now been clarified in the manuscript (Line10, Line 25). The findings are intended to provide descriptive insights that may serve as a basis for future confirmatory studies with larger participant groups.

Experiments III and IV used only 5 participants each, which severely limits statistical power and the generalizability of findings.

- The limited number of participants reflects the actual size of development teams in the company, which typically consist of approximately five members. Increasing the number of participants was therefore not feasible without compromising the realism of the experimental setting. While this limits statistical power and generalizability, the results provide meaningful insights into usability within a representative real-world team configuration.

Prior VR experience, technical familiarity, and hardware exposure were not treated as covariates, potentially biasing usability comparisons across groups.

- A new table has been added (Table4). Participant experience with digital and VR tools was recorded to characterize the background conditions of the groups and to ensure their comparability. As all groups showed comparable average experience levels, no exclusions were necessary. These factors were not treated as covariates, as the study aims to evaluate system usability under realistic conditions with heterogeneous user experience rather than controlled expertise.

Each experiment used different tasks and evaluation criteria, making cross-experiment comparisons of "Usability Degree" questionable.

- The tasks across experiments are not fundamentally different, but rather domain-specific variations of similar design and engineering activities. Each participant group performed tasks that reflect their real-world professional or educational context, ensuring ecological validity. For example, students in the ergonomics design program worked on tasks aligned with their coursework, with which they are already familiar through the use of CAD tools. Similarly, participants from the rail manufacturing sector performed cable routing tasks that correspond to their daily work, and professionals from the cutting machine industry engaged in tasks derived from their operational processes. In all cases, the tasks had previously been performed using CAD systems and were implemented in a VR environment for this study. Although the specific task domains differ, the underlying interaction principles, workflows, and evaluation framework remain consistent across experiments. Therefore, the "Usability Degree" can be meaningfully compared, as it reflects user interaction with the same VR system under comparable conditions rather than the domain-specific content of the tasks.

The absence of a non-VR control condition (e.g., CAD or physical mock-up) prevents attribution of observed effects specifically to VR technology.

- I believe this response largely overlaps with the previous point.

Experiment III acknowledges limited valid questionnaire responses but does not describe how missing data were addressed (e.g., exclusion, imputation).

- The empirical questionnaire results were limited due to the small number of responses. Although five participants took part in the experiment, only four participants completed the survey. The usability metrics were calculated in accordance with the defined methodology based on the four valid responses. The fifth participant’s data were incorporated where possible, based on the questions they answered (Line 738-744).

Internal consistency (e.g., Cronbach's α) is not reported for SUS, NASA-TLX, or the custom usability questionnaire, undermining measurement credibility.

- The SUS and NASA-TLX questionnaires, as well as the seven usability-dimension questions, were adopted from established sources and are referenced in the manuscript ((Line 234- 257), Table 1, Table2, Reference [22] ). Since these instruments were not developed in this study, their original validated questions were used and the corresponding scores were calculated according to the standard methods.

The "Usability Degree" scores in Table 3 (e.g., 51.7%, 62.3%) are not defined in the methodology, preventing replication.

- In section 2.3, it is described how the usability degree is calculated based on the yes/no questions (Equation (1), line 263 ).

Critical user comments are cited frequently, while balanced or positive remarks from the same participants are largely omitted.

- In Experiment III, a positive comment from the CEO is included: “In addition, the CEO of the company stated clearly that they are planning to implement VR in their design-review process with clients, because it provides more clarity and allows clients to familiarize themselves with the machine, especially those who may not be able to understand CAD designs’ (Line 746- 749)
- The test subjects tended to focus on identifying weaknesses, with the result that the positive aspects were rarely highlighted.

Comparisons between students, junior engineers, and senior engineers are descriptive only; significance tests such as t-test or ANOVA are absent.

- Due to the small and uneven sample sizes across the experimental groups, the comparisons between test groups are presented descriptively and should be interpreted as indicative rather than statistically conclusive. The comparison between students, junior engineers, and senior engineers was intentionally designed to reflect the different experience levels present in the industrial context. These groups represent the main stakeholder categories within the company, including working students, early-career engineers, and highly experienced professionals and internationally distributed teams. The purpose of this grouping was not to establish statistically significant differences, but rather to explore how users with varying levels of expertise interact with and respond to the introduction of new technologies such as VR. This approach provides practical insights into the usability and potential adoption of the system across the spectrum of potential users within the organization.
- We have now made this clearer in the manuscript.

Conclusions about system usability and workload from experiments with only 5 participants (III and IV) are presented with unwarranted confidence.

- Therefore, we added a statement acknowledging that a limitation of the study is the small number of participants (Line 1155).

SUS and NASA-TLX measure different constructs, yet the paper often treats them as interchangeable indicators of "usability."

- The SUS measures the usability of the system, while the NASA-TLX evaluates the perceived workload of the tasks. In the study, both indicators were used in a complementary manner.
- These distinctions are now clearly identified in the state of art ((line 234 – 299), Table 1, Table2) .

In Experiment VI, the authors suspect students were biased by their course grading but neither control nor statistically test for this effect.

- The text has now been clarified: “Since the participants were students enrolled in the course and the experiment was part of their assessment, grading considerations may have influenced their responses. Although discrepancies were observed between open-text comments and scoring fields, the study did not explicitly control or statistically evaluate the potential effect of grading-related bias. This should therefore be considered a limitation of Experiment VI (Line 976-981).

Experiments V and VI switched to a standalone headset without disentangling the effects of resolution, tracking precision, or weight from usability outcomes.

- A detailed analysis of the hardware was not the aim of this review. In a direct comparison, the HTC Vive Pro is slightly heavier and has a lower resolution, but offers slightly better tracking accuracy.
- However, hardware and computer specifications were intentionally not included, as we did not aim to promote or advertise particular specifications or providers.

Software versions improved over time, but it is unclear whether any participants repeated experiments, introducing potential learning bias.

- The experiments were conducted with different participant groups, and no participants took part in more than one experiment. Therefore, learning effects across experiments can be excluded.

Table 2 does not explicitly list "multi-user collaboration" as an evaluation focus for Experiment III, despite it being a key objective.

- Correct, and it has now been added.

Methods and procedures such as participant introduction and task familiarization are repeatedly described across experiments, making the text unnecessarily verbose.

- While the experiments share a similar overall structure, there are important differences in procedures such as participant introduction and task familiarization depending on the participant group and the specific experimental setup. In addition, the following text was added in the introduction of Section 3.1 (Line 410-412): “All experiments followed a similar overall procedure consisting of participant introduction, system familiarization, task execution, and post-task evaluation. However, the duration and depth of the introduction and familiarization phases varied depending on the participants’ prior experience and the specific experimental context.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for submitting the revised manuscript and for addressing many of the previous comments. The manuscript has clearly improved. In particular, the study is now framed more appropriately as an exploratory multi-experiment usability evaluation, the description of Experiment I has been corrected, and the discussion is more cautious and better aligned with the study design.

The paper has potential as an applied contribution to VR usability evaluation in engineering product development. However, several issues still require focused revision before publication. Most importantly, the manuscript should consistently avoid wording that implies causal or statistically confirmed effects of software, hardware, user background, or task context. Since the six experiments differ simultaneously in participants, tasks, contexts, software versions, hardware configurations, and sample sizes, the findings should be presented as descriptive, case-specific observations.

I also recommend further clarification of the non-standard SUS aggregation, clearer labelling of the theoretical “proportion of discovered problems,” more transparent reporting of hardware/software configurations, and more systematic reporting of questionnaire outcomes, especially NASA-TLX values. The ethical statement and the potential influence of graded coursework in Experiment VI should also be handled more explicitly.

Overall, the manuscript is moving in a positive direction, and the Discussion has improved substantially. A detailed revision with specific points is provided in the attached review file.

Comments for author File: Comments.pdf

Author Response

Thank you very much for all your helpful comments and suggestions. Please refer to the attached PDF for specific responses to your comments.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authos have addressed all the issues I concerned. I think it can be published at its current state.

Author Response

Thank you very much for all your helpful comments and suggestions. They have been a great help in improving the quality of the publication.