Evaluating the Impact of VR Training Strategies on HRI Cooperative Assembly Performance

Farina, Paola; De Simone, Valentina; Miranda, Salvatore; Di Pasquale, Valentina

doi:10.3390/app16094305

Open AccessArticle

Evaluating the Impact of VR Training Strategies on HRI Cooperative Assembly Performance

Department of Industrial Engineering, University of Salerno, 84084 Fisciano, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4305; https://doi.org/10.3390/app16094305

Submission received: 9 February 2026 / Revised: 16 April 2026 / Accepted: 20 April 2026 / Published: 28 April 2026

Download

Browse Figures

Versions Notes

Abstract

Virtual Reality (VR) has emerged as a powerful tool for improving training strategies in advanced manufacturing through immersive experiences. Within this context, this study examines the impact of two training strategies, VR and Video-Based (VB) instructions, on system performance (execution time and human errors) in a cooperative Human–Robot Interaction (HRI) assembly task. Overall, 26 participants completed the task after receiving either VR or VB training, and a sub-sample of 6 people per group returned one month later to repeat the task, enabling an evaluation of performance over time. Objective and subjective metrics were collected, and statistical and effect size analyses were conducted to compare training effects across sessions. Results show that execution times and number of errors were comparable between VR and VB in the first real session. After one month, both groups exhibited improved performance, but VR-trained participants retained, on average, lower error rates, with a 71% reduction and the number of errors dropping to zero, and more stable error patterns, whereas VB-trained participants displayed greater variability and occasional accuracy degradation during repeated task execution. Moreover, within-group comparisons show that VR training is more effective for accuracy-critical cooperative HRI tasks. At the same time, VB remains a low-cost option for time-focused contexts, shedding light on how training modalities influence learning and forgetting in Industry 5.0.

Keywords:

Virtual Reality; Human–Robot Interaction; system performance; skill retention; cooperative assembly; workforce training

1. Background and Motivation

Modern manufacturing environments increasingly rely on close interaction between humans and collaborative robots, requiring operators to adapt to complex, hybrid work scenarios [1]. Human–Robot Interaction (HRI) encompasses all types of interactions between humans and robots. It can be classified into Human–Robot Coexistence, Cooperation, or Collaboration depending on factors such as shared workspace, timing, objectives, and physical contact [2].

This study specifically focused on Human–Robot Cooperation (HRCoop), which occurs when humans and robots share the same workspace and time but perform different tasks. Understanding how people acquire and retain skills in such contexts is crucial for ensuring both performance efficiency and safety. Structured training approaches play a key role in supporting the learning process, especially for collaborative assembly tasks [3]. Effective planning methods enhance the learning process in manufacturing operations, significantly improving operator performance and supporting optimal human–robot task allocation [4]. The ability to intervene in operators’ learning processes through appropriately designed training sessions represents a key opportunity to improve efficiency, safety, and human well-being in industrial environments.

Learning can be defined as the acquisition and refinement of procedural knowledge and psychomotor skills through experience, practice, or instruction [5]. However, extended periods without practice can induce the opposite process: forgetting, defined as the partial or total loss of previously encoded information from short- or long-term memory. This natural decay of skill retention can lead to performance degradation and increased human error during assembly or collaborative operations.

In this context, Virtual Reality (VR) has become a promising enabler of innovative training methodologies. Immersive simulations provide controlled and repeatable conditions for skill acquisition, offering a safe and engaging environment that closely replicates real-world tasks [6,7]. Compared to traditional approaches such as lectures, supervision by expert operators, or video-based tutorials, VR training allows users to practice complex or hazardous operations without risk to themselves or to the equipment [8]. Moreover, the ability to perform tasks with non-existent or unavailable equipment and to make mistakes without real consequences enhances both confidence and learning efficiency. Previous studies have shown that VR-based learning yields higher engagement, enjoyment, and knowledge retention than conventional 2D training platforms [9], while reducing cognitive load and facilitating procedural memory development. VR also enables personalized, transfer-oriented training scenarios that can be adapted to the individual’s real production environment [10].

From a memory perspective, procedural skill acquisition involves the progressive consolidation of motor sequences into long-term memory representations [11]. Research on motor learning and skill retention suggests that repeated embodied interaction strengthens procedural memory traces, reducing reliance on working memory during subsequent task execution. In immersive environments, sensorimotor coupling and active manipulation may enhance this consolidation process, potentially explaining differences in long-term error stability between training modalities. Specifically, the reduction in human error rates through VR training does not depend solely on task repetition but also on the joint activation of several neurocognitive mechanisms [11,12]: the spatial alignment between stimulus and response in VR scenarios can reduce cognitive load. At the same time, physical immersion and interaction facilitate the consolidation of procedural memory and support more effective transfer of learning and knowledge consolidation if compared to traditional methods.

1.1. Virtual Reality for Human-Robot Interaction Training

VR has been increasingly adopted in HRI working simulations for operator training and task rehearsal, enabling users to familiarize themselves with robotic systems and collaborative workflows before entering the real workspace [13]. In HRI VR-based simulations, operators’ training increases confidence, self-efficacy, and situational awareness when interacting with robots [14,15]. VR is expected not only to replicate the physical workspace but also to simulate dynamic and bidirectional interactions between humans and collaborative robots, where timing, spatial coordination, and communication cues play a crucial role [14,16].

Although VR can provide immersive visual and spatial experiences, current systems often lack haptic precision and realistic force feedback, which are fundamental for simulating contact-rich operations such as grasping, fastening, or guiding robotic arms [17,18]. This limitation affects procedural learning, since operators rely heavily on tactile cues to develop accurate motor patterns and coordination strategies [19]. As a result, the transfer of learned skills from virtual to real settings may be incomplete, leading to a mismatch in performance and perception once users interact with the real-world robot. In addition to technological constraints, VR–HRI training faces important human-centered challenges [20]. The interaction with a simulated robot may elicit different cognitive and emotional responses compared to real collaboration, influencing trust, perceived safety, and engagement [21]. The absence of real physical presence can alter situational awareness and delay the development of accurate mental models of the robot’s behavior [22]. Furthermore, prolonged exposure to immersive virtual environments can lead to cybersickness, manifested by nausea or disorientation, which negatively affects concentration, task persistence, and overall learning outcomes [23]. Beyond usability and comfort issues, the learning–forgetting process inherent to manufacturing operations represents an additional layer of complexity: operators acquire task-specific procedural knowledge through repetition and feedback, but skill retention is highly sensitive to task interruptions and environmental variability [24]. When training occurs in virtual settings, the temporal gap between learning in VR and performing can impact forgetting, especially if sensorimotor representations are not reinforced through physical interaction. Conversely, prior real-world experience might influence how users adapt to virtual tasks, affecting both performance efficiency and error patterns. However, empirical evidence on such bidirectional transfer of learning between VR and real collaborative assembly tasks remains scarce [25]. A recent study highlights that in research, little or nothing is evaluated on the medium-term/long-term effect of learning using VR technologies compared to traditional or different approaches [4]. Instead, the ability to crystallize the skills and competencies acquired during the virtual simulation through scheduled training sessions spread over broader time horizons should be explored in greater depth. The forgetting rate should be studied by varying the training method carried out, as well as the pre-learning strategy used, and the operator’s previous experience with tasks and devices.

In addition to general evidence on VR-based training effectiveness, evaluating learning quality in HRI contexts requires a clearer grounding in cognitive load theory and human performance modeling. Cognitive Load Theory (CLT), developed by J. Sweller in 1988, distinguishes between intrinsic, extraneous, and germane load components, highlighting how instructional design directly affects working memory demands and procedural learning efficiency [26]. In HRI assembly tasks, intrinsic load is associated with task complexity and sequencing, while extraneous load may derive from interface design, robot motion predictability, and environmental constraints [27]. Immersive VR training may reduce extraneous cognitive load by providing spatially coherent representations and embodied interaction, potentially facilitating schema construction and long-term retention [28].

Furthermore, learning mechanisms directly influence error control processes [29]. According to classical human error theory, errors may arise from slips and lapses linked to attentional failures [30] or from mistakes related to incorrect mental models and decision strategies. The transition from controlled to more automatic processing through practice reduces reliance on working memory and enhances action consistency, thereby decreasing the likelihood of execution errors. In HRI cooperative assembly tasks, VR training may therefore contribute not only to faster task execution but also to reduced errors, stability across repetitions, and resilience to task variability [31]. Understanding how immersive VR training influences these cognitive control mechanisms is essential for interpreting differences in error typology and medium-term reliability across training modalities.

Workload assessment in HRI research is frequently operationalized through the NASA-TLX indicator, which conceptualizes workload as a multidimensional construct including mental demand, physical demand, temporal demand, effort, performance, and frustration [32]. However, several studies emphasize that subjective workload ratings should be interpreted alongside objective performance indicators and behavioral measures to avoid biased or incomplete interpretations of operator strain [33]. Very recent comparative HRI studies are starting to adopt multimodal approaches, integrating outcomes, physiological indicators, and standardized questionnaires to capture the interaction between cognitive states and system-level performance.

1.2. Research Gaps and Study Objective

Despite these advances, comparative investigations between VR-based and traditional training approaches in HRI remain fragmented, and the underlying mechanisms of learning transfer between virtual and physical environments remain underexplored [11]. Many studies focus on immediate post-training performance, while fewer address medium-term retention or analyze how learning decay manifests differently across training modalities [24]. Moreover, most comparative works primarily evaluate performance in terms of efficiency (task time), with limited emphasis on reliability-related indicators, such as error typology and stability across repetitions. This gap highlights the need for studies that explicitly integrate cognitive workload perspectives, comparative performance metrics, and retention dynamics within cooperative HRI assembly tasks.

To the best of the authors’ knowledge, the differences in workers’ performance, particularly in terms of reliability and productivity associated with different worker training strategies, have not yet been thoroughly investigated. The novelty of this work lies in the comparison of two training strategies, i.e., VR-based training and Video-Based training for an HRI assembly task. Based on a laboratory experiment, the performance of participants has been collected and statistically analyzed, including not only task execution time but also the number and the type of human errors committed during the execution of tasks, both after the first session of training (R-I) and after a one-month interval (R-II), to capture early indications of performance retention and decay. The carried-out analysis highlighted the most significant performance differences, providing empirical evidence on learning transfer effects in HRI. To guide the research and canalize all the aspects discussed, the following Research Questions (RQs) are formulated:

RQ1: To what extent do VR-based training and VB-based training affect system performance in a real cooperative HRI assembly task? (VR vs. VB in R-I)
RQ2: How do operators’ performance levels in the real cooperative HRI assembly task evolve after one month for the groups trained with VR and VB approaches? (VR vs. VB in R-II)
RQ3: Within each training group (VR and VB), are there significant improvements or degradations in system performance between the first real execution and the execution performed one month later of the cooperative HRI assembly task? (VR and VB sub-groups comparison between R-I and R-II)?

The remainder of the paper is structured as follows. Section 2 describes the methodology followed to build and carry out the experimental campaign from which the data are collected, while Section 3 presents the main findings. Section 4 discusses the most relevant outcomes, highlighting limitations and future research directions. The main conclusions of this research study are presented in Section 5.

2. Methodology

The methodology followed to organize the experimental campaign, conduct it, collect related data, and derive consistent results is schematized in Figure 1, and the following paragraphs detail each procedure. The experimental campaign was conducted in a laboratory environment, and it followed the Declaration of Helsinki of 1975 (https://www.wma.net/what-we-do/medical-ethics/declaration-of-helsinki/, accessed on 21 October 2025), revised in 2013.

The task definition is the first phase (Section 2.1), which includes both the definition of system performance to be monitored and participants’ recruitment and division into sub-groups. The second phase was the experimental campaign, with 2 distinct moments in which participants were called to perform the HRI task (R-I and R-II sessions detailed in Section 2.2). Subsequently, all data were collected and organized into separate sheets, ready for analysis according to the 4 perspectives schematized in the figure, each referring to a specific RQ. Finally, all of them are answered in the Results section.

2.1. Task Description and Technical Equipment

The experimental campaign was conducted in the T12a Laboratory of the Department of Industrial Engineering at the University of Salerno. The workstation where the assembly task was performed is organized on a workbench to which the collaborative robot UR-3 from Universal Robots is attached. The UR3 has compact dimensions, making it suitable for tight workspaces. Its small base is perfect for workbench installation or for integration directly inside machinery, in light assembly or screwing applications. The workbench (180 cm × 70 cm × 97 cm) provided sufficient maneuvering space for the robot and the operator. The task to be reproduced is the assembly of a desk stand for a PC (Figure 2), whose components have been 3D-printed from a Lego Technics set. The components are listed and detailed in Table 1.

This study follows a preliminary work by [34]. As shown in Figure 3, each assembly task involves 3 essential phases, each corresponding to 3 assembly kits. Each kit is identified by a container, and the robot delivers the parts corresponding to the specific phase to the operator, who then assembles them. Figure 3 illustrates all the steps the human operator performs to assemble the PC stand and complete the task. Yellow blocks represent robot-related operations, blue blocks human-related operations, and the pink section highlights the cooperation phase. In addition, the blocks are divided according to the kit required to execute these actions.

Task executions were recorded using two cameras. The first camera was positioned laterally with an inclined angle to capture both the cobot’s movements and the participant’s actions. The second camera was mounted overhead and focused on the operator’s workspace, providing a clear view of the hands and components to facilitate accurate detection of execution errors. These recordings enabled a detailed analysis of operator–system interactions and task performance. Two of the authors, after all tests, have used the recordings to detect HEs, following the HE taxonomy provided by [30].

2.2. Procedure

The experimental procedure is depicted in Figure 4. As the first step, participants of the study were assigned an ID and a set of questions (demographic information, including their social background and level of experience with robots and VR technology) to be answered before being presented at the task venue. Upon arrival at the laboratory, participants were provided with an explanation of the performance that would be collected for the study.

They were then given an information sheet and asked to sign a consent form before the experiment began. The sample of participants was randomly divided into two groups.

The first half, “Video-based training group (VB group),” received training on the task, which involved watching a video that explained all the information about the task and its components.
The second half, “VR-based training group (VR group)”, underwent training in a virtual environment, handling some of the parts they would later find in the kits and assemble. The VR session used a Unity 3D scenario (editor version 2022.3.48f1). A VR headset (Meta Quest 2) was used, designed to be lightweight and comfortable for long gaming sessions. The participant was trained into a cooperative industrial workstation with a UR3 reproduction, whose behavior was pre-programmed and deterministic, ensuring identical task execution across participants. The training consisted of a simple assembly operation performed by the participant, preceded by a predefined pick-and-place action by the robot, thereby preparing the participant for the cooperation mode to be experienced in the real task.

Each participant was randomly assigned to one of the sub-groups and informed of the training strategy. To ensure balanced group composition, stratified randomization by gender was used, resulting in equal numbers of male and female participants across the two groups. Other demographic characteristics, such as age, educational background, and familiarity with VR or robots, were comparable among participants, as all subjects were recruited from the same academic environment.

Before the training, each of them was asked to fill in pre-test questionnaires. The VB training involved viewing the video once, whereas the VR condition required participants to manipulate the task-relevant components once within the virtual environment. After the training, participants performed the task under control conditions in a real environment to ensure consistent data collection. Specifically, they were instructed to assemble the laptop stand 3 times, following the instructions and without taking breaks between repetitions (R-I). The entire test lasted, on average, 20 min. Immediately after the task, participants were asked to answer a set of 3 questionnaires (NASA-TLX [35,36], STAI [37], Godspeed [38]) to assess their perceived physical and mental workload during the task and their attitude towards the robot.

All participants were invited to return after a month and repeat the task (3 task executions in a real environment–R-II) to evaluate changes in performance over a given period, including differences in training experience. Table 2 presents a detailed schematic overview of the tasks and corresponding groups.

2.3. Participants

Overall, 26 participants (14 females, 12 males) voluntarily participated in the experimentation (Age: Mean = 27.2 years, Standard Deviation = 3.31 years). They were Industrial Engineering bachelor’s (5 out of 26) and master’s (15 out of 26) degree students, PhD students, and young researchers (6 out of 18).

This distribution suggests a generally high level of education across the sample, consistent with the cognitive and technical skills required to perform the cooperative assembly task. Regarding previous experience with VR technologies, 44% of participants reported no experience, and 22% indicated only limited familiarity. Medium and high experience levels were less frequent (19% and 15%, respectively), indicating that most of the sample had little or no prior interaction with VR environments. This limited familiarity is relevant for interpreting potential learning effects and adaptation processes observed during the experiment. Similarly, participants reported limited prior experience with collaborative robots: 55% had none, and 26% reported almost none. Only a small fraction declared medium to high familiarity (11% and 8%, respectively). This result indicates that most participants approached the cooperative assembly task without substantial prior knowledge of cobot technology, supporting the study’s validity in assessing genuine learning and interaction effects. The participants had no prior experience with robots, even though the engineering students had a theoretical background in robotic technology. No participants reported mobility, sight, or hearing problems.

2.4. Measures

To accurately assess the effectiveness and efficiency of the task to be realized, it has been necessary to identify and monitor several metrics. Thanks to several literature reviews conducted as previous work [39,40], it was possible to divide these into (i) objective metrics, which include execution time, errors, and physiological measures, and (ii) subjective metrics, i.e., all human factors and perceptions deducible from participants’ responses to different pre-task and post-task questionnaires.

2.4.1. Objective Metrics

Objective metrics collected during each task execution have enabled us to understand the task’s operational dynamics better. The primary metrics, also used for data analysis, are:

Task execution time (T) [s]: measurement of the total time taken to complete the entire task, to obtain an assessment of the overall efficiency of the task.
Human Errors (HEs) [number and number for type]: identification of errors made by operators during task execution and verified through independent analysis of the recordings by two of the authors. HEs can be considered a metric for human operator performance, as the greater the number of errors, the lower the quality of the finished product and workplace safety [30]. In addition, the types of errors have been collected to better model their effects on performance. The taxonomy proposed by [30] includes the errors that are likely to occur in the execution of the selected assembly task, which were then further filtered according to the specific task to be realized in this study. Based on the cited taxonomy, three main types of errors were detected and analyzed as reported in Table 3.

Furthermore, HEs’ detection has been verified through Cohen’s Kappa (k) calculation, thus identifying inter-reliability of the error classification process. The coefficient measures the level of agreement between the two raters (researchers) beyond chance. The agreement analysis was conducted based on the three error categories (Execution, Sequence, and Selection) observed during task repetitions. According to the interpretation scale proposed by [41], the obtained k values (0.699 for R-I and 0.704 for R-II) indicate substantial agreement between the two raters, confirming the reliability of HEs’ detection procedure.

2.4.2. Subjective Metrics

For each participant, demographic characteristics (age and background) and prior experience with robots were recorded before the test began to verify sample homogeneity. Participants completed standardized and validated questionnaires before the task (pre-test) to assess their psychosocial background, initial stress, and robot or VR attitude. They include MRAS (Multidimensional Robot Attitude Scale) [42] and PSS (Perceived Stress Scale) questionnaires [43].

After the task, they were asked to complete additional questionnaires (post-test) to assess participants’ perceptions of the task itself and, finally, to provide an overview of their perceptions and predispositions toward cooperating with the robot in an assembly task. The questionnaires include PQ—Presence Questionnaire [44], SUS—System Usability Scale [45], VRSQ—Virtual Reality Sickness Questionnaire [46], Godspeed Questionnaire [38], NASA-TLX [35,36], and STAI—State Trait Anxiety Inventory Questionnaire [37].

2.5. Data Collection and Analysis

Data analysis aimed at capturing both individual and group-level dynamics and the effect of training strategies (VR and VB) on performance. Data related to the carried-out task (R-I and R-II) were first collected in Microsoft Excel spreadsheets and then statistically processed by RStudio (v. 4.4.2). The data analyzed statistically, including notation used in Section 3 for the presentation of results, are reported in Table 4.

The statistical assumptions derive from different experimental setups derived from literature and that were assimilable to this design of experiments [47,48,49,50]. All tests have considered a significance level of 5%. For each variable, the distributional assumptions were first assessed by applying the Shapiro–Wilk normality test (due to the dimensions of the samples). The null hypothesis (H0) assumed normally distributed data, and the alternative hypothesis (H1) assumed non-normality. A threshold of p-value (p) > 0.05 was used to retain H0 and consider the data normally distributed; variables with p < 0.05 were treated as non-normal. The comparison has been made between the metrics’ averages and the third repetition values on the specific sample participants belonging to the VR and VB groups.

Variables that satisfied normality assumptions, t-tests were conducted to evaluate mean differences between the VR and VB groups (independent samples) and within the same group when comparing related measurements (paired samples). Before, the homogeneity of variances was tested using Levene’s test. When Levene’s test returned p > 0.05, variances were considered homogeneous, and a t-test with equal variances was applied. When p < 0.05, the t-test with Welch correction was used (thus, when variances are not homogeneous).
Variables that did not satisfy the normality assumption in at least one condition were analyzed using the Wilcoxon test (rank-sum, equivalent to the Mann–Whitney U test, for independent samples; signed-rank for paired samples), a non-parametric test. In this case, p < 0.05 indicated a significant difference between the medians of the two conditions.

Practical relevance was also considered. While p-values were used to determine whether differences were statistically significant, effect size was assessed to define the magnitude and practical importance of these differences [51]. Specifically, Cohen’s d was computed to quantify the standardized difference between group means. Additionally, 95% confidence intervals were calculated for the main outcome measures to estimate the precision of the effect size. This approach allows interpreting not only whether differences exist, but also their size and reliability, providing a more comprehensive assessment of training effectiveness.

Until this point, the analysis focused on group-level comparisons of performance across training strategies without considering the variability of single participants and the evolution over repetitions. To enhance the robustness of the findings, a linear mixed-effects model [52] was then employed to examine the effects of training strategy, repetitions, and their interaction on the different performance outcomes (Section 2.4.1). The model included a random intercept for each subject to account for between-subject variability in baseline performance and the repeated-measures design. This approach enabled estimation of group-level effects (fixed effects) while controlling individual differences among participants. Overall, the application of the linear mixed-effects model, although not intended as the primary analysis of this study, offered a complementary perspective by incorporating subject-level variability into the analysis and providing a preliminary insight into learning outcomes.

3. Results

This section aims to describe the main results of the work, starting with a general overview of the sample and the principal data collected, and culminating in an answer to the presented RQs, thereby strengthening the body of knowledge regarding VR applications for training operators in manufacturing HRI tasks. Figure 5 details the logical grouping and the method used to analyze the data and provide answers to the presented RQs.

Section 3.1 investigates the immediate effect of training modality by comparing the performance of VR- and VB-trained operators during the first real execution (R-I) of the cooperative HRI task (part 1 of the scheme—green).
Section 3.2 focuses on the medium-term evolution of performance, comparing VR and VB groups during the second real execution (R-II) performed approximately one month later (part 2 of the scheme—violet).
Finally, Section 3.3 analyses the intra-group performance changes over time, comparing R-I and R-II within each training strategy (VR and VB) to assess improvements or degradations in system performance (parts 3 and 4 of the scheme—pink and orange respectively).

Each subsection provides a detailed, dedicated interpretation of the corresponding experimental findings.

3.1. RQ1: Analysis of First Session Results After Training

RQ1.

To what extent do VR-based training (VR) and video-based training (VB) differently affect system performance in a real cooperative HRI assembly task? (VR vs. VB in R-I).

A between-groups statistical comparison (VR-VB groups) was conducted on the performance indicators measured during the first real execution of the cooperative HRI assembly task following training (R-I session).

The comparison included T (average across the 3 task repetitions and T3), HEs, and their subcategories, reported as mean values and standard deviations (SD). The goal was to understand the differences in performance across different training approaches, as shown in Table 5.

Regarding T, a Shapiro–Wilk test for normality was performed. For the VR group, it revealed no significance for T_avg (p of Shapiro–Wilk test = 0.021) and for T3 (p of Shapiro–Wilk = 0.024), while for the VB group, there was weak significance (p of Shapiro–Wilk = 0.062 for T_avg and 0.24 for T3). Subsequently, the Wilcoxon rank-sum test revealed p-values greater than 0.05 (p of 0.49 for T_avg and 1 for T3). Overall, no statistically significant differences were observed between the two groups conducting the two different training strategies. Specifically, T_avg was 248 ± 72 s for the VR group and 265 ± 78 s for the VB group. Similarly, T3 showed comparable values (197 ± 83 s for VR vs. 182 ± 36 s for VB), with no significant difference. Moreover, the boxplots in Figure 6 show a reduction in execution times across the 3 repetitions for both the VR and the VB groups. All these findings indicate that, in terms of speed, the two training modalities produced comparable immediate performance in the real HRI task, but without a clear systematic advantage attributable to either VR or VB in terms of time efficiency.

None of the variables related to errors showed a statistically significant difference in mean or median values. Regarding HE_avg, which resulted in a normal distribution (Shapiro–Wilk test for VR group 0.10 and for VB group 0.28), the t-test resulted in a p value of =0.66. Instead, for HE3, the Wilcoxon rank-sum test gave p = 0.63. Both groups showed low and comparable values. The boxplots in Figure 7 show a radical reduction in the number of errors across the 3 repetitions, with a more significant decrease in the VB sub-sample during the second task repetition.

The breakdown into specific error categories (Table 6) reveals that Exec_avg, Seq_avg, and Sel_avg are of similar magnitude in VR and VB, indicating that the significant reduction in HEs_avg for VR is not driven by a single error category but rather by an overall improvement in task correctness. Only Exec_avg exhibited a normal distribution, as indicated by the Shapiro–Wilk test (p-value = 0.08 for VR and 0.31 for VB). None of the variables showed a statistically significant difference in mean values. Specifically, for Exec_avg, the t-test yielded a p-value of 0.53. Furthermore, the Wilcoxon rank-sum test produced p-values of 0.29, 0.91, 0.70, 0.58, and 1 for Seq_avg, Sel_avg, Exec_3, Seq_3, and Sel_3, respectively.

Table 7 summarizes the statistical significance, effect sizes (Cohen’s d, i.e., standardized effect size for measuring the difference between two group means), and 95% confidence intervals for the main outcome measures calculated to estimate the precision of effect size. None of the measures reached statistical significance, and the corresponding effect sizes were generally small to negligible, with confidence intervals crossing zero, suggesting a limited practical impact. This pattern highlights an important point: statistical significance alone does not fully capture the practical relevance of observed differences. While p-values indicate whether a difference exists between groups, Cohen’s d provides insight into the magnitude and practical importance of that difference.

In summary, in the R-I session, both total times and error counts were relatively high, with several participants showing extended execution times associated with a larger number of HEs. The comparative analysis of performance learning curves (for both time and number of Human Errors) for all participants in the VB and VR training groups during R-I sessions is reported in Appendix B, Figure A1. The correlation between T and HEs across trials appeared positive (as highlighted in the related boxplots), indicating that during the first exposure, participants were still familiarizing themselves with the task and the interaction with the environment. In the second repetition, a general reduction in both T_avg and HEs was observed. The correlation between the two variables decreased compared to the first trial, remaining moderately positive. This reduction indicates an initial learning effect, where participants became more efficient and made fewer errors as they gained experience. The decrease in correlation strength suggests that some participants managed to improve their performance time independently of error reduction, pointing to an early differentiation in learning pace. By the third session, both T and HEs reached their lowest level, with several participants performing the task almost error-free. This stabilization of performance reflects a consolidation of learning, where participants had acquired sufficient familiarity with both the task and the interaction environment to execute it effectively and consistently. Overall, the results indicate that while VR and VB different training lead to comparable immediate performance in terms of execution time and effort, VR-based training improves task accuracy by reducing the number of human errors during the first real cooperative HRI assembly execution. This finding supports the superior effectiveness of immersive VR training in enhancing operational reliability in safety-critical HRI scenarios.

3.2. RQ2: Performance Evolution After One Month (VR vs. VB)

RQ2.

How do operators’ performance levels in the real cooperative HRI assembly task evolve after one month for the groups trained with VR and VB approaches? (VR vs. VB in R-II).

The analysis of the second experimental session (R-II), conducted approximately one month after training (average delay ≈ 25 days, n = 6 per group), reveals an improvement in operators’ performance across both training strategies, with distinct trends between VR and VB (Table 8). Each performance was statistically analyzed as RQ1, and none of them passed the Wilcox rank-sum test (p for T_avg = 0.39; T3 = 0.39; HE3 = 0.14; Exec3 = 0.42; Seq_avg = 0.18; Seq3 = 0.41; Sel_avg = 0.60; Sel3 = 0.18) or the t-test (p for HE_avg = 0.32, and for Exec_avg = 0.50).

Regarding T_avg, both groups show a reduction from the first execution, reaching 175 s for VR and 153 s for VB, while T3 converges to comparable values (146 s for VR and 152 s for VB). However, based on the available statistical outcomes, these temporal differences between VR and VB at the second session do not appear to be statistically significant.

A more pronounced divergence emerges in the error-related indicators. The HE_avg measured in the second session is lower for the VR group (0.7) than for the VB group (1.3). A similar trend is observed for HE3, which remains markedly lower for VR (0.3) than for VB (2.5).

With reference to RQ2, the findings tentatively indicate that both training strategies may support performance improvement over time, while VR could offer somewhat more stable and reliable error containment at medium-term follow-up. These results should be interpreted with caution and regarded as preliminary evidence, rather than conclusive findings, due to the limited sample size. The analysis of Exec, Seq, and Sel errors supports this interpretation (Table 9).

In fact, at the follow-up, VR shows very low values across all subcategories (Exec_avg = 0.7, Seq_avg = 0, Sel_avg = 0.1), whereas VB maintains higher residual execution-related errors (Exec_avg = 1) and non-negligible values in Seq_avg and Sel_avg. The same pattern is amplified in the third repetition, where VB shows higher Exec3, Seq3, and Sel3, indicating lower stability across task repetitions.

Table 10 summarizes the statistical significance, effect sizes (Cohen’s d), and 95% confidence intervals for the main outcome measures of this case. The effect sizes (Cohen’s d) indicate effects ranging from small to large.

Several measures exhibit moderate to large effect sizes (e.g., HE3, Seq_avg, and Sel_3), suggesting potential differences between the examined conditions. Nevertheless, the wide confidence intervals reflect a high degree of uncertainty in these estimates, likely due to limited statistical power. In this case, in fact, only data related to 6 participants who performed the tasks after one month have been used.

A percentage variation analysis was also conducted between performance across sub-groups, and clear differences emerged between the VR and VB training outcomes (results are in Appendix A, Table A1). Regarding T_avg, the VB group showed a moderately better performance, with times approximately 14% lower than the VR group, indicating a slight advantage in global task completion speed. A similar trend was observed for T3, where VB group achieved marginally faster execution, although the difference was limited (about 4%). In contrast, all error-related metrics consistently favored the VR group. For HE_avg, the VR group committed 43% fewer handling errors than the VB group, suggesting more robust retention of correct manipulation strategies. This advantage became more pronounced in HE3, where VR group showed substantially fewer errors, with VB error rates exceeding VR by approximately 87%, highlighting a marked deterioration in fine handling performance after video-based training. Although the differences are not statistically significant, the results show a directionally favorable trend for VR in terms of long-term accuracy and robustness.

The comparative analysis of performance learning curves (for both time and number of Human Errors) for all participants in the VB and VR training groups during R-I sessions is reported in Appendix B, Figure A2. Overall, the VR vs. VB comparison at one month shows that, while execution speed converges between the two groups, VR-based training achieves significantly better accuracy and robust performance over time. From a qualitative perspective, these findings suggest that immersive training not only provides immediate benefits but also supports more stable retention of procedural correctness in real cooperative HRI assembly tasks. However, the absence of statistical significance for each system’s performance does not allow us to assert defined trends or patterns across task repetitions over time. It is worth declaring that, overall, while the observed trends appear to favor VR, the current evidence does not support definitive conclusions regarding its superiority over VB.

3.3. RQ3: VR and VB Performance Evolution from the R-I and R-II Task

RQ3.

Within each training group (VR and VB), are there significant improvements or degradations in system performance between the first real execution and the execution performed one month later of the cooperative HRI assembly task? (VR and VB sub-groups comparison between R-I and R-II).

The intra-group comparison between the first real execution (R-I session) and the 1-month follow-up (R-II session) reveals distinct learning and retention dynamics for the two training modalities. For VR group (Table 11), a clear improvement is observed in both temporal and error-related outcomes. Statistical analyses were conducted based on data distribution. For variables exhibiting a normal distribution, T_avg and HE_avg, the t-test confirmed statistical significance, with p-values of 0.017 and 0.026, respectively. Among the non-normally distributed variables, only T3 reached statistical significance in the Wilcoxon signed-rank test (p = 0.031).

T_avg decreases from 257 s to 175 s, and T3 from 213 s to 146 s, indicating enhanced execution efficiency. More importantly, a statistically significant reduction in HE_avg is observed, from 2.5 to 0.7, accompanied by a consistent decrease in HE3 from 3.2 to 0.3. The reduction extends across all error subcategories (Exec, Seq, Sel), confirming a robust consolidation of procedural correctness over time for VR-trained operators.

For VB group (Table 12), a general improvement in T_avg is also observed, decreasing from 233 s to 153 s, while T3 remains relatively stable across sessions. HE_avg shows a marked reduction from 7.4 to 1.3, indicating that video-based training also supports medium-term learning. However, an opposite behavior is observed for HE3, which increases from 0.8 to 2.5, together with higher values in Exec3, Seq3, and Sel3, suggesting a less stable retention of fine-grained execution accuracy under repeated task conditions.

Among all variables analyzed, the only one that reached statistical significance was T_avg (t-test p-value = 0.015).

Table 13 summarizes the statistical significance, effect sizes (Cohen’s d), and 95% confidence intervals for the main measures of these two cases. Overall, T_avg significantly improved in both groups, with very large effect sizes. The VR training additionally showed significant improvements in T3 and HE_avg, indicating a broader impact on both speed and accuracy. In contrast, within the VB training, only T_avg reached statistical significance, while the remaining variables were not significant despite some large effect sizes (e.g., HE_avg: Cohen’s d = −1.52). Measures of the type of errors (Exec, Seq, Sel) did not show statistically significant changes in either group. However, moderate-to-large effects were observed in the VR group, whereas the VB group showed smaller or negligible effects. Several outcomes showed large effect sizes, with confidence intervals that included zero. This outcome was expected due to the limitations imposed by the small sample size (6 participants per group). Overall, the VR training appears more consistent and effective, whereas the VB training shows less stable statistical evidence despite some notable effect sizes.

In summary, with respect to RQ3, both VR and VB groups exhibit temporal performance improvements over time, but VR-based training leads to a more consistent and stable reduction in errors across sessions, whereas VB shows higher variability and signs of partial performance degradation in repeated executions, even though statistical significance has not been reached for each case analyzed and so these results should be only suggestive and not mandatory.

An analysis of within-group performance changes between the first and second sessions highlights markedly different retention profiles for the two training strategies (results are in Appendix A, Table A1). For the VR group, improvements were systematic across all metrics. Overall, T_avg decreased by approximately 32%, and a comparable reduction was observed for T3, where performance improved from an initial value of 213 s in R-I to 146 s in R-II. Importantly, this reduction reflects not only a relative improvement but also convergence toward a stable and efficient execution level. In contrast, VB group started R-I with a substantially lower T3 value (169 s) but showed only a marginal reduction in R-II (152 s, −10%), indicating limited temporal consolidation despite repeated exposure. Thus, while VR participants improved from a slower initial performance to a markedly faster and more stable execution, VB participants exhibited a plateau effect, with minimal gains despite repetition.

Error-related metrics further differentiated between the two groups. In VR group, handling errors decreased sharply between sessions (HE_avg −71%, HE3 −89%), indicating strong retention of fine manipulation strategies. Conversely, although VB group showed a reduction in average handling errors (HE_avg −83%), a critical divergence emerged for HE3, which increased by 200% from R-I to R-II. This increase, despite overall familiarity with the task, suggests that VB training may not adequately support the retention of error-sensitive behaviors under higher task complexity, potentially leading to overconfidence or insufficient internalization of corrective strategies when visual guidance is no longer present. A similar pattern was observed for execution-related errors: VR group exhibited large and consistent reductions (Exec_avg −76%, Exec3 −88%), whereas VB group showed moderate improvement in average execution errors (−40%) but a substantial increase in Exec3 (+267%), again indicating instability in complex task conditions; selection and sequencing errors further confirmed this divergence.

Overall, while both training strategies supported improvements in global task efficiency over time, VR training led to more stable, generalized performance gains across increasing task complexity. VB group, despite some aligned improvements in metrics, exhibited pronounced fluctuations and regressions in high-demand conditions, highlighting limitations in long-term retention and transfer of complex cooperative assembly skills.

Although the study focused on establishing whether differences exist among the training strategies, differences were defined by group rather than by individual participants’ evolution over task repetitions. To strengthen the robustness of the analysis, a linear mixed-effects model with random intercepts was fitted to the data from the subject who performed the assembly task 6 times (6 participants in the VR group and 6 in the VB group). The analysis showed a strong effect of the 6 repetitions on time execution (β ≈ −31.7), reflecting marked improvement in performance over time across all participants. Although the VR group showed slightly longer completion times at the midpoint compared to the VB group (β ≈ 23.2), the interaction between training strategy and repetition was minimal (β ≈ 0.7), suggesting largely parallel learning trajectories. In practice, the type of training did not significantly affect the rate of improvement. Regarding errors, the linear mixed-effects model revealed a modest reduction in HE over time (β ≈ = −0.31), reflecting a slight learning process. The VR group had a slightly higher HE at the midpoint than the VB group (β ≈ 0.31). At the same time, the interaction between strategy and repetition was small and negative (β ≈ −0.41), denoting a minor tendency for the VR group to improve slightly faster.

Focusing on the specific typology of errors, for sequence errors, the VR group showed slightly higher errors at the midpoint compared to the VB group (β ≈ 0.42). The negative interaction between strategy and repetition (β ≈ −0.35) indicates that errors decreased more rapidly in VR, denoting a steeper learning pattern for this group. A faster learning rate is obtained despite worse initial performance. Regarding selection and sequence errors, the VR group showed slightly lower values at the midpoint than the VB group (β ≈ −0.08 and −0.03, respectively). This difference was minimal, indicating broadly comparable performance between conditions at baseline. The interaction between strategy and repetition was small for both typologies of errors (β ≈ = −0.03), suggesting largely similar learning trajectories throughout conditions. Selection errors decreased slightly over time (β ≈ −0.02), showing a modest improvement across repetitions, whereas sequence errors remained relatively stable over time (β ≈ = 0.005), indicating little evidence of systematic learning across repetitions. In conclusion, the linear mixed-model analysis indicated that the learning process, as reflected in performance changes across repetitions, remained consistent across strategies. This pattern was also observed when examining errors and specific error types and remained even when accounting for inter-individual variability within participants.

4. Discussions

This study aimed to investigate how different training strategies—VR-based and VB—affect system performance in a real cooperative HRI assembly task, both immediately after training and one month later. Overall, the results show that VR primarily affects the accuracy and stability of human performance (here expressed through the HE taxonomy), rather than execution speed, with effects that remain visible in the medium term. At the same time, the limited sample size in R-II and the lack of statistical significance for most indicators at follow-up suggest that these conclusions should be interpreted with caution and considered preliminary evidence rather than definitive proof.

The primary outcomes of this study address RQ1 and make it possible to determine the extent to which VR-based and VB-based training modalities influence system performance in a real-world cooperative Human–Robot Interaction (HRI) assembly task. Given the amount of data available per participant and the number of participants, the primary analyses focus on this research question, which represents the main scope of the experimental evaluation. Regarding RQ1, VR and VB training produced comparable execution times and number of errors in the first real session (R-I). This pattern is particularly interesting when contrasted with much of the VR training literature, where immersive practice is often associated with shorter times and more efficient task execution, sometimes with less emphasis on error reduction and more focus on high-performance standards [53] or with attention paid only to motion features and technological VR-issues [54]. However, several studies in industrial and safety-critical domains have shown that VR-based training can improve procedural accuracy and reduce execution errors during subsequent real-world task performance, particularly in contexts requiring structured sequencing and spatial coordination [22,54]. These findings support the interpretation of VR training as a tool not only for the enhancement of efficiency but also for reliability and error prevention. In the presented case, VR-trained users did not complete the cooperative assembly faster than VB-trained ones; VR showed slightly higher T3 in R-I, though without statistical significance.

Regarding RQ2 and RQ3, the findings should be interpreted in an exploratory manner due to the limited number of participants who completed the one-month follow-up assessment. Therefore, any comparison between the VR and VB groups at R-II (RQ2), as well as within-group changes between the initial execution and the one-month follow-up (RQ3), should be regarded as preliminary evidence. These results, therefore, do not allow for definitive conclusions but rather provide indicative trends that should be further investigated in studies with larger sample sizes and improved retention rates. Regarding RQ2 and RQ3, both training strategies showed a general improvement between R-I and R-II, with shorter T_avg for all groups.

However, the way this improvement unfolds over time is different for VR and VB. For VR group, the one-month follow-up reveals a coherent and stable performance gain: T_avg and T3 both decreased, and HE3 drops close to zero. When HEs are decomposed into Exec, Seq, and Sel, VR shows a systematic reduction across all subcategories, both in average values and in the third repetition. This indicates that VR training not only lowers the total number of errors but also reduces execution slips, sequence confusions, and selection mistakes in a consistent way, pointing to robust consolidation of procedural knowledge, although the evidence remains preliminary and not statistically conclusive. Such stabilization patterns are consistent with prior evidence suggesting that immersive practice can facilitate motor sequence consolidation and enhance transfer of procedural skills to subsequent real-world execution [11,22]. In collaborative assembly with cobots, reducing errors may be more valuable than reducing cycle time, particularly in early deployment phases. These findings suggest that VR promotes a more cautious and structured interaction style: participants appear to prioritize correct action sequences and component selection, even at the cost of slightly longer execution times. From a cognitive perspective, such behavior may reflect a shift toward controlled processing and enhanced attentional regulation during task execution. According to human error theory, execution-level errors (slips) are frequently associated with attentional lapses or working memory overload rather than insufficient declarative knowledge [55]. Training modalities that support structured sequencing and reduce cognitive interference may therefore contribute to improved error monitoring and action consistency over time. This countertrend with respect to the dominant focus on time reduction in VR training studies represents a novel contribution to HRI literature. Rather than emphasizing speed optimization alone, the present findings tentatively support the interpretation of VR as a reliability-oriented training tool, potentially enhancing execution stability by managing cognitive workload and attentional resources [56]. However, given the limited follow-up sample, this interpretation must be considered exploratory and subject to further validation.

On the other side, VB group also improves in T_avg from R-I to R-II, confirming that VB instruction can support learning over time. However, HE3 and the corresponding subcategories (Exec3, Seq3, Sel3) increase in the second session, especially under repeated execution. This suggests that, although VB participants learn the task and become globally more accurate, their performance becomes less stable under repetition, with local execution and sequencing errors re-emerging in R-II. From a system perspective, VB training seems to foster improvement in “average” behavior, but with a higher residual variability and weaker resistance to perturbations or repeated trials.

When comparing VR and VB at R-II (technology effect on medium-term performance), most indicators do not reach statistical significance, which is a clear limitation of the study given the small sub-samples (n = 6 per group). Nevertheless, the difference in HE_avg, together with lower Exec, Seq, and Sel values for VR, suggests that VR-trained operators tend to maintain a more robust and homogeneous error profile over time. VB-trained operators, in contrast, appear more sensitive to local fluctuations, above all in the third repetition. These tendencies must be confirmed by studies with larger samples and should be regarded as exploratory trends rather than definitive effects: they point towards a qualitative advantage of VR in terms of error stability and retention.

Taken together, these results indicate that VR and VB should not be considered equivalent training solutions for cooperative HRI, even when their effects on execution time are similar. VR appears particularly suited when the design goal is to minimize operational errors and ensure stable behavior across repetitions and overtime, for example, in tasks where safety margins are limited or where small mistakes can propagate along the production line. This perspective aligns with studies emphasizing the role of VR simulation as a preparatory scenario to reduce unsafe interactions and specific errors before real HRI deployment [54]. In such contexts, typical of a high-production mix involving several assembly operations or of collaborative tasks requiring precise spatial coordination, VR training has been shown to support the development of more reliable operator-robot interaction patterns. Specifically, from a theoretical standpoint, immersive environments may facilitate procedural knowledge consolidation through embodied sensorimotor interaction. Motor learning research shows that repeated active manipulation strengthens motor sequence representations and reduces reliance on working memory, promoting more consistent execution over time [57]. Moreover, immersive 3D contexts may enhance spatial memory encoding, potentially supporting more stable retrieval of task-relevant spatial information during later performances [11]. While these mechanisms could contribute to improved error stability, the present findings remain preliminary due to limited statistical power. VB, on the other hand, remains a viable and less resource-intensive option for contexts where moderate error rates are acceptable and where the emphasis is mainly on providing basic procedural knowledge with relatively low effort.

The analysis of error typologies is especially relevant in this regard. By aggregating Exec, Seq, and Sel errors across repetitions and sessions, VR consistently shows more pronounced reductions in all error categories, while VB improvements are more uneven. This suggests that VR not only helps operators remember “what to do” but also how and when to do it, supporting the fine temporal and spatial coordination required by collaborative assembly with cobots. It is important to note that the lack of statistical significance in the between-group comparison limits the strength of these conclusions, and the observed trends should be interpreted with caution.

From a manufacturing perspective, the reduction and stabilization of HEs in cooperative assembly tasks are directly connected to key production indicators [58,59]. Execution errors may translate into defective assemblies and rework operations; sequencing errors can generate micro-stoppage, additional inspection steps, or process interruptions; lastly, selection errors may lead to scrap, component damage, or quality deviations propagating downstream. In safety-critical HRI contexts, unstable performance profiles may also increase the likelihood of near-miss events or unsafe coordination episodes. Therefore, even in the absence of immediate cycle time reduction, improvements in error stability can have measurable implications for production efficiency, quality assurance, and operational risk management [60]. Also, the economic implications of HEs’ reduction in collaborative assembly should not be underestimated. Cost modeling studies in HRC contexts emphasize that performance variability and process disruptions significantly influence overall return on investment and operational sustainability [58]. Framing VR training as a reliability-oriented intervention thus aligns the present findings with concrete manufacturing decision criteria rather than purely academic performance comparisons.

Therefore, the choice between VR and VB training strategies should not be conducted only in terms of technological innovation, but rather in terms of what the operational requirements of the specific industrial scenario are, balancing all cited factors with the real availability of such training resources.

The analysis of questionnaire responses helped understand the overall experience of the sample. The most interesting focus is on NASA-TLX responses: the answers indicate that the cooperative assembly task was not perceived as highly demanding overall, consistent with its relatively simple, structured nature. Participants reported moderate mental demand and low physical demand, suggesting that the task primarily required attention and coordination rather than physical effort. Temporal demand was generally limited, suggesting that the task was not experienced as time-pressured. At the same time, effort ratings remained moderate, reflecting the need for focused engagement rather than sustained cognitive strain. Perceived performance was rated relatively high, indicating that participants felt capable of completing the task successfully. Frustration levels were low overall, with some variability across individuals, likely associated with brief coordination or decision-making phases. Taken together, these results suggest that the task imposed a manageable cognitive load, supporting its suitability as a baseline cooperative assembly scenario for comparing training strategies without introducing excessive workload-related confounding effects, but it is not possible to claim that VR-based training reduces operators’ effort significantly if compared to VB training, as demonstrated in [61]. This study also offers a valuable method to evaluate the usefulness of VR as instrument to train operators performing in HRI tasks, but the novelty of the present work lies in the additional attention on retention and task repetitions over time: this multi-outcome analysis enriches the body of knowledge regarding the design of HRI cooperative tasks, furthermore including the subjective evaluation of the sample about VR training and overall impressions about real task related efforts, placing the human factor at the center at the same level of objective system performance of time and errors. Also [62] provides a valuable reference for the evaluation of VR training approaches in manufacturing HRI tasks, but what is missing is always the long-term evaluation of operators’ skill retention and the perception of how much the virtual scenario provides benefits on the subsequent real-world task performance. The conjunction of time, human error, and personal feedback that has been collected in this experimental campaign offers a profitable basis for multiple analyses, which will surely be enriched in future extensions, especially by expanding the factors under evaluation and their mutual interaction and impact on performance.

4.1. Study Limitations

However, several limitations must be acknowledged. First, the sample size in R-II is small (n = 6 per group), which substantially reduces statistical power and results in non-significant differences across most indicators, despite clear descriptive trends. The cause is the non-availability of the entire sample for the second session, which affected the results. Future studies should replicate the experimental design with larger groups and, if possible, collect multi-site data to increase the robustness and generalizability of the findings.

Second, the experiment focused on a single cooperative assembly task and on a relatively homogeneous, young, and highly educated sample, mainly engineering students and researchers, who may not fully reflect the features, experience, and behavioral patterns of industrial operators.

Moreover, the assembly task is relatively simple, although it remains relevant from an industrial perspective. It is important to emphasize that the primary aim of this study was to investigate the effect of different training strategies on performance. While the task itself is meaningful, it did not represent the main focus of the experiment, which was conducted in a university laboratory setting. The experimental design could be applied to more complex tasks. In such a case, the results could be statistically analyzed to identify trends in time and error performance as a function of task complexity and the different training strategies adopted. Different tasks (e.g., more complex assembly sequences, maintenance or inspection activities) and more heterogeneous populations (e.g., older workers, operators with lower technological familiarity) may yield different patterns of time–error trade-offs and learning curves. Extending the study to multiple task types and user profiles would allow a more complete characterization of when VR provides the largest benefit compared to VB.

Third, many parameters from literature are defined as well-established psychophysiological indicators of mental workload, stress, and autonomic nervous system activity. Accurate combined research on ease of interpretation and monitoring devices helped identify Heart Rate Variability (HRV) as a valuable physiological measure for operator perception. Given the exploratory nature of this study and the strong focus on VR’s potential as a training strategy for HRI tasks, HRV was not considered among the key performance indicators analyzed in this study. HRV has been widely adopted in ergonomics research, and recent advances in wearable technologies enable continuous, non-invasive monitoring of operators’ physiological states. In the context of HRI, several studies have highlighted the potential of HRV to capture stress responses and psychophysiological adaptation during collaboration with cobots. Future research will integrate HRV monitoring, potentially through wearable devices such as smartwatches, to complement performance-based metrics and self-reported measures. Multimodal approaches combining HRV, workload questionnaires, eye-tracking, and behavioral indicators could provide a more comprehensive understanding of cognitive and emotional demands during immersive HRI training.

Finally, the present study considered only two discrete measurement points (immediately after training and one month later). To fully understand learning and retention dynamics, additional intermediate and longer-term follow-ups would be beneficial, as well as experimental manipulations of refreshing training sessions in VR or VB. Modeling the midpoint needed should be of particular interest in the case of a non-repetitive task. This would help determine how often and in what format training should be repeated to maintain low error rates over extended periods, which would better align with the literature outcomes of the initial analysis.

4.2. Future Research Directions

This study provides novel evidence that VR training in cooperative HRI does not necessarily reduce execution time but primarily increases accuracy and stabilizes error patterns over time, in contrast to much of the existing VR literature. This insight can inform the design of hybrid training strategies in which VR is strategically adopted when error-sensitive, safety-critical tasks are introduced in collaborative manufacturing environments, while VB or other low-cost methods are reserved for less critical operations or for early familiarization phases.

Despite the improvements observed in performance, a more critical aspect concerns the economic evaluation of the results achieved. Improvements in performance indicators, such as reliability or task execution time, do not necessarily imply the economic viability of VR-based training systems. Developing VR training scenarios entails significant costs, and defining and quantifying error-related costs can significantly influence the overall feasibility of VR-based training solutions. Researchers should therefore focus on integrating performance metrics with comprehensive economic assessments to evaluate the cost-effectiveness of such training systems better. Most importantly, researchers should quantify these relationships by integrating time-cost-quality models and downtime analysis into HRI training evaluation frameworks.

Regarding the experimental design, the training tasks were conducted to minimize potential differences in practice, thereby reducing bias in the analysis. Further research is warranted to refine the experimental design and to develop improved training strategies for future applications. Such strategies should aim to provide all trainees with a comparable level of practical experience, which should be quantifiable and associated with a defined threshold.

In addition, the analysis primarily focused on differences between the VR and VB training groups, without accounting for individual variability. While the analysis emphasized group-level differences, individual differences warrant further investigation and were briefly introduced in this study only to enhance the robustness of the analysis (Section 3.3). The results are based on a small, relatively homogeneous sample; therefore, they should not be interpreted as generalizable evidence. Future research with larger samples is needed to explore variability and individual learning rates, with a focus on individual participants and including additional participant characteristics (such as age or others).

One of the main methodological limitations of the analysis is the repeated testing on the same dataset, which can increase the risk of false positives in assessing statistical significance. Although the present work represents a small-scale applicative example, in the context of larger experimental campaigns, it becomes essential to account for this issue by adopting appropriate multiple-testing correction methods. Approaches such as Holm–Bonferroni and Benjamini–Hochberg procedures should be considered to control the family-wise error rate and the false discovery rate, respectively, thereby ensuring the validity and robustness of the statistical inference.

Furthermore, while evidence supports the effectiveness of VR in individual skill acquisition, few studies have systematically examined the bidirectional transfer of learning, with which it is intended how prior experience in VR influences real-world performance, and conversely, how real-world experience affects subsequent task execution in VR [61]. Future research should address this gap through larger and more structured experimental campaigns, correlating participant characteristics with performance outcomes to isolate better and quantify the effects of prior VR and task-related experience.

Lastly, future studies should replicate the experimental protocol with professional industrial workers to validate the observed trends under real manufacturing conditions and across broader age and experience profiles.

5. Conclusions

This study investigated the impact of different training strategies (VR and VB training) on system performance in a cooperative HRI assembly task, considering both immediate effects and performance evolution over time. By combining objective performance indicators and a longitudinal experimental design, the work provides empirical insights into how training modalities shape learning, retention, and execution stability in real cooperative manufacturing scenarios. Moreover, it offers preliminary insights into the potential alignment between training modalities and task requirements, which should be further investigated before deriving concrete design guidelines.

The results show that VR-based training does not primarily improve execution speed, as both VR- and VB-trained operators achieved comparable task completion times and error counts. Instead, the main advantage of VR lies in its effect on accuracy and robustness after one month: VR-trained participants consistently committed fewer HEs maintained lower and more stable error levels. In contrast, although VB training supported improvements in average performance over time, it was associated with greater variability and occasional degradation in error-related metrics under repeated task execution, particularly in more demanding conditions. These findings suggest that VR fosters a deeper consolidation of procedural knowledge, supporting not only what actions should be performed, but also how and when they should be executed in cooperative HRI contexts. The results are clearly only descriptive and exploratory, not supported by a consistent statistical significance, that should be researched in future experiments. From an applied perspective, the study highlights an important trade-off between speed and reliability in training design. While VB remains a viable and cost-effective solution for time-oriented or low-criticality tasks, VR-based training may appear more suitable for accuracy-sensitive and safety-critical HRI applications, where error containment and performance stability are prioritized over minimal cycle times. This distinction is particularly relevant in Industry 5.0 settings, where human-centered design, well-being, and system resilience are central objectives. However, these observations should be considered as tentative and hypothesis-generating rather than definitive design conclusions.

As detailed in the previous section, the work has some limitations, including a reduced sample size at the one-month follow-up and a focus on a single cooperative assembly task and a relatively homogeneous population. Future research should extend the experimental design to larger and more diverse samples, multiple task typologies, and longer observation horizons to better characterize learning and forgetting dynamics. Moreover, integrating additional human-centered measures, such as psychophysiological indicators, could further enrich the understanding of cognitive and emotional processes underlying HRI training.

Overall, this work contributes to the growing body of research on immersive training for HRI by providing preliminary indications that immersive VR training may support performance stability over time. However, due to the limited sample size in the follow-up phase, these results should be interpreted cautiously. While the presented differences suggest potential advantages in reliability and error consistency, further research with larger samples is required before drawing definitive conclusions regarding long-term retention effects, thus offering practical guidance for designing effective and sustainable training strategies in collaborative manufacturing systems.

Author Contributions

Conceptualization, V.D.P. and S.M.; methodology, P.F., V.D.S., S.M. and V.D.P.; software, P.F. and V.D.S.; formal analysis, P.F. and V.D.S.; writing—original draft preparation, V.D.S. and P.F.; writing—review and editing, V.D.P. and S.M.; project administration, V.D.P.; funding acquisition, V.D.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Union—NextGenerationEU Plan, component M4C2, investment 1.1, through the Italian Ministry for Universities and Research MUR “Bando PRIN 2022—D.D. 104 del 02-02-2022” in the context of the PRIN Project RESILIENCE.

Institutional Review Board Statement

The experimental campaign was conducted in a laboratory environment, and it was carried out following the rules of the Declaration of Helsinki of 1975 (https://www.wma.net/what-we-do/medical-ethics/declaration-of-helsinki/ accessed on 21 October 2025), revised in 2013.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Dataset available on request from the authors/Data is contained within the article.

Acknowledgments

This research work is part of the activities carried out in the context of the RESILIENCE project (Prescriptive digital twins for cognitive-enriched competency development of workforce of the future in smart factories), Code. 2022K2SAFM, CUP H53D23001310001, D53D23003690006.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Comparisons. Lower values indicate better performance for all metrics. Δ Training (%) represents the relative difference between Video-Based and VR groups (VB − VR)/VB within the same session. Δ Session (%) represents the relative change between R-I.

Metric	VR R-I	VB R-I	Δ Training R-I (%)	VR R-II	VB R-II	Δ Training R-II (%)	Δ Session VR (%)	Δ Session VB (%)
T_avg	257.11	232.50	−11%	174.61	152.89	−14%	−32%	−34%
T3	213.17	169.33	−26%	145.50	152.00	+4%	−32%	−10%
HE_avg	2.50	7.39	+66%	0.72	1.28	+43%	−71%	−83%
HE3	3.17	0.83	−280%	0.33	2.50	+87%	−89%	+200%
Exec_avg	2.83	1.67	−70%	0.67	1.00	+33%	−76%	−40%
Exec3	2.83	0.50	−467%	0.33	1.83	+82%	−88%	+267%
Sel_avg	0.22	0.22	0%	0.06	0.11	+50%	−75%	−50%
Sel3	0.17	0.17	0%	0.00	0.33	+100%	−100%	+100%
Seq_avg	0.17	0.17	0%	0.00	0.17	+100%	−100%	0%
Seq3	0.17	0.17	0%	0.00	0.33	+100%	−100%	+100%

Appendix B

Figure A1. Comparative analysis of performance learning curves for Video-Based (VB) and Virtual Reality (VR) training groups in R-I sessions.

Figure A2. Comparative analysis of performance learning curves for Video-Based (VB) and Virtual Reality (VR) training groups across R-I and R-II sessions.

References

Inkulu, A.K.; Bahubalendruni, M.V.A.R.; Dara, A.; SankaranarayanaSamy, K. Challenges and Opportunities in Human Robot Collaboration Context of Industry 4.0—A State of the Art Review. Ind. Rob. 2022, 49, 226–239. [Google Scholar] [CrossRef]
Di Pasquale, V.; De Simone, V.; Giubileo, V.; Miranda, S. A Taxonomy of Factors Influencing Worker’s Performance in Human–Robot Collaboration. IET Collab. Intell. Manuf. 2023, 5, e12069. [Google Scholar] [CrossRef]
Sadrfaridpour, B.; Wang, Y. Collaborative Assembly in Hybrid Manufacturing Cells: An Integrated Framework for Human-Robot Interaction. IEEE Trans. Autom. Sci. Eng. 2018, 15, 1178–1192. [Google Scholar] [CrossRef]
Di Pasquale, V.; Cutolo, P.; Esposito, C.; Franco, B.; Iannone, R.; Miranda, S. Virtual Reality for Training in Assembly and Disassembly Tasks: A Systematic Literature Review. Machines 2024, 12, 528. [Google Scholar] [CrossRef]
Dwivedi, P.; Cline, D.; Joe, C.; Etemadpour, R. Manual Assembly Training in Virtual Environments. In Proceedings of the IEEE 18th International Conference on Advanced Learning Technologies (ICALT), Mumbai, India, 9–13 July 2018; pp. 395–399. [Google Scholar] [CrossRef]
Borghi, S.; Ruo, A.; Sabattini, L.; Peruzzini, M.; Villani, V. Assessing Operator Stress in Collaborative Robotics: A Multimodal Approach. Appl. Ergon. 2025, 123, 104418. [Google Scholar] [CrossRef] [PubMed]
Brunzini, A.; Papetti, A.; Messi, D.; Germani, M. A Comprehensive Method to Design and Assess Mixed Reality Simulations. Virtual Real. 2022, 26, 1257–1275. [Google Scholar] [CrossRef]
Akinola, Y.M.; Agbonifo, O.C.; Sarumi, O.A. Virtual Reality as a Tool for Learning: The Past, Present and the Prospect. J. Appl. Learn. Teach. 2020, 3, 51–58. [Google Scholar] [CrossRef]
Yin, F.; Chen, J.; Xue, H.; Kang, K.; Lu, C.; Chen, X.; Li, Y. Integration of Wearable Electronics and Heart Rate Variability for Human Physical and Mental Well-Being Assessment. J. Semicond. 2025, 46, 20. [Google Scholar] [CrossRef]
Riemann, T.; Kreß, A.; Roth, L.; Klipfel, S.; Metternich, J.; Grell, P. Agile Implementation of Virtual Reality in Learning Factories. Procedia Manuf. 2020, 45, 1–6. [Google Scholar] [CrossRef]
Levac, D.E.; Huber, M.E.; Sternad, D. Learning and Transfer of Complex Motor Skills in Virtual Reality: A Perspective Review. J. Neuroeng. Rehabil. 2019, 16, 121. [Google Scholar] [CrossRef]
Radianti, J.; Majchrzak, T.A.; Fromm, J.; Wohlgenannt, I. A Systematic Review of Immersive Virtual Reality Applications for Higher Education: Design Elements, Lessons Learned, and Research Agenda. Comput. Educ. 2020, 147, 103778. [Google Scholar] [CrossRef]
Lei, Y.; Su, Z.; Cheng, C. Virtual Reality in Human-Robot Interaction: Challenges and Benefits. Electron. Res. Arch. 2023, 31, 2374–2408. [Google Scholar] [CrossRef]
Liu, O.; Rakita, D.; Mutlu, B.; Gleicher, M. Understanding Human-Robot Interaction in Virtual Reality. In Proceedings of the 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Lisbon, Portugal, 28 August–1 September 2017; pp. 751–757. [Google Scholar]
Nenna, F.; Orso, V.; Zanardi, D.; Gamberini, L. 135. The Virtualization of Human–Robot Interactions: A User-Centric Workload Assessment. Virtual Real. 2023, 27, 553–571. [Google Scholar] [CrossRef]
Plomin, J.; Schweidler, P.; Oehme, A. Virtual Reality Check: A Comparison of Virtual Reality, Screen-Based, and Real World Settings as Research Methods for HRI. Front. Robot. AI 2023, 10, 1156715. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Zheng, P.; Li, S.; Wang, L. Multimodal Human–Robot Interaction for Human-Centric Smart Manufacturing: A Survey. Adv. Intell. Syst. 2024, 6, 2300359. [Google Scholar] [CrossRef]
Zhu, Q.; Du, J. Neural Functional Analysis in Virtual Reality Simulation: Example of a Human-Robot Collaboration Tasks. In Proceedings of the Winter Simulation Conference; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020; pp. 2424–2434. [Google Scholar]
Higgins, P.; Barron, R.; Engel, D.; Matuszek, C. A Comparative Analysis of VR-Based and Real-World Human-Robot Collaboration for Small-Scale Joining. In Proceedings of the 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, VRW 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 845–846. [Google Scholar]
Malik, A.A.; Masood, T.; Bilberg, A. Virtual Reality in Manufacturing: Immersive and Collaborative Artificial-Reality in Design of Human-Robot Workspace. Int. J. Comput. Integr. Manuf. 2020, 33, 22–37. [Google Scholar] [CrossRef]
Takahashi, N.; Inamura, T.; Mizuchi, Y.; Choi, Y. Evaluation of the Difference of Human Behavior between VR and Real Environments in Searching and Manipulating Objects in a Domestic Environment. In Proceedings of the 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), Vancouver, BC, Canada, 8–12 August 2021; pp. 454–460. [Google Scholar] [CrossRef]
Matsas, E.; Vosniakos, G.C.; Batras, D. Effectiveness and Acceptability of a Virtual Environment for Assessing Human–Robot Collaboration in Manufacturing. Int. J. Adv. Manuf. Technol. 2017, 92, 3903–3917. [Google Scholar] [CrossRef]
Urrea, C.; Kern, J. Recent Advances and Challenges in Industrial Robotics: A Systematic Review of Technological Trends and Emerging Applications. Processes 2025, 13, 832. [Google Scholar] [CrossRef]
Hoedt, S.; Claeys, A.; Aghezzaf, E.H.; Cottyn, J. Real Time Implementation of Learning-Forgetting Models for Cycle Time Predictions of Manual Assembly Tasks after a Break. Sustainability 2020, 12, 5543. [Google Scholar] [CrossRef]
Higgins, P.; Barron, R.; Engel, D.; Matuszek, C. Lessons from a Small-Scale Robot Joining Experiment in VR; Association for Computing Machinery: New York, NY, USA, 2021; Volume 1. [Google Scholar]
Paas, F.G.; Van Merriënboer, J.J.; Adam, J.J. Measurement of Cognitive Load in Instructional Research. Percept. Mot. Ski. 1994, 79, 419–430. [Google Scholar] [CrossRef]
Su, B.; Jung, S.H.; Lu, L.; Wang, H.; Qing, L.; Xu, X. Exploring the Impact of Human-Robot Interaction on Workers’ Mental Stress in Collaborative Assembly Tasks. Appl. Ergon. 2024, 116, 104224. [Google Scholar] [CrossRef]
Sweller, J.; Van Merrienboer, J.J.G.; Paas, F.G.W.C. Cognitive Architecture and Instructional Design. Educ. Psychol. Rev. 1998, 10, 251–296. [Google Scholar] [CrossRef]
Hill, R.H.; Goodman, L.L. Man and Automation. Math. Tables Other Aids Comput. 1959, 13, 69. [Google Scholar] [CrossRef] [PubMed]
Esposito, C.; De Simone, V.; Di Pasquale, V.; Rinaldi, M.; Fera, M.; Miranda, S. The Role of Human Error in Human Robot Interaction. Procedia Comput. Sci. 2024, 253, 2347–2357. [Google Scholar] [CrossRef]
Huck, T.P.; Ledermann, C.; Kroger, T. Testing Robot System Safety by Creating Hazardous Human Worker Behavior in Simulation. IEEE Robot. Autom. Lett. 2022, 7, 770–777. [Google Scholar] [CrossRef]
Kalatzis, A.; Rahman, S.; Girishan Prabhu, V.; Stanley, L.; Wittie, M. A Multimodal Approach to Investigate the Role of Cognitive Workload and User Interfaces in Human-Robot Collaboration. In Proceedings of the ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2023; pp. 5–14. [Google Scholar]
Apraiz, A.; Lasa, G.; Mazmela, M. Evaluation of User Experience in Human–Robot Interaction: A Systematic Literature Review. Int. J. Soc. Robot. 2023, 15, 187–210. [Google Scholar] [CrossRef]
Farina, P.; Rinaldi, M.; Di Pasquale, V.; Fera, M.; Macchiaroli, R.; Riemma, S. Mental Stress in Human-Robot Interaction: Experimental Insights from a Cooperative Assembly Task. In Proceedings of the XXX Summer School F. Turco-AIDI, Lecce, Italy, 10–12 September 2025. [Google Scholar]
Coronado, E.; Kiyokawa, T.; Ricardez, G.A.G.; Ramirez-Alpizar, I.G.; Venture, G.; Yamanobe, N. Evaluating Quality in Human-Robot Interaction: A Systematic Search and Classification of Performance and Human-Centered Factors, Measures and Metrics towards an Industry 5.0. J. Manuf. Syst. 2022, 63, 392–410. [Google Scholar] [CrossRef]
Paletta, L.; Pszeida, M.; Ganster, H.; Fuhrmann, F.; Weiss, W.; Ladstatter, S.; Dini, A.; Murg, S.; Mayer, H.; Brijacak, I.; et al. Gaze-Based Human Factors Measurements for the Evaluation of Intuitive Human-Robot Collaboration in Real-Time. In Proceedings of the 24th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Zaragoza, Spain, 10–13 September 2019; pp. 1528–1531. [Google Scholar] [CrossRef]
Spielberger, C.D. State-Trait Anxiety Inventory. In The Corsini Encyclopedia of Psychology; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1983; pp. 19–21. [Google Scholar] [CrossRef]
Bartneck, C. Godspeed Questionnaire Series: Translations and Usage. In International Handbook of Behavioral Health Assessment; Springer: Cham, Switzerland, 2023. [Google Scholar]
Di Pasquale, V.; Farina, P.; Fera, M.; Gerbino, S.; Rinaldi, M.; Miranda, S. Human-Robot Interaction: A Conceptual Framework for Task Performance Analysis. IFAC Pap. 2024, 58, 79–84. [Google Scholar] [CrossRef]
Rinaldi, M.; Di Pasquale, V.; Farina, P.; Iannone, R.; Macchiaroli, R.; Grosse, E.H. Human–Robot Interaction in Industry: A Tertiary Study. Procedia Comput. Sci. 2025, 253, 1691–1701. [Google Scholar] [CrossRef]
Alves, J.; Lima, T.M.; Gaspar, P.D. Is Industry 5.0 a Human-Centred Approach? A Systematic Review. Processes 2023, 11, 193. [Google Scholar] [CrossRef]
Rossi, P.G. Social Robotics; Springer: Cham, Switzerland, 2017; Volume 9. [Google Scholar]
Cohen, S.; Kamarck, T.; Mermelstein, R. A Global Measure of Perceived Stress. J. Health Soc. Behav. 1983, 24, 385–396. [Google Scholar] [CrossRef]
Vorderer, P.; Wirth, W.; Gouveia, F.R.; Biocca, F.; Saari, T.; Jäncke, L.; Böcking, S.; Schramm, H.; Gysbers, A.; Hartmann, T.; et al. MEC Spatial Presence Questionnaire (MEC-SPQ). MEC Spat. Presence Quest. 2004, 18, 15. [Google Scholar] [CrossRef]
Lewis, J.R.; Utesch, B.S.; Maher, D.E. UMUX-LITE—When There’s No Time for the SUS. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris, France, 27 April–2 May 2013; pp. 2099–2102. [Google Scholar] [CrossRef]
Kim, H.K.; Park, J.; Choi, Y.; Choe, M. Virtual Reality Sickness Questionnaire (VRSQ): Motion Sickness Measurement Index in a Virtual Reality Environment. Appl. Ergon. 2018, 69, 66–73. [Google Scholar] [CrossRef]
Aharony, N.; Krakovski, M.; Edan, Y. A Transparency-Based Action Model Implemented in a Robotic Physical Trainer for Improved HRI. ACM Trans. Hum.-Robot Interact. 2024, 14, 15. [Google Scholar] [CrossRef]
Aschenbrenner, D.; Leutert, F.; Çençen, A.; Verlinden, J.; Schilling, K.; Latoschik, M.; Lukosch, S. Comparing Human Factors for Augmented Reality Supported Single-User and Collaborative Repair Operations of Industrial Robots. Front. Robot. AI 2019, 6, 37. [Google Scholar] [CrossRef] [PubMed]
Hader, B.; Wendelin, T.; Schlund, S. Improving Human-Robot Interaction Through Decision Support and Workplace-Based Learning: Prototype of a Worker Assistance System for Adaptive Task Sharing Between Robots and Humans. In Learning Factories of the Future; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2024; pp. 285–292. [Google Scholar] [CrossRef]
Macdonald, S.; Bretin, R.; Salma, E. Evaluating Transferable Emotion Expressions for Zoomorphic Social Robots using VR Prototyping. arXiv 2024, arXiv:2410.15486. [Google Scholar] [CrossRef]
Lee, D.K. Interval and Effect Size. Korean J. Anesthesiol. 2016, 69, 555–562. [Google Scholar] [CrossRef]
Bates, D.; Mächler, M.; Bolker, B.M.; Walker, S.C. Fitting Linear Mixed-Effects Models Using Lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
Ottogalli, K.; Rosquete, D.; Rojo, J.; Amundarain, A.; María Rodríguez, J.; Borro, D. Virtual Reality Simulation of Human-Robot Coexistence for an Aircraft Final Assembly Line: Process Evaluation and Ergonomics Assessment. Int. J. Comput. Integr. Manuf. 2021, 34, 975–995. [Google Scholar] [CrossRef]
Fratczak, P.; Goh, Y.M.; Kinnell, P.; Soltoggio, A.; Justham, L. Understanding Human Behaviour in Industrial Human-Robot Interaction by Means of Virtual Reality. In Proceedings of the ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar]
EUROCONTROL. Human Factors Integration Framework Guidelines—HFIG–WP3: Detailed Framework and General Guidelines for Integration into ATM SMS; Working Draft, Edition 1.0; EUROCONTROL: Brussels, Belgium, 2007. [Google Scholar]
Wickens, C.D. Multiple Resources and Mental Workload. Hum. Factors 2008, 50, 449–455. [Google Scholar] [CrossRef]
Doyon, J.; Benali, H. Reorganization and Plasticity in the Adult Brain during Learning of Motor Skills. Curr. Opin. Neurobiol. 2005, 15, 161–167. [Google Scholar] [CrossRef] [PubMed]
Barravecchia, F.; Mastrogiacomo, L.; Franceschini, F. A General Cost Model to Assess the Implementation of Collaborative Robots in Assembly Processes. Int. J. Adv. Manuf. Technol. 2023, 125, 5247–5266. [Google Scholar] [CrossRef]
Rinaldi, M.; Caterino, M.; Fera, M. Sustainability of Human-Robot Cooperative Configurations: Findings from a Case Study. Comput. Ind. Eng. 2023, 182, 109383. [Google Scholar] [CrossRef]
El Maraghy, W.; El Maraghy, H. A New Engineering Design Paradigm—The Quadruple Bottom Line. Procedia CIRP 2014, 21, 18–26. [Google Scholar] [CrossRef][Green Version]
Adami, P.; Rodrigues, P.B.; Woods, P.J.; Becerik-Gerber, B.; Soibelman, L.; Copur-Gencturk, Y.; Lucas, G. Impact of VR-Based Training on Human–Robot Interaction for Remote Operating Construction Robots. J. Comput. Civ. Eng. 2022, 36, 04022006. [Google Scholar] [CrossRef]
Matsas, E.; Vosniakos, G.C. Design of a Virtual Reality Training System for Human–Robot Collaboration in Manufacturing Tasks. Int. J. Interact. Des. Manuf. 2017, 11, 139–153. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the methodology applied.

Figure 2. PC desk stand for assembly.

Figure 3. HRI assembly task flowchart derived from [34].

Figure 4. Test procedure flowchart.

Figure 5. Distribution of participants across training modalities (VR and VB) and real execution sessions (R-I and R-II).

Figure 6. Boxplots for execution times across the two sub-samples.

Figure 7. Boxplots for human errors across the two sub-samples.

Table 1. PC desk stand—Flat Bill of Materials.

Part Number	Description	Quantity
A	L beam	4
B	5 beam	2
C	6 beam	2
D	8 beam	2
E	Pin	10
F	Axle 12 with stops	2

Table 2. Schematization of groups and description of the referred task. # = number.

	R-I		R-II (After 1-Month Follow-Up)
Groups According to Training Strategy	# Participants	Description	# Participants	Description
VB group	13	Training with video instructions + 3 repetitions in the real environment	6	3 repetitions in the real environment
VR group	13	Training with a VR session + 3 repetitions in the real environment	6	3 repetitions in the real environment

Table 3. Classification of HEs occurring and collected during the HRI task execution.

Type of Human Error	Details
Execution (Exec)	Wrong input to the cobot Wrong positioning (location and orientation) Collision Incorrect assembly
Sequence (Seq)	Insertion, omission, substitution, reversing, dropping
Selection (Sel)	Wrong part/fasteners

Table 4. Data collection and notation.

Notation	Meaning
T_avg	Average task time on the three repetitions
T3	Task time of the third task repetition is used to emphasize the final value achieved
HE_avg	Average number of errors on the three repetitions without considering differences among types of errors
HE3	Number of errors on the third repetition, without considering differences among types of errors
Esec_avg	Average number of execution errors on the three repetitions
Seq_avg	Average number of sequence errors on the three repetitions
Sel_avg	Average number of selection errors on the three repetitions
Esec3	Number of execution errors on the third repetition
Seq3	Number of sequence errors on the third repetition
Sel3	Number of selection errors on the third repetition

Table 5. System performance for the two sub-samples (VR and VB training) for the R-I session.

	Training Strategy	Mean Value	SD	Min	Max
T_avg [s]	VR	248	72	174	434
T_avg [s]	VB	265	78	174	464
T3 [s]	VR	197	83	131	457
T3 [s]	VB	182	36	135	236
HE_avg [#]	VR	2.1	1.0	0.33	3.7
HE_avg [#]	VB	2.4	1.0	0.33	3.7
HE3 [#]	VR	1.7	3.4	0	11
HE3 [#]	VB	0.9	1.1	0	3.0

Table 6. Values for HEs’ specific subcategories for R-I session.

HEs [#]	Training Strategy	Mean Value	SD	Min	Max
Exec_avg	VR	2.3	1.9	0	5.3
Exec_avg	VB	1.9	1.0	0.3	3.3
Seq_avg	VR	0.1	0.2	0	0.7
Seq_avg	VB	0.2	0.3	0	1.0
Sel_avg	VR	0.2	0.3	0	0.7
Sel_avg	VB	0.2	0.3	0	1.0
Exec_3	VR	1.5	3.2	0	10.0
Exec_3	VB	0.7	0.9	0	2.0
Seq_3	VR	0.1	0.3	0	1.0
Seq_3	VB	0.2	0.4	0	1.0
Sel_3	VR	0.1	0.3	0	1.0
Sel_3	VB	0.1	0.3	0	1.0

Table 7. Statistical significance and effect sizes (Cohen’s d and 95% confidence intervals) for system performance in the comparison between VR and VB training during the R–I session.

	Statistical Significance	Cohen’s d	95% Confidence Interval (CI)
T_avg [s]	No (Wilcoxon rank-sum test p = 0.49)	0.22	[−0.55, 0.99]
T3 [s]	No (Wilcoxon rank-sum test p = 1)	−0.25	[−1.01, 0.53]
HE_avg [#]	No (t-test p = 0.66)	−0.18	[−0.94, 0.60]
He3 [#]	No (Wilcoxon rank-sum test p = 0.63)	−0.31	[−1.08, 0.47]
Exec_avg	No (t-test p = 0.63)	−0.25	[−1.02, 0.52]
Seq_avg	No (Wilcoxon rank-sum test p = 0.29)	0.41	[−0.38, 1.18]
Sel_avg	No (Wilcoxon rank-sum test p = 0.91)	0	[−0.77, 0.77]
Exec_3	No (Wilcoxon rank-sum test p = 0.70)	−0.36	[−1.13, 0.42]
Seq_3	No (Wilcoxon rank-sum test p = 0.58)	0.23	[−0.54, 1.00]
Sel_3	No (Wilcoxon rank-sum test p = 1)	0	[−0.77, 0.77]

Table 8. System performance for the two sub-samples (VR and VB training) for R-II session.

	Training Strategy	Mean Value	SD	Min	Max
T3 [s]	VB	152	58	111	258
T3 [s]	VR	146	22	125	188
T_avg [s]	VB	153	22	134	192
T_avg [s]	VR	175	52	137	276
HE3 [#]	VB	2.5	3.1	0	8
HE3 [#]	VR	0.3	0.5	0	1
HE_avg [#]	VB	1.3	1.1	0.3	3.3
HE_avg [#]	VR	0.7	0.7	0	2

Table 9. Values for HEs’ specific subcategories for R-II session.

HEs [#]	Training Strategy	Mean Value	SD	Max
Exec_avg	VB	1	0.9	2.7
Exec_avg	VR	0.7	0.7	2
Seq_avg	VB	0.2	0.3	0.7
Seq_avg	VR	0	0	0
Sel_avg	VB	0.1	0.2	0.3
Sel_avg	VR	0.1	0.1	0.3
Exec3	VB	1.8	2.8	7
Exec3	VR	0.3	0.5	1
Seq3	VB	0.3	0.8	2
Seq3	VR	0	0	0
Sel3	VB	0.3	0.5	1
Sel3	VR	0	0	0

Table 10. Statistical significance and effect sizes (Cohen’s d and 95% confidence intervals) for system performance in the comparison between VR and VB training during the R–II session.

	Statistical Significance	Cohen’s d	95% Confidence Interval (CI)
T_avg [s]	No (Wilcoxon rank-sum test p = 0.39)	−0.54	[−1.69, 0.63]
T3 [s]	No (Wilcoxon rank-sum test p = 0.39)	0.15	[−0.99, 1.28]
HE_avg [#]	No (t-test p = 0.32)	0.60	[−0.58, 1.74]
HE3 [#]	No (Wilcoxon rank-sum test p = 0.14)	0.98	[−0.25, 2.17]
Exec_avg	No (t-test p = 0.50)	0.40	[−0.75, 1.54]
Seq_avg	No (Wilcoxon rank-sum test p = 0.18)	0.85	[−0.36, 2.02]
Sel_avg	No (Wilcoxon rank-sum test p = 0.60)	0.36	[−0.79, 1.49]
Exec_3	No (Wilcoxon rank-sum test p = 0.42)	0.75	[−0.45, 1.91]
Seq_3	No (Wilcoxon rank-sum test p = 0.41)	0.58	[−0.60, 1.72]
Sel_3	No (Wilcoxon rank-sum test p = 0.18)	0.91	[−0.31, 2.09]

Table 11. System performance for VR group across R-I and R-II sessions.

VR Group
	Session	Mean Value	SD	Min	Max
T_avg [s]	I	257	95	174	434
T_avg [s]	II	175	52	137	276
T3 [s]	I	213	121	131	457
T3 [s]	II	146	22	125	188
HE_avg [#]	I	2.5	1.0	1.3	3.7
HE_avg [#]	II	0.7	0.7	0	2.0
HE3 [#]	I	3.2	4.7	0	11.0
HE3 [#]	II	0.3	0.5	0	1.0
Exec_avg [#]	I	2.8	2.5	0	5.3
Exec_avg [#]	II	0.7	0.7	0	2.0
Seq_avg [#]	I	0.2	0.2	0	0.3
Seq_avg [#]	II	0	0	0	0
Sel_avg [#]	I	0.2	0.3	0	0.7
Sel_avg [#]	II	0.1	0.1	0	0.3
Exec3 [#]	I	2.8	4.5	0	10.0
Exec3 [#]	II	0.3	0.5	0	1.0
Seq3 [#]	I	0.2	0.4	0	1.0
Seq3 [#]	II	0	0	0	0
Sel3 [#]	I	0.2	0.4	0	1.0
Sel3 [#]	II	0	0	0	0

Table 12. System performance for VB group across R-I and R-II sessions.

VB Group
	Session	Mean Value	SD	Min	Max
T_avg [s]	I	233	58	174	315
T_avg [s]	II	153	22	134	192
T3 [s]	I	169	42	135	227
T3 [s]	II	152	58	111	258
HE_avg [#]	I	7.4	3.9	4.0	14.3
HE_avg [#]	II	1.3	1.1	0.3	3.3
HE3 [#]	I	0.8	1.0	0	2.0
HE3 [#]	II	2.5	3.1	0	8.0
Exec_avg [#]	I	1.7	1.0	0.3	3.3
Exec_avg [#]	II	1.0	0.9	0	2.7
Seq_avg [#]	I	0.2	0.2	0	0.3
Seq_avg [#]	II	0.2	0.3	0	0.7
Sel_avg [#]	I	0.2	0.4	0	1.0
Sel_avg [#]	II	0.1	0.2	0	0.3
Exec3 [#]	I	0.5	0.8	0	2.0
Exec3 [#]	II	1.8	2.8	0	7.0
Seq3 [#]	I	0.2	0.4	0	1.0
Seq3 [#]	II	0.3	0.8	0	2.0
Sel3 [#]	I	0.2	0.4	0	1.0
Sel3 [#]	II	0.3	0.5	0	1.0

Table 13. Statistical significance and effect sizes (Cohen’s d and 95% confidence intervals) for system performance in the comparison between R-I and R-II for VR and VB sub-groups.

	VR_Training			VB_Training
	Statistical Significance	Cohen’s d	95% Confidence Interval (CI)	Statistical Significance	Cohen’s d	95% Confidence Interval (CI)
T_avg [s]	Yes (t-test p = 0.017)	−1.42	[−2.56, −0.22]	Yes (t-test p = 0.015)	−1.49	[−2.66, −0.26]
T3 [s]	Yes (Wilcoxon signed-rank test p = 0.031)	−0.68	[−1.56, 0.24]	No (t-test p = 0.57)	−0.25	[−1.05, 0.58]
HE_avg [#]	Yes (t-test 0.026)	−1.28	[−2.36, −0.14]	No (Wilcoxon signed-rank test p = 0.17)	−1.52	[−2.71, −0.28]
HE3 [#]	No (Wilcoxon signed-rank test p = 0.18)	−0.67	[−1.54, 0.25]	No (t-test p = 0.28)	0.49	[−0.38, 1.32]
Exec_avg	No (t-test p = 0.05)	−1.02	[−2.00, 0.02]	No (t-test p = 0.17)	−0.66	[−1.53, 0.26]
Seq_avg	No (Wilcoxon signed-rank test p = 0.15)	−0.91	[−1.85, 0.09]	No (t-test p = 1)	0	[−0.80, 0.80]
Sel_avg	No (Wilcoxon signed-rank test p = 0.37)	−0.60	[−1.45, 0.30]	No (t-test p = 0.61)	−0.22	[−1.02, 0.60]
Exec_3	No (Wilcoxon signed-rank test p = 0.37)	−0.63	[−1.49, 0.28]	No (Wilcoxon signed-rank test p = 0.34)	0.41	[−0.45, 1.23]
Seq_3	No (Wilcoxon signed-rank test p = 1)	−0.41	[−1.23, 0.45]	No (Wilcoxon signed-rank test p = 1)	0.41	[−0.45, 1.23]
Sel_3	No (Wilcoxon signed-rank test p = 1)	−0.41	[−1.23, 0.45]	No (t-test p = 0.61)	0.22	[−0.60, 1.02]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Farina, P.; De Simone, V.; Miranda, S.; Di Pasquale, V. Evaluating the Impact of VR Training Strategies on HRI Cooperative Assembly Performance. Appl. Sci. 2026, 16, 4305. https://doi.org/10.3390/app16094305

AMA Style

Farina P, De Simone V, Miranda S, Di Pasquale V. Evaluating the Impact of VR Training Strategies on HRI Cooperative Assembly Performance. Applied Sciences. 2026; 16(9):4305. https://doi.org/10.3390/app16094305

Chicago/Turabian Style

Farina, Paola, Valentina De Simone, Salvatore Miranda, and Valentina Di Pasquale. 2026. "Evaluating the Impact of VR Training Strategies on HRI Cooperative Assembly Performance" Applied Sciences 16, no. 9: 4305. https://doi.org/10.3390/app16094305

APA Style

Farina, P., De Simone, V., Miranda, S., & Di Pasquale, V. (2026). Evaluating the Impact of VR Training Strategies on HRI Cooperative Assembly Performance. Applied Sciences, 16(9), 4305. https://doi.org/10.3390/app16094305

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating the Impact of VR Training Strategies on HRI Cooperative Assembly Performance

Abstract

1. Background and Motivation

1.1. Virtual Reality for Human-Robot Interaction Training

1.2. Research Gaps and Study Objective

2. Methodology

2.1. Task Description and Technical Equipment

2.2. Procedure

2.3. Participants

2.4. Measures

2.4.1. Objective Metrics

2.4.2. Subjective Metrics

2.5. Data Collection and Analysis

3. Results

3.1. RQ1: Analysis of First Session Results After Training

3.2. RQ2: Performance Evolution After One Month (VR vs. VB)

3.3. RQ3: VR and VB Performance Evolution from the R-I and R-II Task

4. Discussions

4.1. Study Limitations

4.2. Future Research Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI