4.1. Subjective Measure of the Cognitive Aids in VRDS
Following the completion of the driving tasks, each participant filled out another questionnaire with ratings, focusing on their experiences with the respective driving conditions. The questions listed within the questionnaire regarding the cognitive aids are shown in
Table 2. Each question was answered on a five-point Likert scale, ranging from strongly agree to strongly disagree.
For the first question, which related to how easily participants could interpret the simulated driving condition, 34.6% of the participants selected the “strongly agree” option and 37.8% selected “agree” for G1. In G2, 18.2% selected “strongly agree” and 23.1% selected “agree”. For G3, 15.8% of the participants strongly agreed, and 20.3% agreed. These results suggest that G1’s drive-specific cognitive aids helped participants better interpret driving conditions compared to G2 and G3.
For the second question, which addressed the ease of converting simulated traffic scenarios into real driving responses, 36.9% of the participants in G1 selected “strongly agree”, while 28.7% selected “agree”. In G2, the “strongly agree” and “agree” responses were 12.3% and 25.4%, respectively. For G3, 35.0% strongly agreed and 27.2% agreed. These results show that G1 provided effective support for translating simulation into practical actions, though G3 also showed strong perceived usefulness.
For the third question, regarding whether participants could perform driving tasks independently without instructor guidance, 35.9% in G1 strongly agreed, and 38.1% agreed. In G2, 13.1% selected “strongly agree”, and 29.8% selected “agree”. For G3, 38.4% chose “strongly agree”, while 26.8% selected “agree”. This indicates that both G1 and G3 enabled users to function autonomously, with G1 slightly ahead in guided independence.
Table 3 and
Figure 11 summarize the participants’ feedback across all the groups and conditions regarding cognitive aids. For the fourth question, which asked whether the condition was ideal for future driving simulations, 34.6% of the participants in G1 strongly agreed and 41.5% agreed. In G2, 21.3% strongly agreed, and 44.8% agreed. For G3, only 7.8% chose “strongly agree” while 20.4% agreed. This highlights a stronger preference for G1 and G2 over G3 in terms of suitability for future driving simulations.
4.2. Reliability and Validity Analysis of the Questionnaires
In this part of the analysis, we examined the validity and reliability of the questionnaires (see
Table 2) using participants’ feedback. To assess these properties, we calculated the Average Variance Extracted (AVE), Composite Reliability (CR), and Cronbach’s alpha. Following the Fornell and Larcker criterion [
23], AVE evaluates convergent validity, with a recommended square root of AVE greater than 0.50. The internal consistency reliability was assessed through Cronbach’s alpha and CR, both of which should exceed 0.70 [
24]. Cronbach’s alpha is particularly suitable for evaluating Likert-scale instruments [
25,
26,
27]. When Cronbach’s alpha is above 0.70, the CR value should also be higher than 0.70; otherwise, it should be lower.
Table 1 presents the AVE, CR, and Cronbach’s alpha values for G1 and G2, based on their feedback. As shown in
Table 4, the square roots of AVE for both questionnaires in G1 and G2 exceed the estimated correlation values. Likewise, the Cronbach’s alpha and CR values for all groups (G1, G2, and G3) are above 0.70, indicating good internal consistency and strong discriminant validity across the questionnaires.
In addition to the AVE, CR, and Cronbach’s alpha, a correlation matrix was computed to assess the discriminant validity. The square root of the AVE values exceeded the inter-construct correlations, confirming that each construct was distinct and measured unique aspects of the questionnaire.
Table 5 presents the inter-construct correlations along with the square roots of the AVE.
4.3. Performance Metrics
This part of the analysis was to check the performance of trainees during driving in the three different cognitive aid VRDSs. The data recorded in this section includes the number of errors performed as well as the tasks’ completion times during their driving simulations. To assess overall driving performance, a composite score combining driving errors and completion times was computed. Both variables were standardized, and the resulting values were summed to produce a single performance index (errors + time composite). This method ensured that both accuracy and efficiency contributed equally to the analysis.
Number of Errors during Driving Task Execution: The numbers of errors performed by the trainees during their driving tasks were recorded. These errors include driving in the wrong direction (entering a lane in the opposite direction of travel at any point was counted as one error per occurrence); colliding with obstacles (contact with road boundaries, obstacles, pedestrians, or sign boards was counted as one error per collision event); exceeding the recommended speed limit (30, 60, or 80 km/h) by more than 5 km/h for longer than 3 s (which was counted as one error); etc. All errors were weighed equally, with each instance counted as one error, and the total error score per participant was computed as the sum of all error instances. The p values of the mean errors were calculated for three groups using analysis of variance (ANOVA). The ANOVA of the errors in task completion for the three groups is statistically significant (F(2, 43) = 16.23); p = 0.001 (p < 0.05) and η2 = 0.44, 95% CI [0.05, 0.32]. Comparing the errors in the task completion of G1 (mean, 8.1 errors; STD, 1.23), where p = 0.002 (p < 0.05), with those of G2 and G3, we observed a significant ANOVA. This indicates that, based on errors in task completion, trainees in G1 (image–arrow VRDS) performed significantly better compared to G2 and G3. On the other hand, comparing the errors in the task completion of G2 (mean, 10.8 errors; STD, 1.31), where p = 0.004 (p < 0.05), with those of G3, we observed a significant ANOVA. It can be observed that the mean errors in the task completion of G2 are better than those of G3. In addition, the Tukey–Kramer post hoc analysis indicates that the numbers of errors are significantly different between G1 and G2 (p = 0.002, p < 0.05), G1 and G3 (p = 0.001, p < 0.05), and G2 and G3 (p = 0.004, p < 0.05).
The means and standard deviations of G1, G2, and G3 based on the errors are provided in
Table 6 and
Figure 12.
Task Completion Time: In addition, the average task completion time and standard deviation for each group were checked for analyzing the performance of the trainees in the VRDSs. The p values of mean time were calculated for the three groups using analysis of variance (ANOVA). The ANOVA of task completion time for the three groups is statistically significant (F(2, 43) = 65.34); p = 0.004 (p < 0.05) and η2 = 0.76, 95% CI [0.28, 0.59]. A comparative analysis revealed significant differences in task completion times among the three groups. Comparing the task completion time of G1 (mean, 3.26 min; STD, 0.56), where p = 0.001 (p < 0.05), with that of G2 and G3, we observed a significant ANOVA. This indicates that, based on task completion, trainees in G1 (image–arrow VRDS) performed significantly better compared to G2 and G3. Similarly, comparing the task completion time of G2 (mean, 4.49 min; STD, 0.67), where p = 0.003 (p < 0.05), with that of G3, we observed a significant ANOVA. It can be observed that the mean task completion time of G2 is better than that of G3. In addition, the Tukey–Kramer post hoc analysis indicates that the task completion times are significantly different between G1 and G2 (p = 0.001, p < 0.05), G1 and G3 (p = 0.001, p < 0.05), and G2 and G3 (p = 0.003, p < 0.05).
The means and standard deviations of successful tasks performed during driving for each group are provided in
Table 6 and graphed in
Figure 13.
Composite Index (Errors + Time): A composite score was calculated by summing the standardized values of mean errors and completion times. ANOVA indicated significant group differences; F(2,43) = 41.26, p < 0.001, η2 = 0.66, 95% CI [0.23, 0.49]. G1 (errors: 8.1 ± 1.23; time: 3.26 ± 0.56) achieved the lowest composite score (best performance), followed by G2 (errors: 10.8 ± 1.31; time: 4.49 ± 0.67), and G3 (errors: 18.4 ± 1.43; time: 6.51 ± 0.68).
From all the above results, we can assume that G1 (image–arrow aids) is the most effective environment in improving task performance in a VR driving simulator. Participants belonging to G1 were less prone to errors and completed more successful tasks in less time than G2 and G3. In addition, based on subjective evaluation, the feedback of G1 showed higher satisfaction compared with that of G2 and G3. It can be observed that the performance of G1 was significantly better than that of G2 and G3 in terms of performance measures (errors + time) and subjective factors such as usability, easiness, understanding, and assistance.