This section presents the empirical results of the study. First, descriptive statistics of the physiological stress indicators are reported for both the real and simulated drives across all route segments. Subsequently, the time series analysis is introduced to examine the similarity between simulator and real-world conditions in greater temporal detail. Finally, the main findings are summarized and discussed with respect to physiological stress patterns observed in both environments.
4.1. Descriptive Statistics
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7 report the minimum, maximum, mean, and standard deviation for each variable (SCR, SCL, PA, HR, RR-interval, RMSSD, and SDNN), both for the real drive (Real Ride) and the simulated drive (Sim. Ride), and separately for each one of the seven driving route segments. Driving route segments 1, 3, and 5 are classified as rural traffic, segments 2, 6, and 7 as urban traffic, and segment 4 as highway traffic. Paired
t-tests (
N = 68; SPSS v27; two-tailed; α = 0.05; pairwise deletion) are conducted to examine whether there are any significant differences between the two driving environments, and the corresponding
p-values are also included in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7.
Data were screened for outliers; extreme values (>3 × IQR) were winsorized to the nearest non-extreme bound to limit their influence while retaining observations [
78]. The normality assumption was evaluated on the difference scores (Real − Sim) for each segment and measure using Q–Q plots, histograms, and Shapiro–Wilk/Kolmogorov–Smirnov tests. While the normality tests frequently rejected perfect normality—consistent with the behavior of physiological measures that often exhibit mild asymmetry [
79]—visual inspection indicated only slight to moderate skewness without substantial anomalies.
Given the sample size (
N = 68), the paired-sample
t-test is robust to moderate deviations from normality of the difference scores [
80], and by the Central Limit Theorem, the sampling distribution of mean differences is approximately normal for
N ≥ 30, particularly when underlying departures are mild [
81]. No correction for multiple comparisons was applied because the seven route segments represent a priori distinct driving contexts rather than repeated tests of the same effect, and the six physiological measures capture different constructs. Note that, in a within-subject design, the paired analysis reduces error variance by accounting for inter-individual variability. Consequently, the standard deviation of the difference scores is typically smaller than the raw standard deviations of each condition, which can yield relatively large t-values even when mean differences appear numerically small.
4.2. Time Series Analysis
By analyzing time-resolved physiological trajectories, we can assess whether the digital twin tracks the real-time evolution of stress responses in a way that is compatible with its intended use: continuous monitoring, adaptive interventions, and predictive modelling. To achieve this, the dataset was restructured into a long-format time series representation to enable trajectory-based analysis across the full driving duration. Through this process, columns are transformed into rows, converting a dataset from a wide format to a long format [
82]. In our case, each segment of the route (column) becomes a new observation (row). This allows us to calculate averages for the entire duration of the drive rather than focusing solely on individual segments. To compare the physiological data between the entire real driving route and its DT, across participants, we first calculated the mean values of the physiological parameters and then performed a paired
t-test.
The analysis of skin conductance showed that for the real drive, the Skin Conductance Response (SCR) (
Table 1) had a mean value of 9.49 (range: 0.00–17.80; SD = 2.57), which was almost identical to the mean value of 9.53 (range: 1.61–16.72; SD = 2.49) for the simulated drive (
p = 0.829). However, for the tonic Skin Conductance Level (SCL) (
Table 2), a highly significant difference was observed: In the real drive, the mean was 12.87 (range: 2.32–57.18; SD = 6.31), while in the simulated drive, the mean was significantly higher at 14.47 (range: 2.32–102.55; SD = 8.62;
p < 0.001). The Peak Amplitude (PA) (
Table 3) also showed significant differences, with a mean of 0.29 (range: 0.00–0.84; SD = 0.18) in the real drive compared to 0.39 (range: 0.03–1.05; SD = 0.24) in the simulation (
p < 0.001).
Significant differences were also found in heart rate (HR) (
Table 4) between the real and simulated drives. The mean HR during the real drive was 82.79 bpm (range: 51.52–114.24; SD = 12.89), which was higher than in the simulated drive (mean = 77.02 bpm; range = 48.31–106.50; SD = 11.14;
p < 0.001). The average RR interval (
Table 5) was 758.50 ms (range: 530.12–1208.01; SD = 129.13) in the real drive, significantly shorter than 814.15 ms (range: 534.41–1231.49; SD = 133.13) in the simulated drive (
p < 0.001). Regarding heart rate variability, represented by RMSSD (
Table 6) and SDNN (
Table 7), no significant differences were observed between the two conditions (RMSSD:
p = 0.165; SDNN:
p = 0.524).
Our analysis uses the continuous time series of physiological recordings. After normalizing timestamps to a common [0,1] scale, we compute Pearson correlations of the full physiological trajectories and visually inspect their temporal evolution.
The analysis focuses on normalized physiological time series, enabling direct comparison of temporal signal dynamics between real and simulated driving across the full interaction period.
Although averaging gives an insight into the similarity of the driving behavior of drivers in the simulator and on the real road, this is only a rough indication. Peaks and extreme values, which indicate intense or stress-related responses, are smoothed out by averaging, resulting in a loss of valuable information [
83]. Mean values obscure important characteristics of time series data by failing to capture variability and response stability, which are critical for interpretation [
84]. Context-specific effects, such as those arising during cornering or emergency braking, are also lost through averaging, masking the true nature of responses [
85]. Moreover, averaging hinders analysis of adaptation and learning over time and prevents examination of dynamic interactions between variables [
53]. For these reasons, we want to go one step further and investigate the time series similarity in more detail. To achieve this, we use the Pearson Correlation Coefficient on normalized time series. This linear correlation between two variables X and Y ranges from −1 and 1, where 1 is a perfect positive linear correlation, −1 is a perfect negative linear correlation, and 0 is no linear correlation. We also normalize the time series to a common scale, enabling a meaningful comparison independent of absolute values or differing measurement durations.
Conducting this time-series analysis required addressing several technical challenges that go beyond the procedures described in [
8] and constitutes an analytical innovation for physiological DT validation:
- (1)
Synchronizing real and simulated data streams with differing durations;
- (2)
Normalizing the time axis to a dimensionless scale to allow comparison of drives with different lengths;
- (3)
Restructuring the dataset via unpivoting so that each route segment becomes a separate observation in order to increase the effective sample size per segment and enable calculation of mean values over the entire drive instead of only segment-level averages;
- (4)
Selecting similarity metrics that are both robust to noise and interpretable for practitioners (i.e., combining Pearson Correlation Coefficients of normalized trajectories with visual inspection of the curves, which jointly reveal pattern similarity and deviations that are completely invisible in segment-wise means).
To ensure a meaningful comparison between the time series from the simulator and real-world driving, we normalized the time axis for both datasets to a common scale of [0,1]. This normalization adjusts the timestamps of each dataset so that their relative progression is comparable, irrespective of their differing durations or absolute time values. The normalization is performed using the following formula, where Timestamp is the original time value for each data point, Timestampmin and Timestampmax are the minimum and maximum time values within the dataset:
This formula ensures that the earliest timestamp maps to 0 and the latest timestamp maps to 1, creating a dimensionless time scale that is independent of the original measurement duration. By normalizing both the simulator and real-world datasets, we aligned their time axes, enabling direct visual comparison of the signal dynamics.
TimeStamp denotes the original time value of a measurement, while TimeStampmin and TimeStampmax represent the first and last recorded time values of the respective dataset. Through this normalization, the starting point of every drive is mapped to 0 and the end to 1, creating a dimensionless and aligned time scale. This ensures that the temporal progression of the biosignals can be directly compared between real-world and simulated conditions, independent of absolute drive duration. In simple terms, this normalization ensures that all drives “start at 0 and end at 1”, no matter how long they actually lasted in minutes or seconds. Without this step, two time series with slightly different durations could not be compared fairly, because one would have more data points or a longer timeline than the other. By converting all timestamps into the same normalized scale, we align both curves so that the same relative moments of the drive can be compared—for example, the beginning (0.0), the middle (0.5), or the final part of the drive (1.0). We then use this NormalizedTime as the common time axis to visually overlay both time series and to calculate correlation values, ensuring a fair, consistent, and meaningful comparison of signal patterns between the real and simulated driving conditions. This step is critical for comparing datasets with different lengths or temporal resolutions, as it eliminates the impact of absolute time differences. The normalized time series were then plotted to visually assess their similarity.
Figure 5 illustrates a biosignal time series for a participant driving the urban and rural segments, comparing data collected in the simulator and real-world conditions. The Pearson Correlation Coefficients for the urban drive (
r = 0.31) and the rural drive (
r = 0.34) suggest a moderate level of linear similarity between the simulator and real-world measurements. The normalization revealed strong parallels in the overall patterns of the tonic signal across both environments. This visual alignment, combined with the Pearson Correlation Coefficients, provides robust evidence for the comparability of the simulator and real-world measurements, supporting the validity of using simulators in such studies.
Figure 5.
Comparison of Two Time Series (for a participant driving the urban and rural segments, real and simulated; see code on GitHub [
86].
Figure 5.
Comparison of Two Time Series (for a participant driving the urban and rural segments, real and simulated; see code on GitHub [
86].
4.3. Discussion of Results
Analysis of the biosignal data reveals several notable findings regarding the physiological stress responses of the participants. In terms of cardiovascular responses, the real ride generally produces higher heart rate values, indicating greater physical exertion and more intense emotional experiences. In contrast, the simulated ride produces a less pronounced heart rate response, suggesting a less physically demanding experience. The longer RR intervals and higher RMSSD values during the simulated ride indicate a more relaxed heart rate variability, reflecting a less stressful physiological response. The real ride, with shorter RR intervals and higher SDNN values, results in greater cardiovascular stress and more intense responses. In addition, in terms of skin responses, the simulated ride tended to elicit greater sympathetic activation, as reflected by increased skin conductance response (SCR) and greater variability, particularly in the later stages. This suggests that the simulated experience may be perceived as more emotionally arousing or stressful, despite potentially being less physically challenging than the real ride. This trend is further supported by higher peak values and greater variability in skin conductance during the simulated rides.
The Skin Conductance Response (SCR) generally indicates higher sympathetic activation during the simulated ride in most segments, as reflected by somewhat higher mean values and greater fluctuations. The maximum values are also higher during the simulated ride, suggesting stronger autonomic responses.
Overall, the simulated ride leads to a stronger and more variable activation of the sympathetic nervous system compared to the real ride, which may indicate higher emotional stress or arousal during the simulation. In all seven segments, there is a general trend that the simulated rides exhibit higher Skin Conductance Level (SCL) than the real rides. This could suggest that the simulated experiences were either perceived as more intense or that the reactions were more strongly stimulated, even though they may offer fewer physical challenges compared to the real ride. The standard deviations are also generally higher in the simulated ride, which points to greater variability in skin conductance values, suggesting a more diverse range of reactions among participants.
The simulated rides exhibit higher peak amplitudes (PAs) in all segments compared to the real rides, suggesting that the simulated experiences generally elicit stronger peak responses. Particularly in later segments, the simulated rides show intense reactions and greater variability, possibly reflecting a broader emotional range.
The heart rate (HR) during the real ride is higher in all segments compared to the simulated ride, suggesting greater physical load. The higher maximum heart rate values during the real rides may reflect stronger peak reactions due to physical exertion or intense emotional experiences. In contrast, the simulated ride shows lower mean and maximum values, suggesting that it was less physically demanding or less exciting.
The RR intervals (RR-Int) are longer in all segments during the simulated versus the real ride, suggesting less physiological strain and more relaxed heart rate variability.
The mean RMSSD is higher during the simulated ride, indicating stronger parasympathetic activity and greater relaxation. Similarly, SDNN values tend to be higher, reflecting greater heart rate variability during simulation.
Overall, the analysis shows that the real ride generally produces higher heart rate values, indicating greater physical exertion, while the simulated ride elicits greater sympathetic activation in skin responses, pointing to higher emotional arousal.
Together, these findings suggest that simulated driving induces greater variability in physiological responses—especially sympathetic activation and heart rate variability—whereas real driving produces stronger, more consistent physical stress responses. This highlights the different physiological characteristics of real versus simulated experiences.
In summary, physiological data show complementary stress patterns in real and simulated driving: real driving induces stronger cardiovascular activation reflecting higher physical load, whereas simulation induces greater sympathetic responses reflecting higher emotional arousal. This supports the validity of driving simulators for studying stress while highlighting inherent physiological differences from real-world driving.