1. Introduction
Over the past decade, the technologies powering self-driving cars have advanced rapidly toward commercial deployment. The current standard for levels of driving automation is defined by the Society of Automotive Engineers (SAE) International’s J3016 Levels of Driving Automation [
1]. The first three levels, 0, 1, and 2, vary from no driving automation to partial driving automation where the human driver monitors the road. Beginning at Level 3, the automated driving system monitors the road, and as the levels progress to 4 and 5, the abilities of the driving system increase, such that at Level 5, the vehicle is fully capable of automation in all scenarios. Mercedes-Benz’s “Drive Pilot” system represents a Level 3 automation system that fully handles the driving task within well-defined operational domains at slow to moderate speeds [
2]. Level 4 systems have progressed to full commercial deployment, with Waymo operating a driverless robotaxi service across multiple US cities that has demonstrated substantially lower crash rates than human drivers [
3]. Level 5 systems, however, remain unrealized due to the number of edge cases that such a system would need to handle. The questions for the immediate future of self-driving car technology then revolve around the adaptability of the technology rather than the technology itself. While self-driving car technologies continue to mature, the large-scale adoption of self-driving cars remains uncertain. Current research accordingly points to limited acceptance of the technology among the general public.
In 2021, Morning Consult surveyed 2200 adults in the United States and found that
of respondents believed autonomous vehicles were less safe than their human-driven counterparts and that only
believed autonomous vehicles were safer than a human driver [
4]. This skepticism has persisted: the AAA’s 2025 survey found that 6 in 10 US drivers remain afraid of riding in a self-driving car, with only
expressing trust in the technology [
5]. With a sustained majority of the public expressing skepticism toward the technology even as these systems enter commercial deployment, research on improving consumer trust in self-driving cars is needed [
6,
7]. Research conducted by Lee et al. suggests that there are several factors affecting the adoption of self-driving cars. These factors include the perceived usefulness, self-efficacy, perceived risk, and psychological ownership of the vehicle [
8]. Similarly, Choi and Ji found that trust and perceived usefulness are the strongest determinants of intention to adopt autonomous vehicles [
9]. Underlying most of these factors is the inherent trust a user has in the self-driving technology. Trust is a key factor in a person’s willingness to accept the technology and is largely shaped by prior experiences [
7,
10]. Lee and See [
6] provide a foundational framework for understanding trust in automation, defining it as the attitude that an agent has that will help achieve an individual’s goals in a situation characterized by uncertainty and vulnerability. Parasuraman and Riley [
11] further distinguish between appropriate use, misuse (over-reliance), and disuse (under-reliance) of automation, each driven by the calibration of trust relative to actual system capability.
Survey studies have revealed a social dilemma in autonomous vehicle ethics: while respondents generally approve of utilitarian autonomous vehicles (AVs) that sacrifice their passenger to save a greater number of lives, they themselves would prefer to ride in a self-protective vehicle [
12]. This tension between collective endorsement and individual self-interest complicates trust formation, as passengers must accept a system whose ethical programming may not prioritize their personal safety. Shariff et al. [
13] propose that the discussion of risk needs to be posed in terms of “absolute risk” rather than relative risk: by driving a self-driving car, one diminishes one’s total risk of injury, and therefore one should not focus on edge cases where one’s safety may not be prioritized. When considering risk from an absolute perspective, users may be more likely to adopt a self-driving car, as their chances of survival on any given drive are overall maximized by doing so.
One method of cultivating trust for self-driving cars is to improve human–machine interaction. By designing self-driving cars that communicate with passengers and give them a more active role in the experience, developers can deliver a more trustworthy system, e.g., through adaptive mood control [
14] and adaptive driving mode [
15]. This is supported by research conducted by Hartwich et al., whose evidence suggests that even given an SAE Level 4–5 system where no human interaction is required, the introduction of monitoring tools significantly improves passenger trust [
16]. Further research conducted by Hartwich et al. shows that the first experience with self-driving cars greatly impacts the trust one associates with the technology [
17].
In addition to the first experience significantly affecting a user’s perception of the technology, research from Shahrdar et al. shows how trust is greatly affected by the driving style used and that defensive driving builds more trust than aggressive driving in virtual reality simulated tests [
18]. These tests also showed that while initial experiences were important, trust in the system can be rebuilt following faulty behavior given enough time experiencing safer and more defensive driving from the self-driving car. Cegarra et al. found that initial trust affects how conventional vehicle drivers behave in traffic with AVs, with higher trust leading to more risk-taking behavior around autonomous vehicles [
19]. Furthermore, the amount of control a user has appears to be a significant factor in a user’s ability to trust a given system. Research has shown that in the classical scenario of a person being chauffeured, there is an increased level of discomfort with being a passenger compared with being an active driver [
20], and it appears that this analog extends to the self-driving car scenario; yet, there is still a decreased amount of trust in robotic drivers versus a human driver given equivalent driving behaviors as shown by Mühl et al. [
21]. This suggests that self-driving cars not only have to perform as well as a human driver but must also perform even better to gain equivalent trust from their passengers.
Studies by Kolekar et al. suggest self-driving behavior can be made more human-like with the introduction of “driver risk field” modeling where the car’s behavior is tuned to a given driver’s perceived risk when executing driving maneuvers [
22]. This generates autonomous behavior that is more in line with human driving than the current status quo, which only tries to minimize the real risk posed by a driving scenario. Beyond increasing the interactions passengers can experience with a self-driving car, the driving style can also be modified to increase trust in the system. Research conducted by Basu et al. showed that a more defensive driving style led to higher comfort and preference in autonomous driving scenarios [
23]. When participants were surveyed on their driving preferences, they responded that they would want an experience similar to their own driving style for a self-driving car. Yet, when passengers were placed in a simulation, participants preferred a driving style that they thought was their own but was in fact much more conservative than their actual driving style [
23].
This coincides with research conducted by Hajiseyedjavadi et al., who showed that in a simulator, drivers preferred their own driving styles over a faster one but still provided negative feedback when replaying their own driving style on urban roads versus rural roads, suggesting that environmental conditions play a significant part in preferred driving style [
24]. These results are partly supported by Dettmann et al., who showed that younger drivers preferred autonomous driving styles similar to their own, while older drivers showed greater tolerance for driving styles more aggressive than their own [
25]. Dettmann et al. also concluded by stating that the main factors in driving style preference depend on “speed, acceleration and deceleration behavior as well as distance control” [
25]. In a related study, Schlüter et al. found that technological affinity and skepticism toward the technology should also be taken into account when designing adaptive driving style autonomous vehicle technologies [
26].
Further research conducted by Bellem et al. showed how participants in a simulation study preferred driving styles that minimized acceleration and jerk when performing driving actions such as lane changing [
27]. Bellem et al. also reported that personality traits associated with participants did not have any significant observable effects on autonomous driving preferences [
27]. Consistent with the finding that drivers prefer conservative self-driving car (SDC) behavior, Craig et al. reported that surveyed participants showed that they expect a self-driving car to behave in a slightly less aggressive manner than their own driving style [
28]. Methods proposed by Park et al. suggest adapting the driving behavior based on electroencephalography feedback to establish and maintain trust in the system [
29,
30].
Beyond any technical limitations, there are also numerous legal issues that arise with highly autonomous systems. From a legal standpoint, liability frameworks for autonomous vehicles remain fragmented; while some manufacturers now accept responsibility when their automated driving systems are engaged [
2], no uniform international standard has emerged. In the case of semi-autonomous systems, liability attribution becomes less clear. Research from Awad et al. suggests that drivers are blamed more than the automated systems in these semi-autonomous situations even when both make errors [
31]. These findings suggest that more clearly defining liability attribution in autonomous systems could help improve public trust in the technology. With these questions in mind, we must further consider how users will respond to these technologies outside of the demographics in which research is collected. An important question remains as to how research participants are biased by the infrastructure and cultural norms of the country in which research takes place. There are some surveys that provide an international view, such as research conducted by Deloitte in 2020, which provided responses by country (South Korea, Japan, United States, Germany, India, and China) detailing the percentage of consumers who believe SDCs will not be safe. The results provided by Deloitte for most countries follow the United States’ sentiments (∼50% believe they will not be safe) with some outliers, such as China, whose survey data suggests a more trusting sentiment, and India, whose survey data suggests a less trusting sentiment [
32]. Large-scale cross-national studies further illustrate cross-cultural variation: Kyriakidis et al. [
33] surveyed over 5000 respondents from 109 countries and found that respondents from more developed countries expressed greater concern about automated driving, while Muzammel et al. [
34] demonstrated that Hofstede’s cultural dimensions significantly predict national-level variation in AV acceptance. More recent multi-country validation studies [
35,
36] confirm that cultural values moderate AV acceptance, underscoring the need for cross-cultural validation of trust instruments.
With prominent Level 4 deployments such as Waymo’s driverless service operating primarily in US cities [
3], it becomes challenging to understand the global needs of this technology when so much of this research is based on the experiences of US drivers on American roads. Throughout this paper, we establish new scales through surveys that allow for measurement of various driving behaviors and expected behaviors from self-driving cars. Because trust calibration theory predicts that higher trust in automation reduces the perceived need for conservative safety margins [
6], we hypothesize that individuals with greater AI trust will show a smaller gap between their self-reported driving aggressiveness and the aggressiveness they prefer from an SDC. Using these scales, we address the following research questions:
- RQ1.
Do drivers prefer SDC driving behavior that is more conservative than their own, and does this preference vary across countries?
- RQ2.
Is Artificial Intelligence (AI) trust level associated with the gap between self-reported driving aggressiveness and preferred SDC aggressiveness?
- RQ3.
Do cross-cultural differences exist in driving behavior, SDC preferences, AI trust, and driver safety behaviors across the US, Germany, and Panama?
This journal version extends the conference paper [
37] with multi-group measurement invariance testing, discriminant validity analysis of AI Trust (AIT) versus AI Driving Mechanics Trust (AIDMT), AIT threshold sensitivity analysis, multivariate regression controlling for demographics, Bonferroni correction for multiple comparisons, and post-hoc power analysis.
2. Materials and Methods
All statistical analyses were implemented in Python (version 3.12.3) and R (version 4.3.3) by the authors; Anthropic Claude (Claude Opus 4.6, 2026) was used as a coding assistant and to verify consistency of reported statistics across the manuscript.
2.1. Participants and Sampling
This study contributes cross-cultural data on driving behavior and SDC preferences through several newly defined metrics based on question groupings that further characterize user expectations of self-driving cars. We surveyed 157 people across the United States, Germany, and Panama who were recruited through local networking and through PollPool.com. Panamanian data were primarily collected through the distribution of the survey among local contacts in the area who distributed the survey further among their peers. Data from Germany and the United States were primarily collected through PollPool. These regions were chosen as the US and Germany represent two global hubs of automotive manufacturing and are at the forefront of self-driving technologies at scale. In contrast, Panama was chosen as an emerging market due to its high number of consumer sales across Central America, where in 2019 it had the highest number of vehicles registered or sold in the region [
38]. Panama also provides an important contrast as a Latin American market where road safety remains a significant public health challenge [
39] and where cross-cultural communication norms may shape technology acceptance differently than in Western industrialized nations [
40], making AV deployment both promising and sensitive to cultural context. Research on AV perceptions in Latin America remains scarce, with Marroquin et al. [
41] providing one of the few regional studies, finding that interpersonal trust is the strongest predictor of AV acceptance across 18 Latin American countries.
Survey data were machine-translated from English into German and Spanish and subsequently reviewed by a native speaker of each language to verify the accuracy and cultural appropriateness of each item. Formal back-translation was not conducted; however, the measurement invariance analyses reported in
Section 3.9 provide partial empirical evidence that the scales function similarly across the three language versions, though measurement invariance alone cannot establish semantic equivalence (see
Section 4.5). Data were collected over the course of 2022. The respondents totaled 50 from the United States, 66 from Germany, and 41 from Panama. Surveys distributed in the United States and Germany were administered via Google Forms, with PollPool used as a recruitment platform to direct participants to the survey, introducing potential self-selection bias toward tech-savvy respondents. Panamanian data were collected through snowball sampling among local contacts, which may introduce network-based sampling bias. Both represent convenience samples and are not nationally representative; however, multivariate analyses controlling for demographic covariates partially address potential demographic skew across groups (see
Section 2.9). This analysis contributes cross-cultural driving behavior and SDC preference data from an underrepresented Latin American market (Panama) alongside two Western automotive hubs (US and Germany), pairing self-reported driving behavior metrics with SDC driving style preference metrics across multiple countries.
2.2. Survey Procedure and Instruments
Participants were asked to complete a survey. The survey asked 57 questions relating to demographics, personal driving behaviors, and trust in AI and self-driving cars. The survey was structured into seven distinct parts as follows:
Part-1 provides demographic information including data about what country the driver currently resides in and what country they have driven the most in as well as age, gender, ethnicity, education, employment status, income range, etc.
Part-2 provides information regarding how a driver behaves on non-highway roads, and it is used to define an aggressiveness score for the driver.
Part-3 provides information regarding how a driver behaves on highway roads, and it is also used to define an aggressiveness score for the driver.
Part-4 provides information regarding general driving behaviors such as parking, turning, and driving under difficult weather conditions.
Part-5 provides information regarding how much drivers currently trust Artificial Intelligence (AI) and its applications to self-driving cars on a 5-point Likert-type scale: distrust, somewhat distrust, neutral, somewhat trust, and trust.
Part-6 provides data regarding how drivers expect AI/autonomy to perform on non-highway roads.
Part-7 provides data regarding how drivers expect AI/autonomy to perform on highway roads.
2.3. Quantitative Measurement
Each question asked can be related to a quantitative value to define Driving Behavior Aggressiveness (DBA), Self-Driving Car Aggressiveness (SDCA), AI Driving Mechanics Trust (AIDMT), general AI Trust (AIT), and Driver Safety Score (DSS) metrics.
Responses from highway-based and non-highway-based questions were averaged together to provide a more general scoring of the driver’s aggressiveness in all situations. The same averaging method was applied to the SDCA items across highway and non-highway contexts. For DBA scores, a score of 0 represents a conservative driver and a score of 1 represents an aggressive driver. For SDCA scores, a score of 0 represents a conservative SDC and a score of 1 represents an aggressive SDC. These scores can then be used to contrast expectations of an SDC to a participant’s own driving behaviors. For the trust scales, AIT and AIDMT scores of 0 represent full distrust and scores of 1 represent full trust. For DSS, a score of 0 represents the safest (most cautious) driving actions and 1 represents the least safe.
2.4. Unidimensionality of Scales
As a prerequisite to most consistency and reliability tests as they relate to new scales, the assumption of the unidimensionality of each scale must be evaluated. In this survey, we introduce five new scales to measure various driving behaviors based on survey responses. These scales are constructed from subsets of questions hypothesized to measure a single underlying factor. To evaluate whether these questions are unidimensional (i.e., they measure a single larger factor), we conducted an iterative confirmatory factor analysis (CFA) of these questions, in which theoretically motivated models were refined through inspection of modification indices and item diagnostics.
For the Driving Behavior Aggressiveness (DBA) scores, we constructed a scale from seven items relating to highway and non-highway driving. Correlated residuals (error covariances) were specified between item pairs guided primarily by empirical modification indices and, where applicable, substantive rationale such as parallel item content across road types (e.g., the same driving mechanics measured in highway versus non-highway contexts). The same correlated residual specifications were applied consistently to both the DBA and SDCA scales. Similarly, for the Self-Driving Car Aggressiveness metric, we constructed a scale consisting of 7 items that measure how aggressively a given user would want their self-driving car to behave. The 7 items used in this scale are the same as the 7 items asked in the DBA scale but now in the context of the self-driving car performing the action. The same correlated residual structure applied to the DBA scale was applied to the SDCA scale to maintain consistency. The next scale constructed was the AI Driving Mechanics Trust (AIDMT) scale, which employed 6 items that gauge on a 5-point Likert-type scale how much trust a user has in an AI car performing a variety of highway driving mechanics such as controlling speed and lane changing. Subsequently, the AI Trust scale (AIT) was constructed from 6 items that determine how much a user would trust AI performance in various complex scenarios. Finally, the Driver Safety Score (DSS) scale was created from seven items that determine how safe the sampled user is based on how they respond to various driving conditions.
During CFA model development, items with standardized factor loadings below 0.40 or whose removal improved model fit were candidates for exclusion. The AIT scale was reduced from 7 to 6 items (one item dropped due to local dependency with an adjacent item), the DBA and SDCA scales from 10 to 7 items each (braking items from both road types and the non-highway speed item were removed), and the DSS scale from 10 to 7 items. The AIDMT scale retained 6 of its original 16 items; notably, all retained items concerned highway driving tasks, so the scale as validated reflects highway-specific driving mechanics trust rather than general driving mechanics trust. Full original item pools are available in the archived survey instruments [
42].
The CFA considered how each item in a scale related to a single latent variable using polychoric correlations to account for the categorical nature of the ordinal response scales [
43]. All CFA models were estimated using the Weighted Least Squares Mean and Variance adjusted (WLSMV) estimator, which is recommended for ordinal categorical data [
43]. No missing data were present at the item level. The DSS model included one correlated residual between items involving lateral vehicle movement (turning behavior and night driving lane changes), guided by modification indices and specified on the same empirical basis as the DBA and SDCA residuals. Goodness of fit for each model was evaluated using the mean-adjusted (scaled) fit indices produced by WLSMV, which correct for non-normality due to the ordinal data. The primary fit indices reported are the
test statistic and its
p-value, the Comparative Fit Index (CFI), and the Tucker–Lewis Index (TLI). Root Mean Square Error of Approximation (RMSEA) is also provided but should be interpreted cautiously for this study due to known issues with overestimating misfit when the degrees of freedom and sample size are small [
44].
Table 1 presents the fit parameters for each scale. Standardized factor loadings range from 0.45 to 0.92 for DBA, 0.74 to 0.89 for SDCA, 0.75 to 0.85 for AIT, 0.74 to 0.87 for AIDMT, and 0.49 to 0.72 for DSS. The
p-value for each scale is greater than 0.05 and thus non-significant. CFI and TLI values exceed 0.95 for all five scales, indicating good model fit [
45]. RMSEA values range from 0.025 to 0.074, indicating close to reasonable fit, though as noted above, RMSEA should be interpreted cautiously given the small sample size and low degrees of freedom; the CFI/TLI values provide stronger evidence of model fit.
2.5. Consistency and Reliability in Measurement
To establish the reliability of our analysis, the consistency of each scale must be demonstrated. While Cronbach’s alpha is often used for this purpose in the literature, its assumptions are frequently violated in practice, as it assumes tau-equivalence of the items within each set, i.e., each item contributes equally to the general factor being measured, as well as assumes that the general factor being measured is unidimensional [
46,
47].
Furthermore, if either of these requirements is not met, the reported Cronbach’s alpha value may over- or underestimate the reliability of the test, depending on the pattern of loadings and error correlations [
46,
48]. To address these issues, we consider a model-based measure of reliability known as McDonald’s Omega [
48,
49], which takes a factor analysis approach to deriving correlations between items. McDonald’s Omega maintains the same range and threshold of accepted consistency as Cronbach’s alpha, where Omega values greater than 0.7 are considered acceptable, while not requiring tau-equivalence among the items. Similar to Cronbach’s alpha, the data are required to be unidimensional, which we have established for our data using the CFA in the previous section.
Table 2 presents the resulting Omega values. Our results for McDonald’s Omega exceed 0.7, confirming good reliability for all five scales.
2.6. Measurement Invariance
This study compares scale scores across three culturally and linguistically distinct groups; as such, it is essential to establish that the scales function equivalently across groups. To do so, we established evidence for measurement invariance [
50,
51]. Without this evidence, observed cross-cultural differences may reflect measurement artifacts rather than true group differences. Prior work by Chien et al. [
52] validated a trust-in-automation scale across US, German, Taiwanese, and Turkish samples, demonstrating the importance of such validation for cross-cultural trust research. Multi-group CFA was conducted for each of the five scales using the WLSMV estimator with theta parameterization, which is preferred for ordinal response data [
53]. Three progressively restrictive models were tested: (1)
configural invariance (same factor structure across groups), (2)
metric invariance (equal factor loadings), and (3)
scalar invariance (equal loadings and thresholds). Model comparisons used the change in CFI (
CFI), where
indicates that invariance holds [
50]. We note that Chen [
50] suggested a stricter threshold of
for small, unequal samples; however, the 0.01 criterion remains the most widely applied standard in applied measurement invariance research [
51] and is adopted here. For scales where some response categories were empty within specific country groups (DSS, SDCA, and DBA), sparse categories were collapsed to enable model convergence. Results are presented in
Section 3.9.
2.7. Discriminant Validity: AIT Versus AIDMT
The moderate observed Pearson correlation between the AIT and AIDMT scales (
) raises the question of whether these scales measure genuinely distinct constructs. AIT captures
general trust in AI and autonomous technologies, whereas AIDMT measures trust in
specific driving mechanics performed by an autonomous vehicle. This distinction mirrors the established generalized versus task-specific trust framework in organizational psychology [
54]. To empirically validate this distinction, three complementary analyses were conducted: (1) a comparison of two-factor versus one-factor CFA models using chi-square difference testing, (2) the Fornell–Larcker criterion, requiring each scale’s Average Variance Extracted (AVE) to exceed the squared inter-factor correlation, and (3) the Heterotrait–Monotrait (HTMT) ratio, where HTMT
supports discriminant validity [
55]. Results are presented in
Section 3.10.
2.8. AIT Threshold Justification
This study classifies participants as “trustful” (AIT
) or “distrustful” (AIT
) when examining the relationship between AI trust and SDC aggressiveness preferences (
Section 3). The 0.5 threshold represents the neutral midpoint of the 0–1 scale: an average item response above 0.5 indicates a net positive disposition toward trust, while a score at or below 0.5 indicates a net negative disposition, yielding
and
. To verify that findings are not sensitive to this particular threshold, a separate sensitivity analysis was conducted across seven thresholds (0.35, 0.40, 0.45, 0.50, 0.55, 0.60, and 0.65); in the sensitivity analysis, participants scoring exactly at each threshold were excluded to create clean group separation (e.g., at 0.50, 16 participants with AIT
were excluded, yielding
and
). Additionally, continuous analyses using Spearman correlations and Ordinary Least Squares (OLS) regression were performed to complement the dichotomous approach. To move beyond a null-result interpretation for the trustful group, a Two One-Sided Tests (TOST) equivalence procedure [
56], implemented using two one-sided Wilcoxon signed-rank tests on the paired DBA–SDCA differences, was applied to determine whether the DBA–SDCA difference in the trustful group was statistically equivalent to zero within a meaningful margin. The equivalence margin was set to
, equal to the distrustful group’s observed pseudo-median, representing the smallest empirically meaningful gap. TOST results are presented alongside the paired DBA–SDCA comparison in
Section 3; threshold sensitivity results are presented in
Section 3.11.
2.9. Multivariate Analysis
To assess whether observed cross-cultural differences persisted after accounting for demographic heterogeneity, OLS regression was conducted for each scale with country as the primary predictor and age, gender, and education level as covariates. Demographics were harmonized across the three survey languages by standardizing response categories. Education was coded ordinally (0 = none, 1 = high school or equivalent, 2 = bachelor’s, 3 = master’s, and 4 = doctoral/professional), with country-specific degree names mapped to the nearest equivalent level. Income was excluded due to incompatible currency units across countries. Driving experience (years of licensure) was not collected in the survey instrument and therefore could not be included as a covariate. Seven participants were excluded from the regression analyses due to missing demographic data (4 from Panama and 3 from Germany), yielding
. OLS regression on averaged ordinal scale scores is standard practice in psychometric research when the dependent variable represents a multi-item composite [
57,
58]; averaging across 6–7 items produces a quasi-continuous distribution, and with
, the Central Limit Theorem supports the assumption that the sampling distributions of the OLS coefficient estimates are approximately normal. Results are presented in
Section 3.12.
3. Results
This section compares the scores generated from each of the defined quantitative metrics (DBA, SDCA, AIDMT, AIT, and DSS) against the collected demographic data to test whether there are statistically observable differences across demographics within a nation’s population as well as from an international perspective. Distributions were compared using two-sided Mann–Whitney U tests for independent groups and two-sided Wilcoxon signed-rank tests for paired within-subject comparisons (e.g., DBA versus SDCA and AIT versus AIDMT) and reported if the resulting
p-value was less than 0.05. Direction of effects was conveyed through signed effect size measures (Cliff’s
, Hodges–Lehmann estimates, pseudo-medians, and rank-biserial correlations) rather than test sidedness. In comparisons involving multiple countries, a per-construct Bonferroni correction [
59] was applied within each scale to control the family-wise error rate. Because each scale measures a conceptually distinct construct, the correction was applied within each family of three country-pair comparisons (
) rather than across all 15 tests globally. This approach scopes the correction to tests that share the same construct-specific null hypothesis, avoiding the inflation of type II error that results from penalizing each construct for comparisons on unrelated scales. Under a global Bonferroni correction across all 15 tests (
), four of eight significant results would no longer reach significance (DBA US–Panama, SDCA US–Panama, SDCA Panama–Germany, and DSS Panama–Germany). Within-country paired comparisons (DBA vs. SDCA and AIT vs. AIDMT) are reported at
without correction, as each paired test addresses a distinct within-subject hypothesis.
As the Mann–Whitney U test provides only a significance value without quantifying the magnitude of difference, additional statistical tools were employed for measures found significant. The Hodges–Lehmann (HL) estimator [
60], bootstrapped with 10,000 iterations, was used to estimate the median location shift between distributions along with 95% confidence intervals; positive HL values indicate the first group scored higher than the second. This analysis also reports Cliff’s
[
61] as a standardized effect size measure describing the tendency for scores in one group to be higher than the other. Cliff’s
is measured on a scale of
to
, where
represents the case when all values in group A are greater than all values in group B, 0 represents groups with perfect overlap, and
represents all values in group A being less than all values in group B. For paired within-subject comparisons, the pseudo-median of within-subject differences and its 95% confidence interval are reported as the location-shift measure, along with the matched-pair rank-biserial correlation
as the effect size. Positive pseudo-median values indicate the first scale scored higher;
ranges from
to
with the same directional interpretation.
3.1. Summary Statistics
Table 3 summarizes the demographic characteristics of participants by country. Panamanian respondents were older on average (
;
) than US (
;
) and German (
;
) respondents. The US sample was predominantly male (68.0%), whereas the German sample was predominantly female (63.6%). Education levels varied, with Germany having the highest proportion of bachelor’s (55.6%) and master’s (23.8%) degree holders, while Panama had the highest proportion of high school or some college education (60.0%). Seven participants had incomplete demographic data (four from Panama and three from Germany) and were excluded from regression analyses, yielding
for those models.
To provide further evidence that our metrics measure different aspects of a respondent’s driving profile, we considered the correlations among their responses. The Pearson correlation coefficient was evaluated between each measured metric across all demographics. The basic correlation analysis shows that most inter-scale correlations are weak to moderate, with the exception of AIT and AIDMT (
), whose distinctness is empirically verified through discriminant validity analysis (
Section 3.10). The remaining correlations support the conclusion that each scale is measuring a different aspect of driving behaviors and preferences. These results are shown in
Table 4. The summary statistics for each surveyed demographic were also generated and are shown in
Table 5. These tables provide a high-level overview of how each demographic tended to respond in each measured scale. A more detailed examination of how these distributions compare with one another is shown in the following sections; however, some differences in the distribution of responses across demographics are already apparent. Detailed paired and between-group distribution comparisons for all scales are presented in
Table 6, and all cross-country Mann–Whitney U tests with Bonferroni correction are summarized in
Table 7.
Table 8.
DBA scale items.
Table 8.
DBA scale items.
| | DBA Questions Used for Scale |
|---|
| DBA1 | Which best describes your driving behavior most of the time in terms of speed while driving on: THE HIGHWAY |
| DBA2 | Which best describes your lane changing behavior when driving on: THE HIGHWAY |
| DBA3 | How would you describe the way you accelerate and decelerate while driving on: THE HIGHWAY |
| DBA4 | How often do you pass others when driving on: THE HIGHWAY |
| DBA5 | What best describes your lane changing behavior when driving on: NON-HIGHWAY ROADS |
| DBA6 | How would you describe the way you accelerate and decelerate while driving on: NON-HIGHWAY ROADS |
| DBA7 | How often do you pass other vehicles when driving on: NON-HIGHWAY ROADS |
Table 9.
SDCA scale items.
Table 9.
SDCA scale items.
| | SDCA Questions Used for Scale |
|---|
| SDCA1 | If you are traveling in a self-driving car, and the car is in control of the speed, what range speed would you feel most comfortable with when driving on: HIGHWAY ROADS |
| SDCA2 | If you are traveling in a self-driving car, on HIGHWAY ROADS, you expect the car to change lanes: |
| SDCA3 | If you are traveling in a self-driving car, on HIGHWAY ROADS, your preference for the way it accelerates and decelerates would be: |
| SDCA4 | If you are traveling in a self-driving car, how often would you expect the car to pass other vehicles when driving on: HIGHWAY ROADS |
| SDCA5 | If you are traveling in a self-driving car, on NON-HIGHWAY ROADS, you expect the car to change lanes: |
| SDCA6 | If you are traveling in a self-driving car, on NON-HIGHWAY ROADS, your preference for the way it accelerates and decelerates would be: |
| SDCA7 | If you are traveling in a self-driving car, how often would you expect the car to pass other vehicles when driving on: NON-HIGHWAY ROADS |
Table 10.
AIT scale items.
Table 10.
AIT scale items.
| | AIT Questions Used for Scale |
|---|
| AIT1 | What is your trust level to utilize Artificial Intelligence or Fully Autonomous Technologies? |
| AIT2 | What is your trust level that self-driving cars will keep your own safety as its primary objective? |
| AIT3 | What is your trust level that self-driving cars will be able to navigate in construction zones that include temporary detours that would ordinarily go against the flow of traffic? |
| AIT4 | What is your trust level that self-driving cars will be able to navigate in crowded pedestrian areas? |
| AIT5 | What is your trust level that self-driving cars will successfully get you to the EXACT destination you requested? |
| AIT6 | What is your trust level in the ability of self-driving cars to navigate safely with no person in the vehicle? |
Table 11.
AIDMT scale items.
Table 11.
AIDMT scale items.
| | AIDMT Questions Used for Scale |
|---|
| AIDMT1 | If you were in a self-driving car, what tasks do you feel comfortable handing over to the car to perform autonomously? Please rank your trust level when it comes to the following driving tasks being performed by a self-driving car when driving on: HIGHWAY ROADS [The speed of the car] |
| AIDMT2 | If you were in a self-driving car, what tasks do you feel comfortable handing over to the car to perform autonomously? Please rank your trust level when it comes to the following driving tasks being performed by a self-driving car when driving on: HIGHWAY ROADS [Changing lanes] |
| AIDMT3 | If you were in a self-driving car, what tasks do you feel comfortable handing over to the car to perform autonomously? Please rank your trust level when it comes to the following driving tasks being performed by a self-driving car when driving on: HIGHWAY ROADS [Signaling] |
| AIDMT4 | If you were in a self-driving car, what tasks do you feel comfortable handing over to the car to perform autonomously? Please rank your trust level when it comes to the following driving tasks being performed by a self-driving car when driving on: HIGHWAY ROADS [Driving in hazardous weather conditions] |
| AIDMT5 | If you were in a self-driving car, what tasks do you feel comfortable handing over to the car to perform autonomously? Please rank your trust level when it comes to the following driving tasks being performed by a self-driving car when driving on: HIGHWAY ROADS [Braking] |
| AIDMT6 | If you were in a self-driving car, what tasks do you feel comfortable handing over to the car to perform autonomously? Please rank your trust level when it comes to the following driving tasks being performed by a self-driving car when driving on: HIGHWAY ROADS [Maintaining a certain distance from the cars around you] |
Table 12.
DSS scale items.
Table 12.
DSS scale items.
| | DSS Questions Used for Scale |
|---|
| DSS1 | What best describes your driving behavior when making a turn? |
| DSS2 | What best describes your driving behavior when driving on a winding road? |
| DSS3 | When driving in that less than perfect weather condition, what happens to your speed? |
| DSS4 | How would you describe your lane changing behavior when driving in that less than perfect weather condition? |
| DSS5 | Which best describes your driving behavior most of the time in terms of speed while driving at night? |
| DSS6 | Which best describes your behavior most of the time when it comes to signaling lane changes while driving at night: |
| DSS7 | Which best describes your lane changing behavior when driving at night? |
3.2. International DBA and SDCA Metrics Across All Demographics
The following DBA and SDCA scores shown in
Figure 1a,b were generated and compared at an international level. These distributions show that across the combined international sample, most drivers behave in a conservative to moderate fashion and most drivers prefer a more conservative self-driving car compared with their own driving behaviors. This finding supports previous research conducted in [
28], which provided similar distributions across these metrics (see
Table 6).
Using a two-sided Wilcoxon signed-rank test (paired), we found evidence that these distributions are statistically different, with the underlying distribution of the DBA score being statistically greater than the underlying distribution of the SDCA score (positive pseudo-median and
; see
Table 6) with a
p-value
. Also relevant is the relationship between people’s trust in AI and how they would like their SDC to perform. For this analysis, we considered a driver with an AIT score of less than or equal to
to be generally distrustful of AI technology and a driver with a trust score greater than
to be generally trustful of AI technology. Under these parameters, the following SDCA score distributions are considered and shown in
Figure 1. Using a two-sided Wilcoxon signed-rank test, we compared each participant’s own DBA and SDCA scores within each trust group. The comparison of the SDCA of drivers who are distrustful of AI technology with their own driving behaviors shows a
p-value of <0.001, indicating that the underlying distributions are not equal and agreeing with the general result that people want an SDC that is
more conservative than their own driving behavior. However, this result changes when we consider the SDCA of drivers who are trustful of AI technology compared with their own driving behavior. In this case, we failed to reject the null hypothesis with a
p-value of
, finding no evidence that the trustful SDCA distribution differs from the DBA distribution. A TOST equivalence test confirmed that the trustful group’s DBA–SDCA difference is statistically equivalent to zero within the equivalence margin of
(the distrustful group’s pseudo-median): TOST
; the 90% confidence interval of the trustful pseudo-median
falls entirely within the equivalence bounds. Thus, the trustful group’s DBA–SDCA gap is not merely non-significant but
demonstrably negligible, supporting the interpretation that users who are more trustful of AI technology show
DBA–SDCA equivalence, consistent with acceptance of a driving style comparable to their own. 3.3. American Respondent Analysis
This analysis examines whether the same difference in preference holds when comparing DBA and SDCA scores for American respondents. A two-sided Wilcoxon signed-rank test (paired) failed to reject the null hypothesis when comparing these two distributions (
; see
Table 6). No evidence was found that American respondents prefer a different SDC aggressiveness level than their own driving behavior; however, this null result should not be interpreted as evidence of equivalence given the modest sample size (
). The two trust metrics AIT and AIDMT were computed and showed that
American respondents had higher trust in AI’s ability to perform the mechanics of driving than in the technology as a whole with a corresponding
p-value of
from a two-sided Wilcoxon signed-rank test (paired; see
Table 6), providing evidence that the AIT and AIDMT distributions are stochastically different (AIDMT > AIT based on positive pseudo-median).
3.4. German Respondent Analysis
The German responses show a clear difference between the DBA and SDCA distributions. Using a two-sided Wilcoxon signed-rank test (paired), we observed a
p-value of <0.001, providing statistical evidence that the German DBA and SDCA distributions are stochastically different (DBA > SDCA based on positive pseudo-median; see
Table 6). This measurement supports the idea that
German drivers prefer a more conservative SDC than their own driving behaviors. Notably,
Germans had the highest mean DBA score across all participants for this research, though this difference was not statistically significant versus US respondents after Bonferroni correction (see
Table 7). When considering AI Trust metrics, German respondents showed higher trust in AI’s ability to perform the mechanics of driving (AIDMT) than in the technology as a whole (AIT). This is supported by a
p-value of <0.001 from a two-sided Wilcoxon signed-rank test (paired; see
Table 6) (AIDMT > AIT based on positive pseudo-median).
3.5. Panamanian Respondent Analysis
The Panamanian responses showed a statistical difference between the DBA and SDCA scores with a
p-value of
using a two-sided Wilcoxon signed-rank test (paired), providing evidence that the DBA and SDCA distributions are stochastically different (DBA > SDCA based on positive pseudo-median; see
Table 6). This result provides statistical evidence that
Panamanian drivers prefer a more conservative car behavior than their own driving behavior. The two trust metrics AIT and AIDMT were also evaluated. Notably,
Panamanian respondents had significantly lower AIDMT scores compared with their American and German counterparts (see
Table 7), while AIT differences did not reach significance after Bonferroni correction. Furthermore, Panamanians had nearly equal trust in AI’s ability to perform driving mechanics and in the technology as a whole; a two-sided Wilcoxon signed-rank test (paired) failed to reject the null hypothesis (
; see
Table 6), failing to provide evidence that the two distributions are stochastically different. This is in sharp contrast with both Americans and Germans, who had more trust in AI’s ability to manage driving mechanics versus the technology as a whole.
3.6. Americans Versus Germans
When we compared Americans versus Germans, we found no significant differences in the generated distributions for any of the five metrics, DBA, SDCA, AIT, AIDMT, and DSS, after per-construct Bonferroni correction (all
; see
Table 7). Note that post hoc power was below 0.26 for all US–Germany comparisons (see
Section 4.5), so these null results should not be interpreted as evidence of equivalence. This paper also considers which individual AI Trust questions yielded statistically different distributions across demographics. These questions give insight into the needs of each demographic in terms of general AI trust. The resulting distributions show that Americans reported higher trust in the AI’s ability to navigate crowded pedestrian areas compared with Germans, with a
p-value of
. Note that the individual trust–question comparisons reported in the cross-country subsections below are exploratory and are not corrected for multiple comparisons; they should be interpreted as hypothesis-generating rather than confirmatory.
3.7. Americans Versus Panamanians
When we compared Americans versus Panamanians, we found statistically significant results in four of the five metrics, DBA, SDCA, AIDMT, and DSS, with
p-values of
,
, <0.001, and <0.001, respectively, in two-sided Mann–Whitney U tests, all surviving per-construct Bonferroni correction (
; see
Table 7), with positive Cliff’s
values confirming that the distributions for these metrics were statistically higher for American respondents when compared with Panamanian respondents. The results illustrate that
Panamanian drivers prefer a more conservative SDC experience relative to American drivers. Additionally, the results show that
Panamanians reported lower trust in AI’s ability to perform driving mechanics. The lower DSS scores further indicate that
Panamanian drivers tend to take more cautious driving actions compared with American drivers. When considering the statistically significant AI trust questions, the results show that Americans believe AI will keep their safety as the highest priority compared with Panamanians, with a
p-value of
. Additionally, Americans reported higher trust in AI’s ability to perform with no person in the vehicle compared with Panamanians, with a
p-value of
in two-sided Mann–Whitney U tests.
3.8. Panamanians Versus Germans
When we compared Panamanians with Germans, we found statistically significant differences in DBA, SDCA, AIDMT, and DSS scores (
,
,
, and
, respectively; all
; see
Table 7). These distributions indicate that
Panamanian drivers prefer a more conservative driving style compared with German drivers. Additionally,
Panamanian drivers expect a more conservative self-driving car compared with German drivers.
A statistically significant difference also emerged in how Panamanians trust the SDC to be able to perform driving mechanics. The AIDMT distributions show Panamanians are less trustful in AI’s ability to perform driving mechanics compared with Germans. The lower DSS score distribution indicates that Panamanian drivers tend to take more cautious driving actions compared with German drivers. When considering the statistically significant AI trust questions, the resulting distributions show that Panamanians reported higher trust in AI’s ability to navigate a crowded pedestrian area than Germans (). Additionally, Germans reported higher trust in AI’s ability to navigate to an exact destination compared with Panamanians ().
3.9. Measurement Invariance Results
Multi-group CFA results for measurement invariance are summarized in
Table 13 and
Figure 2. Four of five scales (AIT, AIDMT, SDCA, and DBA) achieved full scalar invariance, with all
values below the 0.01 threshold recommended by Chen [
50]. The DSS scale achieved configural invariance but failed the configural-to-metric transition (
), indicating that factor loadings may not be fully equivalent across groups for this scale. To investigate further, each item loading was systematically freed one at a time to identify the source of non-invariance. Freeing the loading for item DSS6 (“Which best describes your behavior most of the time when it comes to signaling lane changes while driving at night?”) was sufficient to achieve partial metric invariance (
; CFI
). Item DSS5 also independently achieved partial invariance when freed (
). Following Byrne et al. [
62], cross-cultural comparisons on the DSS remain interpretable under partial metric invariance, as the majority of factor loadings (six of seven) are invariant across groups. Notably, the two primary scales of interest for cross-cultural comparisons (SDCA and DBA) both demonstrate full scalar invariance, supporting the validity of mean-level comparisons across groups.
3.10. Discriminant Validity Results
Three complementary analyses confirmed discriminant validity between AIT and AIDMT. First, the two-factor CFA model (CFI = 0.995) fit significantly better than a one-factor model (CFI = 0.975;
difference
), confirming that AIT and AIDMT measure distinct latent constructs. Second, the Fornell–Larcker criterion was satisfied: AVE(AIT) = 0.649 and AVE(AIDMT) = 0.689, both exceeding the squared inter-factor correlation of 0.545. Third, the HTMT ratio was 0.711, well below the 0.85 threshold recommended by Henseler et al. [
55]. These results show that while AIT and AIDMT are related (latent factor
; observed Pearson
; see
Table 4), they measure different aspects of trust.
3.11. AIT Threshold Sensitivity Results
Table 14 and
Figure 3 present the sensitivity analysis results. The SDCA distributions of trustful and distrustful groups differ significantly at all seven tested thresholds (0.35–0.65), with trustful individuals consistently reporting higher SDCA scores. Combined with the paired DBA–SDCA tests in
Table 6a, this supports the finding that AI-trustful individuals show a smaller DBA–SDCA gap than distrustful individuals regardless of threshold choice. Continuous analyses also support this finding: the Spearman correlation between AIT scores and the DBA–SDCA gap was
(
), indicating that higher AI trust is associated with a smaller gap between one’s own driving aggressiveness and preferred SDC aggressiveness. OLS regression on the full sample (
) confirmed that AIT significantly predicts SDCA (
;
) after controlling for country, with an overall model
.
3.12. Multivariate Regression Results
Table 15 presents the results of OLS regression models controlling for age, gender, and education. After demographic adjustment, the Panama country effect remained significant for DBA (
;
), AIDMT (
;
), and DSS (
;
), indicating that the observed driving aggressiveness, trust, and safety differences between Panamanian respondents and the reference group (Germany) persist beyond demographic composition. Gender emerged as a significant predictor for AIT (
;
) and AIDMT (
;
), with male respondents reporting higher trust scores. The SDCA country effect for Panama was attenuated after demographic controls (
), suggesting that some of the observed cross-cultural differences in SDC aggressiveness preferences may be partially attributable to demographic heterogeneity across samples. The persistence of DBA, AIDMT, and DSS country effects after controlling for demographics provides evidence that cross-cultural differences in driving behavior, AI driving trust, and safety behaviors reflect cultural variation rather than demographic differences alone. These results suggest that SDC aggressiveness preferences (SDCA) may be influenced by both cultural and demographic factors, while driving behavior aggressiveness (DBA), trust perceptions (AIDMT), and safety behaviors (DSS) are more strongly tied to cultural context. The modest
values (0.079–0.170) indicate that country and demographics explain only a small proportion of variance in scale scores; unmeasured factors such as driving experience, prior AV exposure, and personality traits likely account for additional variance.
3.13. Bonferroni-Corrected Results
Table 7 presents all 15 cross-country Mann–Whitney U tests with per-construct Bonferroni correction [
59]. Because each of the five scales measures a conceptually distinct construct, the correction is applied within each family of three country-pair comparisons (
), rather than globally across all 15 tests. All eight tests significant at the uncorrected
level also survive the per-construct Bonferroni correction. Four of five scales show significant differences between both the US and Panama and between Germany and Panama (DBA, SDCA, AIDMT, and DSS); only the AIT scale does not reach significance for any country pair. No significant differences were found between the US and Germany on any scale.
3.14. Post Hoc Power Analysis
Post hoc power analysis (computed at the Bonferroni-corrected ) revealed that 5 of 15 pairwise comparisons achieved adequate statistical power (≥0.80), with a median power of 0.711 across all tests. Tests involving the largest effect sizes (AIDMT comparisons with Panama and DSS US versus Panama) achieved a power of ≥0.85. However, comparisons involving small effect sizes between the US and Germany were underpowered (power <0.26 in all cases), which should be considered when interpreting non-significant US–Germany differences.
5. Conclusions
Regarding RQ1, drivers internationally prefer SDC behavior that is more conservative than their own, though this preference reached significance only in Germany and Panama, not in the US. Regarding RQ2, AI trust level is consistently associated with the DBA–SDCA gap across all tested thresholds and in continuous analysis, with trustful individuals showing DBA–SDCA equivalence (TOST ). Regarding RQ3, significant cross-cultural differences exist on four out of five scales (DBA, SDCA, AIDMT, and DSS) between Panama and both Western countries but not between the US and Germany; notably, general AI trust (AIT) did not differ significantly for any country pair.
At an international level, comparing the three countries’ survey results combined, this analysis found that drivers with higher trust in AI technologies (based on their AIT scores) showed DBA–SDCA equivalence (TOST ), consistent with acceptance of a driving style comparable to their own, while drivers who were distrustful of AI technologies preferred a self-driving car that was more conservative than their own driving style. When comparing individual countries, further observations emerge. Notably, Panamanian respondents had the lowest average SDCA score, indicating a preference for a more conservative SDC experience compared with US respondents (), with a significant difference also observed relative to German respondents (). When comparing Panamanian respondents to German respondents, statistical differences were found in four of the five quantitative measurements, suggesting that technology design or deployment strategy may need to be adapted for populations with trust and driving behavior profiles similar to those observed in Panama to improve social acceptability of SDCs. However, the SDCA country effect was attenuated after controlling for demographics (), suggesting that SDC aggressiveness preferences may be influenced by both cultural and demographic factors.
Future work could include expanding the sample size and increasing the number of nations surveyed to better understand the global needs of SDC technology. As autonomous vehicles continue to be deployed [
71] and mixed traffic scenarios become more common [
72], public trust is likely to evolve, warranting periodic reassessment of these metrics across diverse populations.
Based on the findings of this exploratory study, we suggest the following directions for SDC manufacturers and policymakers:
- 1.
Default driving profiles could be calibrated to regional driving mechanics trust (AIDMT) baselines and aggressiveness preferences, with more conservative defaults for markets with lower measured AIDMT scores.
- 2.
Adaptive driving systems may benefit from incorporating real-time trust assessment to personalize the driving experience.
- 3.
Since US and German respondents trusted specific driving mechanics (AIDMT) more than general AI scenarios (AIT), trust-building efforts for these markets may benefit from addressing the specific low-trust AIT scenarios (e.g., construction zone navigation and autonomous operation without a human present) rather than relying solely on demonstrations of driving mechanics competence as an entry point for broader trust formation.
- 4.
Given the significant cross-cultural differences observed in four out of five scales despite similar general AI trust levels, cross-cultural validation of trust metrics should be considered as a component of international SDC deployment.
- 5.
For populations exhibiting significantly lower driving mechanics trust (AIDMT) and aggressiveness scores (such as Panamanian respondents in this study), longitudinal research should investigate how trust evolves with exposure to autonomous vehicle technology and whether incremental introduction of autonomous features can build trust over time.