Where We Come from and Where We Are Going: A Systematic Review of Human Factors Research in Driving Automation

: During the last decade, research has brought forth a large amount of studies that investigated driving automation from a human factor perspective. Due to the multitude of possibilities for the study design with regard to the investigated constructs, data collection methods, and evaluated parameters, at present, the pool of ﬁndings is heterogeneous and nontransparent. This literature review applied a structured approach, where ﬁve reviewers investigated n = 161 scientiﬁc papers of relevant journals and conferences focusing on driving automation between 2010 and 2018. The aim was to present an overview of the status quo of existing methodological approaches and investigated constructs to help scientists in conducting research with established methods and advanced study setups. Results show that most studies focused on safety aspects, followed by trust and acceptance, which were mainly collected through self-report measures. Driving/Take-Over performance also marked a signiﬁcant portion of the published papers; however, a wide range of different parameters were investigated by researchers. Based on our insights, we propose a set of recommendations for future studies. Amongst others, this includes validation of existing results on real roads, studying long-term effects on trust and acceptance (and of course other constructs), or triangulation of self-reported and behavioral data. We furthermore emphasize the need to establish a standardized set of parameters for recurring use cases to increase comparability. To assure a holistic contemplation of automated driving, we moreover encourage researchers to investigate other constructs that go beyond safety.


Introduction
The advent of automated driving (AD) systems and Human-Machine Interfaces (HMI) marks one of the biggest game changers in transportation research and development of our time. In 2013, the Society of Automotive Engineers [1] published the first version of their definition describing different levels of driving automation, which addresses challenges, sets foundation for future standardization, and establishes a common language. With SAE Level 2 (L2) driving automation already on the road, it is only a matter of time until SAE Level 3 (L3) AD systems (ADS), or even higher levels, become commercially available. Thereby, the enormous potential of technological progress has sparked enthusiasm among human factors and Human-Computer Interaction researchers to develop user interfaces for this novel technology. Investigations of such interfaces through user studies are then conducted to first determine feasibility, and, in the next step, to fine-tune conceptual approaches. Here, basic research findings from engineering psychology, and human factors on one side, as well as computer science on the other, have been applied by automotive industry and academia. Furthermore, some lessons learned from prior automation development in the aviation sector [2][3][4] could be transferred. However, automated driving systems imply different preconditions which makes the situation much more complex. First, the driving environment is highly time-critical, and thus interventions must happen in seconds or even fractions of a second, while, in airplanes, pilots usually have more time to respond to critical events. Second, there exists greater variety among the targeted user groups. Automated vehicles are consumer products, and driver-passengers will have different levels of training and technology experience, acceptance of-and trust in automation, etc., and further come with additional goals (such as "driving fun" or the desire to engage in non-driving related activities) that are not that relevant, or not even present in classical operator settings. As a result, there are a wide range of challenges that need to be overcome, and a multitude of papers addressing these timely issues have been published over the last years. However, it is often hard to integrate and/or compare the obtained findings, as topics are often investigated differently. Furthermore, due to the sheer amount of results, it is hard for researchers to identify gaps where they could build upon.
Thus, we claim that the time has come to systematically review which topics in driving automation received the most attention, and which methods have been applied to investigate these. Therefore, the present work aims at combining these approaches, and provide researchers and practitioners with an overview of current and past topics of AD. It also points towards improvements in the future, and unveils directions that have yet to be investigated. Moreover, we plan to give an impetus towards a standardization of methodology concerning self-report, behavioral, and physiological measures, as well as appropriate triangulation approaches of them. The main contribution of this work is twofold, concerning the status quo and derivation of future research directions: • Providing an overview of emerging possibilities for study design in driving automation research • Outlining which constructs have been investigated, which data collection methods have been applied, and which specific parameters have been calculated and reported To reach this aim, we developed and followed a structured approach for reviewing related literature. First, we summarize important topics in driving automation that have been addressed in the last years, followed by a precise description of inclusion/exclusion criteria for publications used in this literature review. We then outline a procedure creating a database in which relevant contributions can be stored. Eventually, descriptive results of queries on the database are presented. The results of this literature review are expected to add to a better understanding of current trends and research directions of AD. Hence, it holds a mirror to this community on what has been accomplished and which future aspects need more attention (see Figure 1).
It is important to note here that the present paper does explicitly not target one specific area of research in automated driving. On the contrary, it serves the purpose of getting an overview of constructs on a higher level. From this overview, the work provides an in-depth summary of methodological approaches and measures within each of the constructs. This combination of high level comparison between research approaches (and their relative importance until now) as well as the insights into methodological approaches within each marks the novel contribution of this work. We acknowledge the fact that there is a large variety of research questions combined here. However, this work marks a first step to unveil trends in research on automated driving research. A more constrained work that focuses more specifically on one particular construct might be future research but is deliberately not part of the present work. Such work can also take into account the variety of research questions. Eventually, this work aims to come up with recommendations for empirical user studies on automated driving. From the reasoning above, we are already aware that each recommendation might not be suitable for largely different research questions. However, the aim is to derive recommendations in a general manner so that researchers can adapt these to the specific research purpose. Moreover, these recommendations are not mandatory in a sense that we mean them to be forced upon researchers. In our opinion, the recommendations from such a profound database are meant for researchers to include these in their initial contemplation of study design.

Research Status on Automated Vehicle HMIs
The HMI on the one hand shows information of the task (i.e., driving) to the user and on the other hand offers a possibility to provide input from the user to the driving automation system. User studies on automated vehicle (AV) HMIs have focused on different constructs, applied different collection methods, and also evaluated different parameters. In addition, different study design approaches are possible, which in turn depend on the respective construct and collection measure. However, there is no common agreed-upon methodological framework for evaluating AVs (and potential in-vehicle HMIs). Consortia reports are possible sources that researchers could consult [5][6][7]. Moreover, there have been first efforts to give an overview of methodological approaches in human-computer interaction research on user experience in general [8], the evaluation of in-vehicle information systems [9], on the evolution from manual to automated driving [10], and also with a focus on AD [11], however, without reviewing a broader set of categories.

Topics of Interest
Early research efforts in AD have often focused on Take-Over Request (TOR) scenarios, where the system exceeds its operational design domain or encounters an emergency, and prompts the driver to interrupt his/her non-driving related task (NDRT) to regain manual vehicle control. These transitions are either due to sensor failures/malfunctions (imminent transition) or because the system issues a planned indication to take over-in both cases, it is the user's role to ensure a safe transition to manual driving. These studies have revealed different issues such as controllability [12][13][14], fatigue [15,16], mode awareness [17], or automation trust [18][19][20][21][22]. Other studies frequently applied survey approaches to investigate public acceptance and readiness for the introduction of this technology to the consumer market [23][24][25]. The downside of these acceptance-related studies is that, mostly, no realistic AD system is provided to the users. At best, a description of such a system is given that requires a lot of imaginative power. A closely linked construct to acceptance is trust in automation. Here, there have been driving simulator studies that supported realistic ADS representations and HMIs [18,20,[26][27][28]. Other topics that have recently emerged go beyond safety-related issues, such as usability [29,30], and user experience (UX) [31][32][33] of AD systems and HMIs. In the usability domain, research questions mainly focus on the measures and appropriate conditions under which users effectively and efficiently interact with driving automation and in turn are satisfied with the interface [34]. This also marks an additional factor in common acceptance frameworks [35,36]. User Experience [37] expands usability going beyond mere pragmatic aspects of using driving automation by adding hedonic qualities. With drivers being relieved of driving themselves, there might be a lack of fulfillment of needs [32,[38][39][40], and, consequently, despite effectiveness and efficiency in interaction, positive emotions, thus, positive attitudes towards driving automation are not guaranteed. UX research therefore aims at investigating underlying mechanisms needed to substitute former driving experiences with other, potentially meaningful activities, and appropriate user interfaces which carve out advantages and balance negative effects of automation.

Possibilities for Study Design
Besides different topics of interest in AD HCI research, and thus constructs, collection methods, and parameters, which are all closely tied to dependent measures, there is also a variety of study design possibilities when conducting user studies on AD. One important aspect is the study environment which is a driving simulator in many instances. Here, the degree of immersion varies from low to medium fidelity simulation [41] to high fidelity driving simulation with [28,42] or without a moving-based platforms [13,43]. Moreover, depending on the availability of (real or simulated) automation function, studies on test tracks and real roads are possible [44,45]. Other types of studies use an interview or survey setting to gain insights into AD because the users have used such a system before [46,47], or they instead target the readiness of the consumer population [25,31]. Another aspect providing researchers with possibilities for study design is the representation of the automation. At the moment, most user studies in the AD context are set up in a simulation environment, since the maturity of this technology is not yet given. A vast majority of studies has been conducted in driving simulators, where AV functions and HMIs can be implemented without much effort, and tests in a risk-free and standardized environment are possible. However, first on-road tests have been conducted in Wizard-of-Oz [48,49], or even real settings [50,51]. As mentioned before, there are studies that are not placed in a lab environment, but rather take a survey approach and represent the driving automation by means of static descriptions [25], or sketches [31]. Studies also differ in the type of research or main focus and contribution they bear. Some studies target conceptual development with a subsequent proof-of-concept user study [52][53][54]. Other approaches more generically cover basic research topics, and focus on fundamentals of human perception and action in the AD context. Such research is rather independent from specific HMI concepts, but implications hold true for the variety of conceptual approaches [55,56]. A similar independence from certain HMI concepts is a characteristic of methodological work. Such studies aim at providing instruments and tools for proper study conduction when evaluating HMIs [29,30,57].
The heterogeneity of constructs, use cases, and AD operationalization has recently led to first efforts in methodological standardization. For example, various taxonomies [58][59][60] have been proposed to describe TOR scenarios and related use cases [61]. In addition, regarding TOR, many research groups follow their own measurement procedures and evaluation methods, which highlights the need for commonly accepted standards.

Method
With the considerations about the current research status, topics of interest in AD HCI research, study possibilities, and the authors' experience in AD research in mind, we defined the reviewing process for this literature review.

Venue Selection Process
We conducted a pilot study to identify journals and conferences in the Human Factors community that have published relevant and representative work on driving automation. An online survey was distributed via social media (e.g., Twitter, LinkedIn, etc.) and to peers of the authors. In the survey, participants could indicate (1) both the top 3 journals and conferences where they have already published as well as (2) both the top 3 journals and conferences where they consider submitting an article (favored). Moreover, the survey included questions on whether the authors have already published original research on automated driving (yes vs. no) and the year of the author's first publication. Eventually, demographic data (i.e., age, gender, academic degree, and academic background) were collected.
Demographics showed that mean age of the n = 21 participants (n = 5 female) was 32.81 years (SD = 5.65, MIN = 24, MAX = 49). Most participants (n = 10) held a Master's degree, n = 9 a PhD, and n = 2 were professors. The Academic Background showed that the majority were psychologists (n = 8), followed by engineers (n = 5), computer scientists (n = 5). n = 2 participants had a Human Factors or Media Informatics background (multiple choice was allowed). Out of the 21 participants, n = 19 have already published whereas n = 2 have not yet published their research. The earliest publication dates back to 1999 and the latest to 2016.
For the identification of relevant venues, we counted the overall number of instances independently from its position (i.e., 1st vs. 2nd vs. 3rd rank). The results regarding journals showed that Transportation Research Part F (n = 7 publications, n = 11 favored), the Journal of Human Factors (n = 7 publications, n = 11 favored) and Accident Analysis and Prevention (n = 7 publications, n = 6 favored) were the most frequently indicated venues. Regarding conferences, the AutomotiveUI (n = 6 publications, n = 13 favored), the Human Factors and Ergonomics Society Annual Meeting (n = 2 publications, n = 11 favored), and the Conference on Human Factors in Computing Systems (CHI; n = 2 publications, n = 6 favored) were mentioned most frequently. Based on this expert survey, we selected these three journals and conferences to be included in our structured literature review.

Paper Selection Process
The basis for the selection of papers for the present literature analysis were all research papers that were published in the respective venues between 2010 and up to 2018 (inclusive). We developed a decision tree to decide in a standardized and step-wise manner, whether or not to include each paper. This decision tree is depicted in Figure 2. It features four steps represented through binary decisions, where each has to be answered with "yes" for a paper to be included into our analysis. To pass the first step of the decision tree, the paper had to contain at least one of a set of keywords related to driving automation in the full text (see below). These keywords were selected to initially reduce the amount of papers in a reasonable way, while at the same time ensuring that no potentially relevant paper would be excluded. The first step of the decision tree was carried out by querying the respective online data bases using the following search terms: "automated driving" OR "autonomous driving" OR "self-driving" OR "self driving" OR "autonomous vehicle" OR "automated vehicle" Papers that did not feature AD represented by at least one of the keywords, as well as short papers, posters, and adjunct proceedings [62], were excluded in this step. For the remaining papers, the next step in the selection process consisted of examining whether the papers' objective was an empirical study to discard other literature reviews, as well as juridical, theoretical, or ethical papers [11,63,64]. The subsequent step of the decision tree aimed at the actual primary focus of the empirical papers. If this was not research and development of AD, the respective paper was excluded from further analysis. This step was incorporated to rule out work on concepts that, in principle, could be used for the development of AD, but was not originally investigated with that purpose [65]. In the last step of the selection process, we took a closer look at the levels of automation [1] that were investigated. The level of mere driver assistance (Level 1) as well as concepts which do not count as driving automation according to SAE definition [66,67] were out of scope for this literature analysis. Thus, only papers examining L2 and/or higher levels of automation (i.e., simultaneous lateral and longitudinal vehicle control) were included. Overall, n = 161 research papers passed all steps of the decision tree and were included in the present work. To ensure inter-rater reliability, a random selection of 10 papers were compared by means of intra-class correlations (ICC). ICC estimates were calculated based on a single-rating regarding the inclusion criteria (outcome of the decision tree to find out whether a paper should be included for further analysis), using a two-way random-effects model. Results revealed a high inter-rater reliability with a correlation of r = 0.809 (F(9,36) = 22.170, p < 0.001).

Paper Reviewing Strategy
After selecting the papers, we developed a reviewing strategy for literature analysis in two expert workshops. The first workshop lasted six hours and aimed at developing a standardized reviewing procedure. After the workshop, the resulting categories/dimensions as well as their emerging relations were translated into a database. For a detailed description of the categories and the database, see Database Structure. Subsequently, five reviewers classified the selected 161 research papers by sorting them into the categories of the database. In case a new category occurred in the papers that had not been considered before, it was added to the database. After reviewing a subset of the 161 papers, we conducted another expert workshop which lasted approximately four hours. During this workshop, lessons learned from a first subset of publications (n = 42) were derived and each reviewer could put unclear classification up for discussion. This ensures potentially high reviewer agreement in classification of the remaining papers (ambiguities in the subset have been resolved during the workshop).

Database Structure
We set up a MS Access Database to capture the relevant information needed for our investigations. The schema consists of the five main tables Paper, Conference, Construct, CollectionMethod, and Parameter. In the following, we introduce the most relevant properties: • Paper: For each paper, we collected descriptive information (title, abstract, year, authors, conference), as well as the levels of automation addressed in the study. The following additional information was collected: The type of user (driver, passenger, external), road type (urban, highway, rural, not relevant), study type (lab, test track, real road, survey), the representation of the AV (static text description, sketch, driving simulator, Wizard-of-Oz, real vehicle), study period (single session, short-term, long-term), type of research (basic research, concept evaluation, method development, model development), as well as participant information, such as the number of subjects, their mean age, as well as if they were internally (students, employees, etc.) or externally recruited. • Construct: Represents the topics of investigation. To avoid subjective interpretation by the reviewers, we only collected constructs which were explicitly mentioned by authors in the papers, (such as Safety, Trust, Acceptance, etc.). All constructs which were only investigated by one single paper are summarized within another construct. Generic investigations on participants' opinion and general perception without directly mentioning specific constructs are summarized with a General Attitude construct. • Collection Method: Relevant data collection methods, such as driving performance, TOR performance, secondary task performance, ECG, EEG, standardized questionnaire, interview, etc. Secondary task performance in particular refers to a participant's performance in a task not related to driving (e.g., number of missed target stimuli on a tablet). Again, we came up with initial suggestions that were expanded in case a new item emerged during the reviewing process. • Parameter: Parameters that were used in the different data collection methods, for example standard deviation of lateral position (SDLP) [68], response time (could be used to measure driving performance), gaze-off-road time, gaze standard deviation (eye-tracking), technology acceptance model [35], NASA-TLX [69] (examples for standardized questionnaires), etc. • Relationships: To structure our data, we created a relationship table to represent (n:m) relations of papers, collection methods, parameters, and constructs. Thus, each paper can investigate different constructs, where each construct can be assessed by one (or multiple) data collection methods, and each data collection method by one (or multiple) parameters. For each relation, we categorized whether the combination represents behavioral, self-reported, or physiological data. Furthermore, we assessed if the parameter was measured before/during/after a trial in the experiment. This data model allowed us to store all information without duplicates (each combination of paper/construct/collection method/parameter was stored only once using database key constraints), while the relations allow for performing powerful queries on the data (in comparison to pure list/sheet based representations). For example, we can ask the database which papers investigated a certain construct using physiological sensors, how many papers utilized a certain standardized scale, or how constructs changed over the years.

Results
In the following, we report the obtained results from of the final selection of 161 papers. All selected conferences and journals (see above) are represented in the final analysis: Human Factors and Ergonomics Society Annual Meeting (n = 34), AutomotiveUI (n = 42), CHI (n = 10), Accident Analysis (n = 18), Human Factors (n = 20), and Transportation Research Part F (n = 37). All results were obtained using the built-in structured query language (SQL) of MS-Access.

General Study Details
Regarding AD research, we found that SAE L3 is the most frequently studied level of automation with 58.39% (n = 94) followed by Level 2 (36.65%, n = 59) and Level 4 (22.36%, n = 36). Level 5 was investigated in 19.88% (n = 32) of the studies. Thereby, the number of publications gradually increased up to 2018 regardless of a specific level of automation. The one that stands out is SAE L3, which attracted earlier attention than the other levels (see Figure 3); however, the steepest increase in AD research was observed between 2015 and 2016.  Overall, 73.29% (n = 118) of all studies were conducted in a lab environment, 13.66% (n = 22) as a survey, and 11.80% (n = 19) on real road, while only 2.48% (n = 4) reported results obtained on a test track. This is in accordance with the utilized AV representation. In addition, 71.43% (n = 115) of the papers reported to have used a driving simulator, 12.42% (n = 20) a real vehicle, 11.18% (n = 18) a textual description, 7.45% (n = 12) a Wizard-of-Oz setup, and 3.11% (n = 5) a visualization. We observed thereby a clear tendency towards single session evaluation (94.41%, n = 152), and only nine papers called in participants multiple times, where 3.11% (n = 5) investigated automated driving use over a short time period (e.g., up to one week) [70] and three studies (1.86%) long-term effects in longitudinal studies (e.g., by following a survey approach [47]). While 60.25% (n = 97) of the papers conducted basic research such as observing pedestrian interaction behavior with AVs on real roads [71], 34.78% (n = 56) papers evaluated a specific concept in their study (e.g., a haptic seat to prepare driver for TORs) [72]. Smaller percentages of studies conducted empirical research about AD with the aim to create a method (8.07%, n = 13) or a model (1.86%, n = 3). The center of empirical research on AD is clearly the driver, who was investigated in 78.26% (n = 126) of all studies. Passengers (9.32%, n = 15) and other road users (6.83%, n = 11, e.g., pedestrians) are still a side topic in AD HCI research. The environmental driving context is varying, while 47.83% (n = 77) investigated a highway setting, 19.88% (n = 32) addressed urban, and 19.25% (n = 31) rural road conditions. For the remaining studies (13.04%, n = 21), the road type was either not relevant or not described.
Study participants' mean age in the most papers is below 40 years (n = 98), and only a few papers report higher mean ages such as Frison et al. [73], who in particular invited participants older than 65 to investigate acceptance of AVs in comparison with participants of younger age (see Figure 4). In 12.42% (n = 20), participants' age is not reported.
In total, 57.76% (n = 93) of the papers selected their participants randomly, while only 29.81% (n = 48) used targeted sampling. The rest (n = 20) did not provide any indication about participant sampling. Thereby, 34.40% (n = 57) of papers recruited participants internally (e.g., students or employees of their institution), and 44.10% (n = 71) invited external participants. The remaining publications (20.5%) did not provide this information.
Over half of the papers triangulated different types of data (55.28%, n = 89). Thereby, 44.72% (n = 72) of the papers triangulated behavioral and self-reported data. Less common is the triangulation of self-reported, behavioral and psycho-physiological data (6.21%, n = 10). The combination of behavioral and psycho-physiological data (n = 5), or self-reported and psycho-physiological data (n = 2) is even more rarely applied. A large portion of the papers (44.72%, n = 72) is working with only one type of data. These papers use mainly self-report measures (27.95%, n = 45), while 16.77% (n = 27) of the papers report exclusively behavioral data.

Constructs and Methodological Approaches
To specify empirical research on AD in more detail, we took a closer look at the constructs, which were investigated in the individual studies. The most frequently investigated construct is safety, followed by trust, acceptance, and workload, see Table 1 for all constructs, respective number of papers and exemplary references. We observed that there was a broad range of 36 distinct constructs that was addressed only once. We summarized it here as other. Other, e.g., Personalization 36 22.36 -In the following paragraphs, individual constructs, summarized in the subsection according their occurrence of investigation, are described by elaborating on the applied collection methods (n number of distinct papers) and collected parameters (n p total number of parameters). One distinct paper can investigate more than one parameter.

Safety
Parameters (p) selected for the AD studies on safety mainly include behavioral data (only n p = 22 collected self-reported data; n p represents the number of parameters). The most applied collection method for safety is the measurement of participants' TOR performance, which is applied by n = 58 distinct papers. Thereby, the most collected parameter is participants' reaction time, which includes the time to first driving action like braking or accelerating, to system deactivation, button press or hands on the steering wheel. Furthermore, the lateral position is another frequently utilized parameter, including maximal lateral position, standard deviation of lateral position, or Daimler Lane Change Performance [89]. Furthermore, Time to Collision (TTC), speed TOR timings and acceleration and braking parameters were also repeatedly collected. Driving performance (which is in contrast to TOR performance that assesses the immediate response, calculated based on longer phases of manual driving) is also often used (n = 24) by e.g., collecting data on participants lateral position, speed and reaction times. Eye-tracking (n = 12) by regarding gaze percentage, duration and number on areas of interests like mirrors or road, etc., observation (n = 12) of participants crossing behavior, NDRT engagement parameters, etc., and a self-defined questionnaire (n = 11) are further important collection methods for safety. Standardized questionnaires like the scale for criticality assessment of driving and traffic scenarios [90], secondary task performance and interviews are more seldom used (n <= 3), see Table 2 for more details.
In total, n p = 89 different parameters were collected across all papers investigating safety-83% of the studies were conducted in a driving simulator, 11% in a real vehicle, 5% used static text, 4% a Wizard-of-Oz setup, and 2% a sketch.  Trust in AD is investigated as the second most used construct, see Table 1. Here, in contrast to safety, more self-reported (n p = 42) data are reported. Most common is the usage of a self-defined questionnaire (n = 19), or a standardized questionnaire, especially the Automation Trust Scale (ATS) [91] (n = 12) is popular. The interpersonal trust scale (ITS) [92], trust in technology scale [93], Van der Laan scale [94], the Trust Perception Scale-HRI [95] and the Propensity to Trust Scale [96] are each used only once. Interviews, structured and semi-structured, are more rarely used (n = 4). However, behavioral parameters (n p = 19) are also collected, by observing (n = 5) body pose and movements, acceleration and braking behavior, gaze duration on an area of interest, etc. In addition, eye-tracking (n = 4) is conducted. Only one paper [97] used driving performance (braking and steering behavior) and TOR (reaction time) performance measures. This paper additionally collected participants' hands on wheel and eyes on road time using observation, see Table 3. In total, n p = 28 different parameters were collected across all papers investigating trust-78% of the studies were conducted in a driving simulator, 16% in a real vehicle, 11% used a textual description and 5% a Wizard-of-Oz setup to represent the AV.
Acceptance is investigated by self-reported parameters as intensively as trust (n p = 41). Researchers mostly apply standardized questionnaires (n = 18). Thereby, the Van der Laan acceptance scale [94] is the most frequently used questionnaire. Furthermore, the technology acceptance model (TAM) [35], the Unified Theory of Acceptance and Use of Technology (UTAUT) [36] or the Car Technology Acceptance Model (CTAM) [24] are frequently used approaches. The self-defined questionnaire is also a popular collection method here (n = 13). Behavioral data (n p = 3) on acceptance is collected by observing (n = 3) the time to system activation, and the number/share of times the automation was enabled/disabled by study participants. Driving performance, or qualitative methods like interviews or focus groups, are applied less frequently (n< = 2), see Table 4.
In total, n p = 15 different parameters were collected across all papers investigating acceptance. In total, 53% of the studies were conducted in a driving simulator, 29% used only static text, and 15% used a real vehicle, respectively 3% used either a sketch or a Wizard-of-Oz setup. Workload is also investigated mainly by using self-reported measures (n p = 26), usually by implementing standardized questionnaires. The NASA Task Load Index (NASA-TLX, n = 17) [69] is the most popular, other questionnaires like Driver Activity Load Index (DALI) [98], Rating Scale Mental Effort (RSME) [99], scale for subjectively experienced effort (SEA) [100], or the global mental workload measurement by Wierwille and Casali [101] are more rarely applied, as well as self-defined questionnaires (n = 4) and semi-structured interviews (n = 1). As behavioral data (n p = 7), the secondary task performance collecting data on NDRT performance (i.e., number solved tasks), using the Surrogate Reference Task, or Twenty Question Task, observation, eye-tracking and driving performance measures were conducted, see Table 5.
In total, n p = 12 different parameters were collected across all papers investigating workload-88% investigated it in a driving simulator, only 13% in a real vehicle, and 6% used a Wizard-of-Oz setup.

General Attitude, Situation Awareness, and Stress
Participants' general attitude towards AD is investigated by the majority of the papers by self-reported data (n p = 22), by a self-developed questionnaire (n = 15) or interviews (n = 5). One paper derived insight based on observation (see Table 6).
In total, n p = 3 different parameters were collected across all papers investigating general attitude-40% used a driving simulator, 30% static text, 15% a real vehicle, and 15% a Wizard-of-Oz setup. Situation Awareness in contrast is examined more by behavioral (n p = 26) than self-reported data (n p = 14). However, almost the same share of papers apply a self-developed questionnaire (n = 6) as eye-tracking is used (n = 7). Thereby, gaze duration, number, and percentage is collected, as well as glancing and blinking behavior. In addition, TOR performance (n = 2) measures were collected like lateral position, reaction time, acceleration, and time to collision parameters. A popular standardized questionnaire is the situational awareness rating technique (SART), or the situation awareness global assessment technique (SAGAT) [102]. See Table 7. In total, n p = 25 different parameters were collected across all papers investigating situation awareness-67% of the studies were conducted used a driving simulator, 28% a real vehicle, and 17% a Wizard-of-Oz setup was used. Table 7. Situation Awareness collection methods and parameters.

Collection Method for Situation Awareness (n) Parameter n p
Eye Tracking/Gaze Behavior (7) Gaze Stress is investigated by psycho-physiological (n p = 11) and self-rated data (n p = 8). Hence, standardized questionnaires like the Short Stress State Questionnaire (SSSQ) [103], Dundee Stress State Questionnaire (DSSQ) [104], or the Driver Stress Inventory (DSI) [105] are used, but also heart rate variability, GSR and EMG. Observation, interviews and driving performance measures were only used by single papers. See Table 8. In total, n p = 17 different parameters were collected across all papers investigating stress-80% of all studies regarding stress were performed in a driving simulator and 20% in a real vehicle-respectively, 7% used either static text or a Wizard-of-Oz setup. Participants' interaction behavior with AV/ADS was mainly investigated by behavioral data (n p = 11), observing (n = 6) such as pedestrians' walking behavior, number/share of times the automation was activated/deactivated, or eye-tracking analysis (percentage of the gaze on different AOIs). However, standardized questionnaires (n = 3) like the Pedestrian Behavior Questionnaire (PBQ) [106], Brief Sensation Seeking Scale (BSSS-8) [107] or the theory of planned behavior (TPB) [108] were also used. Two papers developed an own questionnaire, and one paper applied eye-tracking, conducted an interview or collected data on secondary task performance, see Table 9. In total, n p = 13 different parameters were collected across all papers investigating interaction behavior-45% studies regarding interaction behavior were investigated in a driving simulator, 36% used a Wizard-of-Oz setup, and 27% a real vehicle. Drowsiness/Fatigue was investigated by self-reported (n p = 11) and behavioral data (n p = 7). Hence, standardized questionnaires (n = 7) like Karolinska Sleepiness Scale (KSS) [109], Driver Stress Inventory (DSI) [110], Multidimensional Fatigue Inventory (MFI) [111] are mainly used as well as self-developed questionnaires (n = 2). However, researchers also collected (n = 2) blinking behavior or yawing, as well as eye-tracking, driving performance or adapted the method of UX Curve [112]. See Table 10. In total, n p = 12 different parameters were collected across all papers investigating drowsiness/fatigue-91% of the studies collecting drowsiness were conducted in a driving simulator, 9% used a real vehicle, and a further 9% used static text as AV representation. User Experience is mainly investigated by self-reported measurements n p = 16, n p = 5 collect behavioral data, only once psycho-physiological measures by capturing Heart Rate Variability is utilized. Various standardized questionnaires (n = 4) like AttrakDiff [113], UEQ [114], etc. are applied, or interviews (n = 4), and other qualitative methods like UX curve [112], think aloud, and sorting have been conducted. Behavioral data are collected by one paper regarding participants' driving performance, e.g., acceleration, braking, speed, and lane changes. See Table 11. In total, n p = 13 different parameters were collected across all papers investigating UX. Researchers interested in productivity collect behavioral data n p = 20, primarily about participants' secondary task performance (n = 6), including performance (e.g., characters per second, number of answered questions, etc.) but also engagement and accuracy (e.g., error rate) parameters (NDRT duration, frequency, or percentage). Single papers also observed participants, used eye-tracking or collected driving performance measures. In addition, only one semi-structured interview was conducted, see Table 12. In total, n p = 18 different parameters were collected across all papers investigating productivity. On the contrary, comfort was investigated instead by self-reported data (n p = 11), mainly by self-defined (n = 7) or standardized questionnaires (n = 2), Driving Style Questionnaire (MDSI), UEQ, TAM, and UTAUT. One paper reported behavioral data about participants' acceleration, see Table 13. In total, n p = 6 different parameters were collected across all papers investigating comfort. In addition, emotions are investigated by self-reported data (n p = 12) using self-defined (n = 4) and standardized questionnaires (n = 4) like PANAS, Affect Grid, Multi-Modal Stress Questionnaire (MMSQ), Affect Scale [115], etc. Furthermore, emotions are investigated by observing facial expressions, think aloud technique or an interview (each n = 1). See Table 14. In total, n p = 9 different parameters were collected across all papers investigating emotions.
Moreover, usability is investigated as well mainly by self-developed (n = 4) and standardized questionnaires (n = 3), e.g., SUS is a popular method. Furthermore, semi-structured interviews (n = 2) and the think aloud technique are also applied. Hence, solely self-reported data are collected (n p = 11), see Table 15. In total, n p = 4 different parameters were collected across all papers investigating usability. Cognitive processes, in contrast, are analyzed by self-reported (n p = 4), behavioral (n p = 4), and psycho-physiological data (n p = 2). Thereby, most developed their own questionnaires (n = 3) and one paper used a standardized questionnaire, the Driver Stress Inventory (DSI). EEG is utilized twice, while respectively one paper applied driving performance measures (e.g., lateral position) or a detection task method collecting reaction time and accuracy. See Table 16. In total, n p = 6 different parameters were collected across all papers investigating cognitive processes. Motion Sickness is mainly investigated by the standardized questionnaires (n = 4), the Simulator Sickness Questionnaire (SSQ) [116], and the Motion Sickness Assessment Questionnaire (MSAQ) [117]. One paper developed an own questionnaire. However, Heart Rate Variability as well as the measure of Motion Sickness Dose Value were collected, see Table 17. In total, n p = 5 different parameters were collected across all papers investigating motion sickness.

Other Constructs
A wide range of additional more special constructs were investigated, including cooperation, well-being, mental models and ethics in which only a few papers were interested in (n< = 3), and we identified many special constructs only single papers were interested in (n = 1), e.g., personalizing, intuitiveness, immersion, helpfulness, annoyance, motivation, etc. While cooperation was investigated mainly by driving performance measures like reaction time, duration of vehicle interaction, etc., and self-defined questionnaires and well-being by self-defined and standardized questionnaires and observation, researchers interested in mental models and ethics solely applied self-defined questionnaires. The more special constructs were summarized here as others were investigated in most cases by self-defined questionnaires (n = 15) or specific standardized questionnaires (n = 12).

Discussion of Findings and Recommendations for Future AD Studies
The present literature review shows that self-report measures mark the most prominent measure in HMI evaluation for AD. In comparison to that, there is further need to report accompanying behavioral measures, such as driving/TOR performance, or gaze behavior. The reason for that is quite simple: The users' reported attitudes frequently do not match the behavior that one observes [118]. For example, usability comprises effectiveness and efficiency (behavioral) as well as satisfaction (attitude) in order to get a comprehensive product evaluation. Attitudes and behavior are often separate dimensions that do not match in different various areas of research [11,119,120]. For a comprehensive evaluation of HMIs and driving automation features from a Human Factors perspective, it might be necessary to include both of these data sources. Of course, if a researcher is only interested in the user's attitudes, he/she can solely collect these, but, eventually, he/she will have to face the discussion of whether and how valid the insights are. Moreover, concerning areas of research, safety aspects received the most attention to date, followed by trust and acceptance. In the following, we discuss findings of (1) study setup, (2) data collection methods, (3) reported parameters, and (4) investigated constructs and derive recommendations (REC) for future research in the context of automated driving.

Study Setup
Concerning representation and study type, we found that most studies were conducted using driving simulators. Hence, present research supports high internal validity making interpretations of effects of differing conditions on dependent measures possible. On the downside, the obtained results might lack external validity, and there is no guarantee that they generalize to real world settings, since driving simulators may lack realism due to the insufficient field of view, or motion feedback. Thus, by lacking a feeling of presence [121], the transfer of behavior found in the laboratory to real world settings might be limited [122]. Conditions in realistic driving studies are not as controlled as in a laboratory and might differ in terms of surrounding traffic, weather conditions or vehicle speeds. Despite the restriction of high internal but limited external validity, we can assume that strong and consistent effects such as the dependency of TOR response time on the driver state [13,15,21,123,124] will also be reflected in real world driving. Especially when it comes to safety and trust issues in real driving, research should determine validity of the findings. While participants in driving simulator studies might behave in a more liberal way due to the absence of realistic severe consequences, it remains to be seen how these effects pertain to the real road. At this point, the question arises as to whether the scenarios that are tested in driving simulator research and frequency within a study are representative for human-automation interaction. Up to now, there is no valid statement on the frequency of transitions from automated to manual possible. An indication might be available from the California disengagement report [125], where developers of ADS have to report the number of safety driver interventions. As the systems are still under development and have not reached a maturity level for the commercial market, it is questionable whether these numbers resemble future series products. Future research from FOT or NDS data might be valuable for putting the importance of such scenarios into perspective. Relative validity of effects in driving simulator studies can well be assumed [126], but there are still research efforts necessary for validating safety relevant human-automation interaction scenarios. Despite this criticism, we acknowledge that driving simulators are still immersive research tools providing a certain, and, in many cases sufficient, degree of external validity [127,128]. Nevertheless, the present results point towards the need of conducting studies in real vehicles equipped with driving automation technology [50,129,130], or Wizard-of-Oz Settings [48,49,131].
⇒ REC 1: Existing study results obtained in driving simulators should be validated in realistic on-road settings.
Review results also revealed that most studies targeted single sessions of interaction, and thus provide only snapshots of first time use. While usability can already make reliable estimates about user behavior within a single session experiment [29], UX, trust, and acceptance might take a longer period of time until behavior and attitudes have reached a stable level [8,131,132]. Long-term studies tackling exactly these issues are rather scarce since they require high efforts. One example of such a study comes from Beggiato et al. [88] who observed users of L1 automation (i.e., Adaptive Cruise Control) over ten repeated one-hour sessions (however, due to the restriction to SAE L1, this paper would not satisfy the inclusion criteria for this review). Such studies can provide valuable insights into acceptance, trust, and system understanding. The present database only includes a small number of publications that investigated long-term use such as described in Dikmen and Burns [47]. This study example, however, did not follow a longitudinal approach, but rather a cross-sectional approach by surveying Tesla drivers. We expect important insights into behavioral adaptation over longer time periods, such as the amount of actual use or NDRTs that users engage in. Similarly to field operational tests [133] or naturalistic driving studies [134], there is still a blind spot in research on driving automation that could open up with commercial availability. First, efforts in this direction with L2 automation are reported in Gaspar and Carney [135], or planned in the L3-Pilot project [136]. Thus, there are open research directions towards ongoing and long-lasting effects of AD on use and interaction between the human operator and the ADS. ⇒ REC 2: In contrast to single session experiments, insights into the long-lasting effects of AD usage are scarce and can benefit from longitudinal study designs.
The age distribution of all samples within the data base showed a bell shaped curve (see Figure 4). This indicates that, across all identified studies, participants' ages were balanced and, overall, findings can be well generalized to the population. Both young and therefore novice, as well as elderly, drivers are considered in these studies. To maintain but also increase validity, we invoke authors of future studies to continue to regard users' diversity regarding age and also point to further consider gender, cultural, and other differences in their studies.
⇒ REC 3: To increase validity, the sample characteristics should be adapted to the addressed research questions. In addition, researchers need to explain why the particular sample was chosen and considered as valid.

Data Collection Methods
Concerning collection methods, the review results show that a vast majority of studies collected self-report data. In comparison to that, behavioral data were only reported in every second publication. From there, the question arises regarding what the reasons for this observation are. One obvious reason is that survey approaches [31,137] might focus more on technology readiness and deliberately collect attitudinal measures only. For these approaches, to date, there is no available behavioral criterion, such as buy/usage rates. Studies operationalizing acceptance via the TAM [35] do not provide the possibility to provide insights into behavioral measures. As soon as the functions are available, however, research needs to investigate whether predictions made by these studies hold true. With commercial availability of L2 automation, such a study could have already been conducted, but, to our knowledge, this is still missing. One positive aspect of self-report measures is that psychometrically validated scales have been applied frequently. This shows the professionalism of research of the reviewed venues and deliberate preparation of study design.
Another factor for the imbalance might be that self-report measures are much easier to collect. It does not take comparably much effort to hand out a questionnaire or interviewing participants. In contrast, the collection of behavioral data is much more complex. For example, dynamic vehicle data require extensive pre-processing before descriptive and inferential analyses can be run. The collection of eye-tracking data requires even more resources due to the need for manual calibration to ensure data quality, although such data provide the possibility to make direct inferences about cognitive processes [138]. For example, prior research has suggested that the number of gaze switches (i.e., monitoring behavior) can serve as an indicator for trust/reliance [18], or interface understanding [29]. One solution to the difficulty and extensive effort of collecting behavioral data might be experimenters' single-item ratings of interaction performance [139][140][141]. However, this approach requires well-trained raters and, ideally, ratings are given single-blind, so that the rater is not aware of assigned experimental conditions for the participant. In addition, there should be more than one rater to ensure inter-rater reliability. Despite requiring additional time and cost efforts, researchers should consider the collection and analysis of behavioral data should in the planning of a user study, since it can provide additional valuable insights about the tested interface or feature. It is not for nothing that the usability ISO-9241 [34] includes effectiveness and efficiency as behavioral components and satisfaction as an attitudinal component. These sources of data do not always align well [11,118], which is not necessarily a bad study outcome. It rather supports the assumption that both sources of data are necessary to derive a holistic impression of an interface. This has also been emphasized by Pettersson et al. [8,142], who expressed the urgency of triangulation in user studies. We appeal to researchers in the field of human-automation interaction to always consider additional behavioral observations. The importance of collecting behavioral measures of course depends on the respective research question. For example, there are instances where researchers are interested specifically in the public opinion of a large sample of respondents [25]. As explicitly mentioned in the Introduction of this work, we do not see the recommendations as mandatory but rather as thought-provoking. Thus, it might guide exclusive attitude research towards future work on the behavioral consequences.
⇒ REC 4: Researchers should consider the collection of more than a single source of data (i.e., triangulating behavioral, psychophysiological, and self-reported data). Insights from self-report measures might be worthy of future research for its effects on user behavior.

Parameters
The results of the literature review showed that research provides a heterogeneous pool of parameters, for example considering driving or TOR performance measures in controllability studies (see Tables 2-12). To better compare results between different laboratories, there is an urgent need to come up with a standardization of procedures when evaluating human-automation interaction. Efforts in this directions are reported in the works by Wintersberger et al. [59], which suggests TOR performance measures based on SAE J2944 or in Naujoks et al. [60] outlining a standardized set of use cases for control transitions between levels of automation. Since there has been a lot of research on TOR controllability, it is now time to combine this body of knowledge and standardize methodology along the lines of driver distraction research [143,144] ⇒ REC 5: Consider existing or proposed standards for measurements to allow comparison of study results.

Constructs
First, it turned out that some studies included parameters whose focus of research was not clearly defined. Despite setting up a large number of constructs during the expert workshop and adding further options during the review process, there still remained a considerable number of of parameters that could not be assigned free of doubt. This resulted in the two constructs general attitude and interaction behavior (see Table 1). In these instances, information was rather scarce and only high level indications about interface evaluation were provided by the authors. We therefore encourage researchers to first clearly state objectives and classify these within a specific body of research. This should allow both authors and readers to eventually compare the obtained results with existing findings, and derive implications of the reported work. ⇒ REC 6: Clearly communicate which constructs are addressed by an experiment to foster transparency The majority of parameters were well classifiable within our database. The safety construct constituted the largest part of all measures (see Table 1).This shows the importance of safety concerns when it comes to investigating driving automation technology. Investigations on TOR performance during system failures have been the first scenarios in human-automation interaction research [12][13][14]145]. Here, many issues such as trust [18,21,97], controllability [12], fatigue [15], or mode awareness [17] have been discovered. Right now, it seems like a re-orientation of recent research is taking place which is also reflected in the remaining constructs of the database. Trust [146] and intention to use [35] constitute precursors of actual system use, and, due to the progress in time, technology, and scientific evidence, research has more closely investigated these constructs. Here, the aforementioned open research directions of long-term studies, absolute validity concerns in realistic driving and agreeableness of attitudinal and behavioral measures apply.
Besides safety and acceptance, there remain other constructs that have rather been neglected until now. The reason why research has given less attention to usability, UX, or productivity until now might be that these are rather precursors of safety and acceptance. Another reason might be that the scenario where automation fails and humans need to step into action was an important issue to determine feasibility of driving automation in the first place. We argue that more emphasis should be paid to other types of interaction such as ongoing automation, user-initiated or planned transfers of control, as most likely, these use cases will occur more frequently than automation failures [59,147]. From that, the need to investigate efficient and effective interaction arises [29]. Furthermore, it might not be sufficient to develop interfaces that users are satisfied with [34], but which they have fun and enjoy interacting with [32,148]. Moreover, safety critical issues of AD, like TORs in SAE L3, also impact users overall driving experience [149]. Hence, an additional focus on UX and research on users' emotions and need fulfillment can pave the way to developing not only proper but great HMIs for AVs, which increase individual but also societal acceptance of this emerging technology.
We also emphasize at this point that the respective research question largely impacts the investigated constructs. The existing large number of findings on safety might benefit researchers in the formulation of research questions since statements on future research and unanswered questions are an inherent part of a researcher's work. At the same time, this implies that there is a lot of room for research questions beyond established construct. ⇒ REC 7: Depending on the particular research question, we suggest going beyond established constructs like safety, trust, acceptance, or workload and regard topics from different perspectives.

Limitations
The literature review presented here comes with some limitations. First, the restriction on the most relevant six sources (three journals and and three conferences) limits the reach of included works, and also other venues/journals publish cutting-edge research on driving automation matching our inclusion criteria. However, comparing the obtained results with a prior review concentrating only on one of the six included sources [150] shows that the results did not drastically change after extension. Second, we (subjectively) felt a trend to experiments that are published multiple times with slightly adapted focus (for example, a conference publication presenting first insights into an experiment is followed by a more detailed journal submission). As our review focused on publications rather than single experiments, we cannot guarantee that some studies in our database are duplicates, which may have slightly impacted the results (for example regarding the distribution of participants' age, study types, or the share of the levels of automation addressed). Third, the inter-rater reliability is calculated based on the inclusion/exclusion criteria of papers, while the subsequent classification process is not completely free from subjective assessment. Although we tried to keep the level of subjective interpretation to a minimum (by defining a standardized reviewing procedure), not all involved decisions were free from ambiguity. For example, some publications evaluated multiple constructs but did not provide information on the mapping of investigated parameters to these constructs. In such cases, the decision is burdened on the reviewer, and, to minimize such effects, the authors discussed potential inconsistencies together and made adaptions (if necessary) before the final database analyses. Moreover, the methodological insights presented here combine a large variety of research questions spanning over many different areas of research. To derive specific insights for one target of research (e.g., how do drivers take over from automated to manual control), a separate consideration is necessary using relevant publications for such a category. Despite this limitation, our results mark a first step into this direction by providing researchers with methods and parameters that they most likely have to consider when setting up an experiment investigating a certain construct with relevance to automated driving. Eventually, this work contains literature from six venues and one needs to be aware that the recommendations apply to the literature that was included here. However, the claim that these venues cover a representative share of driving automation studies strengthens the argument that they might apply for venues beyond. The initially applied keywords to reduce the amount of papers might be criticized as arbitrary and a different set of keywords could have led to a different collection of publications. However, these keywords were chosen after extensive discussion among the authors of this paper and thus are based on the best of our knowledge of human factors research in driving automation.

Conclusions
In this article, we have reviewed the status quo of methods utilized in human factors research in driving automation. We followed a structured approach to give an overview of the research domain by selecting relevant papers, and reviewing them in a standardized manner using a relational database. There is a good portion of research in different aspects of driving automation indicating that researchers in the community work on the issue of developing and improving human-machine interfaces for automated vehicles. When researchers plan to engage in research and development of automated driving, the present work provides them with an overview of the current landscape. Thus, one can derive information about main research areas and emerging trends that have not been studied extensively yet. Additionally, this literature review provides researchers and practitioners with suggestions about methodological tools (i.e., collection methods and specific parameters) that they can use when assessing a certain construct of a driving automation system. To conclude, we list a set of recommendations to be considered in future experiments addressing automated driving: • Existing study results obtained in driving simulators should be validated in realistic on-road settings. As most experimental results were obtained in driving simulators (or used even lower degree of realism/immersion), their main findings must urgently be validated in more realistic settings, especially when addressing constructs that incorporate risk (such as trust in automation). • In contrast to single session experiments, insights into the long-lasting effects of AD usage are scarce and can benefit from longitudinal study designs. Another huge potential for future work are longitudinal studies. Such studies cannot only validate results obtained in single-session experiments: they might even reveal new issues which have not been addressed yet. • Depending on the particular research question, we suggest going beyond established constructs like safety, trust, acceptance, or workload. In addition, take constructs into account that have not yet been intensively addressed (such as personalization, cooperation, wellbeing, etc.). Cooperation in particular refers to different parties performing as a team together rather than only one party at a time [87]. When designing HMIs for vehicles, consider the full spectrum of user experience research and include user satisfaction (such as hedonic qualities) in HMI evaluation also as these aspects will finally decide between success or defeat of the technology on the market. • To increase validity, the sample characteristics should be adapted to the addressed research questions. In addition, researchers need to explain why the particular sample was chosen and considered valid. Aim for a participant sample that allows for investigating the proposed research question plausibly. Do not only list a more diverse sample as a limitation, but also discuss how your results could be affected by biases in participant sampling. • Researchers should consider the collection of more than a single source of data (i.e., triangulating behavioral, psychophysiological, and self-reported data). Insights from self-report measures might be worthy of future research for its effects on user behavior. Evaluation of different types of data sometimes leads to contradicting results. Such conflicts should not be avoided, as in the best case they contribute to a better understanding of established theory. Only comprehensive evaluation of the involved factors allows for drawing meaningful conclusions. • Consider existing or proposed standards for measurements to allow for comparison of study results. The possibility to compare study results is a key element of scientific practices. Thus, if possible, utilize standardized methods (regarding parameters, their measurement times, as well as their calculation). In case there is no such standard or best practice, instead of inventing additional measurements, build upon related work. • Clearly communicate which constructs are addressed by an experiment to foster transparency. Minimize the potential for ambiguity. Clearly state which constructs are investigated, and how the selection of methods/parameters used for evaluation are related to them.