Evaluation of Multimodal External Human–Machine Interface for Driverless Vehicles in Virtual Reality

: With the development and promotion of driverless technology, researchers are focusing on designing varied types of external interfaces to induce trust in road users towards this new technology. In this paper, we investigated the effectiveness of a multimodal external human–machine interface (eHMI) for driverless vehicles in virtual environment, focusing on a two-way road scenario. Three phases of identifying, decelerating, and parking were taken into account in the driverless vehicles to pedestrian interaction process. Twelve eHMIs are proposed, which consist of three visual features (smile, arrow and none), three audible features (human voice, warning sound and none) and two physical features (yielding and not yielding). We conducted a study to gain a more efﬁcient and safer eHMI for driverless vehicles when they interact with pedestrians. Based on study outcomes, in the case of yielding, the interaction efﬁciency and pedestrian safety in multimodal eHMI design was satisfactory compared to the single-modal system. The visual modality in the eHMI of driverless vehicles has the greatest impact on pedestrian safety. In addition, the “arrow” was more intuitive to identify than the “smile” in terms of visual modality. J.D.; writing—review J.D., S.C. visualization, J.D.; supervision, S.C.; Z.T.


Introduction
With the development of self-driving technologies, traffic situations can be constructed by three traffic factors (pedestrians, driverless vehicles and other traffic participants) rather than two factors (drivers and other traffic participants) in the future [1]. The interactions between driverless vehicles and other traffic participants (non-motor vehicles and pedestrians) will play an important role in the mixed traffic environment. Although traffic lights, traffic signs, and road markings provide clear guidance for people, there is still some confusion, such as "who goes first" under certain circumstances. Traditionally, pedestrians often rely on cues of driver behavior, such as eye contact, posture and gestures, as well as physical signals, such as brake lights or indicators, when they are not sure whether they can cross the road. However, current driverless vehicles are just machines without emotions and feelings, thoroughly lacking such communication and coordination capabilities. When this interaction gradually disappears, pedestrians will not be able to infer the intention of autonomous vehicles, which may hinder efficient interactions with pedestrians and vehicles. A survey study [2] shows that failure to establish interpersonal communication is one of the main reasons for the increased risk for vulnerable road users when interacting with driverless vehicles. Therefore, we need to ensure that the artificial interaction of the driverless vehicles can safely and smoothly replace the common implicit interaction between the on-board driver and pedestrians today. This opens a completely new field for researchers of human factors and human-machine interface (HMI) designers [3]. Recently, researchers have proposed potential external feature ideas for driverless vehicles and in-vestigated pedestrians' preferences for these features using real-world studies [4][5][6], video recording methods [7], and simulation-based studies [8][9][10][11].
There are two distinct focuses of research into human-machine interfaces for driverless vehicles. The major emphasis of related work is on the interactive interface between the driverless vehicle and the person inside the vehicle, which we call the iHMI (interior human-machine interface) [12][13][14]. The external human-machine interface (eHMI) that focuses on the interaction of the driverless vehicle and the pedestrian, is less explored, which is the focus of our work. Although it is used to describe a visual interface in most cases, some researchers consider [15] that the human-machine interface also includes auditory display and all vehicle control, including the traditional use of pedals and steering wheels, vehicle dynamics [16], and additional tactile elements such as pulse, vibration, and other physical guidance. Therefore, the human-machine interface can be seen as a window between human and machine in a broad sense. It plays a central and key role in the operation of the joint system.
The current design of the eHMI for driverless vehicles focuses on elements such as position, color, information type and technology. Clamann et al. [2] developed a prototype display placed in front of the radiator grille for vehicle-pedestrians communication. The human-like symbols on the display were presented in white on a black background. Ackermann et al. [17] focused on the visual modality and presented five examples of eHMI design variations. In their research, projection and LED display/light strip were used for technology; windscreen, radiator grille and street surface for position; and textversus symbol-based message and message content for information type. Clercq et al. [18] studied the effect of vehicle physical behavior, vehicle size and eHMI type on pedestrian crossing decisions. The above studies mostly focus on the single modality of eHMI. In contrast, Deb et al. [19] used both visual and audible modality in designing the eHMI. The information type still adopted a "smile" symbol and text visually. Mahadevan et al. [20] implemented four proof-of-concept prototypes to test their interfaces in a street crossing scenario. One of prototypes mixed three modalities of cues; a visual cue based on the local street, an auditory cue from pedestrians' cellphones and a physical cue from the driverless vehicle. They observed that three modalities of cues can be important for creating an eHMI of driverless vehicles. However, its work was controlled studies using a Wizard-of-Oz prototyping approach and, rooted in their studied in vehicle-pedestrian interaction patterns, they observed in their own country context, which is lacking realism and generalizability. Sucha et al. [16] studied the factors which influence pedestrians' feeling of safety when crossing a road at a crosswalk. The most important factor in respect of pedestrians' safety is the speed of the vehicle, which is consistent with research results reported in the literature [21,22]. The distance of the vehicle from the pedestrian and various signs given by the drivers (waving a hand, flashing lights, etc.) also are the main factors which pedestrians used to decide about their behavior [23]. Based on previous work revealing that pedestrians are used to noticing both vehicle cues and eHMI when making crossing decisions, and when driverless vehicles were first introduced, the single modality is not enough to compensate for the loss of driver behaviors in driverless vehicles.
What is more, it is necessary to further explore and study the design of visual modality because of its great influence on pedestrian perception [1], which has not been sufficiently investigated in previous studies. The information types of visual modality mainly include symbol, text and light. For example, Baidu-Hongqi in L4 driverless vehicles adopted "symbol + text" type, where the icon used two lines with perspective, a human silhouette, and a custom symbol (composed of an ellipse and two dots located symmetrically in it); the text used was "Is about to start", "Please pass", and "Thanks for the compliment". Mercedes-Benz F015 designed special lighting that flows in one direction in the location of a radiator grille to remind pedestrians to pass ("light" type). Habibovic et al. [24] mounted a lighting interface displaying visual signals to pedestrians at the top of the windshield ("light" type). The signals were developed around four pieces of messages describing the vehicle patterns and intentions. Chang et al. [25] used cartoon eyes on Symmetry 2021, 13, 687 3 of 17 the headlights to communicate with pedestrians ("symbol" type). Semcon's conceptual smiling car adopted a smiley which was chosen in theoretical studies by most current researchers [18,19] ("symbol" type). It means when the car detects a pedestrian, a smile will light up on the front car display, confirming that the car intends to park for the pedestrian. The text can transfer specific meanings of autonomous vehicles, but it also needs some basic reading skills and often brings cognitive loads when used by several cars on the road. As for light, its visibility of symbols may be reduced in poor weather or very bright sun, and lighting design in cars must meet strict design standards in China. As such, we choose the "symbol" type as the information transmission channel in visual modality. In addition, the symbol elements currently used have no direct relationship with traffic context, and previous studies did not consider whether the source of these design elements is reasonable, and whether they can be quickly understood by pedestrians.
In this paper, we proposed a multimodal eHMI for driverless vehicles based on three modalities, including visual, audible, and physical. In order to adapt the mixed traffic environment, a new symbol element in visual modality from the specific scenario was extracted and deigned. Then, by constructing a virtual reality pedestrian simulator, we measured the interaction efficiency and psychological security of pedestrians under different modal combinations and compared the performance of the new symbol elements. Finally, we provided some suggestions for the design of the eHMI of driverless vehicles in the future, according to the experimental results.

Framework for a Multimodal eHMI of a Driverless Vehicle
In this section, we provide an overview of our framework of a multimodal external human-machine interface. Our pipeline utilizes visual, auditory and physical signals to construct a multimodal eHMI of a driverless vehicle, in order to promote the communication with pedestrians more efficiently.
As shown in Figure 1, the interaction between pedestrians and the driverless vehicle was first divided into two or three phases artificially, according to the vehicle behavior. Driverless vehicles in a mixed traffic environment on today's roads need to provide various forms of communication to pedestrians to allow for safe, smooth and intuitive interactions [26]. A driverless vehicle may or may not yield to a pedestrian when it encounters them, depending on the integrated decision-making capabilities of the smart car. Then, from the perspective of semantic expression of the driverless vehicle, we use both intention information and suggestion information, which were conveyed to pedestrians by a multimodal eHMI in different phases. For yielding the vehicle, "Pedestrian identified" and "Slowing down" describe the vehicle's own situation, which can be classified as intention information, and "Please pass safely" is apparently suggestion information. As for an unyielding vehicle, they only present intention information. Finally, we designed various and complementary eHMIs on different modalities to informatize the semantics to pedestrian. this scenario: some driverless vehicles with good safety performance are driving down a road. Will pedestrians ignore the traffic rules and just walk on the street because it will not hurt them? Pedestrians' aggressive behaviors toward driverless vehicles may be unavoidable, as they seem "safe", since driverless vehicles are programmed to follow road rules and have to accommodate pedestrians' behaviors [31]. Therefore, pedestrians believe that driverless vehicles will avoid them under any circumstances, creating a feedback loop that makes it difficult for vehicles to enter dense traffic areas [29]. In this study, by designing the eHMIs, we try to convey a dangerous signal or warning message to pedestrians in advance when the driverless vehicles are not yielding. For yielding vehicles, the gap between 50 m and 30 m from the pedestrian is defined as the identifying phase, the gap between 30 m and 5 m is defined as the decelerating phase, and the phase when the vehicle is waiting for the pedestrian to pass is defined as the parking phase (See Figure 2). Under the condition of not yielding, the driverless vehicle runs at a constant speed of 36 km/h.

Physical Modality
The physical modality referred to physical feedback based on the vehicle itself, such as vehicle type, approach speed, gap size, and so on. These vehicle behaviors probably effect pedestrians' anticipation of the vehicle's intentions. Other scholars [20,27] suggested that physical modality can also be a kind of haptic feedback, such as a mobile phone vibration [28] or haptic actuator. However, our goal is to design vehicle-based eHMI which means these interfaces involve placing cues on the vehicle only. So, physical-visual feedback was taken into account only. Vehicle behavior often provides pedestrians with an understanding of its maneuvers, which is especially critical in a shared space. Thus, we choose two cases of yielding (the approaching speed decreases gradually) and not yielding (the approaching speed remains unchanged) for the study. Driverless vehicles cannot always give way to pedestrians in the process of interacting with them [29,30]. Imagine this scenario: some driverless vehicles with good safety performance are driving down a road. Will pedestrians ignore the traffic rules and just walk on the street because it will not hurt them? Pedestrians' aggressive behaviors toward driverless vehicles may be unavoidable, as they seem "safe", since driverless vehicles are programmed to follow road rules and have to accommodate pedestrians' behaviors [31]. Therefore, pedestrians believe that driverless vehicles will avoid them under any circumstances, creating a feedback loop that makes it difficult for vehicles to enter dense traffic areas [29]. In this study, by designing the eHMIs, we try to convey a dangerous signal or warning message to pedestrians in advance when the driverless vehicles are not yielding.
For yielding vehicles, the gap between 50 m and 30 m from the pedestrian is defined as the identifying phase, the gap between 30 m and 5 m is defined as the decelerating phase, and the phase when the vehicle is waiting for the pedestrian to pass is defined as the parking phase (See Figure 2). Under the condition of not yielding, the driverless vehicle runs at a constant speed of 36 km/h. this scenario: some driverless vehicles with good safety performance are driving down a road. Will pedestrians ignore the traffic rules and just walk on the street because it will not hurt them? Pedestrians' aggressive behaviors toward driverless vehicles may be unavoidable, as they seem "safe", since driverless vehicles are programmed to follow road rules and have to accommodate pedestrians' behaviors [31]. Therefore, pedestrians believe that driverless vehicles will avoid them under any circumstances, creating a feedback loop that makes it difficult for vehicles to enter dense traffic areas [29]. In this study, by designing the eHMIs, we try to convey a dangerous signal or warning message to pedestrians in advance when the driverless vehicles are not yielding. For yielding vehicles, the gap between 50 m and 30 m from the pedestrian is defined as the identifying phase, the gap between 30 m and 5 m is defined as the decelerating phase, and the phase when the vehicle is waiting for the pedestrian to pass is defined as the parking phase (See Figure 2). Under the condition of not yielding, the driverless vehicle runs at a constant speed of 36 km/h.

Visual Modality
The visual modality mainly uses visual cues, such as text, patterns and symbols. There have been many studies on the feasibility of text as a visual cue [28,32], but few on different visual symbols. In this section, two visual features are put forward for comparison; one is a new symbol, "arrow", and the other is "smile".
According to the empirical principle of visual perception [33], observers tend to perceive elements based on past experience. That is why we chose arrow as the basic unit to design eHMIs for driverless vehicles in visual modality. The arrow element is a familiar symbol in traffic environment systems, and exists in the common cognition of the public.

Visual Modality
The visual modality mainly uses visual cues, such as text, patterns and symbols. There have been many studies on the feasibility of text as a visual cue [28,32], but few on different visual symbols. In this section, two visual features are put forward for comparison; one is a new symbol, "arrow", and the other is "smile".
According to the empirical principle of visual perception [33], observers tend to perceive elements based on past experience. That is why we chose arrow as the basic unit to design eHMIs for driverless vehicles in visual modality. The arrow element is a familiar symbol in traffic environment systems, and exists in the common cognition of the public. Based on the principle of simplicity and easy identification, we designed visual features 1, 2 and 3 in the basic of the arrow element through array and deformation, which correspond to three interaction phases, respectively ( Figure 3). A comparative study was conducted with the "smile" proposed by Semcon (A Swedish technology company) in 2016 ( Figure 4).
Based on the principle of simplicity and easy identification, we designed visual features 1, 2 and 3 in the basic of the arrow element through array and deformation, which correspond to three interaction phases, respectively ( Figure 3). A comparative study was conducted with the "smile" proposed by Semcon (A Swedish technology company) in 2016 ( Figure 4).

Auditory Modality
In the auditory modality, human voice and warning sounds were proposed to study the multimodal eHMI. Electric vehicles must be equipped with AVAS (Acoustic Vehicle Alerting System) to warn pedestrians of passing vehicles when they are driving at speeds below 30 km/h [34], and, at high speeds, the tires and wind can create enough noise to alert passers-by to the presence of a vehicle. In this paper, the human voice and warning sounds occurs automatically when the gap size is at speeds below 30 km/h. Considering that a human voice could be clearly heard only at very low speeds, this auditory feature was expressed only in the yielding situation. A female voice played a 3-s-long message (slowing down) in the decelerating phase, and a 3.75-s-long message (please pass safely) with 1 s delay in the parking phase. For the not yielding situation, the driverless vehicle will emit the warning sound. And the warning sound uses a frequency of 1500 Hz, and the volume is set to 20 dB above ambient noise, with an absolute volume of 80 dB.

Materials and Methods
An experimental study was designed using a virtual environment developed in Unity and experienced with HTC Vive Pro, which includes a head-mounted display, two controllers and two lighthouse sensors. Considering participant physical health issue and control of the experimental setting for each of the trials were significant reasons for choosing VR in this article. There are two ways of data collection: controller-collected objective Based on the principle of simplicity and easy identification, we designed visual features 1, 2 and 3 in the basic of the arrow element through array and deformation, which correspond to three interaction phases, respectively ( Figure 3). A comparative study was conducted with the "smile" proposed by Semcon (A Swedish technology company) in 2016 ( Figure 4).

Auditory Modality
In the auditory modality, human voice and warning sounds were proposed to study the multimodal eHMI. Electric vehicles must be equipped with AVAS (Acoustic Vehicle Alerting System) to warn pedestrians of passing vehicles when they are driving at speeds below 30 km/h [34], and, at high speeds, the tires and wind can create enough noise to alert passers-by to the presence of a vehicle. In this paper, the human voice and warning sounds occurs automatically when the gap size is at speeds below 30 km/h. Considering that a human voice could be clearly heard only at very low speeds, this auditory feature was expressed only in the yielding situation. A female voice played a 3-s-long message (slowing down) in the decelerating phase, and a 3.75-s-long message (please pass safely) with 1 s delay in the parking phase. For the not yielding situation, the driverless vehicle will emit the warning sound. And the warning sound uses a frequency of 1500 Hz, and the volume is set to 20 dB above ambient noise, with an absolute volume of 80 dB.

Materials and Methods
An experimental study was designed using a virtual environment developed in Unity and experienced with HTC Vive Pro, which includes a head-mounted display, two controllers and two lighthouse sensors. Considering participant physical health issue and control of the experimental setting for each of the trials were significant reasons for choosing VR in this article. There are two ways of data collection: controller-collected objective

Auditory Modality
In the auditory modality, human voice and warning sounds were proposed to study the multimodal eHMI. Electric vehicles must be equipped with AVAS (Acoustic Vehicle Alerting System) to warn pedestrians of passing vehicles when they are driving at speeds below 30 km/h [34], and, at high speeds, the tires and wind can create enough noise to alert passers-by to the presence of a vehicle. In this paper, the human voice and warning sounds occurs automatically when the gap size is at speeds below 30 km/h. Considering that a human voice could be clearly heard only at very low speeds, this auditory feature was expressed only in the yielding situation. A female voice played a 3-s-long message (slowing down) in the decelerating phase, and a 3.75-s-long message (please pass safely) with 1 s delay in the parking phase. For the not yielding situation, the driverless vehicle will emit the warning sound. And the warning sound uses a frequency of 1500 Hz, and the volume is set to 20 dB above ambient noise, with an absolute volume of 80 dB.

Materials and Methods
An experimental study was designed using a virtual environment developed in Unity and experienced with HTC Vive Pro, which includes a head-mounted display, two controllers and two lighthouse sensors. Considering participant physical health issue and control of the experimental setting for each of the trials were significant reasons for choosing VR in this article. There are two ways of data collection: controller-collected objective measures for reaction time and survey-based responses on different subjective measures. All participants involved in this study gave their informed consent.

Participants
Ten participants were recruited from Southeast University. All participants had normal full color vision, had no hearing problem, and were able to walk at a normal pace and tread for a distance over 500 m. The subjects included 6 males (aged 22-30, M = 24.83, SD = 2.86) and 4 females (aged 21-24, M = 23.00, SD = 1.41). They had no experience of motion sickness in real life, and no symptoms of simulation sickness, such as vomiting or dizziness, occurred at the beginning or during the experiment. Thus, all participants completed the entire study which took them around 35 min. Each of the participants were compensated for their participation.

Apparatus
The experiment was conducted in a real laboratory where the HTC Vive Pro lighthouse sensors were mounted and fixed on tripods facing each other, approximately 8 m apart, 2 m high, angled down at 30º. The HMD had a binocular resolution of 2880 × 1600 px, a refresh rate of 90 Hz and a FOV of 110 degrees. The Vive sensors can track head position and orientation of the participant's head-mounted display. The scenario model and a driverless vehicle in this virtual experiment were modeled by 3ds Max. The car model largely mimicked Google's first self-driving car prototype. The animation of an eHMI for a driverless vehicle was performed using Photoshop and a Timeline component in Unity, mainly. Data collection was performed on an MSI laptop equipped with an Intel core i9 8950HK processor and 1080 graphics card.

Virtual Environment
The experimental traffic environment was a non-signalized intersection where the interaction between pedestrian and driverless vehicle occurred on a two-way road. To eliminate the interference of irrelevant variables and keep pedestrians' attention on the clear visibility of the eHMI for the driverless vehicle, the following measures were taken: (1) The scenario had no traffic lights and no zebra crossing because these two kinds of cues could cause confusion among pedestrians in judging the intent of the car; (2) the color of the car was painted in light gray to contrast and emphasize the visual effect of the eHMIs; and (3) the car came from the right side of participant in each trial, with no other traffic flow. Twelve eHMIs for the driverless vehicle were designed, which were presented, respectively, through 12 virtual scenes, manually switching by the operator.

Experiment Design
The experimental paradigm adopted in this experiment was that the driverless vehicle could detect the pedestrian (which had been fully realized in reality), and it might yield or not yield for them. When pedestrians have no right of way, driverless vehicles may give way to pedestrians out of politeness or safety. But they do not have to give way to pedestrians every time, otherwise it would be difficult to actually drive on the road.
A randomized complete block design with 12 treatment combinations was used for the experiments ( Table 1). The three independent factors were visual features (three levels: smile, arrow, and none), auditory features (three levels: human voice, warning sounds and none) and physical features (two levels: yielding and not yielding). The two hypotheses of the experiment were given: Hypothesis 1. Compared with single-mode eHMI, multi-mode eHMI provides pedestrians with a higher sense of security, and performs better in improving the interaction efficiency between the pedestrian and the driverless vehicle.

Hypothesis 2.
For visual modality, "arrow" gives pedestrians a higher sense of security than "smile", and performs better in improving the interaction efficiency between the pedestrian and the driverless vehicle.
The evaluation of multi-modal eHMI was carried out by combining objective measurement and subjective measurement. The evaluation indices are the sense of security of pedestrians on the road and the interaction efficiency, which is equivalent to the performance of pedestrians accepting and understanding the messages conveyed by driverless vehicles. In the objective measurements, the "feel-safe percentage" of pedestrians was calculated by collecting the "feel-safe time" of pedestrians in the trials through a controller. To calculate the "feel-safe percentage", we defined it as the total time spent pressing the trigger key divided by the total time of each trial. Specifically, in the case of yielding, the total time of each trial was 12.372 s, which started when the trial began, and ended when the vehicle stopped for 6 s (the distance between the car and the pedestrian was 5 m). In the case of not yielding, the speed was always maintained at 36 km/h. The total time was 6 s, which started when the trial began, and ended when the car drove in front of the pedestrian (the distance between the vehicle and the pedestrian was 0 m). In subjective evaluation, scales were the main method.

Procedure
Participants were asked to read and fill out a basic information form and provide informed consent first. In a consent form, the task that the participants needed to complete were stated as follows: "Each time you feel safe, please perform the following operations: (1) Press the trigger button on the controller. (2) Press and hold the trigger button as long as you feel safe.
(3) When you feel danger or nervous, release it." Then the simulator sickness questionnaire (SSQ) was administered to record a baseline score. After this, the subjects put on the headset display and became familiar with the virtual environment in the study. This part helped them learn how to use the controller and adapt to the VE. After every 12 trials, participants took off the headset to have a break, completing the SSQ to express any immediate discomfort from their exposure to the VR. Next, they were introduced to the formal experiment and the guideline was described orally as follows: "Now you are going to cross the road, and you have reached the middle of the road. At this time on the right side of your lane at nearly 50 m coming a driverless vehicle, you do not see the driver. You can obtain the intention of the driverless vehicle through an external interface and listen to the audible cues coming from the vehicle during the experiment. Hold down the trigger button on the controller for as long as you feel safe, and release when you feel danger or nervous. Are you ready? Do it now!" Every participant experienced 36 trials in a random order ( Figure 5). A practice session demonstrating the capability of the system was carried out before the formal experiment in order to overcome the "novelty effect" for the participants. After every 12 trials, participants were given 3 min long break. The whole experiment lasted about 35 min, and the average time of the subjects in the virtual environment was less than 25 min. For ensuring their safety due to simulation sickness, participants responded to the SSQ (Simulator Sickness Questionnaire) [35] at the beginning, in the middle, and at end of the experiment. In addition, they were asked to fill out a PQ (Presence Questionnaire) [36] and an IRS (Interface Rating Scale) (See Appendix A) at the end of the experiment. through an external interface and listen to the audible cues coming from the vehicle during the experiment. Hold down the trigger button on the controller for as long as you feel safe, and release when you feel danger or nervous. Are you ready? Do it now!" Every participant experienced 36 trials in a random order ( Figure 5). A practice session demonstrating the capability of the system was carried out before the formal experiment in order to overcome the "novelty effect" for the participants. After every 12 trials, participants were given 3 min long break. The whole experiment lasted about 35 min, and the average time of the subjects in the virtual environment was less than 25 min. For ensuring their safety due to simulation sickness, participants responded to the SSQ (Simulator Sickness Questionnaire) [35] at the beginning, in the middle, and at end of the experiment. In addition, they were asked to fill out a PQ (Presence Questionnaire) [36] and an IRS (Interface Rating Scale) (See Appendix A) at the end of the experiment.

Results
SPSS 26.0 was used for data sorting and statistical analysis of objective and subjective measurement data. The feel-safe time recorded during the experiment was collected from Unity 2017.4.3. Then, the feel-safe percentage was calculated programmatically. In this section, the feel-safe percentages of 12 treatment combinations in the yielding and not yielding case were firstly compared, and then classified into single-modal and multimodal groups for comparison. We further explored the effects of two visual features on pedestrian crossing in the visual modality. Finally, the results of SSQ, PQ, and IRS were reported based on central tendency (i.e., mean or median) to measure participants' simulator sickness symptoms, presence, and their subjective preference for the eHMI on the driverless vehicle.

Objective Data Analyses
First of all, 12 treatment combinations of objective data of independent variables under two kind of physical modalities were statistically analyzed. As shown in Table 2, the descriptive statistics were made on the feel-safe percentage under yielding condition. In this case, the group without audio-visual external human machine interface (Yielding) scored the lowest (Mean = 79.64%, SD = 0.142). Next came the smile-only group (Smile + Yielding), the human voice-only group (Human voice + Yielding), and the arrow-only group (Arrow + Yielding), with mean scores and standard deviations of 84.33% (0.106), 90.36% (0.098) and 92.81% (0.055), respectively. It was found that, when the eHMI had

Results
SPSS 26.0 was used for data sorting and statistical analysis of objective and subjective measurement data. The feel-safe time recorded during the experiment was collected from Unity 2017.4.3. Then, the feel-safe percentage was calculated programmatically. In this section, the feel-safe percentages of 12 treatment combinations in the yielding and not yielding case were firstly compared, and then classified into single-modal and multimodal groups for comparison. We further explored the effects of two visual features on pedestrian crossing in the visual modality. Finally, the results of SSQ, PQ, and IRS were reported based on central tendency (i.e., mean or median) to measure participants' simulator sickness symptoms, presence, and their subjective preference for the eHMI on the driverless vehicle.

Objective Data Analyses
First of all, 12 treatment combinations of objective data of independent variables under two kind of physical modalities were statistically analyzed. As shown in Table 2, the descriptive statistics were made on the feel-safe percentage under yielding condition. In this case, the group without audio-visual external human machine interface (Yielding) scored the lowest (Mean = 79.64%, SD = 0.142). Next came the smile-only group (Smile + Yielding), the human voice-only group (Human voice + Yielding), and the arrowonly group (Arrow + Yielding), with mean scores and standard deviations of 84.33% (0.106), 90.36% (0.098) and 92.81% (0.055), respectively. It was found that, when the eHMI had both visual and auditory modalities, the feel-safe percentage was the highest, and the "Arrow + Human voice + Yielding" group (Mean = 95.66%, SD = 0.050) was higher than the "Smile + Human voice + Yielding" group (Mean = 94.20%, SD = 0.080). As shown in Figure 6, the distribution of feel-safe percentages of different treatment combinations under two scenarios were given. The boxplot reflects the distribution range of median and quartile ranges of six groups of the feel-safe percentage. The height of the box reflects the degree of fluctuation of the data to some extent. It can be seen in the Figure 6a, intuitively, that when the eHMI of the driverless vehicle presented multimodal combinations, the maximum value reached 100%; that is, participants performed significantly more positively when the vehicle was approaching and getting ready to park. However, when the driverless vehicles decided not to yield, participants felt negative, according to Figure 6b. both visual and auditory modalities, the feel-safe percentage was the highest, and the "Arrow + Human voice + Yielding" group (Mean = 95.66%, SD = 0.050) was higher than the "Smile + Human voice + Yielding" group (Mean = 94.20%, SD = 0.080). As shown in Figure 6, the distribution of feel-safe percentages of different treatment combinations under two scenarios were given. The boxplot reflects the distribution range of median and quartile ranges of six groups of the feel-safe percentage. The height of the box reflects the degree of fluctuation of the data to some extent. It can be seen in the Figure  6a, intuitively, that when the eHMI of the driverless vehicle presented multimodal combinations, the maximum value reached 100%; that is, participants performed significantly more positively when the vehicle was approaching and getting ready to park. However, when the driverless vehicles decided not to yield, participants felt negative, according to Figure 6b. Then, we divided the eHMI with a number of two or fewer modalities into a singlemodal group, and the eHMI with a number of three modalities into a multimodal group. For example, the "Smile + Human voice + Yielding" group contains three modalities: visual, auditory and physical. Therefore, it belongs to the multimodal group. Furthermore, we compared the single-modal group and multimodal group under yielding condition. Since the data of these two independent samples do not follow the normal distribution, a Mann-Whitney Test was used for analysis. It can be seen from Table 3 that the average feel-safe percentage of the single-modal group and the multimodal group was significantly different, and the single-modal group was significantly lower than the multimodal group (z = −4.995, p < 0.001). However, in the case of not yielding, the combination with visual and auditory modalities did not provide a higher sense of safety. Instead, the group with no visual and auditory cues (Not yielding) scored highest (Mean = 75.51%, SD = 0.184). The combinations which had a warning sound all showed a lower average feel-safe percentage (63.05%, 67.84%, 69.83%), suggesting that it does not convey effective information for pedestrians (See Table 4). Similarly, we performed a Mann-Whitney Test on two independent single-mode and multi-mode samples in the not yielding case, and the results showed no significant difference between the two groups (p = 0.057 > 0.05). In particular, we studied the effects of different visual features in visual modality on the efficiency and safety of pedestrians when crossing the street. Since the multimodal eHMI had no significant impact on a pedestrian's sense of safety in the physical modality, only the yielding situation was considered here. We investigated the relationship between three independent samples with no visual feature (None), smile feature, and arrow feature. The results of a Kruskal-Wallis Test showed that, according to the Test standard of α = 0.05, there was a significant difference between the three groups (H (2.87) = 15.422, p < 0.001). After a Pairwise comparison, there was no significant difference between the "smile" and "none" group (Adjusted p = 0.095). However, there was a significant difference between the "none" and the "arrow" groups (Adjusted p < 0.001). The difference between the "smile" and "arrow" groups was also statistically significant (Adjusted p = 0.016 < 0.05). In order to explore which visual features made pedestrians feel safer, we calculated the percentile in the 25th, 50th, and 75th of two groups. As is shown in Table 5, the median value in the arrow group (94.52%) was significantly higher than those in the smile group (83.94%). Simulator sickness includes a "visual induced motion sickness (VIMS), visual simulation sickness, virtual reality-induced symptoms and effects" [37]. It is often caused by the visually induced perception of movement, not only by a motion [38]. Here, we discuss the impact of using a head-mounted display to experience a virtual reality environment on the participants. The total scores for SSQ gathered periodically throughout the experiment increased with time ( Table 6). The findings revealed that participants reported minimal simulation sickness (5 < difference from baseline score < 10) in the first 20 min. However, 30 min or longer of exposure may led to a significant simulation sickness (10 < difference from baseline score < 15). The three subscales of the SSQ questionnaire including nausea, oculomotor disturbance, and disorientation were evaluated in more detail, as well as the total score. After repeated measures analysis of variance, oculomotor disturbance and total scores increased significantly over time. The oculomotor disturbance score is related to the visual aspects of the simulator and can be explained by eyestrain resulting from prolonged exposure to the display. In short, during the whole experiment, the simulator sickness of the participants was relatively weak and negligible.

Presence Questionnaire Statistical Analysis Results
The presence questionnaire measures perception of involvement and immersion in a virtual environment. It was reported that the mean score of the 19 items for the validated VE was 98.11, with a standard deviation of 15.78. A high average score of PQ (M = 114.13, SD = 4.73, Min = 107, Max = 125) proved the authenticity of the designed experimental environment and user perception of immersion. Table 7 displays the mean score (value range from 1 to 7) and standard deviations for each of the subscale in the PQ. The involvement score showed that participants could move nimbly within the virtual environment. The immersion score also indicated that the environment responded realistically to the participants' actions. The visual fidelity score established the reliability of the simulator in terms of depth perception and visual clarity [39]. The interface quality scores showed the usability of this simulator for performing assigned tasks [36]. The average score on sound (M = 5.22, estimated separately for items 5, 11, and 12) showed a realistic representation of auditory modality in the virtual environment.

Interface Rating Scale Statistical Analysis Results
In this part, we designed two questions to obtain subjective evaluations of visual feature types and multimodal eHMI under static conditions. As for the result of the first question-"What visual feature is placed on a driverless vehicle to make you feel safer and more comfortable when you are crossing the street?"-the results showed that the symbol "arrow" was the first choice for most participants. All participants said that having a visual feature made them feel safer and more secure than not having one.
For another question-"What combination of eHMI on a driverless vehicle makes you feel safer and comfortable when you are crossing the street?"-the statistics showed that the average score of multimodal eHMIs were higher than that of single-modal eHMIs. Besides, the combination of arrow and human voice under the yielding situation won the first prize, while no eHMI had the lowest rating. In the not yielding case, most of the participants think that the single warning sound made them nervous and uneasy, as well as combinations with a warning sound. Some participants comments that, "when it makes a beep sound, I began to panic. I don't even understand the intention"; or "if it's not going to stop, I hope it can get through, don't need to issue any instructions to me". The overall results of the IRS score also indicated this point.
In interviews at the end of the experiment, most participants reported that the warning sounds made them feel in danger. When the driverless vehicle does not intend to make way for pedestrians, the participants want them to pass as quickly as possible without giving them extra information, similarly to traditional vehicles. In addition, participants also reported that, in a virtual reality environment, the visual eHMI design seen through a head-mounted display was very clear.

Effect on Psychological Comfort
Psychological comfort refers to being in a state of wellbeing and, thereby reducing stress [15]. For pedestrians, the typical measure of psychological comfort is the subjective expression of the ease or pleasantness when crossing, related to whether the driverless vehicle offers them reassuring messages. Psychological comfort is experienced when pedestrians feel at ease and have confidence that the driverless vehicle will exhibit the predictive behavior and negotiate with them in a friendly way. There is also extensive literature on pressure from the crossing situation [40,41].
We examined people's preference of possible eHMI designs which driverless vehicles can use to interact safety and smoothly with pedestrians. The results indicated that participants could clearly distinguished between different design features with respect to perceived intents, and multimodal eHMIs that map the speed and distance information to sounds and flashing patterns were perceived as more efficient than vehicles with no displays. Specifically, the arrow and human voice combination in the yielding case was perceived as the safest. Compared to empirical studies looking to develop new means of driverless vehicle-pedestrian communications, this paper demonstrated a simulator-based approach to designing multichannel interactions that were presented on external displays of driverless vehicles. Multimodal eHMIs for a driverless vehicle were designed, evaluated in contextualized scenarios that simulated real world constraints in pedestrian crossings. We hope that increased dimensions of information sources with external interfaces will lead to more attentive pedestrians and, eventually, to safer crossings.
In general, the multimodal eHMI does not impose an additional cognitive burden on pedestrians, but transmits useful signals to them from several channels. These results correspond to those previously showing that an HMI should not increase the already complex condition of communication in a traffic environment [42]. Most pedestrians are usually distracted and inattentive in the process of crossing the road [43], and the use of multimodal interaction can avoid the failure and low efficiency of single-modal information transmission.

Effect on Feature Evaluating
From the perspective of physical modality, the movement of driverless vehicles, such as approach speed, plays a fundamental role on the safety of pedestrians on the road. In traditional interactions between pedestrians and vehicles, the former relies on approach speed and gap to judge both the awareness and the intent of the driver [21,44,45]. Our findings indicate that physical modality, including vehicle movement patterns, will continue to be a significant cue in driverless vehicle and pedestrian interactions, even in the presence of eHMIs. On the one hand, pedestrians often love to see driverless vehicles surrender to them. On the other hand, if the driverless vehicle does not intend to negotiate with pedestrians, the information transmission of eHMI is often meaningless. The results of the multimodal eHMI under the two physical modalities are completely opposite, which indicates that pedestrians have different responses to different vehicle motion patterns under different driving intentions. As a result, in the field of intelligent vehicles, engineers and designers should classify and grade the physical modalities of driverless vehicles and design the corresponding eHMI for better communication with other traffic participants.
From the perspective of visual modality, the directivity and intuitiveness of the arrow symbol are stronger than that of the smile symbol. The symbol of the arrow is a very common traffic sign on the road and, generally, there is no language barrier in recognition, which can provide a familiar and comfortable feeling for pedestrians. By contrast, the shape of a smile was a less effective feature because participants were given no clear information about the vehicle's expected behavior. In interviews after the experiment, participants said they had some doubts about whether the car was thanking them for not crossing the road or trying to persuade them to do so. These results are consistent with finding from previous research regarding the preference of pedestrians for smileys. Pedestrians do not consider the combination or single use of smile to be the best option [18,19]. We figured that features applied to eHMIs should be designed with familiar elements (shape, color, etc.) to enhance trust in driverless vehicles during the transition period. However, the current literature does not provide a consistent answer regarding color types for visual features. Red front brake lights [46], blue lights [47], and green lights [48] were proposed to convey the information that the vehicle is slowing down. There should be a uniform standard for the use of color in the eHMI for driverless vehicles. Regardless of the technology, there is an inevitable defect to deploying eHMI in real life: light and weather conditions can compromise the visibility and readability of the information.
From the perspective of auditory modality, statistical results show that the human voice can make pedestrians feel safe. A possible reason is that the human voice gives pedestrians a sense of familiarity and intimacy, and the information conveyed by the human voice is relatively straightforward. On the contrary, the rapid and hurried warning sound brings the pedestrians thrill, which proves that the participants prefer a positive feature design rather than a negative one. Auditory modality provides an additional means of interaction for the visually impaired and those whose vision is compromised to some extent, but it may cause noise pollution in a complex traffic environment.
As driverless vehicles become more and more common, and are deployed more widely, these considerations may help provide guidance for engineers and designers who hope to work on standardizing certain of the above factors.

Limitations
The pedestrian simulator scenario developed only focuses on a two-way road without traffic lights. In the real world, however, there are still many other scenarios where interactions between pedestrians and driverless vehicles could occur, such as intersections with traffic lights, parking lots, and so on. The ways and intentions of communication may vary from scenario to scenario. In addition, considering that too many environmental variables may distract the attention of the participants, we only study the interaction between one driverless vehicle and one pedestrian. In future studies, more specific interaction scenarios can be supplemented to study pedestrian behavior under different circumstances. Finally, the design of the eHMI for the driverless vehicle will be carried out under the conditions of multi-scene and multi-dimension in a complex traffic environment.

Conclusions
Although the introduction of driverless vehicles may bring a lot of convenience to people's lives, the trust issue will be around for a long time, especially among pedestrians, who are one of the vulnerable groups in terms of road safety. In this paper, a multimodal external human-machine interface was put forward from visual, auditory, and physical perspectives to promote a better interaction between the driverless vehicle and a pedestrian. Then, a confirmatory experiment was conducted in a virtual environment. The results show that the multimodal eHMI can provide pedestrians with a higher sense of security, under the premise that the driverless vehicle makes way for pedestrians when yielding. In the comparative study of the two visual features, the participants prefer "arrow" symbols. When driverless vehicles do not intend to yield, they should minimize the transmission of additional information and pass through as soon as possible, so as to avoid confusion and tension for pedestrians. In general, future driverless vehicles should be equipped with an eHMI to communicate and exchange information with the surrounding traffic participants, so as to improve interaction efficiency on the road and the safety of pedestrians. Automobile manufacturers and designers should take these results into account to set up safety management information systems for driverless vehicles and conduct unified standardization and normalization.