“Just One Short Voice Message”—Comparing the Effects of Text-vs. Voice-Based Answering to Text Messages via Smartphone on Young Drivers’ Driving Performances

: Despite the well-known distracting effects, many drivers still engage in phone use, especially texting and especially among young drivers, with new emerging messaging modes. The present study aims to examine the effects of different answering modes on driving performance. Twenty-four students (12 females), aged between 19 and 25 years (M = 20.83, SD = 1.53), volunteered for the study. They accomplished the Lane Change Task (LCT) with baseline and dual-task runs in a driving simulator. In dual-task runs, participants answered text messages on a smartphone by voice or text message with varying task complexity. Driving performance was measured by lane deviation (LCT) and subjective measures (NASA-TLX). Across all trials, driving performance deteriorated during dual-task runs compared with the baseline runs, and subjective demand increased. Analysis of dual-task runs showed a beneﬁt for voice-based answering to received text messages that leveled off in the complex task. All in all, the beneﬁts of using voice-based answering in comparison with text-based answering were found regarding driving performance and subjective measures. Never-theless, this beneﬁt was mostly lost in the complex task, and both the driving performance and the demand measured in the baseline conditions could not be reached.


Introduction
Smartphones have become ubiquitous in our everyday lives and are used in many situations, even in situations where their use is inappropriate, like driving. Research has shown the numerous detrimental effects of phone use while driving. More than 40 years ago, the first adverse effects were found [1]. Recent meta-analyses show effects on reaction time, lane keeping, headway and speed in both simulator and field studies [2][3][4]. The effects of phone use on driving seem comparable to drunk driving at the legal limit when the time on task is controlled [5,6]. With advancing technology, texting gained popularity as another potential distracting feature of phone use.
Similarly, texting while driving became an important research field, with metaanalyses showing effects on reaction times, collisions, lane keeping, speed and headway [7]. Impairments are even more significant for texting than for phone calls [8,9]. Regardless of drivers recognizing the distraction of texting and adapting by choosing longer headways, risky driving behavior like increased glances off the road were still shown [10,11]. Although aware of the increased risk of road traffic crashes, many drivers engage in phone use while driving, even though manual phone use is not legal in many countries [12,13]. Phone use and distracted driving are still relevant today, and distracted driving will become even more relevant with increasing infotainment systems and advanced driver assistant systems in modern vehicles. Additionally, in this study, the task complexity of the secondary task of text-and voicebased answering will be varied to better understand the workload on driving performance, which was rarely investigated before [9,35].
It is expected in this study that engaging in either secondary task of text-or voicebased answering will lead to increased attentional demand, leading to impaired driving performance. Voice-based answering should lead to less impaired driving performance compared with text-based answering, as both use different modalities according to Wickens' multiple resource theory [36], with voice-based answering relying more on auditory and text-based answering relying primarily on visual attention. The higher complexity of answers leads to more impaired driving performance as increased task demands compete for attention resources.

Design
Two independent within-person variables were realized to investigate possible detriments and compare the effects of voice-and text-based answering on driving performance. The answering mode (text vs. voice input) served as the first independent variable, whereas task complexity (simple vs. complex task) constituted the second independent variable. Both variables were manipulated in a 2 × 2 factorial design, combining each level of answering mode and task complexity to a sum of four conditions (text and simple task, voice and simple task, text and complex task and voice and complex task). Driving performance, as well as subjectively perceived demand, served as dependent variables. A repeated measures within-subject design was used to control for individual differences in driving performance.

Participants
In this study, a total of 24 students (n = 24, 12 females) with a mean age of 20.83 years (SD = 1.53), ranging from 19 to 25 years participated. All participants were frequent drivers, driving at least once a week, on average 5802 km (SD = 4043) per year and owning a driver's license for at least 2 years. All participants were frequent smartphone and instant messaging users, both using them daily and being accustomed to voice messaging. All had normal or corrected-to-normal vision, were right-handed and were novices in the LCT. No participant-reported instances of simulator sickness. All participants were university students and participated voluntarily. All participants were able to abort the study at any time and were debriefed afterward. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted under the Declaration of Helsinki and the Ethical Principles and Protocol Code of the Federation of German Psychologists Associations (adapted from the American Psychological Association (APA) Code of Ethics). The Ethik-Kommission der Deutschen Hochschule der Polizei (Ethics Committee of the German Police University) provided post hoc approval of the study (approval number DHPol-EthK.2020.Su1).

Messaging Task
The messaging task was carried out on a Samsung Galaxy S II GT-I9100 with a 4.3" display and a resolution of 480 × 800 pixels running Google Android Version 4.1.2. The smartphone was fixated on the dashboard within easy reach for all participants to ensure comparability across trials. All tasks were completed within the instant messaging app WhatsApp version 4.12.449. Messages were sent to the smartphone by the experimenter. The texting task was adapted from Burge and Chaparro [38]. Participants received text messages containing letter strings made of five random letters (e.g., MZPAL). They were told to answer the text messages in four different ways, varying the answer mode and task complexity. In the simple task, participants had to copy the text messages by repeating the letters in order and sending them back. They were asked to sort the letters in alphabetical order before sending them back in the complex task. Secondly, messages were answered either via text by typing on the touch screen keyboard or via voice input by

Messaging Task
The messaging task was carried out on a Samsung Galaxy S II GT-I9100 with a 4.3" display and a resolution of 480 × 800 pixels running Google Android Version 4.1.2. The smartphone was fixated on the dashboard within easy reach for all participants to ensure comparability across trials. All tasks were completed within the instant messaging app WhatsApp version 4.12.449. Messages were sent to the smartphone by the experimenter. The texting task was adapted from Burge and Chaparro [38]. Participants received text messages containing letter strings made of five random letters (e.g., MZPAL). They were told to answer the text messages in four different ways, varying the answer mode and task complexity. In the simple task, participants had to copy the text messages by repeating the letters in order and sending them back. They were asked to sort the letters in alphabetical order before sending them back in the complex task. Secondly, messages were answered either via text by typing on the touch screen keyboard or via voice input by

Messaging Task
The messaging task was carried out on a Samsung Galaxy S II GT-I9100 with a 4.3" display and a resolution of 480 × 800 pixels running Google Android Version 4.1.2. The smartphone was fixated on the dashboard within easy reach for all participants to ensure comparability across trials. All tasks were completed within the instant messaging app WhatsApp version 4.12.449. Messages were sent to the smartphone by the experimenter. The texting task was adapted from Burge and Chaparro [38]. Participants received text messages containing letter strings made of five random letters (e.g., MZPAL). They were told to answer the text messages in four different ways, varying the answer mode and task complexity. In the simple task, participants had to copy the text messages by repeating the letters in order and sending them back. They were asked to sort the letters in alphabetical order before sending them back in the complex task. Secondly, messages were answered either via text by typing on the touch screen keyboard or via voice input by recording the message. Text messages were sent in capital letters but answered in lowercase letters, as pretests showed this to be more comfortable for participants. Participants were told to answer the messages as soon as they received them and respond correctly. The following message Safety 2021, 7, 57 5 of 12 was sent as soon as the participants had answered. This way, the tasks were completed continuously during runs with just the network communication delay in between.

Lane Change Task
Driving performance was measured by the mean deviation in meters. As recommended by ISO 26022 [39], the adaptive model was used. In the adaptive model, the participants' baseline is individually derived from actual baseline runs instead of a normative baseline for every participant. It is advantageous compared with the basic model because it discriminates more effectively and allows better comparability between participants by considering individual driving styles [40,41]. The LCT track between the start sign and the last lane change sign was used for analysis, resulting in a total driving time of 160 s.

NASA-TLX
The NASA-TLX [42] was used to measure subjective demand, with scores summed up for each dual-task run and the baseline run of the LCT (see Figure 3). Participants were asked to fill in questionnaires and to rate the subjective demand of the previous run according to their perception, after dual-task runs and after the first baseline run. The raw version of the NASA-TLX was used because it has been proven to generate comparable results to the weighted version [43] while requiring less time.
Safety 2021, 7, x FOR PEER REVIEW 5 of 12 recording the message. Text messages were sent in capital letters but answered in lowercase letters, as pre-tests showed this to be more comfortable for participants. Participants were told to answer the messages as soon as they received them and respond correctly.
The following message was sent as soon as the participants had answered. This way, the tasks were completed continuously during runs with just the network communication delay in between.

Lane Change Task
Driving performance was measured by the mean deviation in meters. As recommended by ISO 26022 [39], the adaptive model was used. In the adaptive model, the participants' baseline is individually derived from actual baseline runs instead of a normative baseline for every participant. It is advantageous compared with the basic model because it discriminates more effectively and allows better comparability between participants by considering individual driving styles [40,41]. The LCT track between the start sign and the last lane change sign was used for analysis, resulting in a total driving time of 160 s.

NASA-TLX
The NASA-TLX [42] was used to measure subjective demand, with scores summed up for each dual-task run and the baseline run of the LCT (see Figure 3). Participants were asked to fill in questionnaires and to rate the subjective demand of the previous run according to their perception, after dual-task runs and after the first baseline run. The raw version of the NASA-TLX was used because it has been proven to generate comparable results to the weighted version [43] while requiring less time.  Figure 3 displays the procedure of the experiment. First, potential participants were informed about the experimental goals and procedures. If the participants provided their informed consent, they were familiarized with the setup and received instructions for the different tasks. Second, participants received task instructions and a training session to ensure familiarity with the setup and tasks. The training sessions contained the messaging task, the LCT and the dual-task condition (messaging task and LCT simultaneously). Third, baseline runs were conducted for the messaging task and the LCT. The order of the letter strings that had to be repeated while answering messages was standardized for each task condition. The order of blocks for the baseline runs was randomized. The baseline runs allowed comparability with performance during the dual-task condition.

Procedure
The training and baseline runs were followed by the experimental dual-task condition, with participants actually answering while driving. The dual-task runs were completed in a blocked design, with one condition per run. This meant participants either answered by text or voice message and just worked on simple or complex messages while driving. According to ISO 26022 [39], participants were instructed to complete the lane changes quickly and efficiently and complete the driving and messaging tasks to the best of their ability, emphasizing performance for both. The order of dual-task runs was randomized. The order of letter strings and the order of the tracks were standardized for each condition to ensure the same requirements across participants. Finally, after completing  Figure 3 displays the procedure of the experiment. First, potential participants were informed about the experimental goals and procedures. If the participants provided their informed consent, they were familiarized with the setup and received instructions for the different tasks. Second, participants received task instructions and a training session to ensure familiarity with the setup and tasks. The training sessions contained the messaging task, the LCT and the dual-task condition (messaging task and LCT simultaneously). Third, baseline runs were conducted for the messaging task and the LCT. The order of the letter strings that had to be repeated while answering messages was standardized for each task condition. The order of blocks for the baseline runs was randomized. The baseline runs allowed comparability with performance during the dual-task condition.

Procedure
The training and baseline runs were followed by the experimental dual-task condition, with participants actually answering while driving. The dual-task runs were completed in a blocked design, with one condition per run. This meant participants either answered by text or voice message and just worked on simple or complex messages while driving. According to ISO 26022 [39], participants were instructed to complete the lane changes quickly and efficiently and complete the driving and messaging tasks to the best of their ability, emphasizing performance for both. The order of dual-task runs was randomized. The order of letter strings and the order of the tracks were standardized for each condition to ensure the same requirements across participants. Finally, after completing the dualtask runs, another baseline run for the LCT was conducted to control the learning effects. To test for sufficient familiarity with the LCT, the performance in the baseline runs had to be under a mean deviation of 1.2 in the basic model according to the ISO 26022 [39] requirement. The experiment ended with a questionnaire (i.e., demographics, driving, smartphone experience and usage) and each participant's debriefing. After each of the four dual-task runs and after the first baseline run, participants completed the NASA-TLX to measure subjective demand.

Results
When not stated otherwise, differences between the conditions were analyzed via repeated measures ANOVA and multiple t-tests with Bonferroni correction. Differences between the dual-task and baseline runs were analyzed via ANOVA, followed by analyses of the differences between the dual-task runs only with another ANOVA. Furthermore, data were analyzed for outliers via z-transformation. One participant's dataset was excluded from analysis for having outlying values in several runs of the LCT. The values of the remaining 23 participants were included in the analyses. Figure 4 shows an overview of the driving performance and subjective demand during dual-task runs.
the dual-task runs, another baseline run for the LCT was conducted to control the learning effects. To test for sufficient familiarity with the LCT, the performance in the baseline runs had to be under a mean deviation of 1.2 in the basic model according to the ISO 26022 [39] requirement. The experiment ended with a questionnaire (i.e., demographics, driving, smartphone experience and usage) and each participant's debriefing. After each of the four dual-task runs and after the first baseline run, participants completed the NASA-TLX to measure subjective demand.

Results
When not stated otherwise, differences between the conditions were analyzed via repeated measures ANOVA and multiple t-tests with Bonferroni correction. Differences between the dual-task and baseline runs were analyzed via ANOVA, followed by analyses of the differences between the dual-task runs only with another ANOVA. Furthermore, data were analyzed for outliers via z-transformation. One participant's dataset was excluded from analysis for having outlying values in several runs of the LCT. The values of the remaining 23 participants were included in the analyses. Figure 4 shows an overview of the driving performance and subjective demand during dual-task runs.  Table 1 shows a summary of the means and standard deviations for driving performance and the subjective measures. The mean of the two baseline runs was used as the baseline in the following analyses. Differences in the mean deviation across all runs, the four dual-tasks and the baseline were analyzed via repeated measures ANOVA. Since Mauchly's test indicated a violation for the assumption of sphericity (χ 2 (9) = 30.58, p < 0.001), the degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (ε = 0.60 [44]). The omnibus F-test showed that the mean deviation differed significantly across runs (F (2.   Table 1 shows a summary of the means and standard deviations for driving performance and the subjective measures. The mean of the two baseline runs was used as the baseline in the following analyses. Differences in the mean deviation across all runs, the four dual-tasks and the baseline were analyzed via repeated measures ANOVA. Since Mauchly's test indicated a violation for the assumption of sphericity (χ 2 (9) = 30.58, p < 0.001), the degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (ε = 0.60 [44]). The omnibus F-test showed that the mean deviation differed significantly across runs (F (2. Post hoc comparisons showed that when working on the simple task, the mean deviation differed significantly between voice-and text-based answering (t(22) = 5.8, p < 0.001, d = −1.26), but when working on the complex task, the mean deviation between voice-and text-based answering did not differ significantly (t(22) = 1.15, p = 0.263, d = −0.33). The beneficial effect of voice-based answering in comparison with text-based answering was attenuated when working on the complex task.

19).
Factorial repeated-measures ANOVA for just the dual-task runs showed a significant main effect for the answering mode (F(1,22) = 27.87, p < 0.001, η 2 = 0.22) and a significant main effect for task complexity (F(1,22) = 92.7, p < 0.001, η 2 = 0.43). On average, the participants perceived driving when sending voice-based answers (M = 59.35, SD = 22.14) to be less demanding than when sending text-based answers (M = 75.46 SD = 18.18) and working on the simple task (M = 56.04, SD = 18.88) to be less demanding than working on the complex task (M = 78.77, SD = 18.34). Furthermore, a significant interaction effect was found (F(1,22) = 12.96, p = 0.002, η 2 = 0.03). Post hoc comparisons showed that subjective demand differed significantly between voice-and text-based answering (t(22) = 6.52, p < 0.001, d = −1.33) when working on the simple task as well as during the complex task (t(22) = 2.87, p = 0.009 d = −0.61). When sending text answers, subjective demand differed significantly between the simple and complex tasks (t(22) = −9.31, p < 0.001, d = 1.01), as well as when sending voice answers (t(22) = −7.86, p < 0.001, d = 1.97). Table 2 shows a summary of the means and standard deviations for answering performance. Differences in the number of responses between the dual-task runs and the baseline runs were tested via repeated t-tests and corrected using Bonferroni adjustment. During the dual-task runs, all messaging performance decreased significantly compared with the baseline runs (all p < 0.001, except for voice-based answering during the simple task with p = 0.002), with participants writing fewer answers while driving. Differences in the number of erroneous answers between the dual-task and the baseline runs were analyzed in the same way. Similarly, the answering performance decreased during dual-task runs compared with the baseline runs, with participants writing more erroneous answers. The comparisons were significant for the text-based answers, both during simple (p = 0.018) and complex (p = 0.032) tasks, but for voice-based answers, significance was found just during complex (p = 0.001) and not during simple (p = 0.487) tasks.

Discussion
We conducted this study to compare the possible detrimental effects of text-and voice-based answering with text messages on driving performance. Participants answered text messages of varying complexity with different answering modes (text vs. voice) while driving in the Lane Change Task. Driving performance and subjective measures were obtained. The driving performance differed significantly from the baseline run in terms of the mean deviation in all dual-task runs. As expected, the secondary task's increased demand led to an objective decrease in performance in the primary task of driving. The increased demand could also be seen in a subjective dimension in the higher scores of the NASA-TLX for dual-task runs. Driving while sending answers was perceived as more demanding than baseline driving. This effect was found for all conditions of the answering tasks (i.e., text-and voice-based answers and simple and complex tasks).
When just analyzing the dual-task runs, it can be seen that both the answering mode and the task complexity had individual impacts on driving performance. Typing a textbased answer led to more significant impairment than recording a voice-based answer, and sorting the answer content alphabetically beforehand (complex task) led to greater impairment than copying them (simple task). When regarding the effect sizes, the influence of the task complexity was slightly larger than that of the answering mode. Nevertheless, both effects were substantial. These effects were also reflected in the subjectively experienced demands of the runs, as the NASA-TLX showed. Sending text-based answers and working on the complex task was more strenuous for participants than sending voice-based answers and working on the simple task. When looking at driving performance, the interaction of the answering mode and task also played an important role. In general, using voice input was less disadvantageous than text-based answering while still remaining inferior to baseline driving. This difference was more prominent during the simple task condition but became somewhat lost in the complex task condition. The increase in mean deviation in the LCT from simple to complex tasks was more pronounced when using voice-based answering than when using text-based answering. The voice input showed fewer disad-vantages in driver performance, but this difference nearly leveled off when facing a more complex task.
Despite some clear evidence, the current study also faces some limitations that should be addressed in further research. The first limitation is the LCT's inability to effectively distinguish between the visual and cognitive demands of the task. The effects of the complex task could stem from longer gaze times instead of higher cognitive demand. Follow-up studies could use a more articulate analysis of the LCT data, which was not available for this study but was used by Engström and Markkula [45]. Furthermore, the LCT in a simulator setting can be seen as entirely artificial (i.e., not taking traffic or speed into account). Unlike in naturalistic driving, participants worked continuously on the task and could not freely choose when to answer. Jamson et al. [22] mentioned that choosing when to answer is a benefit of text messaging compared with a phone call, since messaging can be interrupted more quickly than a phone call.
Further research in more naturalistic settings is necessary to examine the performance and the choice of different answering modes in a complex driving environment, such as Peng, Boyle, and Hallmark [29] did by looking at driver inattention. As only young drivers participated in the presented study, exact effects on other age groups should be addressed by further research. Recent studies for texting while driving have shown that older drivers might be more negatively affected by texting while driving as young drivers seem more experienced with texting [46], while distracted driving is a higher risk factor for young drivers [47].
Regardless of the limitations, evidence shows that using voice-based answering can be less disadvantageous to driving performance compared with text-based answering. Nevertheless, it is essential to underline that the baseline performance without a secondary task cannot be achieved. As with earlier studies looking at different metrics [21,27], driving performance in lane keeping was similarly impacted. This finding is also in line with previous research findings showing that voice-based systems are helpful compared with manual control while still impairing driving performance [18,20,48]. Especially when the effect of a task's complexity is considered, it can be seen that voice-based solutions have limitations. When the verbal interaction gets too complicated, the advantage of using a voice system is mainly lost. It was shown that voice-based answering could not be seen as an alternative to text-based answering while driving. Both secondary tasks impair driving performance to a safety-critical extent. As young drivers are especially prone to texting while driving [25,49], their driving behavior will be probably impacted. As He et al. [27] stated, voice-based messaging is not a panacea. Therefore, it can be emphasized again that messaging and especially texting while driving should be avoided at all times.

Conclusions
The objective of this study was to compare the effects of text-and voice-based answering to text messages on young drivers' driving performance. The effects of different answering modes (voice vs. text) could be clearly distinguished in terms of driving performance and subjective demand reported by participants. While answering via voice was less disadvantageous and reported as less demanding than text-based answering, baseline driving performance and demand could still not be reached, showing again that messaging while driving, independent of answering mode, impacted the driving performance of young drivers and should be avoided.
In more general terms, the study's results might guide road safety measures in engineering, education, enforcement and public relations. Concerning engineering, the design of the driver-device interaction in manually driven vehicles is essential for increasing vehicle safety. Vehicle manufacturers should introduce fully voice-based and hands-free navigation, infotainment and communication systems as standard features in upcoming vehicles. Those systems unfold their full safety potential the simpler and more intuitive the driver-device interaction is. Furthermore, the distracting effects of manual-visual control and the negative consequences on driving performance were severe and likely underesti-mated in the present simulator study. To further increase vehicle safety, any manual-visual driver-device interaction should be reduced to the essential to prevent visual and manual distractions. It seems worthwhile to evaluate traditional manual-visual driver-vehicle interfaces and find speech-based or other (not manually or vision-based) substitutes. Further research should be addressed toward alternative designs of driver-vehicle interfaces.
Concerning education, enforcement and public relations, the findings once more emphasize that manual-visual driver-device interaction bears severe risks, even with very simple tasks. The study's findings support the current regulation in many countries of banning the use of handheld devices while driving. The message that manual-visual distraction should be avoided at any time while driving is still valid and gains even more weight in today's technology-driven environment. Stakeholders in road traffic safety, such as public service and road safety organizations, should continue with related public awareness campaigns.