1. Introduction
With the integration of information technology into automotive systems, special vehicles are increasingly adopting automation and intelligent design. Consequently, the main duties of crew members are shifting from manual operations to cognitively demanding tasks [
1,
2]. Science and technological advancements have improved the capabilities of machinery and equipment, thereby amplifying the impact of human limitations on overall system performance. This trend is particularly evident as machines become more intelligent while human operators take on supervisory roles involving monitoring, planning, and diagnosis [
3]. Studies reveal that more than 70% of accidents and incidents in typical safety-critical industrial domains are attributable to human error [
4,
5]. The design of the interaction modality within the crew cabin of special vehicles directly affects crew performance and is closely related to system safety. In recent years, human–machine interaction in special vehicle cabins has transitioned from traditional mechanical interfaces to more intelligent modalities, such as touch-based, speech-based, and multimodal hybrid interactions [
6]. Consequently, the research on novel interaction modalities has received considerable attention in the human factor design of crew cabins.
Naturalistic interaction modalities—such as touch, speech, and multi-modal interactions—are being actively explored. Touch interaction enables rapid feedback [
7] and fully functional intuitive interfaces within confined operational environments [
8,
9]. On the other hand, speech interaction offers a remote and contactless interaction interface [
9,
10]. It allows users to select functions through natural verbal commands, eliminating the need for hierarchical structures and explicit navigation [
11,
12], while also enhancing emotional engagement [
9,
13]. Previous studies have shown that the adaptability of touch and speech modalities varies across task contexts, indicating that each modality has its own strengths [
14,
15,
16,
17]. Multi-modal interaction technologies allow operators to employ multiple input channels simultaneously, which overcomes the limitations of unimodal interaction and has a favorable potential to improve the efficacy of human–machine interaction [
18,
19,
20]. For example, in applications such as smartphone control [
21], personal health data exploration [
12], and web-based visualization tools [
6], researchers have observed that multi-modal interaction has superior operational efficiency and user satisfaction compared to unimodal interaction.
Numerous studies have undertaken empirical comparisons between touch and multi-modal interactions. For instance, according to Dudek and Schulte [
10], when participants were free to choose the interaction modality, the average time of task assignment was shortened. Another user study revealed that the introduction of speech interaction renders multi-modal interaction more precise and rapid than typical touch interaction [
22]. In addition, multi-modal interactions incorporating speech and location input have the potential to reduce cognitive workload. In a selection and system control test, Zimmerer et al. [
23] found that the combination of touch and speech interactions had a lower cognitive workload than the touch-only unimodal interaction. However, other studies have suggested that the effectiveness of multimodal interaction is not necessarily superior to that of unimodal interaction. For example, in a simulated manned–unmanned teaming monitoring system in the Black Hawk helicopter environment, Levulis et al. [
20] observed no significant difference in task performance between the multi-modal and touch interactions. The integration of speech interaction has been found to reduce the physical demands associated with specific interface inputs and provide unique flexibility [
20]. Nevertheless, the effectiveness of multimodal interaction may depend on specific task demands, user expectations, and context of use. Although advances in intelligent and information-driven system design are driving the adoption of touch and multimodal interaction as the future primary human–machine interaction modalities in the crew cabin of special vehicles [
9,
24], whether multi-modality results in better task performance and lower workload for a typical special vehicle crew as compared with touch interaction is not known.
Given the increasing complexity of tasks encountered by special vehicle crews, it is essential to further clarify the influencing mechanisms of various interaction modalities on crew performance under varying task demands, particularly under high-complexity conditions. Knowledge on the role of task complexity in determining task performance remains insufficient [
25]. While several studies have demonstrated that increasing task complexity typically results in high demands for cognitive resources that in turn degrade task performance and increase workloads [
26,
27], others argue the opposite [
28]. The adoption of suitable interaction modalities has been shown to enhance performance under conditions of high task complexity, while the effectiveness of a given modality may depend on specific task characteristics, as different modalities align differently with various operational demands [
10,
18]. Specifically, task complexity can be modulated through different information presentation channels, such as visual, auditory, and tactile modalities, each exerting distinct cognitive demand. When the interaction modality shares the same sensory channel as the task, for example, speech interaction with auditory tasks, conflicts may arise due to competition for processing resources. Therefore, testing the applicability of specific interaction modalities under different task complexity conditions is essential. In special vehicle cabins, crew members frequently use headphones for speech communication during special vehicle operations. The task complexity provided via the auditory channel changes because of this communication style; nevertheless, it is typical and expected. As the requirement of maximizing efficiency becomes more critical when confronted with complex tasks, modifying task complexity by adding auditory channel information processing tasks and verifying the effects of introducing speech interaction on the performance of a special vehicle crew in task-specific settings has practical research significance.
Moreover, the methodologies used to evaluate the effectiveness of different interaction modalities in complex operational contexts warrant further refinement. In previous studies, performance measurement was typically applied to assess the effectiveness of interaction modality, whereas physiological assessment was rarely included for comprehensive analysis. However, eye responses have a distinct advantage in revealing differences in the efficiency of information processing [
29,
30]. As task complexity varies, certain eye movements change [
26,
27]. Additionally, the introduction of speech interaction compared with solely employing touch interaction may reduce the necessity for visual saccades during multi-modal interaction [
20]. High-complexity tasks may leverage multi-modal interaction over touch interaction, as reflected by specific eye movement features. Typical eye response metrics include peak velocity [
31,
32], fixation entropy [
27], the nearest neighbor index (NNI) [
33], and mean pupil diameter. Therefore, studies are required to determine the sensitivity of eye response metrics to various interaction modalities and task complexity in special vehicle crew tasks.
Although studies on the effects of typical interaction modality and task complexity on performance have been conducted in several civil domains (including car driving, smartphone controlling), pertinent research in the field of safety-critical equipment development remains limited [
34]. With the foregoing context, the following aspects of the current study may be deemed as requiring improvement. First, pertinent studies have found that multi-modal interaction may have benefits over touch interaction in terms of augmenting task performance and reducing workloads. However, whether these benefits remain applicable to the typical human–machine interaction task of a special vehicle crew is unclear. Specifically, further research is required to determine whether the benefits of multi-modal interaction over touch interaction continue to hold given the task complexity presented by various information processing channels. Moreover, with regard to the human–machine interaction tasks of a special vehicle crew, the statistical disparities that eye response metrics may indicate and the differentiated information processing characteristics they reveal are not yet well understood. Using a high-fidelity special vehicle simulation platform, 20 experienced special vehicle drivers or test drivers performed an ergonomic experiment under typical planning tasks for special vehicle crews. The objective is to examine the effects of interaction modality and task complexity. The purpose of this study is to investigate the effects of traditional touch modality and a novel interaction modality with the introduction of speech interaction on crew performance (including task performance, subjective workload, and eye response) in low-complexity and high-complexity tasks. Considering the foregoing background, the following are hypothesized.
Hypothesis 1 (H1).
The introduction of speech interaction modality will result in better task performance and reduce subjective workload compared with the traditional touch interaction modality, regardless of the task complexity (low when involving information processing by a single visual channel or high when involving information processing by an auditory channel).
Hypothesis 2 (H2).
A high-complexity task with the addition of an auditory channel task will result in worse task performance and higher subjective workload, as compared with a low-complexity task with a single-channel visual task, regardless of the interaction modality.
Hypothesis 3 (H3).
Different interaction modalities and task complexities will result in variations in operator attention strategy and workload states, as reflected by particular eye response metrics.
2. Methods
2.1. Participants
Twenty highly operationally experienced special vehicle drivers or test drivers are recruited, with eighteen proving effective. Two participants with unqualified data were removed, specifically, one participant’s eye movement data was not collected due to equipment failure, and another participant’s performance data did not satisfy the training criteria. The age range of the participants, who are all male and have a mean age of 32.17 years (SD = 7.30), is from 23 to 46 years old. All participants have good health, are right-handed, do not have color blindness, have normal or corrected vision, normal hearing, and have gotten enough sleep (at least 8 h) the night before the experiment. Before the experiment, participants are given information about the rules and procedures and are asked to sign a written informed consent form. The study is in accordance with the Declaration of Helsinki and has been approved by the Biological and Medical Ethics Committee of Beihang University (Approval number BM20230003).
2.2. Experimental Platform and Equipment
The experiment is set up on a new high-fidelity platform for simulating special vehicle tasks. The setup has four parts: hardware, main control, task simulation, and real-time data recording systems. The hardware of the simulation platform consists of multiple touchscreen displays, control components, a simulator host, a server, and other devices. The main control system can select different crew tasks and terminal display control interfaces and set the corresponding initial task parameters. The task simulation system consists of a control simulation component, a display and control integrated terminal, and an embedded control and information terminal software system to simulate high-fidelity tasks, and achieves the interaction between human–machine information and control. The real-time data system has a background information terminal recording module, and the primary function of this module is to gather and record the human–machine interaction behaviors. It also records the precise time of particular operations, such as touch and speech inputs. The Tobii Glass II wearable eye-tracking system (Tobii Technology, Stockholm, Sweden) is used in this study to collect the eye responses of participants. The system’s sampling frequency and accuracy are 50 Hz and less than 0.5°, respectively. The range of eye tracking is 82° horizontally and 48° vertically, the camera resolution is 1920 × 1080, and the frame rate is 25 Hz. The calibration is completed using the one-point calibration method.
2.3. Experimental Design
Two-factor repeated measures with two task complexities × two interaction modalities are used in the experimental design, with each participant completing four distinct experimental conditions. The first independent variable describes the interaction modality of a special vehicle information display interface, including two levels: the traditional touch interaction and the novel touch–speech multi-modal interaction. Participants can use either the touch modality or hybrid touch–speech multi-modal interaction for inputting the same target task. Note that the number of steps required to complete the same task in the multi-modal interaction in which speech is introduced is less than that in the touch interaction. This indicates that a certain manual touch operation has been replaced by an automatic system backstage process according to system recognition and speech input. Additionally, the second independent variable, task complexity, can encompass multiple dimensions such as information presentation channels and time pressure. Given the prevalence and effectiveness of information transmission through various channels in specialized vehicle operations, this study manipulates task complexity by varying information processing demands across multiple presentation channels [
35,
36]. Specifically, the low-complexity task solely involves processing visual channel information, whereas the high-complexity task involves processing an additional auditory channel information based on the low-complexity task. As shown in
Figure 1, participants respond to the typical task presented by the visual channel in the low-complexity task condition, and recognize and provide feedback to the typical visual input and the audible warning input in the high-complexity condition. To balance the different levels in the repeated measures design and eliminate the effects of exercise and fatigue, the experimental sequence is arranged using a Latin Square design.
The dependent variable has three participant performance metrics: (1) task completion time and number of misoperations of crew task; (2) National Aeronautics and Space Administration Task Load Index (NASA-TLX) scores of subjective workload [
37]; and (3) mean peak saccade velocity [
31,
38], fixation entropy [
27], NNI [
33], and mean pupil diameter [
39] of eye response. NASA-TLX is considered one of the most widely used subjective measures of workload [
40], while eye-tracking measurements are also recognized as a physiological method sensitive to workload [
41,
42]. Therefore, both methods are employed in this study. Task completion time refers to the total duration taken by a crew member to complete a given task, number of misoperations refers to the total count of incorrect actions made by the operator during the task. NASA-TLX score is derived by calculating the score for each subscale and its corresponding weight, which are then combined to produce an overall workload score [
37]. The formula is as follows:
Wi represents the weight assigned to each of the six subscales, based on the participant’s assessment of their relative importance, Ri is the rating for each subscale.
Saccades are rapid eye movements that occur between fixations. The mean peak saccade velocity measures the highest speed at which the eyes move during these swift shifts in gaze. Fixation entropy quantifies the diversity of an individual’s gaze behavior, representing the extent to which fixations are distributed or concentrated across a given region or stimulus [
33]. The formula for fixation entropy (
H) is derived from Shannon’s entropy in information theory and is given as follows:
where
pj represents the probability of a fixation occurring at the
j-th position within a specific area of interest (AOI), and
n denotes the total number of AOIs. NNI measures the degree to which fixations are spatially clustered or dispersed, offering insight into the concentration or distribution of gaze within a defined AOI [
33]. The NNI is calculated as follows:
where min (
dkl) represents the minimum distance existing between point
k and its nearest neighbor
l (with
l ranging from 1 to
N, and
l ≠
k),
N corresponds to the number of points, and
A denotes the polygonal area defined by the outermost fixations. Mean pupil diameter refers to the average size of both the left and right pupils during the task.
2.4. Experimental Tasks and Procedures
Task procedures for typical crew positions were developed in this study through conducting field research and interviews with domain experts (designers, developers, and frontline operators). Typically, the four stages and their corresponding operational designs are determined based on field research and were validated by multiple frontline operators, ensuring strong representativeness. The experimental tasks are designed as follows:
Under a condition in which a task has low complexity, participants are expected to complete the entire task, consisting of four phases (or subtasks): task decomposition, road planning, perception planning, and strike planning. The participants are required to process the visual channel input presented by the information display interface of the special vehicle simulation platform. In the task decomposition phase, the participants are required to check the tasks issued by the main control system for implementation, report the task information verbally, and continue with the task decomposition following the prompts of the task guidance module. In the road planning phase, the participants are required to search and enter the start and end points of the target (or enter the start or end point by voice) on the display control terminal according to the prompts of the task guidance module to complete road planning and delivery. In the perception planning phase, the participants are required to select specific operation equipment and set its search mode and parameters following the prompts of the task guidance module. In the strike planning phase, participants are required to add targets or implement executive equipment, employ equipment and establish specific parameters on existing targets following the prompts of the task guidance module.
Under the condition of high complexity task, the participants are required to respond to the auditory channel warning, which is randomly presented, in addition to completing the visual channel task. The audible channel warnings including high voltage, low voltage, high oil quantity, and low oil quantity. Audio warning messages are delivered to participants through headphones with a random interval of three times per minute. Depending on the warning type, the participants are required to implement certain responses as rapidly and as accurately as possible. This experiment has two phases: training and formal experiment. The participants in the training phase sign the informed permission form after they have been fully acquainted with the experimental platform and task. Each participant is required to complete four experimental tasks under different conditions and then take the NASA-TLX scale test under each condition. An adequate rest period of approximately 10 min is scheduled in between the two experimental conditions, each of which lasts approximately 15 min.
3. Results
Statistical analysis was conducted using IBM SPSS Statistics 25.0 with a confidence level of
α = 0.05. Two factors, the interaction modality and the task complexity, were studied in terms of their interactions and main effects on the assessed indicators using repeated measure analysis of variance (ANOVA) [
43]. When the main effect and interaction were significant, the Bonferroni method [
44] was applied to post hoc comparison and simple effect analysis. The Green–Geisser method [
45] is used for correction if the sphericity test is not satisfied.
3.1. Task Performance Results
The task completion time and number of misoperations are both considered in task performance. The descriptive results indicate that the trend of the task completion time is the gradual decrease from “T_H” (traditional touch interaction modality with high task complexity presented by an audio–visual dual task; mean ± standard deviation: 1030.78 ± 623.09), “T_L” (traditional touch interaction modality with low-complexity task presented by a visual single task; 843.61 ± 250.96), and “N_L” (novel interaction modality in which speech interaction is introduced with low task complexity presented by a visual single task; 480.67 ± 271.45) to “N_H” (novel interaction modality in which speech is introduced with high task complexity presented by an audio–visual dual task; 461.11 ± 202.49). Moreover, the number of misoperations gradually decreases from “T_H” (6.78 ± 7.00), “T_L” (6.22 ± 5.81), and “N_L” (3.72 ± 4.88) to “N_H” (3.39 ± 5.81).
The implementation of two-way repeated-measure ANOVA reveals that the interaction between interaction modality and task complexity is insignificant in terms of task completion time and number of misoperations (
ps > 0.05). The interaction modality has significant main effects on task completion time (
F(1,17) = 47.698,
p < 0.001,
= 0.737) and number of misoperations (
F(1,17) = 13.894,
p = 0.002,
= 0.450). Compared with the traditional touch interaction modality, the task completion time of the novel interaction modality in which speech is introduced is significantly shorter (
p < 0.001), and the number of misoperations is lower (
p = 0.002), as shown in
Figure 2. In addition, compared with the traditional touch interaction modality, the novel interaction modality in which speech is introduced significantly reduced task completion time by 49.76% (from 937.194 ± 96.004 to 470.889 ± 46.263) and number of misoperations by 45.29% (from 6.500 ± 1.732 to 3.556 ± 1.052). The main effect of task complexity on task completion time and number of misoperations is insignificant (
ps > 0.05).
3.2. Subjective Workload Results
The NASA-TLX scale was used to calculate the subjective workload. The descriptive results indicate an upward trend of NASA-TLX total score in the sequence N_L, T_L, N_H, and T_H, as listed in
Table 1.
As listed in
Table 2, the two-factor repeated measure ANOVA indicates that the interaction between modality and task complexity is insignificant (
p > 0.05). The main effect of the interaction modality on the NASA-TLX score is significant (
p = 0.001), demonstrating that the NASA-TLX score under the novel interaction modality in which speech is introduced is significantly lower than that of the traditional touch interaction modality. The main effect of task complexity on the NASA-TLX score is significant (
p = 0.002), indicating that the NASA-TLX score of low-complexity task is significantly lower than that of the high-complexity task.
The NASA-TLX total and sub-scale scores are shown in
Figure 3.
Table 2 indicates that the interaction modality has significant effects on the six NASA-TLX sub-scales. The novel interaction modality in which speech is introduced has significantly low mental demand (MD), physical demand (PD), temporal demand (TD), effort (EF), and frustration (FR) sub-scale scores (
ps < 0.05). It also has a significantly higher performance (PE) sub-scale score compared with the traditional touch interaction modality (
p < 0.05). The main influence of task complexity on the other four NASA-TLX sub-scales was considerable (
ps < 0.05), except for the PE and EF sub-scales. A task with high complexity has been demonstrated to have considerably higher MD, PD, TD, and FR scores than a low-complexity task (
ps < 0.05).
3.3. Results of Eye Response
Mean peak saccade velocity, fixation entropy, NNI, and mean pupil diameter are the eye movement metrics chosen for this investigation. As shown in
Figure 4, the mean peak saccade velocity has an upward trend in the sequence T_L, T_H, N_H, and N_L (194.89 ± 20.99, 195.86 ± 17.96, 206.53 ± 28.22, and 206.55 ± 39.72, respectively). Fixation entropy exhibits an upward trend in the sequence N_L, N_H, T_L, and T_H (194.89 ± 20.99, 195.86 ± 17.96, 206.53 ± 28.22, and 206.55 ± 39.72, respectively). Similarly, the NNI exhibits an upward trend in the sequence T_L, N_L, T_H, and N_H (0.44 ± 0.07, 0.45 ± 0.09, 0.47 ± 0.07, and 0.49 ± 0.06, respectively). The mean pupil diameter also exhibits an upward trend in the sequence T_L, N_L, N_H, and T_H (3.68 ± 0.44, 3.72 ± 0.43, 3.77 ± 0.407, and 3.77 ± 0.42, respectively).
The results of the two-way repeated measure ANOVA for the mean peak saccade velocity reveal that the interaction effect between interaction modality and task complexity is not statistically significant (p > 0.05). In particular, the mean peak saccade velocity of the traditional touch interaction modality is significantly lower than that of the novel interaction modality in which speech is introduced; this is the main effect of interaction modality: F(1,17) = 6.783, p = 0.019, and = 0.285. Task complexity does not have a significant effect (p > 0.05). The results of the two-way repeated measure ANOVA for fixation entropy indicate that no significant interaction occurs between task complexity and interaction modality (p > 0.05). Interaction modality has a significant effect: F(1,17) = 6.833, p = 0.018, and = 0.287. The traditional touch interaction modality has a fixation entropy that is considerably higher than the novel interaction modality in which speech is introduced. Task complexity does not have a significant effect (p > 0.05). The results of the two-way repeated measure ANOVA for the mean pupil diameter reveal that the interaction between interaction modality and task complexity is not statistically significant (p > 0.05). In particular, the mean pupil diameter considering the low-complexity task is significantly lower than that of the high-complexity task; the foregoing is the main effect of task complexity (F(1,17) = 14.954, p = 0.001, and = 0.468). The interaction modality has an insignificant effect (p > 0.05). The results of the two-way repeated measure ANOVA for the NNI indicate that no significant interaction occurs between task complexity and interaction modality (p > 0.05). Task complexity has a significant main effect: F(1,17) = 10.017, p = 0.006, and = 0.371. Particularly, with the introduction of speech interaction, the low-complexity task has an NNI that is considerably lower than the high-complexity task. The interaction modality does not have a significant effect (p > 0.05).
4. Discussion
To examine the changes in performance (including task performance, subjective workload, and eye response) on typical planning tasks for a special vehicle crew, an experiment is designed. The experiment has two interaction modalities (traditional: touch interaction; novel: touch–speech multi-modal interaction) in two types of task complexity (low: visual single task; high: audio–visual dual task). Results revealed that with the introduction of speech interaction modality, the participants demonstrated improved task performance, reduced subjective workload, greater mean peak saccade velocity, and lower fixation entropy under various task complexities compared with the typical touch interaction modality. No significant degradation was observed in task performance (both task completion time and number of misoperations) when comparing the tasks with high and low complexities. The subjective workload was higher, as revealed by the higher NASA-TLX scale and sub-scale scores (except for the PE and EF sub-scales). The mean pupil diameter and NNI increased. For task performance and subjective workloads, no significant interaction effect between interaction modality and task complexity is observed in this study.
After the introduction of speech interaction modality under different task complexities, the participants demonstrated a shorter task completion time (49.76% reduction), fewer misoperations (45.29% reduction), and lower subjective workloads compared with the traditional touch interaction modality. The foregoing is manifested by the lower NASA-TLX and sub-scales scores (MD, PD, TD, EF, and FR) and the higher PE sub-scale score of the novel interaction modality compared with scores of the traditional touch interaction modality. Therefore, H1 is confirmed. This result agrees with those of Levulis et al. [
20] and Dudek & Schulte [
10]. The introduction of speech interaction into traditional touch interaction in the operation of special vehicle crews is an exploratory endeavor in engineering practice. Typically, the novel interaction modality in which speech is introduced can combine the benefits of fast feedback and facile correction offered by touch interaction with potential object selection and non-contact features of speech interactions [
20,
46]. Furthermore, considering that the introduction of speech in the novel interaction modality simplifies certain task processes (e.g., triggering automation through speech commands), this inherent characteristic of speech interaction may contribute to performance improvements. Future research could consider adopting more refined experimental designs, such as examining the task simplification effect of speech, to explore this in greater depth. As demonstrated by the lower NASA-TLX scores of the novel interaction modality in which speech is introduced, the subjective workload is lower than that in the traditional touch interaction. This is possibly due to the contactless interface of speech interaction [
9], which reduces the physical demand associated with the frequent movement of the upper limbs to interact with the display and lowers the information search cost [
20]. It is important to note that, while statistically significant, performance metrics with a medium effect size still require further validation in real-world and complex scenarios to assess their practical potential.
Compared with the low-complexity single visual channel task, the high-complexity task with the addition of an auditory channel has higher NASA-TLX total and sub-scale scores (except for the PE and EF sub-scales). However, the completion time or number of misoperations did not increase; that is, H2 is rejected. This demonstrates that no discernible reduction in task performance is observed, although the subjective workload increases with the shift from single visual channel processing to audio–visual dual-channel processing. This may be a case where the addition of an auditory warning task under the conditions in which the high-complexity task increases the task demand, resulting in an increase in the subjective workload of participants. This result is similar to that of Gulati et al. [
28]. However, no distinct degradation in the performance of operators is found; this is inconsistent with the findings of previous investigations in which performance declines [
26,
27]. The audio–visual dual-channel presentation of high-complexity tasks in this study enables operators to coordinate time sharing more easily. For the high task complexity in this study, participants may have successfully managed the increased workload by adjusting their strategies (e.g., prioritization). Critically, from the perspective of multiple resource theory, these adaptive behaviors indicate that operators can effectively distribute and share different types of cognitive resources, processing more information without depleting any single resource pool, thereby maintaining performance levels even under increased subjective demand. Consequently, the decline in task performance is mitigated in contrast to the single visual channel complex task presented in previous studies.
Tasks with high complexity are frequently considered in the engineering applications of special vehicles, particularly when the tasks are provided through multiple channels, such as auditory and visual channels. Therefore, task complexity is introduced to observe its influence on the main effect of interaction modality. The foregoing findings indicate that interaction modality and task complexity have no appreciable interaction effects on task performance and subjective workload measures. Accordingly, whether the task complexity is lower with the single visual channel or higher with the addition of an auditory channel, the introduction of speech to the novel interaction modality results in better task performance and lower subjective workloads compared with the traditional touch modality. The perceived temporal demands and selection preferences of operators with respect to various interaction modalities vary when the task complexity is altered according to previous studies [
10,
47]. However, the findings in this study indicate that even under conditions where the task complexity are high, the advantages of the novel interaction modality with the addition of speech interaction compared with conventional touch interaction modality on task performance are evident. Moreover, subjective workloads are not diminished. However, the findings are not consistent with those of previous studies [
26,
27] except that the results on subjective workload are similar to those of J. Lee et al. [
48]. According to multi-resource theory [
49], the audio–visual dual-channel presentation used in this study can achieve better time sharing, preventing the degradation of advantages of performance and subjective workload of the novel interaction modality over the traditional modality under high task complexity conditions. This implies that even under conditions where the task complexity is high, the novel interaction modality in which speech interaction is introduced can continue to be leveraged as a primary design to maintain satisfactory crew task performance and subjective workloads.
The examination of eye response metrics indicates that the novel interaction modality in which speech is introduced has greater mean peak saccade velocity and lower fixation entropy than the traditional touch interaction modality. Reduced fixation entropy typically denotes a more regular and systematic visual exploration strategy [
50,
51] and is related to reduced subjective workloads [
52]. In contrast, the higher peak velocity is typically associated with lower brain attention demands and subjective effort [
53]. This result suggests that under the novel interaction modality in which speech interaction is added compared with the touch interaction modality, the participants have superior attentional strategies and lower mental resource demands. The results showed that the mean pupil diameter of participants and NNI significantly increased in the high-complexity task compared with those in the low-complexity task. The increases in NNI and pupil diameter are typically closely correlated with an increase in cognitive workloads [
50,
52]. This study implies that the high-complexity task compared with the low-complexity task may lead to higher cognitive workloads in typical special vehicle information display interface activities. The subjective workload results presented above are generally comparable with the eye response metric results; thus, H3 is confirmed. That is, specific eye response metrics are sensitive to changes in task complexity and interaction modality. To some extent, this finding reveals the changes in attentional strategies and workload states of participants.
The findings of this study mainly contribute to the body of knowledge on task complexity and interaction modalities with respect to performance advantages in typical planning tasks for special vehicles. Despite these results, this study has deficiencies. First, although a new high-fidelity platform for simulating specialized vehicle tasks and eye-tracking measurements was employed, only a limited set of typical tasks (task decomposition, road planning, perception planning, and strike planning) and eye-tracking metrics reflecting cognitive workload were tested. Future studies will require more realistic and complex task scenarios (e.g., noise, net-work latency, team-level dynamics, task complexity dimensions and multitasking interruptions) of special vehicles, along with comprehensive physiological measures (e.g., heart rate variability, electroencephalogram), to validate the results. What is more, due to workplace constraints and institutional requirements, all participants in this study were 18 males aged 23 to 46, with future studies planned to include female participants and a larger, more diverse sample to explore gender differences and age-related factors. Additionally, the brief task durations prevented the assessment of operator fatigue or adaptation effects, which are critical factors in extended real-world operations and will be addressed in future studies on sustained performance. Finally, because only a few hybrid interaction modalities, such as touch and speech, have been considered thus far, further research is required to study more hybrid interaction modalities (e.g., gesture, eye-gaze, multimodal combinations [
54,
55]). Research results can offer valuable support to the subsequent design of the information display interface of special vehicle crew cabins and serve as a point of reference for related military equipment domains.