Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding

Zhang, Heming; Wang, Changyuan; Wang, Pengbo

doi:10.3390/mti9100103

Open AccessArticle

Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding

by

Heming Zhang

¹,

Changyuan Wang

^1,* and

Pengbo Wang

²

¹

School of Optical Engineering, Xi’an Technological University, Xi’an 710021, China

²

School of Mechatronic Engineering, Xi’an Technological University, Xi’an 710021, China

^*

Author to whom correspondence should be addressed.

Multimodal Technol. Interact. 2025, 9(10), 103; https://doi.org/10.3390/mti9100103

Submission received: 11 August 2025 / Revised: 11 September 2025 / Accepted: 25 September 2025 / Published: 1 October 2025

Download

Browse Figures

Versions Notes

Abstract

Pilot visual behavior safety assessment is a cross-disciplinary technology that analyzes pilots’ gaze behavior and neurocognitive responses. This paper proposes a multimodal analysis method for pilot visual behavior safety, specifically for cognitive state decoding. This method aims to achieve a quantitative and efficient assessment of pilots’ observational behavior. Addressing the subjective limitations of traditional methods, this paper proposes an observational behavior detection model that integrates facial images to achieve dynamic and quantitative analysis of observational behavior. It addresses the “Midas contact” problem of observational behavior by constructing a cognitive analysis method using multimodal signals. We propose a bidirectional long short-term memory (LSTM) network that matches physiological signal rhythmic features to address the problem of isolated features in multidimensional signals. This method captures the dynamic correlations between multiple physiological behaviors, such as prefrontal theta and chest-abdominal coordination, to decode the cognitive state of pilots’ observational behavior. Finally, the paper uses a decision-level fusion method based on an improved Dempster–Shafer (DS) evidence theory to provide a quantifiable detection strategy for aviation safety standards. This dual-dimensional quantitative assessment system of “visual behavior–neurophysiological cognition” reveals the dynamic correlations between visual behavior and cognitive state among pilots of varying experience. This method can provide a new paradigm for pilot neuroergonomics training and early warning of vestibular-visual integration disorders.

Keywords:

visual observation behavior; cognitive state decoding; multimodal fusion; neuro-ergonomics

1. Introduction

With the development of the economy and society, the aviation transportation industry has shown a trend of year-on-year development, and major transportation indicators have maintained steady and rapid growth [1]. With the gradual optimization and innovation of aircraft auxiliary driving and guidance equipment, the incidence of aviation accidents caused by pilot operation errors has shown an upward trend. Sant’Anna et al. [2] found that more than 50% of aviation accidents are caused by human error at a certain point in the causal chain during flight control. Human factors, including operational errors, are considered to be important factors leading to general aviation accidents [3]. Among them, insufficient basic operating skills of pilots, low operational proficiency, and insufficient enforcement of flight regulations are important factors causing human errors. It can be seen that pilots have the primary responsibility for flight operation safety and are also the “last line of defense for civil aviation safety” to protect the lives of passengers.

In summary, the frequency of flight safety accidents caused by pilots’ subjective operation errors is increasing year by year. Among them, the proportion of flight accidents caused by pilot operation errors is increasing year by year. Analysis of flight operation data (including normal flight operations, events, and accidents) shows that the trend of operation errors is increasing during manual flight [4]. Investigations, studies, and analyses of aviation accidents show that human factors have become an important cause of flight accidents. Wrong decisions and operations made by flight crews are mainly caused by poor human–machine interaction, which is a reflection of the poor human–machine interaction performance of the aircraft cockpit. The Federal Aviation Administration of the United States believes that to meet the needs of safe flight, it is necessary to maintain and improve the knowledge and skills of manual flight operations [5]. The safety alert further discussed that modern aircraft automatic flight systems are often used. Unfortunately, the continuous use of these systems is not conducive to the improvement and maintenance of manual flight knowledge and skills. However, the constant use of automatic flight systems may lead to a reduction in pilots’ control capabilities. To improve pilots’ manual operation capabilities, many airlines regularly organize pilot retraining and re-evaluate pilots’ flight capabilities and behaviors during the training process.

In the 1950s and 1960s, National Aeronautics and Space Administration (NASA) organized a large number of pilots to use eye trackers to count the frequency of pilots’ attention to different instruments during flights. A set of aircraft instrument position arrangement schemes was formed according to the frequency of pilots’ gazes on different instruments. For a qualified pilot, the most important part of the safety of his flight control behavior is the pilot’s observation ability. It can timely detect the deviation of the aircraft’s track and parameters from the expected data, or the deviation that will occur. Therefore, airlines stipulate the safe observation sequence and observation method of pilots for flight instruments. Pilots can ensure timely and efficient acquisition of flight data deviations by following the prescribed instrument observation method. However, the current assessment method for evaluating pilot flight ability mainly focuses on pilot flight control data, but the assessment of pilot visual observation behavior only stays in the form of manual assessment. This method cannot form a quantitative, scientific, and efficient assessment method. The lack of observation behavior assessment will cause some pilots to pass the assessment by using their past flight operation experience, but bad gaze habits will bury hidden dangers for their future flight safety. For example: missing the altimeter during the descent of the aircraft can easily cause excessive landing impact. Incorrect instrument gaze sequence and non-compliance with instrument layout can easily confuse observation behavior. Unequal distribution of attention to different parts can easily lead to the omission of data deviation. In addition, the existing visual observation behavior detection method does not solve the “Midas contact” problem [6]. That is, the detection method cannot determine whether the pilot’s observation behavior of the instrument is conscious observation or unconscious stop. Failure to distinguish the latter visual behavior will also cause the pilot to be unable to timely know the deviation of the aircraft flight data, which is an invalid observation.

This paper proposes three innovative contributions to the key bottleneck problems in the safety detection of pilot visual behavior:

(1): Algorithm breakthrough based on facial images and dynamic cognitive monitoring: In view of the subjective defects of traditional evaluations that rely on manual observation and instructor scoring, we propose a new detection method that integrates artificial intelligence network analysis of facial images and Markov state transition model. By designing an integrated parallel artificial intelligence visual behavior classification network and combining Markov chain to model the gaze state transition probability, dynamic quantitative monitoring and analysis of pilots’ visual observation behavior is achieved. Find an efficient and intelligent solution to determine whether the pilot’s visual observation behavior pattern is safe or not.
(2): Multimodal feature fusion solves the “Midas contact” problem: Given the core problem that an invalid gaze is difficult to distinguish from an effective gaze, we propose a multimodal physiological signal analysis method based on feature layer fusion to identify the cognitive state of pilot observation behavior. To solve the problem of simultaneous multimodal data feature isolation, we deeply fuse heterogeneous data such as Electroencephalogram (EEG), eye movement, respiration, and skin charge at the feature level to capture the dynamic correlation between cognitive load and physiological response. On this basis, the cognitive state of the pilot is detected and classified through a bidirectional LSTM network.
(3): Decision-level fusion visual normativeness detection framework: We study the decision-level fusion model based on DS evidence theory, integrating visual observation behavior and cognitive state assessment results to break through the limitations of single-dimensional detection. By designing a dynamic confidence weight allocation mechanism, we can achieve a comprehensive evaluation of visual normativeness at the decision level and provide a quantifiable detection benchmark for aviation safety standards.

2. Related Work

Pilot visual behavior is a key aspect of aviation safety. In the 1950s, researchers began to analyze and study pilot visual behavior patterns. Researchers represented by Fitts [7] examined the pilot’s observation behavior in actual flight by counting the number of times the pilot looked at the instrument. Wang et al. [8] used eye movement image tracking data in a real flight environment to establish a statistical classification model. This research result can be used in the evaluation of pilot behavior. Dehais et al. [9] collected eye movement data of pilots in light aircraft. The researchers analyzed the distribution of pilots’ attention in different flight phases. The classification of pilots’ behavior in different flight phases was achieved. This study serves as a reference for analyzing the safety of pilots’ driving behavior in actual flight in the future. These studies have made people realize that the survey of pilots’ observation behavior is crucial in improving flight safety and quantifying pilots’ flight capabilities.

In recent years, with the advancement of the computer field, researchers have primarily employed image detection technology to analyze pilots’ observation behavior and flight patterns. Lyu et al. [10] proposed a pilot behavior detection method based on the Flashlight model. This method utilizes flight eye tracking data to analyze and detect pilot behavior. Verification indicates that the algorithm has achieved high accuracy in detecting pilot flight performance. Wu et al. [11] designed a non-invasive pilot observation behavior detection method. This method combines the ResNet50 model and the Swin Transformer model and uses the pilot’s facial image information to realize the analysis of the pilot’s observation behavior. After verification, the classification accuracy of this method for specific target areas reached 90%. Although this study only used a single data modality for training, the method successfully used statistical methods to realize the extraction and detection of pilot cognitive behavior. Wang et al. [12] used pilot image data to extract multimodal eye movement signals such as gaze ratio, pupil diameter, and number of blinks. By using the XGBoost algorithm to classify pilot performance, experiments have shown that the accuracy of this classification method reached 77.02%. Jin et al. [13] proposed a pilot visual search performance evaluation model based on an attention allocation algorithm to evaluate the pilot’s observation behavior. The researchers used eye movement indicators such as gaze duration and number of gazes as training data. The method innovatively used an expert scoring method to assign different weight coefficients to different training data. The results showed that the algorithm can effectively reflect the pilot’s visual search performance level. Although these pilot behavior detection methods, using multi-dimensional eye movement data, did not fully detect and evaluate gaze behavior, such as gaze behavior trajectory, they made improvements in data dimensions and modalities, further enhancing the accuracy of pilot behavior detection in specific scenarios. Peysakhovich et al. [14] proposed a method for training a linear support vector machine model. They used the pilot’s eye movement images to extract the pilot’s attention distribution and instrument scanning sequence. After verification, this method increased the flight phase classification accuracy to 64.1%. Gao et al. [15] extracted the pilot’s gaze point through the pilot’s gaze behavior image. The researchers designed and optimized a hidden semi-Markov model. With the help of this model, the researchers realized the analysis of the pilot’s observation behavior trajectory. This method has been verified to have a detection accuracy of 93.55%, significantly improving the ability to model temporal behavior. Although pilot behavior analysis based solely on eye movement data is difficult to effectively detect pilots’ cognitive states during gaze, the method of using pilot gaze trajectories to classify and evaluate flight behavior further improves the detection of pilot observation behavior.

As research on pilot physiology continues to deepen, it has become apparent that eye movement and image data alone cannot fully describe the observational behaviors and flight patterns of different pilots. Therefore, researchers have begun leveraging multimodal signals to comprehensively model and evaluate pilot behavior patterns. Korek et al. [16] combined eye tracking data with flight control data to statistically analyze the observational and flight control behaviors of different pilots. This research expanded the data modality and provided new insights into multimodal detection methods. Building on previous research, Jiang et al. [17] proposed an artificial intelligence network-based detection model to monitor pilot visual behavior and flight performance. The researchers utilized multimodal data, including eye movement, gaze tracking, flight control, and flight parameters, to detect pilot observation and behavior. This method has been validated to achieve a detection accuracy of over 90% for specific flight behaviors. This method further expands the data modality of detection methods and, through artificial intelligence, enables the detection and classification of various flight behaviors. Smith et al. [18] further developed this approach. Researchers combined eye movement signals with electroencephalogram (EEG) signals and constructed a detection model for visual behavior and cognitive level using artificial intelligence algorithms. This method successfully incorporates EEG and eye movement signals to detect pilot visual behavior. To more comprehensively characterize pilots’ observation and cognitive states, Li et al. [19] collected electrodermal activity, facial expressions, and the NASA Task Load Index to assess pilots’ flight training performance. This study proposed a pilot flight capability detection algorithm. Building on previous research findings, this method further enriches the physiological data modalities used in pilot observation and cognitive model detection methods. It also validates the effectiveness of cross-modal information complementation in cognitive state modeling. To accurately assess pilot training effectiveness, Huang et al. [20] proposed a pilot training behavior detection algorithm based on neural networks and reinforcement learning. This method is trained using facial expressions, eye movements, and EEG data. It successfully achieves automated detection of pilots’ specific flight capabilities.

From the current research progress, we can see that the current flight behavior and flight ability detection technology for pilots is moving towards multimodality, intelligence, and automation. Researchers have achieved rich research results and progress in this field. However, the current assessment of pilot flight ability is limited to the direction of pilot control ability and cognitive load. The automated assessment and detection technology of pilot observation ability and pilot visual behavior safety for cognitive state decoding still needs further research.

Compared to existing work, this paper investigates the following aspects. First, by extracting dynamic gaze features from continuous facial images, we construct a sequential visual behavior model to quantify observation behavior in a flight environment, overcoming the limitations of previous methods in temporal modeling. Second, the proposed multimodal fusion mechanism addresses the “Midas contact” problem and effectively detects interference caused by unconscious observation. Furthermore, by fusing multi-source temporal information, this paper addresses the isolation and lack of coordination of multimodal features at the same moment in traditional methods. Finally, this paper introduces the Dempster-Shafer (DS) evidence theory for decision fusion, achieving more accurate and robust assessment of visual observation behavior and cognitive state.

3. Materials and Methods

To identify the safety of pilots’ flight observation behavior, this paper proposes a multimodal detection model for pilot visual safety based on cognitive state decoding. The network uses multimodal physiological signals such as pilot visual behavior and cranial nerves to perform multi-layer fusion algorithms at the data layer, feature layer, and decision layer to solve the problems of non-quantification and low efficiency in the assessment of pilot visual observation behavior. The activity diagram of the observation behavior detection method is shown in Figure 1.

3.1. Flight Observation Behavior Safety Data Collection Equipment and Data Sources

Our dataset collection strategy is built under the framework of aviation human factors engineering research, aiming to build an artificial intelligence efficient learning model for flight observation behavior safety detection through multimodal physiological signal characteristics. To this end, we designed a pilot observation behavior detection platform. The platform is based on a six-axis simulated flight platform and is equipped with multiple types of sensors to collect eye movement, EEG, Electrocardiogram (ECG), breathing and skin electrical signals during pilot training in real time. Among them, in order to better collect pilot eye movement signals and eliminate the restrictions of the collection equipment on the pilot’s head movement, we set up a three-camera, eight-light source visual behavior collection array on the display of the flight simulation platform to collect the observation behavior images of flight trainees. This method can ensure the collection of the pilot’s complete facial image while reducing the restrictions on the head movement angle. The facial image collection method of three cameras and eight light sources is shown in Figure 2. The camera used is the AZURE-1214MM camera, which has a focal length of 12 mm, 1.3 mega-pixel pixels, a field of view of 49°, and distortion of <1%.

To solve the Midas contact problem of observation behavior, that is, to judge the observation validity of observation behavior. We train with the help of multimodal signals such as eye movement, EEG, ECG, respiration, and skin electricity. Judge the effectiveness of pilots’ observation behavior and the cognitive state of the observation process. This study collects high-density, multimodal biometric data of pilots during benchmark tests and route orientation training. The signal modalities we collected include 32-lead EEG signals, which are used to analyze neurocognitive features such as theta wave oscillation in the prefrontal cortex and alpha wave suppression in the parietal occipital area; three-lead electrocardiogram (ECG) signals are responsible for quantifying the state of autonomic nervous regulation; chest and abdomen dual-channel respiratory sensors are used to capture respiratory sinus rhythm characteristics; and galvanic skin reaction (GSR) monitor the activation level of the sympathetic-adrenal medullary system. In the benchmark test, three cognitive states of pilots are induced through neuropsychological tasks during flight, namely, flight cognitive concentration state (CA), distracted state (DA), and startle reflex (SS). The flight simulation scenario constructs an ecological validity verification environment through human–computer interaction events, such as inserting sudden warnings and sudden wind shear warnings into instrument monitoring tasks, and simultaneously collects pilot physiological data under different cognitive states for time-locked analysis. The architecture of the pilot visual observation behavior safety detection solution is shown in Figure 3.

For the selection of multimodal signal data, we can decode the cognitive resource allocation pattern through EEG time-frequency features, the LF/HF ratio of ECG reflects the dynamic balance between vagal nerve tension and task load, the chest-abdomen coordination index of the respiratory signal can characterize the stress response level, and GSR is significantly correlated with the level of consciousness. By constructing a multimodal feature fusion network, this dataset can effectively identify the neurophysiological homeostasis deviation of pilots in complex human–machine systems and especially provide quantitative evaluation indicators for instrument gaze effectiveness detection. The multimodal signal acquisition sensors of pilots are shown in Table 1. The multi-physiological data acquisition device used in this study/project is a high-performance ErgoLAB series sensor produced by Beijing Kingfar Technology Co., Ltd., the publisher of this product, Beijing Jinfa Technology Co., Ltd., located in Beijing, China.

It should be noted that the multimodal dataset constructed for this study was derived from a high-fidelity flight simulation environment. This platform provides realistic visual scenes, flight dynamics, and mission flow. Compared to a live flight operating environment, simulators cannot fully replicate the constant vibrations, cabin pressure fluctuations, and random turbulence experienced in real flights. To address minor deviations from the simulated flight environment, we employed a multi-layered cross-validation mechanism to ensure the reliability of data labels. The core validation layer was conducted by experienced flight instructors in an independent control room. During this experiment, the flight instructors observed the pilots’ operating behaviors, voice responses, and facial expressions, and simultaneously linked simulator event logs. Subjects’ status was independently assessed and annotated using the NASA-Task Load Index (TLX). This process deeply integrated the instructors’ professional judgment based on aviation human factors. Objective corroboration was achieved through time-locking of the experimental protocol. The start and end times of all induction processes were strictly synchronized with the multimodal physiological signal acquisition system. Subjective verification was achieved through retrospective interviews after the experimental phase, asking pilots to describe their subjective experiences of key events to cross-check the instructors’ assessments.

3.2. Pilot Visual Observation Behavior Detection Network

This paper proposes a pilot gaze behavior analysis model that integrates multi-scale visual features and spatiotemporal attention mechanisms. Its core architecture consists of a parallel collaborative FPN module for local detail extraction, a global context-aware Swin-Transformer module, and a Markov module responsible for behavior pattern reasoning. On the local feature extraction side, a self-designed dynamic weight feature pyramid network is used to capture the key facial detail features of pilots when they gaze at different instruments through an adaptive cross-layer fusion strategy; on the global feature modeling side, an improved Swin-Transformer network is introduced, and temporal position encoding is embedded in its windowed self-attention mechanism to extract the feature correlation and spatiotemporal dependency when pilots annotate different instruments.

The network is trained with the face spatiotemporal images of the pilots when observing the instruments. After the two features are weighted by the dynamic gating unit, the pilots’ observation path is obtained. We input the gaze path into the sequence decision maker based on the Markov algorithm to classify different observation behaviors. According to the flight behavior assessment requirements, we divide the pilots’ observation behaviors into four types, namely repeated observation of instruments, correct observation of instruments, missing observation of instruments, and incorrect observation of instruments in sequence. The Markov sequence decision-maker module models the state transition constraints of gaze behavior in the time dimension by constructing an operation process template of instrument inspection specifications. When the visual feature response value of the necessary instrument at a certain moment is lower than the preset threshold, the “less look” judgment is triggered; if a certain instrument area has repeated high-confidence recognition signals in a non-essential time window, it is marked as “more look”; at the same time, the optimal state path is traced back through the Viterbi algorithm to detect the node order deviation between the actual gaze sequence and the standard operation process, to achieve accurate positioning of “sequence error”. The model innovatively combines computer vision feature extraction with sequence probability reasoning and constructs the mapping relationship between gaze behavior and operation specifications in a data-driven way without the need for a predefined rule base. The pilot visual observation behavior detection network structure is shown in Figure 4.

3.2.1. Global Feature Extraction Module

This study proposes an improved Swin-Transformer network architecture for global feature modeling in the pilot visual behavior detection network. As a landmark Transformer variant in visual tasks, the Swin-Transformer’s advantages lie in its hierarchical windowing mechanism and cross-window connection strategy. While effectively modeling the spatial layout correlations between instruments and the spatiotemporal dependencies of pilot gaze movements, it significantly reduces computational complexity, meeting the requirements of real-time detection. The network consists of four stages, each of which includes multiple shifted window-based multi-head self-attention modules (SW-MSA) and a feed-forward network (FFN). The algorithm achieves feature map size reduction and receptive field expansion through layer-by-layer downsampling. To address the temporal dynamics of pilot gaze behavior, this paper embeds temporal position encoding (TPE) within the standard Swin-Transformer. Assume that the input sequence containing T time steps is

X_{t} \in R^{T \times H \times W \times C}

. The temporal encoding injects the spatiotemporal position information through the following formula:

{\bar{X}}_{t} = X_{t} + P_{space} + P_{time}^{(t)},

(1)

where

P_{space} \in R^{H \times W \times C}

is the two-dimensional spatial position encoding, and

P_{time}^{(t)} \in R^{C}

is a learnable time-step embedding vector. This design enables the network to simultaneously model the static dependencies of the cockpit instrument layout and the temporal evolution of the pilot’s gaze trajectory over the mission phase.

3.2.2. Local Feature Extraction Module

This paper uses the feature pyramid network to extract the local features of the pilot. We redesigned the original feature pyramid model architecture, and the model structure is shown in the figure. To further improve the network’s ability to extract local features, we introduced the SE self-attention mechanism and further optimized the feature pyramid structure. The structure diagram of the feature pyramid module is shown in Figure 5.

Feature Pyramid Network (FPN) is a convolutional neural network for detecting multi-scale objects [21,22]. FPN combines the fine-grained spatial information of shallow feature maps with the semantic information of deep feature maps, significantly improving the performance of object detection. The core structure of FPN contains a bottom-up path and a top-down path.

The bottom-up path is the forward process of the convolutional neural network. In the forward process, the size of the feature map changes after passing through some layers, while it remains unchanged after passing through other layers. The layers that do not change the size of the feature map are grouped into one stage, so that each extracted feature is used as the output of the last layer of each step, thus forming a feature pyramid. Specifically, it is the feature output of the last residual structure in the five stages of the ResNet network. Then, the feature map is upsampled through the top-down path so that the upsampled feature map is the same size as the feature map of the next layer. The feature maps generated by the bottom-up method are C1, C2, C3, C4, and C5 in Figure 6. The feature maps generated by the top-down path are P2, P3, P4, and P5 in Figure 6.

We connect the SE self-attention module in series with the low-level Block 1 module of the feature pyramid. Its role is reflected in the optimization of multi-scale features. In the low-level features, the SE module strengthens the local details related to target detection through the channel attention mechanism, while weakening the interference of background noise. In the pilot cognitive state detection task, the model’s ability to capture subtle differences in the pilot’s visual images is improved.

The SE (Squeeze-and-Excitation) self-attention mechanism is a channel-level dynamic feature calibration module. Its core goal is to adaptively enhance the response of task-related feature channels by modeling nonlinear dependencies between channels while suppressing the interference of redundant or noisy channels. The mechanism can be decomposed into three stages: compression, excitation, and recalibration. The mathematical process is as follows:

In the compression stage, the network performs global average pooling on the input feature map

x \in R^{H \times W \times C}

to compress the spatial information of each channel into a scalar descriptor [23]:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j) \forall c \in {1,2, \dots, C} .

(2)

The channel statistics vector

z \in R^{C}

is obtained by calculation, which essentially captures the global distribution characteristics of the channel response.

The excitation stage is to learn the nonlinear interaction relationship between channels through the fully connected layer of the bottleneck structure and generate the channel weight vector.

\begin{array}{l} u = δ (W_{1} z + b_{1}) \\ s = σ (W_{2} u + b_{2}) \end{array},

(3)

where

W_{1} \in R^{C / r \times C}, W_{2} \in R^{C \times C / r}

are learnable parameters,

r

is the compression rate,

δ

is the ReLU function, and

σ

is the Sigmoid function. This step explicitly models the channel importance through the gating mechanism.

In the recalibration phase, the network multiplies the weight vector

s \in [0,1]^{C}

generated in the excitation phase by the original feature map channel by channel to achieve soft selection of feature channels:

{\tilde{x}}_{c} (i, j) = s_{c} \cdot x_{c} (i, j) \forall c, i, j .

(4)

Therefore, in the output feature map

\tilde{X} \in R^{H \times W \times C}

, the feature response of the high-weight channel will be enhanced, while the low-weight channel will be suppressed. Ultimately, the feature extraction capability of the feature pyramid can be effectively improved.

3.2.3. Sequential Detection Algorithm for Visual Behavior

We classify pilots’ gaze behaviors by introducing a pilot visual behavior classification method based on a hidden Markov model [24,25]. The correct gaze sequence of the pilot’s observation behavior is shown in Figure 7. We define hidden states as specific instrument types (such as attitude indicators, airspeed indicators, altimeters, etc.) based on the transfer rules of the pilot’s gaze sequence of instruments, and the observation sequence as the actual gazed instruments. We use the statistical characteristics of state transition probabilities to indirectly infer behavior labels (omission of viewing, repeated viewing, viewing in the wrong order, correct viewing). The model captures the temporal dependencies between instruments through the state transition matrix and associates the hidden states with the actual gaze positions in combination with the observation probability matrix. Finally, statistical features are extracted based on the decoded state sequence and mapped to four types of behavior labels. The observation behavior sequence detection model includes the following:

(1): Hidden state set: $S = \{s_{1}, s_{2}, \dots, s_{N}\}$ , representing N types of instruments.
(2): Observation set: $O = \{o_{1}, o_{2}, \dots, o_{T}\}$ , representing the actual gaze instrument $(o_{t} \in S)$ at T moments.
(3): State transition probability matrix: $A = [a_{i j}]$ , where $a_{i j} = P (y_{t} = s_{j} ∣ y_{t - 1} = s_{i})$ , representing the probability of transferring from instrument $s_{i}$ to $s_{j}$ .
(4): Observation probability matrix: $B = [b_{j} (o_{t})]$ , where $b_{j} (o_{t}) = P (o_{t} ∣ y_{t} = s_{j})$ . If the gaze position is noise-free (i.e., the observation is a fixed value), it can be simplified to $b_{j} (o_{t}) = δ (o_{t} = s_{j})$ (δ is the indicator function). The initial state distribution of the detector is $π = [π_{i}]$ , where $π_{i} = P (y_{1} = s_{i})$ .

In order to detect different observation behaviors of pilots, we established the hidden state sequence

Y = \{y_{1}, y_{2}, \dots, y_{T}\}

and observation sequence O of the pilot’s gaze behavior according to the behavior and order of the pilot’s gaze on the instrument, and defined

A = [a_{i j}]

, where

a_{i j} = P (y_{t} = s_{j} ∣ y_{t - 1} = s_{i}) .

(5)

We calculate it by maximum likelihood estimation:

a_{i j} = \frac{N (s_{i}, s_{j})}{\sum_{k = 1}^{N} N (s_{i}, s_{k})},

(6)

where

N (s_{i}, s_{j})

represents the frequency of transition from state

s_{i}

to

s_{j}

;

C (s_{k})

represents the frequency of state

s_{k}

in the sequence.

θ_{low}

represents the frequency threshold of under-viewing behavior; α is the sensitivity coefficient of sequence error detection; β is the path probability threshold of correct gaze behavior.

We use the behavior classification rule to classify different gaze behaviors of pilots based on the statistical characteristics of the decoding sequence

Y^{*} = \{y_{1}^{*}, y_{2}^{*}, \dots, y_{T}^{*}\}

. Among them, we use the Markov algorithm to classify the pilot’s gaze behavior into missed viewing, repeated viewing, viewing order errors, and correct viewing. The detection method of each gaze behavior is as follows:

(1): Missed viewing: If the frequency of state $s_{k}$ is $C (s_{k}) < θ_{l o w}$ , where $C (s_{k}) = \sum_{t = 1}^{T} δ (y_{t}^{*} = s_{k}), δ (\cdot)$ .
(2): Repeated viewing: If there are $τ \geq 2$ consecutive observations of $y_{t}^{*} = y_{t + 1}^{*} = \dots = y_{t + τ - 1}^{*} = s_{k}$ .
(3): Sequential error: If the transition probability $a_{i j} < α \cdot a_{r e f, i j}$ , where $α \cdot a_{r e f, i j}$ is the reference transition probability of the standard operating procedure, and α ∈ (0,1) is the sensitivity threshold.
(4): Correct viewing: If the path probability $P (Y^{*} ∣ O) \geq β$ , and does not meet the above abnormal conditions.

3.3. Cognitive Detection Network Based on Multi-Dimensional Physiological Rhythm Feature Fusion

To solve the problem of whether the pilot is in a normal cognitive state when observing the flight instruments and solve the “Midas contact” problem of the gaze problem, we use multimodal physiological data and a two-stream LSTM network algorithm with feature layer fusion to detect and classify the pilot’s gaze cognitive level. Our data comes from the multimodal physiological signal collection of professional pilots when performing tasks in a simulated flight environment. The experiment designed three typical cognitive state scenarios, namely distracted state (pilots perform secondary tasks such as talking tasks), focused state (pilots perform normal five-side flight tasks), and panic state (we induce panic in pilots through emergencies such as engine failure or sudden weather changes in flight simulators).

3.3.1. Flight Training Multimodal Data Preprocessing and Data Layer Fusion Algorithm

Flight training involves multiple stages, including takeoff, ascent, descent, and landing, spanning a wide range of time and subjects. Therefore, the multimodal data generated during training is high-dimensional and large-scale before processing. To address the computational complexity associated with this large-scale data, we need to leverage powerful computing power to further integrate and extract features from this massive amount of training data for efficient pilot performance assessment. On the other hand, previous studies of multimodal data for pilot training assessment have only considered the frequency domain characteristics of a single physiological signal, failing to capture the complementary relationships expressed by multiple physiological sensors at a given moment during flight training, nor the frequency domain characteristics of multimodal data. To address both, we used the short-time Fourier transform (STFT) to resample and fuse the data. The EEG signal component diagram during flight training is shown in Figure 8.

In the STFT, the original signal is divided into several fixed-length windows. The signal within each window is Fourier transformed to obtain the corresponding spectrum. This allows the spectral characteristics of the signal to be analyzed within each time window, thereby obtaining information about the frequency variation in the signal over time. By moving the window, a time-frequency plot of the entire signal is obtained. The formula for the STFT is

X_{S T F T} (τ, f) = \int_{- \infty}^{\infty} x (t) h (t - τ) e^{- j 2 π f t} d t,

(7)

where x(t) is the time domain signal at time t, h is the window function, and τ is the position of the center of the Fourier transform window on the time axis. A square window is chosen here. In this paper, a Hamming window is used as the window function with a window length of 0.5 s and a window overlap of 50%. To ensure consistency of the input data, the data is normalized. The feature layer fusion structure diagram is shown in Figure 9. The colored lines in Figure 9 represent the temporal trends of multimodal data after fusion. The red line represents eye movement data, the green line represents EEG data, the blue line represents EDA signals, and the purple line represents ECG signals.

3.3.2. Observational Behavior Cognitive State Detection Network

To model the multimodal physiological cognitive data of pilots, we classify different pilot cognitive states through a two-stream LSTM network. The network adopts a dual-branch architecture to process the original time series signal and statistical features, respectively, and realizes cognitive state classification through hierarchical feature fusion. The structure of the bidirectional long short-term memory neural network is shown in Figure 10.

The bidirectional long short-term memory neural network adds a backward network on the basis of the original LSTM [26,27].The forward long short-term memory network is used to obtain the characteristic information of the first half of the reservoir segment of the input data, and the backward long short-term memory network is used to obtain the characteristic information of the second half of the reservoir segment of the input data. This dual LSTM structure can fully extract the characteristics of the front and back reservoir segments, and more generalized characteristic information content can be obtained by splicing and combining in the output layer. The structure of the bidirectional long short-term memory neural network is shown in Figure 10. The calculation formula of the bidirectional long short-term memory network is shown in Formulas (8) to (9).

M_{1}, M_{2}, M_{3}, M_{4}, M_{5}

and

M_{6}

are the corresponding weight parameters [28,29]. The output

H_{m t}

of the forward propagation LSTM is

H_{m t} = F (M_{1} X_{t} + M_{2} H_{m t - 1}) .

(8)

The output

H_{n t}

of the back-propagation LSTM is

H_{n t} = F (M_{3} X_{t} + M_{5} H_{n t - 1}) .

(9)

Concatenate

H_{m t}

and

H_{n t}

to combine the vectors of the front and back propagation layers:

c o n c a t (H_{m t}, H_{n t}) = [\begin{matrix} H_{m 1} \\ \begin{matrix} H_{m 2} \\ \begin{matrix} . . . \\ H_{m n} \end{matrix} \end{matrix} \end{matrix}] \oplus [\begin{matrix} H_{n 1} \\ \begin{matrix} H_{n 2} \\ \begin{matrix} . . . \\ H_{n n} \end{matrix} \end{matrix} \end{matrix}] = [\begin{matrix} \begin{matrix} \begin{matrix} H_{m 1} H_{n 1} \\ H_{m 2} H_{n 2} \end{matrix} \\ . . . . . . \end{matrix} \\ H_{m n} H_{n n} \end{matrix}] .

(10)

Input the concatenated vector into the softmax [30] function to obtain the value of the output layer

Y_{t}

.

3.4. D-S Decision Layer Algorithm

The improved D-S evidence decision model proposed in this paper introduces conflict monitoring and manual review mechanisms on the basis of the traditional evidence theory framework to construct a dynamic evaluation system for the safety of pilots’ visual behavior. First, an identification framework containing safety levels Θ = {safe, risky, serious hidden dangers} is established, and the behavior probability output by the visual behavior detection network is mapped to the m1 evidence source, and the state probability of the cognitive state evaluation network is converted to the m² evidence source. The semantic alignment of heterogeneous evidence is achieved by designing an evidence conversion matrix, in which the visual behavior category is mapped to the safety level according to the aviation safety operation specifications, and the cognitive state is associated with the safety indicator according to the neurophysiological standards [31]. When the conflict factor of the two types of evidence exceeds the preset threshold, the manual review mechanism is triggered, and the senior flight instructor makes the final judgment based on the context of the situation, which effectively solves the decision failure problem caused by high-conflict evidence in the autopilot cabin environment. D-S evidence theory fusion decision model is shown in Figure 11.

In the specific algorithm process, the multimodal detection results are firstly aligned in time and space, and the dynamic confidence is constructed using the evidence sequence in the sliding time window

W_{t} = [t - Δ t, t]

:

m_{i}^{(t)} = α \cdot m_{i}^{(t)} + (1 - α) \cdot \frac{1}{Δ t} \sum_{τ = t - Δ t}^{t - 1} m_{i}^{(τ)}, i = 1, 2,

(11)

where α is the time decay factor, which realizes the gradual forgetting of historical evidence. Calculate the conflict factor between different pieces of evidence:

K_{t} = 1 - \sum_{A \cap B = C} m_{1} (A) m_{2} (B) .

(12)

When

K_{t} > θ_{k}

, it is judged as an extreme conflict scenario. At this time, the conventional Dempster combination rule may produce a paradox. The system will freeze the automatic decision-making process, activate the manual review interface, and present multi-dimensional visual information including the original physiological signal waveform, gaze heat map, and cognitive load curve for experts to conduct joint judgment. For situations where the conflict threshold is not reached, an improved weighted evidence fusion strategy is adopted:

m (C) = \frac{\sum_{A \cap B = C} w_{1} m_{1} (A) w_{2} m_{2} (B)}{1 - K_{t} + ϵ} .

(13)

The weight coefficient

w_{i}

is dynamically generated by the evidence quality assessment module:

w_{i} = 1 - \frac{H (m_{i})}{l o g | θ |}, H (m_{i}) = - \sum_{A \subseteq Θ} m_{i} (A) l o g m_{i} (A) .

(14)

This method can give greater weight to evidence with low information entropy and enhance the decision-making contribution of reliable information.

4. Experimental Results and Analysis

To ensure the reliability and reproducibility of the experimental conclusions, this study rigorously standardized the data collection process and statistical analysis methods. Twenty active civil aviation pilots were recruited and divided into two groups based on their total flight hours (TFH): a senior group (TFH ≥ 3000 h, average TFH 4850 ± 620 h) and a junior group (TFH < 3000 h, average TFH 1650 ± 410 h). The number of participants in the advanced group experiment was 10. The number of participants in the beginner group experiment was 10. The amount of data collected by each label was evenly distributed.

This experiment strictly adhered to the ethical principles of the Declaration of Helsinki [32]. All participants passed an aeromedical physical examination, and the experimental protocol was approved by the ethics committee of Air Force Medical University. Written informed consent was obtained from all participants [33]. In the X-Plane 11 simulated flight environment, a total of 4500 sets of synchronized multimodal data samples were collected (each set corresponding to a facial image sequence within a 5 s time window, 32-lead EEG, eye tracking, ECG, respiratory, and GSR signals), with each pilot contributing an average of 225 sets of data. The dataset was randomly divided into training and test sets based on participant stratification. The flight training process diagram for flight trainees is shown in Figure 12.

4.1. Analysis of Experimental Results of Pilot Visual Behavior Analysis Model

This study uses 224 × 224 pixels face images as input and divides the dataset into 80% training set and 20% test set through random segmentation and identifier separation strategy. Facial images of pilot visual observations in different situations are shown in Figure 13.

The network training cycle is set to 100 epochs, and the batch size is 32. To verify the effectiveness of the model, we conducted comparative experiments with the basic Transformer network [34] and the feature pyramid network [35]. The average classification accuracy of the three is 80%, 82%, and 88%, respectively, showing the significant advantages of the proposed parallel network.

From the comparative analysis of the model architecture, it can be seen that the basic Transformer network performs well in capturing global attention features, but its average accuracy is relatively low due to its limited ability to extract local detail features. The feature pyramid network has improved the representation of spatial details by the multi-scale feature fusion mechanism, and the accuracy has been increased to 82%. The parallel network designed in this study integrates the hierarchical attention mechanism of Swin-Transformer and the multi-scale representation advantages of the feature pyramid while retaining the global context information, strengthening the discrimination ability of local features, and finally increasing the average classification accuracy to 88%. The Different network recognition accuracy is shown in Figure 14.

The pilot’s attention heatmap is shown in Figure 15. A fine-grained analysis of target area classification shows that the model maintains a stable classification accuracy for instrument areas such as airspeed (AS), pressure altitude (PA), attitude (PG), and heading. This is due to the discrete spatial distribution of these areas within the cockpit and the significant differences in their visual features. The classification accuracy for all target areas meets the practical benchmark requirements, verifying the model’s robustness in complex cockpit environments.

The experimental results show that the proposed parallel network effectively improves the discrimination accuracy of the fixation target area by co-optimizing the global attention mechanism and the local feature enhancement strategy. In particular, when dealing with instrument areas with spatial proximity and visual similarity, the anti-interference ability of the model verifies the rationality of the architectural design. The classification accuracy confusion matrix for each label is shown in Figure 16. The highlight dot represents the pilot’s gaze focus on the instrument panel. The longer the gaze duration, the darker the color of the highlight dot.

4.2. Analysis of Complementary Experimental Results of Multimodal Observation Behavior Cognition Detection Model

In the complementary experiment of the multimodal observation behavior cognitive model, we verified the difference in the contribution of different physiological modalities to cognitive state detection through a systematic ablation study. The experimental results show that when the EEG signal is removed alone, the classification accuracy of the model on the test set drops significantly from 93.2% of the complete multimodal to 71.5%, indicating that neurocognitive features are irreplaceable for pilot situational awareness detection. Further analysis of the ablation effect of eye movement signals shows that the model accuracy drops by 17.3%, which confirms the strong correlation between visual attention allocation pattern and cognitive load and is consistent with the research conclusion of Kang et al. [36] on eye movement-EEG fusion detection. In contrast, the removal of ECG, respiration, and skin electrical signals alone leads to a decrease in accuracy of 7.4%, 5.1%, and 3.8%, respectively, and their effect values decrease in turn, reflecting the auxiliary and complementary role of autonomic nervous system signals. This experiment proves that the multimodal combination model verifies the hypothesis of synergistic enhancement of cross-modal features. This result reveals from an empirical perspective the core value of EEG signals in decoding prefrontal cognitive functions and the key role of eye movement data in the evaluation of visual-spatial working memory. Through rigorous ablation control experiments, this study provides a quantitative basis for the optimal configuration of sensors for cockpit multimodal monitoring systems and confirms the scientificity and necessity of a multi-source fusion architecture based on the neural-visual dual-modality in cognitive state detection. The accuracy comparison results of complementary experiments are shown in Figure 17.

The multimodal fusion bidirectional LSTM model proposed in this study achieved 93.2% cognitive state detection accuracy in a simulated civil aircraft environment. However, in the mechanical instrument layout of a primary trainer aircraft, the model’s accuracy improved to 92.8%, due to the discrete spatial distribution and high visual feature differentiation of the instruments. In the cockpit environment of a mainstream civil aircraft, the accuracy reached 89.5%, significantly outperforming single-modal detection methods.

Figure 18 shows the visualization results of cognitive state classification based on multimodal physiological signal fusion. The experimental data includes EEG rhythm features, eye movement patterns, heart rate variability, respiratory rhythm, and skin conductance response. The time series features are extracted and presented by the bidirectional LSTM algorithm. In the laboratory baseline data, the focus state (blue circle) shows a coordinated pattern of stable high-frequency EEG activity and regular respiratory rhythm, forming obvious clustering characteristics in the three-dimensional embedding space; the distraction state (red square) shows the characteristics of enhanced low-frequency components of EEG and abnormal heart rate fluctuations, which is significantly separated from the focus state; the panic state (green triangle) presents a unique discrete distribution in the feature space due to the sudden increase in pupil dilation (eye movement data) and skin conductance caused by sudden stress. The experimental results verify the effectiveness of multimodal fusion for the recognition of cognitive states in complex scenes. The results show that by integrating neural activity, autonomic nervous regulation, and visual behavior characteristics, a cognitive assessment model across experimental environments can be established, providing a reliable basis for attention monitoring in flight training.

To further study the performance of the two-stream LSTM algorithm based on feature layer fusion, we used our own multimodal cognitive data as dataset I and NASA’s multimodal pilot cognitive status data [37] as dataset II. The two datasets were imported into the network for training and compared with the baseline network. The results are shown in Figure 19a,b. The classification efficiency of the cognitive detection models based on random forest, CNN [38] and RNN on both datasets was relatively low, with accuracy rates below 30%. This is mainly due to the inherent limitations of the models. When capturing the temporal dynamic characteristics of pilot physiological signals, the CNN model is limited by the static characteristics of its local receptive field and is difficult to effectively model the temporal coupling relationship of cross-modal physiological parameters such as EEG rhythm phase synchronization and heart rate variability. Although the RNN [39] model can model time series, the skin conductance caused by the sudden alarm increases sharply, and the nonlinear interactive characteristics of respiratory rhythm and eye movement trajectory cannot be fully reflected. The accuracy of the SVM [40] algorithm exceeded 50% on both datasets. In contrast, the LSTM algorithm optimized through feature layer fusion achieved output accuracy rates of 71% and 68%, respectively, demonstrating the role of down sampling feature layer fusion in the dynamic selection and fusion of cross-modal temporal features. In particular, it addressed the joint modeling of EEG time-frequency features and respiratory signals, significantly improving the ability to distinguish between distracted states and loss of situational awareness, and validating the beneficial effect of multimodal feature complementarity on cognitive detection. In contrast, the improved multi-scale fusion bidirectional LSTM network proposed in this study achieved nearly 90% accuracy on both datasets, a significant improvement over previous algorithms.

To verify the advantages of the multimodal detection model proposed in this paper in the visual safety detection of pilots, we compared the performance differences in different detection methods through systematic experiments. The accuracy and delay comparison of different detection methods are shown in Table 2. The experimental results show that although the accuracy of the traditional manual discrimination method can reach 80%, there is a significant delay of 500 ms. With the increased workload, the manual method is prone to misjudgment due to subjective fatigue factors; the accuracy of single eye movement signal detection is the lowest because it cannot overcome the invalid gaze misjudgment caused by the “Midas contact” problem. In contrast, the multimodal fusion detection method proposed in this paper has achieved a performance breakthrough through multi-dimensional technical innovation: in terms of accuracy, the bidirectional LSTM network with feature layer fusion captures cross-modal temporal correlation features such as EEG theta wave oscillation and ECG LF/HF ratio, and combines the DS evidence theory to dynamically adjust the decision weights of visual behavior and cognitive state, which improves the accuracy to 91.3%; in terms of real-time performance, the parallel computing architecture of the trinocular vision acquisition array and the hierarchical window attention mechanism of the improved Swin-Transformer are synergistically optimized to control the detection delay to 62 ms, which is 87.6% shorter than manual detection, meeting the real-time monitoring needs of the cockpit. It shows that integrating cutting-edge technologies in the fields of computer vision, cognitive neuroscience, and information fusion provides the first solution for pilot visual safety monitoring that combines aviation-grade accuracy with practical-grade real-time performance. This method can meet the needs of pilot visual behavior detection.

4.3. Discussion

We used a multimodal detection method for pilot visual safety based on cognitive state decoding to detect the flight visual observation habits of pilots of different years. We needed to find 10 pilots of different years to test their visual observation behavior habits. A comparison of visual observation safety among pilots of different years of experience is shown in Figure 20. The red, yellow, and blue lines represent the observation habits of senior, mid-level, and junior pilots, respectively. We found that the pilots’ visual behavior patterns were significantly positively correlated with their flight experience. This finding provides important inspiration for aviation safety training. Through analysis, we found that pilots with more than 3000 h of flight experience showed more stable and reliable visual observation habits. Senior pilots can maintain a stable instrument gaze behavior at a safe level during critical flight phases. The matching degree of their sight transfer path with standard operating procedures is significantly higher than that of other groups. Pilots with less than 3000 h of flight time have less stable flight instrument gaze habits than senior pilots in terms of safety gaze habits. Although they still maintained relatively standardized observation behaviors in the later stages of flight training, they were not as good as senior pilots in terms of average safety and stability. This may be related to the degradation of manual operation skills caused by over-reliance on automation systems, which confirms the safety warning of the Federal Aviation Administration on the “automation capability trap”.

The multimodal detection system proposed in this study effectively revealed the dynamic association between visual behavior and cognitive state in pilots of varying experience levels. This finding has important theoretical and practical significance. Experimental results showed that senior pilots (≥3000 flight hours) exhibited greater visual-cognitive coordination stability, with their gaze paths significantly better aligned with standard operating procedures than those of the junior pilot group. Furthermore, prefrontal theta oscillations exhibited stronger phase synchronization with visual search behavior during multitasking. This phenomenon confirms the role of experience in promoting cognitive automation in pilots. Senior pilots, through long-term training, have developed more efficient visual sampling strategies and neural resource allocation patterns, enabling them to achieve a coordinated allocation of attention between focal and peripheral regions through peripheral vision monitoring while maintaining primary task performance. In contrast, junior pilots (<3000 flight hours), while capable of performing basic maneuvers, exhibited higher cognitive load characteristics in their visual behavior, such as increased coefficient of variation in gaze divergence and delayed suppression of EEG alpha waves, reflecting their immaturity in attention allocation and cognitive resource management. The system accurately captures these experience-related differences in cognitive-behavioral coupling through multimodal data fusion. This not only validates the detection model but also provides a quantitative basis for differentiated training designs—for example, strengthening situational awareness and attention allocation training for junior pilots, while focusing on maintaining cognitive agility for senior pilots to prevent the loss of vigilance caused by automation. This finding deepens our understanding of the relationship between pilot experience and cognitive performance, demonstrating the high ecological validity and application potential of this system in revealing the dynamic connection between vision and cognition.

While further research is needed to apply our method to real-world flight environments, the real-time cognitive monitoring framework proposed in this study has significant potential for integration with intelligent cockpit systems. We will subsequently explore combining observational behavioral assessment methods with large language models based on human feedback to build intelligent aviation safety assistance systems with continuous learning and adaptive capabilities. We will further investigate deep reinforcement learning methods for detecting flight preference control behaviors [41] and reinforcement learning methods for detecting pilot behavior based on human feedback [42]. This approach extracts multimodal behavioral data from pilots in real-world flight environments and constructs a joint visual-cognitive decoding model based on pilot feedback. By establishing a reinforcement learning mechanism driven by human feedback, the system can dynamically adapt to the cognitive behavior patterns of different pilots. For the “Midas contact” problem, the gaze validity threshold can be optimized based on actual flight performance and expert annotations, achieving a more refined trade-off between multimodal signal processing and cognitive state decoding. When the system detects a crew member in a hazardous control state, it automatically triggers a multi-level response mechanism. A primary warning generates a contextualized SOP briefing or voice prompt; intermediate and advanced warnings further increase the alert intensity. This layered response mechanism can advance flight behavior monitoring from passive warning to active assistance. By combining the contextual reasoning and generation capabilities of large language models, the model can provide personalized cognitive enhancement strategies and customized flight control behavior recommendations for pilots of varying experience levels. Ultimately, this system can be deeply integrated into the intelligent architecture of next-generation cockpits, enabling a flight monitoring system capable of real-time perception, intelligent interpretation, and autonomous collaboration, providing more comprehensive and adaptive technical support for flight safety.

The dual-dimensional “visual-behavioral-neurophysiological-cognitive” assessment system constructed in this study demonstrates significant paradigm shift value in the field of high-risk human–computer interaction. For oil well instrument operators, this method quantifies their gaze compliance with key instruments such as pressure gauges and flow meters by reusing a dynamic weighted feature pyramid network. Simultaneously, it leverages a bidirectional LSTM network to decode multimodal signals and provide early warning of operational delays caused by cognitive overload. In the context of air traffic controller situational awareness monitoring, this model can integrate radar screen gaze trajectories with voice command features to effectively mitigate monitoring blind spots caused by tower workload. For vehicle drivers, this algorithm enables early diagnosis of driver fatigue. This technology could ultimately be deployed in real-time security systems, such as those on vehicles and towers.

5. Conclusions

This study aims at the safety hazards caused by the deviation of pilots’ visual observation behavior in civil aviation flight safety, focusing on the two core defects of existing detection methods: first, traditional visual behavior evaluation relies on single-modal data, which makes it difficult to effectively distinguish the “Midas contact” problem of “effective gaze” and “ineffective glance”; second, the lack of dynamic monitoring of pilots’ cognitive state makes it impossible to quantify the cognitive effectiveness of gaze behavior. To this end, this paper innovatively proposes a multimodal fusion pilot visual behavior safety detection framework. By integrating facial visual features and multi-dimensional physiological signals, a full-process detection system covering “visual behavior pattern recognition-cognitive state assessment-comprehensive safety decision-making” is constructed.

At the method level, this study breaks through the combination of computer vision and cognitive neuroscience to construct a two-stream collaborative detection network. For visual behavior detection, a parallel architecture based on a dynamic weight feature pyramid and improved Swin-Transformer is designed and implemented. Through cross-scale feature fusion and spatiotemporal position encoding technology, an accuracy of 88% is achieved in the instrument gaze behavior classification task, which is 6% and 8% higher than the traditional feature pyramid network and basic Transformer, respectively. In terms of cognitive state detection, the proposed dual-stream LSTM network achieved a 93.2% cognitive level classification accuracy rate through the temporal feature fusion method of multimodal physiological signals, which has significant advantages over single modality detection. Finally, the two types of detection results were fused through the DS evidence theory decision layer, which effectively solved the decision conflict case and realized the safety detection of pilots’ visual observation behavior.

In cognitive detection, this study revealed the dynamic correlation between pilot visual behavior and cognitive state, discovering that pilots of different seniority groups exhibit transitional characteristics of excessive attention allocation, providing important insights for aviation safety training. Application of this system in flight training has effectively reduced the incidence of unsafe gaze behavior, validating its practical value in improving aviation safety. While the dual-dimensional “visual-cognitive” evaluation system constructed in this study faces challenges in real-world flight environments, this technology offers new insights for optimizing cockpit human–machine interaction, innovating pilot training systems, and early warning of vestibular dysfunction.

Author Contributions

Conceptualization, H.Z. and C.W.; methodology, H.Z.; software, H.Z.; validation, H.Z.; formal analysis, H.Z.; investigation, P.W.; resources, C.W.; data curation, P.W.; writing—review and editing, C.W. and H.Z.; visualization, H.Z.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by The National Natural Science Foundation of China (Grant Number: 52072293) 10.13039/501100015401-Key Research; Development Projects of Shaanxi Province (Grant Number: 2024GX-YBXM-126) and Key Laboratory of Metrological Optics and Application, State Administration for Market Regulation (No.SXJL2023003KF).

Institutional Review Board Statement

The experiments involved simulated flight trials using wearable physiological sensors, with no clinical or high-risk procedures. The study was exempt from formal ethics review by the college but falls under the oversight of the Ethics Committee of Xi’an Technological University.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original dataset presented in this study can be found in an online repository at: [Multimodal Physiological Monitoring During Virtual Reality Piloting Tasks (physionet.org)], dataset identifier [v1.0.0]. The supporting code and materials for this study are openly available in the GitHub repository at: [LegendaryFlutist/MDF-LSTM (github.com)].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, G.; Xue, Y.; Zheng, X.; Yang, S.; Shao, M.; Li, Z. Study on the influence mechanism of the development of China’s civil aviation transport industry on the upgrading of industrial structure. J. Intell. Technol. Innov. 2025, 3, 24–44. [Google Scholar] [CrossRef]
de Sant, D.A.L.M.; de Hilal, A.V.G. The impact of human factors on pilots’ safety behavior in offshore aviation companies: A brazilian case. Saf. Sci. 2021, 140, 105272. [Google Scholar] [CrossRef]
Sheffield, E.; Lee, S.Y.; Zhang, Y. A systematic review of general aviation accident factors, effects and prevention. J. Air Transp. Manag. 2025, 128, 102859. [Google Scholar] [CrossRef]
Zhang, X.; Sun, Y.; Zhang, Y.; Su, S. Multi-agent modelling and situational awareness analysis of human-computer interaction in the aircraft cockpit: A case study. Simul. Model. Pract. Theory 2021, 111, 102355. [Google Scholar] [CrossRef]
Prinzel, L.J.; Krois, P.; Ellis, K.K.; Vincent, M.; Stephens, C.; Oza, N.; Chancey, E.T.; Davies, M.; Mah, R.; Ackerson, J.; et al. The adaptable and resilient safety system: The human factor in future in-time aviation safety management systems. In Proceedings of the AIAA SCITECH 2024 Forum, Orlando, FL, USA, 8–12 January 2024; p. 1603. [Google Scholar] [CrossRef]
Freitas, A.; Santos, D.; Lima, R.; Santos, C.G.; Meiguins, B. Pactolo bar: An approach to mitigate the Midas touch problem in non-conventional interaction. Sensors 2023, 23, 2110. [Google Scholar] [CrossRef] [PubMed]
Fitts, P.M.; Jones, R.E.; Milton, J.L. Eye movements of aircraft pilots during instrument-landing approaches. Ergon. Psychol. Mech. Models Ergon. 2005, 3, 56. [Google Scholar]
Wang, Y.; Li, W.C.; Nichanian, A.; Korek, W.T.; Chan, W.T.K. Future flight safety monitoring: Comparison of different computational methods for predicting pilot performance under time series during descent by flight data and eye-tracking data. In International Conference on Human-Computer Interaction; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 308–320. [Google Scholar]
Dehais, F.; Juaneda, S.; Peysakhovich, V. Monitoring eye movements in real flight conditions for flight training purpose. In Eye-Tracking in Aviation. Proceedings of the 1st International Workshop (ETAVI 2020); ISAE-SUPAERO, Université de Toulouse; Institute of Cartography and Geoinformation (IKG), ETH Zurich: Zürich, Switzerland, 2020; pp. 54–60. [Google Scholar]
Lyu, M.; Li, F.; Qu, X.; Li, Q. Flashlight model: Integrating attention distribution and attention resources for pilots’ visual behaviour analysis and performance prediction. Int. J. Ind. Ergon. 2024, 103, 103630. [Google Scholar] [CrossRef]
Wu, G.; Wang, C.; Gao, L.; Xue, J. Hybrid Swin transformer-based classification of gaze target regions. IEEE Access 2023, 11, 132055–132067. [Google Scholar] [CrossRef]
Wang, Y.; Yang, L.; Korek, W.T.; Zhao, Y.; Li, W.C. The evaluations of the impact of the pilot’s visual behaviours on the landing performance by using eye tracking technology. In International Conference on Human-Computer Interaction; Springer Nature: Cham, Switzerland, 2023; pp. 143–153. [Google Scholar] [CrossRef]
Jin, H.; Hu, Z.; Li, K.; Chu, M.; Zou, G.; Yu, G.; Zhang, J. Study on how expert and novice pilots can distribute their visual attention to improve flight performance. IEEE Access 2021, 9, 44757–44769. [Google Scholar] [CrossRef]
Peysakhovich, V.; Ledegang, W.; Houben, M.; Groen, E. Classification of flight phases based on pilots’ visual scanning strategies. In Proceedings of the 2022 Symposium on Eye Tracking Research and Applications, Seattle, WA, USA, 8–11 June 2022; pp. 1–7. [Google Scholar] [CrossRef]
Gao, L.; Wang, C.; Wu, G. Hidden semi-Markov models-based visual perceptual state recognition for pilots. Sensors 2023, 23, 6418. [Google Scholar] [CrossRef]
Korek, W.T.; Mendez, A.; Asad, H.U.; Li, W.C.; Lone, M. Understanding human behaviour in flight operation using eye-tracking technology. In International Conference on Human-Computer Interaction; Springer International Publishing: Cham, Switzerland, 2020; pp. 304–320. [Google Scholar] [CrossRef]
Jiang, G.; Chen, H.; Wang, C.; Xue, P. Transformer network intelligent flight situation awareness assessment based on pilot visual gaze and operation behavior data. Int. J. Pattern Recognit. Artif. Intell. 2022, 36, 2259015. [Google Scholar] [CrossRef]
Smith, K.J.; Endsley, T.C.; Clark, T.K. Predicting situation awareness from physiological signals. arXiv 2025, arXiv:2506.07930. [Google Scholar] [CrossRef]
Li, T.; Lajoie, S. Predicting aviation training performance with multimodal affective inferences. Int. J. Train. Dev. 2021, 25, 301–315. [Google Scholar] [CrossRef]
Huang, W.; Wang, C.; Jia, H.B.; Xue, P.; Wang, L. Modeling and analysis of fatigue detection with multi-channel data fusion. Int. J. Adv. Manuf. Technol. 2022, 122, 291–301. [Google Scholar] [CrossRef]
Alreshidi, I.; Moulitsas, I.; Jenkins, K.W. Multimodal approach for pilot mental state detection based on EEG. Sensors 2023, 23, 7350. [Google Scholar] [CrossRef] [PubMed]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Pennsylvania, PA, USA, 27 February–2 March 2025; Volume 39, No. 7. pp. 6896–6904. [Google Scholar]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective fusion factor in FPN for tiny object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1160–1168. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Naik Bukht, T.F.; Al Mudawi, N.; Alotaibi, S.S.; Alazeb, A.; Alonazi, M.; AlArfaj, A.A.; Jalal, A. A novel human interaction framework using quadratic discriminant analysis with HMM. Comput. Mater. Contin. 2023, 77, 1557–1573. [Google Scholar] [CrossRef]
Cheng, X.; Huang, B. CSI-based human continuous activity recognition using GMM–HMM. IEEE Sens. J. 2022, 22, 18709–18717. [Google Scholar] [CrossRef]
Jacob, R.J. The use of eye movements in human-computer interaction techniques: What you look at is what you get. ACM Trans. Inf. Syst. (TOIS) 1991, 9, 152–169. [Google Scholar] [CrossRef]
Mahadevaswamy, U.B.; Swathi, P. Sentiment analysis using bidirectional LSTM network. Procedia Comput. Sci. 2023, 218, 45–56. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Imrana, Y.; Xiang, Y.; Ali, L.; Abdul-Rauf, Z. A bidirectional LSTM deep learning approach for intrusion detection. Expert Syst. Appl. 2021, 185, 115524. [Google Scholar] [CrossRef]
Zeng, J.J.; Chao, H.C.; Wei, J. Abnormal behavior detection based on DS Evidence theory for air-ground Integrated Vehicular Networks. IEEE Internet Things J. 2025, 12, 11347–11355. [Google Scholar] [CrossRef]
Ashcroft, R.E. The declaration of Helsinki. Oxf. Textb. Clin. Res. Ethics 2008, 21, 141–148. [Google Scholar]
Nijhawan, L.P.; Janodia, M.D.; Muddukrishna, B.S.; Bhat, K.M.; Bairy, K.L.; Udupa, N.; Musmade, P.B. Informed consent: Issues and challenges. J. Adv. Pharm. Technol. Res. 2013, 4, 134–140. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 2117–2125. [Google Scholar]
Kang, Y.; Liu, F.; Chen, W.; Li, X.; Tao, Y.; Huang, W. Recognizing situation awareness of forklift operators based on eye-movement & EEG features. Int. J. Ind. Ergon. 2024, 100, 103552. [Google Scholar] [CrossRef]
Yang, F.; Song, X.; Yi, W.; Li, R.; Wang, Y.; Xiao, Y.; Ma, L.; Ma, X. Lost data reconstruction for structural health monitoring by parallel mixed Transformer-CNN network. Mech. Syst. Signal Process. 2025, 224, 112142. [Google Scholar] [CrossRef]
Yunita, A.; Pratama, M.I.; Almuzakki, M.Z.; Ramadhan, H.; Akhir, E.A.P.; Mansur, A.B.F.; Basori, A.H. Performance analysis of neural network architectures for time series forecasting: A comparative study of RNN, LSTM, GRU, and hybrid models. MethodsX 2025, 15, 103462. [Google Scholar] [CrossRef]
Lorenz, O.; Thomas, U. Real Time Eye Eye-tracking System using CNN-based Facial Features for Human Attention Measurement. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), Prague, Czech Republic, 25–27 February 2019; pp. 598–606. [Google Scholar]
Burges, C.J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar] [CrossRef]
Wong, M.F.; Tan, C.W. Aligning crowd-sourced human feedback for reinforcement learning on code generation by large language models. IEEE Trans. Big Data 2024, 20, 1–12. [Google Scholar] [CrossRef]

Figure 1. Activity diagram of observation behavior detection method.

Figure 2. Visual behavior collection method.

Figure 3. Pilot visual observation behavior safety detection solution architecture.

Figure 4. Pilot visual observation behavior detection network structure.

Figure 5. Structure of feature pyramid module.

Figure 6. Feature pyramid working principle diagram.

Figure 7. Correct gaze sequence for pilots’ observation behavior. The numbers in Figure 7 indicate the correct steps and order for pilots to look at different instruments.

Figure 8. EEG signal components during flight training.

Figure 9. Feature layer fusion structure diagram.

Figure 10. The structure of a bidirectional long short-term memory neural network.

Figure 11. D-S evidence theory fusion decision model.

Figure 12. Flight training process for flight trainees.

Figure 13. Facial images of pilot visual observations in different situations.

Figure 14. Comparison of pilot visual behavior detection accuracy using different algorithms.

Figure 15. Heat map of pilot visual attention distribution.

Figure 16. Confusion matrix for pilot cognitive state detection.

Figure 17. Comparison of output accuracy of multimodal signal complementary experiment.

Figure 18. Data distribution results of different cognitive states before and after data fusion.

Figure 19. Comparison of classification results of different networks.

Figure 20. Comparison of visual observation safety of pilots with different years of experience.

Table 1. Pilot multi-modal signal acquisition sensors.

Device Name	Sensor Type	Collection Frequency
ErgoLAB EDA	Galvanic Skin Sensor	1024 Hz
ErgoLAB ECG	ECG sensor	1024 Hz
Xatu_Eyetracking	Eye movement detection sensor	1024 Hz
ErgoLAB 32-lead	EEG Sensor	256 Hz

Table 2. Accuracy and latency of different detection methods.

Identification Method	Manual Identification	Single Eye Movement Signal	Fusion Information
Accuracy (%)	80%	65%	91.3%
Latency (ms)	500	32	62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Wang, C.; Wang, P. Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding. Multimodal Technol. Interact. 2025, 9, 103. https://doi.org/10.3390/mti9100103

AMA Style

Zhang H, Wang C, Wang P. Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding. Multimodal Technologies and Interaction. 2025; 9(10):103. https://doi.org/10.3390/mti9100103

Chicago/Turabian Style

Zhang, Heming, Changyuan Wang, and Pengbo Wang. 2025. "Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding" Multimodal Technologies and Interaction 9, no. 10: 103. https://doi.org/10.3390/mti9100103

APA Style

Zhang, H., Wang, C., & Wang, P. (2025). Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding. Multimodal Technologies and Interaction, 9(10), 103. https://doi.org/10.3390/mti9100103

Article Menu

Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Flight Observation Behavior Safety Data Collection Equipment and Data Sources

3.2. Pilot Visual Observation Behavior Detection Network

3.2.1. Global Feature Extraction Module

3.2.2. Local Feature Extraction Module

3.2.3. Sequential Detection Algorithm for Visual Behavior

3.3. Cognitive Detection Network Based on Multi-Dimensional Physiological Rhythm Feature Fusion

3.3.1. Flight Training Multimodal Data Preprocessing and Data Layer Fusion Algorithm

3.3.2. Observational Behavior Cognitive State Detection Network

3.4. D-S Decision Layer Algorithm

4. Experimental Results and Analysis

4.1. Analysis of Experimental Results of Pilot Visual Behavior Analysis Model

4.2. Analysis of Complementary Experimental Results of Multimodal Observation Behavior Cognition Detection Model

4.3. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI