A Lightweight Framework for Pilot Pose Estimation and Behavior Recognition with Integrated Safety Assessment

Wu, Honglan; Lu, Xin; Sun, Youchao; Liu, Hao

doi:10.3390/aerospace12110986

Open AccessArticle

A Lightweight Framework for Pilot Pose Estimation and Behavior Recognition with Integrated Safety Assessment

by

Honglan Wu

,

Xin Lu

,

Youchao Sun

^* and

Hao Liu

College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(11), 986; https://doi.org/10.3390/aerospace12110986

Submission received: 24 September 2025 / Revised: 29 October 2025 / Accepted: 31 October 2025 / Published: 3 November 2025

(This article belongs to the Section Air Traffic and Transportation)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of aviation technology, modern aircraft cockpits are evolving toward high automation and intelligence, making pilot-cockpit interaction a critical factor influencing flight safety and efficiency. Pilot pose estimation and behavior recognition are critical for monitoring pilot state, preventing operational errors, and enabling adaptive human–machine interaction, thus playing an essential role in aviation safety assurance and intelligent cockpit development. However, existing methods face challenges in real-time performance, reliability, and computational complexity in practical applications. Traditional approaches, such as wearable sensors and image-processing-based algorithms, demonstrate certain effectiveness but still exhibit limitations in aviation environments. To address these issues, this paper proposes a lightweight pilot pose estimation and behavior recognition framework, integrating Vision Transformer with depth-wise separable convolution to optimize the accuracy and efficiency of keypoint detection. Additionally, a novel multimodal data fusion technique is introduced, along with a scientifically designed evaluation system, to enhance the robustness and security of the system in complex environments. Experimental results on a pilot keypoint detection dataset captured in a simulated cockpit environment show that the proposed method achieves 81.9 AP, while substantially reducing model parameters and notably improving inference efficiency compared with HRNet. This study provides new insights and methodologies for the design and evaluation of aviation human-machine interaction systems.

Keywords:

pilot pose estimation; behavior recognition; vision transformer; lightweight design; multimodal data fusion; aviation safety

1. Introduction

With the rapid advancement of aviation technology, modern aircraft cockpits are progressively evolving towards higher levels of automation and intelligence [1]. In this process, the interaction between pilots and the cockpit has become a critical factor determining flight safety and efficiency. Particularly, human-body sensing interaction technology, due to its ability to significantly enhance the operational convenience and comfort of pilots, has emerged as one of the core areas in cockpit design. Among these intelligent cockpit technologies, pilot pose estimation and behavior recognition are essential for safety and situational awareness [2].

Despite the crucial role of posture estimation and behavior recognition in aviation, current methods encounter various difficulties when applied in real-world settings, particularly regarding their ability to operate in real time and maintain reliability under aviation conditions. For instance, traditional posture estimation methods primarily rely on wearable sensors [3] or classical computer vision algorithms based on image processing [4]. However, these methods encounter several challenges when applied in aviation environments. While wearable sensors can directly capture posture information, their high device costs, poor wearing comfort, and potential interference with the natural interaction between pilots and the cockpit limit their practicality. On the other hand, traditional image processing-based algorithms are constrained by the highly dynamic flight environment and the demand for real-time performance. These limitations motivate the adoption of deep learning for accurate and contactless pilot pose estimation.

The core task of human body keypoint localization is to predict the positions of human body keypoints in images. In recent years, thanks to the rapid development of artificial intelligence technology, this field has achieved groundbreaking progress. Researchers have explored various technical approaches to improve detection accuracy, such as enhancing network design [5], integrating PVT and SAR [6], and global modeling [7]. In human pose estimation tasks, global dependency relationships are crucial. However, traditional convolutional neural networks (CNNs) are limited by their local receptive fields, making it difficult to model the relationships between distant keypoints [8]. To address this, Transformer has been widely adopted due to its global modeling capabilities. Through its self-attention mechanism, Transformer significantly improves global modeling. Nevertheless, representative methods such as TransPose [9] and HRFormer [10], while achieving state-of-the-art results in general human pose estimation, are primarily designed for generic tasks and rely on relatively heavy architectures. Recognizing complex pilot actions, which require detecting multiple hand and limb keypoints simultaneously, further increases computational complexity, limiting their real-time applicability in aviation scenarios. Therefore, developing lightweight models that maintain accuracy while meeting real-time requirements in aviation environments is a critical challenge.

Lightweight design is key to addressing the challenges of computational complexity and real-time performance. Common approaches include techniques based on convolutional decomposition [11] and model compression strategies [12], which reduce model complexity while also addressing dynamic changes and occlusion issues in aviation environments to some extent. However, these methods still face challenges in complex action recognition tasks. For instance, in multi-keypoint detection tasks, compressed models may lead to keypoint loss or reduced localization accuracy [13]. In multimodal data fusion, lightweight strategies may weaken the model’s ability to capture temporal information [14]. Therefore, there is an urgent need for a lightweight framework that can ensure efficient inference while enhancing the perception of complex action details, thereby improving the reliability of aviation human-machine interaction systems.

In pilot action recognition tasks, safety assessment is a critical component to ensure the reliable operation of the system. Accurately assessing the behavioral state of pilots not only helps improve flight safety but also provides a scientific basis for flight training and operational standards. However, traditional data-driven assessments often rely on single-modal data [15], such as visual data. Single-modal data struggles to comprehensively reflect the actual state of pilots, particularly under complex environments, occlusions, or dynamic conditions. Multimodal data fusion techniques (e.g., combining visual data, inertial measurement units, and physiological signals) offer new possibilities for comprehensively evaluating system performance [16]. By integrating multimodal data, it is possible to more comprehensively capture the behavioral characteristics of pilots, significantly enhancing the robustness of assessments, especially in complex environments [17]. Recent studies have employed approaches such as Fault Tree Analysis, Bayesian Networks, and machine learning-based fusion models [18] to leverage multimodal information, demonstrating the potential of combining multiple data sources to assess pilot state and workload more comprehensively [19]. In addition, explainable AI [20] techniques can reveal which features or sensor inputs contribute most to model predictions, improving interpretability and supporting objective safety assessment. Nevertheless, existing assessment methods still have shortcomings in the design of evaluation metrics, making it difficult to fully leverage the advantages of multimodal data, particularly in complex environments where they fail to meet practical requirements.
In summary, although existing methods have made some progress in pilot pose estimation and behavior recognition tasks, they still face the following key challenges: (1) Transformer-based methods, while significantly improving global modeling capabilities, suffer from high computational complexity, making real-time inference difficult in resource-constrained scenarios; (2) Existing evaluation methods lack a scientifically supported metric system, making it challenging to fully leverage the advantages of multimodal data and meet practical requirements in complex environments. To address these challenges, this paper proposes a lightweight framework for pilot pose estimation and behavior recognition, and innovatively introduces multimodal data fusion technology to achieve a scientific evaluation of system performance. The main contributions of this paper are as follows:
This paper innovatively proposes a vision Transformer framework for pilot pose estimation, combining a collaborative optimization strategy of order-swapped attention mechanisms and self-attention mechanisms. Depth-wise separable convolutions are introduced to enhance local feature extraction capabilities, achieving joint optimization of hand and limb keypoint detection, significantly improving the accuracy and efficiency of pilot keypoint localization.
The effectiveness of the proposed model is validated through extensive experiments. The results demonstrate that the model significantly enhances the accuracy and efficiency of pilot keypoint localization in complex environments.
Based on extensive experiments and practice, a scientific metric system is summarized, providing a reliable basis for system safety assessment. Additionally, multimodal data fusion technology is innovatively introduced, and the system’s reliability and safety are evaluated using this metric system, effectively ensuring the robustness and stability of the system in complex environments.

To verify the effectiveness of the proposed method, this paper conducts a comprehensive evaluation through theoretical analysis and experimental validation. The structure of this paper is shown in Figure 1 and is organized as follows: Section 2 introduces related research progress; Section 3 details our methodology and model architecture; Section 4 presents experimental validation of pilot pose estimation; Section 5 evaluates the safety of the proposed model; and finally, Section 6 concludes the paper.

2. Related Research Progress

2.1. Posture Estimation and Behavior Recognition

Over the past few years, keypoint detection has seen considerable progress, forming the basis for pose estimation. Traditional methods primarily rely on feature-based analysis [21] and are typically used in relatively simple scenarios. With the advent of deep learning [22], CNN-based methods [23] (e.g., MobileNet [24], OpenPose [25]) have achieved higher accuracy, can learn multi-level features, and are applicable to complex scenes. Research in multimodal fusion [18] has expanded the application of keypoint detection, such as combining RGB and depth images [26], as well as infrared and thermal images [27]. At the same time, self-supervised learning methods [28] have been shown to improve the generalization ability and computational efficiency of keypoint detection, as demonstrated in studies on pressure maps [29] and thermal images [30]. Moreover, the development of lightweight models, such as YOLOv8n-Pose [31], has significantly advanced real-time processing techniques.

Nevertheless, the performance of pose estimation largely depends on the precision of keypoint detection, particularly in pilot pose estimation tasks where accurate identification of hand and limb keypoints is essential. Consequently, enhancing the reliability and accuracy of keypoint detection has become a major focus of current research. With the development of deep learning, pose estimation has evolved from early inference based on statistical methods such as maximum likelihood estimation [32,33] to various deep learning-based models [34,35], with the estimation dimension gradually expanding from 2D to 3D [36,37]. For instance, SMT-PTEC [38] and GraphMLP [39] both propose graph-like MLP structures to capture local details, which improve the recovery of spatial information.

Global correlation plays a critical role in human pose estimation, and the localized nature of CNNs can limit their ability to capture long-range interactions [8]. Transformer-based models [9,40,41,42,43,44] were introduced to overcome CNNs’ local dependency limitations by capturing global feature correlations through self-attention. However, these models often involve high computational complexity and insufficient fine-grained feature extraction, limiting their real-time applicability in high-dynamic pilot pose estimation scenarios.

With increasing emphasis on efficiency, lightweight Transformer variants have been developed to balance accuracy and speed, such as the CSA-Block hybrid attention network [45], PLPose [46], and HRFormer [10], which optimize feature representation and attention computation through window-based or sparse attention mechanisms [47]. Nevertheless, lightweight models still face trade-offs between efficiency and accuracy, indicating the need for further architectural optimization to enhance real-time performance and robustness in pilot pose estimation.

In summary, while existing research has made progress in improving the real-time performance and accuracy of pilot pose estimation, challenges remain in balancing model lightweighting with maintaining high accuracy, enhancing feature extraction efficiency, and optimizing keypoint detection strategies.

To tackle these challenges, we introduce a lightweight visual Transformer architecture, augmented with depth-wise separable convolutions to improve feature extraction capabilities. The specific methodology will be detailed in the next chapter.

2.2. Security Assessment

With the continuous improvement of civil aviation safety requirements and the rapid advancement of intelligent cockpit technologies [48], pilot pose estimation and behavior recognition systems, as core components ensuring flight safety, have garnered widespread attention for their safety and reliability evaluation techniques. From early-stage simple evaluations based on rules and expert experience to the current comprehensive evaluations leveraging artificial intelligence and multi-dimensional data, safety assessment technologies have undergone significant evolution. This developmental trajectory can be summarized as a progressive transition from rule-driven to data-driven, and further to intelligent evaluation.

In the early stages, safety assessments primarily relied on predefined rules [49] and expert experience [50], lacking flexibility and struggling to adapt to complex and dynamic flight scenarios. With the advancement of statistical analysis and traditional machine learning technologies, safety assessments transitioned from rule-driven to data-driven [51] approaches. Researchers began employing algorithms such as Support Vector Machines [52] and Random Forests [53], combined with historical data for model training, significantly enhancing the generalization capabilities of assessments. For instance, PS-AE-LSTM [54] utilized time-series data, integrating Autoencoders and LSTM to achieve risk assessment. Despite these advancements, traditional methods still relied on manually designed features, making it difficult to capture complex behavioral patterns and demonstrating limited capabilities in processing high-dimensional data.

The full utilization of multimodal data has further advanced the development of safety assessment methodologies. Researchers have employed methods such as Fault Tree Analysis [55], Bayesian Networks [56], and Fuzzy Theory [57], integrating multimodal data (e.g., visual, auditory, and biological signals) to conduct comprehensive evaluations. Wei et al. [58], based on Safety-II theory, collected multidimensional data and proposed a quantitative assessment framework utilizing both implicit and explicit multi-source data to evaluate pilot flight safety. Pei et al. utilized visual data by converting pilot flight videos into frame sequence datasets. After preprocessing the images, they employed a GRNN model to assess pilot mental fatigue [59]. Li et al. [2] established a neurophysiological model using electroencephalogram (EEG) data and evaluated the model’s feasibility using leave-one-subject-out cross-validation. Ma et al. [60] collected EEG, electrocardiogram (ECG), and skin conductance signals to extract high-dimensional feature matrices, constructing a multimodal fusion workload recognition model based on machine learning. However, existing research still faces challenges in the utilization of multi-source data, such as low data fusion efficiency, insufficient model interpretability, and poor adaptability to extreme scenarios, which affect the reliability of the evaluation results in practical applications.

These issues further constrain the construction of the indicator system, leading to the use of empirical indicators in most existing evaluation systems, which fail to systematically define and quantify the correlation between pilot state and safety risks. For instance, most studies focus on the analysis of single-modal physiological or behavioral data, without considering the interactions between different modalities. Moreover, some studies, although establishing evaluation standards for specific scenarios, lack generalizability and cannot be applied to complex flight environments.

To address these challenges, virtual simulation technology has gradually emerged as a critical direction in safety assessment [61]. Virtual simulation can simulate various flight scenarios under controlled conditions, including extreme situations and emergencies, thus providing an efficient platform for the generation and fusion of multi-source data. Through the simulation environment, researchers can generate large amounts of multimodal data, optimize data fusion algorithms, and verify the robustness and adaptability of evaluation models. Moreover, virtual simulation also provides an experimental basis for the construction of an indicator system, enabling the testing and validation of evaluation standards under different scenarios. However, current research primarily uses virtual simulation for data collection and testing, without establishing a complete safety assessment system, and still lacks a scientific and systematic indicator system to quantify the relationship between pilot state and safety risks.

To sum up, existing research still faces shortcomings in the integration of multi-source data and the design of safety assessment indicator systems, which hinders the full utilization of multi-modal data advantages and the improvement of assessment accuracy and adaptability. To address this, this paper proposes a new safety assessment indicator system based on multi-source data, and utilizes the collected multi-source data for scoring and quantitative evaluation to enhance the accuracy and applicability of the assessment.

3. Methodology

This study presents a hybrid model for pose estimation and behavior recognition for civil aviation pilots (Figure 2). The core architecture of the model comprises three main modules. First, a lightweight pilot keypoint detection network is designed, which integrates Order-switched Attention (OSA) and Standard Attention (SA) [62] in parallel branches, significantly reducing the model’s parameter count and computational complexity while maintaining high detection accuracy. Second, a Depth-wise Separable Convolution [63] module is introduced into the Feedforward Neural Network (FFN) to enhance the perception of local features by expanding the model’s receptive field. Based on the above keypoint detection framework, this paper further constructs a specialized hand keypoint detection sub-network to achieve joint optimization of limb and hand keypoint detection. To address the pilot behavior recognition task, this paper integrates a K-Nearest Neighbor (KNN) classifier based on the keypoint detection network to construct a complete behavior recognition and classification system. To verify the effectiveness of the model and explore the influence of each component on the performance of keypoint localization, a series of ablation experiments are conducted. These experiments systematically evaluate the contribution of the core components, such as the order-switching attention module, to the performance of the model. This is achieved through a combination of quantitative and qualitative methods to ensure that the model prediction results are interpretable and traceable. Furthermore, this study proposes a comprehensive evaluation index system for pilot behavior recognition, thus providing a scientific quantitative standard for system performance evaluation.

To further illustrate the implementation process of the proposed model, Figure 3 presents the software workflow of the pilot behavior recognition and evaluation system, showing the data flow from input acquisition to final performance assessment.

3.1. Pilot Body Keypoint Detection Model

3.1.1. HRNet-Former

Given HRNet’s [64] excellent performance in human keypoint detection tasks, this study presents an improved HRNet-Former module. This module incorporates Four-stage Progressive Feature Extraction, inheriting HRNet’s core architecture. Specifically, high-resolution feature maps serve as the initial input in each stage, and a multi-branch parallel structure is used to construct a feature pyramid from high to low resolution [65]. During feature extraction, the resolution of the feature maps decreases progressively at each branching level, while the number of feature channels increases exponentially, doubling at each descending level. To enhance feature representation, this study introduces the Multi-scale Feature Fusion Mechanism (MFM), which enables deep fusion and information complementarity of feature maps at different resolutions through Cross-level Feature Interaction (CLEI).

3.1.2. Standard Attention Module and Order-Switched Attention Module

The structural composition of each stage is delineated in Table 1, where the high-resolution branch employs the order-switched attention module, and the low-resolution branch utilizes the standard attention module.

Unlike the locally convolved CNN structure, the self-attention mechanism features global feature interaction. To prepare the input for the Transformer, the output of the first stage

M \in R^{U \times H \times W}

is converted into a two-dimensional sequence

X \in R^{L \times D}

(

L = H \times W

;

U

denotes the number of channels;

D

is the feature dimension, set to be the same as the number of channels) as follows.

X = Img 2 Seq (M)

(1)

This operation allows the Transformer to treat each spatial feature as a token, enabling global contextual modeling beyond local convolution. Here,

Img 2 Seq (\cdot)

represents the operation of converting image features into a sequence while preserving spatial location information. To this end, the location information

P \in R^{L \times D}

is embedded into a two-dimensional sequence

X^{l}

. The sequence is then subjected to a linear transformation to obtain

Q u e r y (Q \in R^{L \times D})

,

K e y (K \in R^{L \times D})

, and

V a l u e (V \in R^{L \times D})

. The raw attention computes the pairwise similarity between Q and V, resulting in a complexity of

O (L^{2} D)

. The attention computation is expressed as:

Attention (Q, K, V) = Softmax (Q K^{T} / \sqrt{d_{k}}) V

(2)

This attention operation captures the global dependencies among all spatial positions, adaptively weighting features according to their relevance to each other.

It is important to note that, given the significantly smaller dimension

D

in comparison to

L

the high-resolution branch, the switched order attention is applied in the same high-resolution branch. It is also noteworthy that raw attention is applied in the low-resolution branch; the specific descriptions of each branch are shown in Table 1.

As demonstrated in Figure 4, the sequence swapping attention modifies the model complexity by initially computing.

K^{T} V

through the implementation of appropriate swapping operations, thereby yielding a complexity of

O (L D^{2})

. Subsequently, the single-head order-swapping attention operation is executed to ascertain the global interactions in the sequence as outlined below.

Attention (Q, K, V) = Q Softmax (K^{T} V / \sqrt{d_{k}})

(3)

Here,

Soft \max (\cdot)

denotes the activation attention window operation;

\sqrt{d_{k}}

scales the dot product result, and

T (\cdot)

represents the transpose operation. This operation enables efficient global dependency modeling while reducing computational cost by reordering the attention computation.

Intuitively, this order-switched mechanism reduces computational complexity by changing the order in which attention is computed—first attending along the feature dimension and then along the spatial dimension, rather than jointly over all tokens. This decomposition breaks the original quadratic dependency into two smaller linear operations, thereby maintaining global interaction while significantly lowering the overall computation cost.

The multi-head attention operation is performed using multiple single-head attention with a weight matrix

W^{O}

as follows.

MultiAtt (Q, K, V) = Concat ({head}_{1} \dots {head}_{n}) W^{O}

(4)

This formulation allows the model to jointly attend to information from different representation subspaces, enhancing feature diversity and robustness.

The output of the multi-head attention is fused with the input sequence

X^{l} \in R^{L \times D}

to obtain

f \in R^{L \times D}

.

f = X^{l} + MSA (X^{l})

(5)

where

M S A (\cdot)

denotes the multi-head self-attention operation. Moreover, this residual fusion helps preserve the original spatial information while incorporating globally refined contextual features.

However, the order-switching attention restricts the

S o f t \max (\cdot)

attention window from

L \times L

to

D \times D

size, thereby diminishing the local feature details. Consequently, to enhance the local perceptual ability of the sequentially switched attention module, this paper proposes the

F F N_{D W}

module that augments the local attention by inserting additional deep convolution in the

F F N

layer. The structure of the

F F N_{D W}

module is illustrated in Figure 5, which shows a sequence of

1 \times 1

convolution, followed by

k_{s} \times k_{s}

depth convolution, and concluding with

1 \times 1

convolution.

The fusion information

f

is entered into the

F F N_{D W}

module, thus completing the feature information interaction and consequently yielding

Y \in R^{L \times D}

, as outlined below.

Y = Img 2 Sqe (F F N_{D W} (Sqe 2 Img (f)))

(6)

O_{f} = Sqe 2 Img (f + Y)

(7)

These operations jointly realize the interaction and reconstruction of multi-scale features—first integrating global and local information across branches, and then restoring the sequential representation to the spatial domain to generate enhanced image features.

The sequence transformation graphical operation is denoted by

Sqe 2 Im g (\cdot)

.

In the subsequent phase, the sequence

Y

undergoes the

Sqe 2 Im g (\cdot)

operation to obtain the subsequent image feature

O_{f} \in R^{D \times H \times W}

, as illustrated in Equation (7). Ultimately, the Transformer coding layer features undergo a

1 \times 1

convolutional transform, which yields dimensions consistent with the number of key points to generate a keypoint heat map.

Compared with representative methods such as HRFormer [10] and TransPose [9], our approach introduces a lightweight order-switched attention mechanism combined with the

F F N_{D W}

module, which incorporates depth-wise separable convolutions to enhance local feature extraction. This design significantly reduces computational complexity while preserving both global and local feature interactions. Consequently, our method achieves real-time performance in aviation scenarios while maintaining high accuracy in detecting multiple hand and limb keypoints, addressing the limitations of the baselines that rely on relatively heavy architectures primarily designed for generic human pose estimation tasks.

3.2. Pilot Hand Keypoint Detection Model

This paper proposes a hand keypoint detection pilot network, which is principally composed of a palm detector module and a hand keypoint module. The palm detector is trained by YoloV5 [66], and the hand keypoint detection model is the pilot keypoint detection model in Section 3.1.

3.3. Assessment of Multi-Source Data Fusion

To comprehensively evaluate the effectiveness of the pilot behavior recognition system, this study adopts a multimodal data fusion approach, integrating both qualitative and quantitative indicators for a comprehensive analysis. By consolidating data from diverse sources, a thorough assessment of the system’s performance is achieved.

In multimodal fusion tasks, several statistical and evidential strategies can be employed to integrate heterogeneous data. Among these, this study adopts an extended Bayesian fusion approach to combine qualitative and quantitative information in a unified probabilistic framework. Compared with other fusion strategies such as the Dempster–Shafer evidence theory [67] or Kalman filtering [68], the Bayesian method provides a more flexible and mathematically consistent framework for integrating heterogeneous information sources [69]. Specifically, it can explicitly model uncertainty and prior knowledge, handle both qualitative and quantitative indicators in a unified probabilistic form, and avoid the conflict accumulation problem often encountered in Dempster–Shafer fusion. Moreover, unlike Kalman filtering, which assumes temporal correlation and linear Gaussian noise, the Bayesian approach is more suitable for static evaluation tasks involving expert judgment and multidimensional indicator data.

The Bayesian method, based on conditional probability theory, effectively processes data from various sources. The formula is as follows:

P (θ = y e s) = \partial \prod_{p = p_{1}}^{p_{n}} [(M \times N) + (1 - M) \times (1 - N)]

(8)

\partial = \frac{1}{\prod_{p = p_{1}}^{p_{n}} [(M \times N) + (1 - M) \times (1 - N)] + \prod_{p = p_{1}}^{p_{n}} [M \times (1 - N) + (1 - M) \times N]}

(9)

Among them,

P (θ = y e s)

represents the probability that all evaluating experts are satisfied with the indicator under a certain criterion;

M

and

N

denote the satisfaction and knowledge level of the experts regarding the indicator, respectively; and

\partial

represents the normalization factor. In essence, these Bayesian formulations quantify the likelihood of expert consensus and integrate subjective (qualitative) and objective (quantitative) evidence, ensuring a balanced probabilistic evaluation of system performance.

The weighted summation of the satisfaction scores computed for each indicator yields the effectiveness evaluation value of the pilot behavior recognition system for individual actions, as shown in Equation (10).

E = \sum_{i = 1}^{23} W (u_{i}) * P (u_{i} | θ = y e s)

(10)

where:

W (u_{i}) = {w_{1}, w_{2}, \dots, w_{23}}

is the weight of the underlying indicator. This weighted aggregation produces the final effectiveness score, reflecting the overall performance of the pilot behavior recognition system across multiple evaluation dimensions.

4. Discussion

4.1. Pilot Body Keypoint Detection Dataset

4.1.1. Dataset Source

In this study, a high-quality pilot keypoint detection dataset is created using a cockpit simulation platform. The data collection process strictly follows the standardized operating procedures outlined in the Crew Standard Operating Procedures (CSP) Third Edition. The dataset consists of 6800 pilot pose images, including 5200 in the training set, 800 in the validation set, and 800 in the test set, which ensures that the model’s training, validation, and testing requirements are met. All image data are subjected to a standardized preprocessing procedure and uniformly resized to a resolution of 640 pixels (height) × 480 pixels (width) to eliminate potential inconsistencies in image sizes that could impact model performance. Multiple standard RGB cameras are mounted in the simulated cockpit to capture pilot poses from different viewpoints (Figure 6), ensuring comprehensive coverage and high-quality images for keypoint annotation. The specifications of these cameras are summarized in Table 2.

In keypoint annotation, this study uses 17 predefined key points to annotate the pilot’s limbs, including the nose, both eyes, ears, shoulders, elbows, wrists, hips, and the midpoints between the right and left hip and knee, as well as the right and left knees. Figure 7 shows some samples of images we have experimentally captured for keypoint detection. The annotation of these key points provides accurate spatial position information for subsequent pose estimation and behavior recognition tasks, laying the foundation for model training and evaluation.

4.1.2. Evaluation Criteria

In this study, Object Keypoint Similarity (OKS) [70] is used as the performance evaluation metric for pilot keypoint detection. OKS provides an accurate measure of model performance by quantifying the spatial consistency between predicted keypoints and ground truth keypoints. The specific evaluation metrics include Average Precision (AP) and Average Recall (AR) at OKS thresholds of 0.50, 0.55, …, 0.90, and 0.95, calculated at 10 points to comprehensively reflect the model’s performance under different similarity thresholds. The formula for calculating OKS is defined as follows:

O K S = \frac{\sum_{j} \exp (- \frac{d_{j}^{2}}{2 s^{2} k_{j}^{2}}) \cdot δ (v_{j} > 0)}{\sum_{j} δ (v_{j} > 0)}

(11)

where:

j

refers to the type index of the pilot’s limb key points;

d_{j}

is the Euclidean distance between the predicted key point coordinates and the manually labelled coordinates;

v_{j}

represents the visibility information of key points, and

v_{j} > 0

represents the visibility of key points;

s

is the scale factor, usually the square root of the area of the target area, which is used to normalize the distance measurement;

k_{j}

represents the control attenuation constant associated with each key, which is used to adjust the weights of different keys.

The evaluation method provides a scientific and comprehensive quantitative criterion for the performance of the pilot’s keypoint detection model by taking into account the spatial error, visibility, and scale information of the critical points.

4.1.3. Experimental Setup

The experiments were conducted on an experimental setup consisting of an Ubuntu 20.04 LTS 64-bit system, two NVIDIA 3090 graphics cards (NVIDIA Corporation, Santa Clara, CA, USA), and the PyTorch 1.8.1 deep learning framework. The images were cropped to a size of 256 × 192 for training on the pilot limb keypoints dataset, and the AdamW optimizer was used in the experiments. The initial learning rate was set to 1 × 10⁻³, reduced to 1 × 10⁻⁴ at the 140th epoch, and further reduced to 1 × 10⁻⁵ at the 170th epoch, with the model being trained for a total of 210 epochs. The batch size per GPU was set to 32, and the data was augmented using image rotation and horizontal flipping.

4.1.4. Result Analysis

To evaluate the effectiveness of the proposed model, several comparisons were made with benchmark models on the pilot keypoint detection dataset. The results of these comparisons, including the number of keypoint detection network parameters and GFLOPs (Gigaflops Floating-point Operations Per Second), are presented in Table 3.

In this paper, the model is trained with an input size of 256 × 192, achieving an AP score of 81.9. Compared to SimpleBaseline, our method shows a 4.0 AP score advantage. Additionally, our method demonstrates a 1.1 AP score advantage over HRNetV1, while reducing the number of parameters by 24.0 M. A comparison with TransPose reveals that our method has 3.5 M fewer parameters and a 0.6 higher AP score. However, compared to PoseUR, the AP score of our method is 0.3 points lower, but the number of parameters is significantly reduced by 24.3 M. Overall, this method strikes a balance between the number of parameters and accuracy, providing a real-time solution for pilot behavior recognition.

Furthermore, as shown in Table 4, the statistical inference speed on a single NVIDIA 3090 graphics card outperforms that of HRFormer-S. Despite the higher number of parameters and complexity in our model, its inference speed is faster. This can be attributed to the mechanism in HRFormer-S, which reduces computational cost. The findings highlight the practical relevance of the proposed model and emphasize its significant engineering implications for future applications.

Figure 8 demonstrates the visualization results of the pilot operating images captured in this study under the keypoint detection model. The green dots in the figure mark the detected key points of the human body, and the connecting lines indicate the connectivity between the body parts, which provides the basic support for the subsequent tasks of pose estimation and behavior recognition.

4.2. MS COCO Dataset

4.2.1. Introduction to the MS COCO Dataset

The MS COCO keypoint detection dataset comprises a total of 200,000 images and 250,000 human instances, with 17 keypoints. The OKS evaluation criterion is applied to the COCO dataset, as illustrated in Equation (11).

4.2.2. Experimental Settings

The training environment employed for experimental purposes is as follows: an Ubuntu 20.04 LTS 64-bit system, four NVIDIA 3090 graphics cards, and a Pytorch 1.8.1 learning framework. The images were cropped to a 256 × 192 size for the purpose of training on the MS COCO dataset, and the experiments were conducted using the AdamW optimizer. The initial learning rate was set to 1 × 10⁻³, reduced to 1 × 10⁻⁴ at the 160th epoch and further to 1 × 10⁻⁵ at the 210th epoch, with the model undergoing a total of 230 cycles of training. Each GPU batch size was set to 32.

4.2.3. Analysis of Experimental Results

The model has been validated on the COCO val2017 dataset, with the results of this validation process presented in Table 5. When the input image size is 256 × 192, the model obtained 72.8 AP (it should be noted that the image brightness adjustment module was not used in the above results in order to ensure a fair comparison). In comparison with the CPN model, the average prediction accuracy of this model has been improved by 4.2 AP points. In comparison with the Simple Baseline model, the number of model parameters is reduced by 16.7%, the GFLOPs are decreased by 42.7%, and the AP accuracy is enhanced by 2.4 points. Furthermore, when contrasted with the HRNetV1 network model, the number of parameters of this model is reduced by 84.2%, and the number of GFLOPs is decreased by 46.5%, however, the AP accuracy is reduced by only 0.6 points.

In comparison with MobileNetV2 and ShuffleNetV2, the models presented in this paper demonstrate a reduction in parameters by 5.1 M and 3.1 M, respectively, while concurrently exhibiting an enhancement in performance by 8.2 AP points and 12.9 AP points within the lightweight network models. In contrast with the HRFormer-S model, despite an augmentation in parameters by 2.0 M, the prediction AP score is elevated by 2.6 points.

4.3. FreiHAND Hand Keypoint Dataset

4.3.1. Introduction to the FreiHAND Hand Keypoint Dataset

The selection of the hand keypoint dataset is informed by the FreiHAND dataset [78], which contains 130,240 training samples and 3960 validation samples. Each training sample is characterised by high-resolution RGB images (224 pixels × 224 pixels) in addition to precise location information for 21 hand keypoints. The high-quality annotations of this dataset provide reliable data support for the training and validation of the model, ensuring the high accuracy and robustness of the model in the hand keypoint detection task.

The model performance was evaluated using a standard measure of Percentage of Correct Keypoints (PCK) as follows.

h = \sqrt{{(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2}}

(12)

P C K = \sum_{i = 1}^{n} \{h_{i} \geq e | h_{i} \in D\} / n

(13)

where:

h

is the error distance between the predicted and true position of the key point;

e

is the error threshold;

n

is the number of key points of the hand.

4.3.2. Analysis of Experimental Results

The experimental validation was carried out on the FreiHAND data, and Table 6 shows the results of the comparison between this method and other methods.

In terms of real-time performance, this model exhibits inferior performance in comparison to MobileNetV2, MobileNetV3, and ShuffleNetV2. However, it demonstrates a comparatively higher prediction score. In contrast to the SRHandNet network model, this model exhibits an enhancement of 3.41 in PCK accuracy score.

4.4. Joint Deployment of Limb and Hand Keypoints

The single-image-based joint deployment model for pilot limb and hand keypoint detection proposed in this study is capable of simultaneously extracting 59 pilot body keypoints, including 42 hand keypoints (21 points per hand) and 17 limb keypoints. Typical flight maneuver scenarios were selected for validation, including operations such as clicking on the touch screen, pushing and pulling the throttle stick, opening the landing gear, and activating the auto-navigation button. The experimental findings demonstrate that the jointly implemented model exhibits effective performance in real-time detection tasks and establishes a substantial foundation for subsequent pilot behavior recognition.

As illustrated in Table 7, the proposed joint deployment model is compared with alternative existing methods. The results of the joint model are derived from the data presented in Table 3 and Table 6. The model under discussion exhibits superior real-time performance while concomitantly reducing the number of parameters, thereby demonstrating an optimal balance between computational efficiency and detection accuracy. Furthermore, the real-time inference results of the proposed method are demonstrated in Figure 9, alongside those of several other models that have been deployed in conjunction with the present model, utilizing a single NVIDIA 3090 graphics card. This figure serves to further substantiate the efficacy and practicality of the proposed method in real-world application scenarios.

4.5. Ablation Experiment

To investigate the effect of the order-switched attention module in the Transformer module on the model performance, three models were validated on the pilot’s limb keypoint dataset. The three models are the original attention module, the order-switched attention without the

F F N_{D W}

layer, and the model proposed in this paper. When the input size is 256 × 192, Table 8 shows the comparison of the performance of the different models on the pilot’s limb keypoint dataset. It is evident from the comparison with the original model that this model can approximate its recognition ability with lower complexity and higher inference speed. Furthermore, the insertion of deep convolution in the FFN layer can enhance the perception of local information, as evidenced by the comparison with the order-switched attention (without

F F N_{D W}

).

5. Evaluation

5.1. The Construction of a Pilot Behavior Recognition Indicator System

To evaluate the feasibility of the pilot behavior recognition system in civil aviation cockpits, we developed a comprehensive evaluation indicator system based on computer vision technology, considering multiple factors such as hardware facilities, software algorithms, external lighting conditions, and individual differences. This system, validated through extensive experiments and practical applications, includes key indicators such as pose estimation and motion capture accuracy, as shown in Table 9. The qualitative indicators include {A1, A3, A4, A8, A13, A14, A22}, and the quantitative indicators include {A2, A5, A6, A7, A9, A10, A11, A12, A15, A16, A17, A18, A19, A20, A21}. This system aims to ensure that the behavior recognition system complies with aviation safety standards and is effectively applicable to typical scenarios in flight missions.

5.2. Experimental Content

The experiment was conducted on a civil aircraft cockpit simulation platform, equipped with a pilot behavior recognition system and a data analysis system, and a wide-angle camera was installed on the top of the flight simulator to collect pilot operation data in real time. Twenty flight trainees with some flight experience were trained before the experiment to ensure that they operated in a stable mood and without physical fatigue. The experimenters grouped the trainees according to light intensity and completed the flight tasks under different light conditions. The pilots’ actions were recorded and identified by a behavioral recognition system, which classified the actions as “correctly identified”, “not identified” and “incorrectly identified”.

For the qualitative indicators, five experts in the field were invited to conduct scoring. For the quantitative indicators, multiple indicator tests were conducted, and the method of fusion of multiple data sources in Section 3.3 was applied to calculate the satisfaction level of the indicators and the evaluation value of the effectiveness of the single-action-based pilot behavior recognition system.

5.3. Analysis of Experimental Results

Figure 10 illustrates the identification results of typical pilot operations—right-click, push, gear shift, and twist—recognized by the behavior recognition system. These operation samples were selected as representative actions in the constructed indicator system and used to evaluate the system’s safety performance under realistic cockpit scenarios.

Based on the experimental data and the multimodal data fusion method introduced in Section 3.3, this study evaluates the safety of the pilot behavior recognition system. By integrating qualitative and quantitative indicators through an extended Bayesian approach, the overall system satisfaction under different criteria is computed and analyzed. The specific evaluation results are presented in Table 10.

Through the aforementioned data fusion and weighted summation, this study obtained an effectiveness evaluation value of 0.782 for the pilot behavior recognition system, indicating a certain level of recognition capability, particularly excelling in routine pilot operations. Specifically, the recognition accuracy for the “touching the touchscreen” and “pushing the throttle lever” actions reached 0.782 and 0.753, respectively, approaching the accuracy of manual methods (0.98 and 0.99). However, for the more complex operation of “deploying the landing gear,” the recognition accuracy is significantly lower at only 0.474, far below the manual method’s accuracy of 0.97. This suggests that the system’s recognition capability declines when handling complex flight tasks.

To further investigate the limitations of the proposed pilot behavior recognition system, Figure 11 presents a comparison between correctly and incorrectly identified actions. The system accurately recognizes routine operations such as pushing the throttle lever and touching the touchscreen, achieving precise keypoint localization and consistent pose tracking. However, misclassifications occur in certain cases: the touchscreen interaction is mistakenly identified when the pilot’s finger is positioned above a button overlaying the touchscreen, and the throttle lever push is incorrectly recognized due to the lever blending with the background color.

These errors mainly arise from partial occlusions, ambiguous hand–instrument interactions, and visual similarity with surrounding cockpit elements. These findings indicate that while the system performs reliably for routine pilot actions, its robustness decreases under visually complex or occluded scenarios. Future improvements will focus on integrating temporal modeling and multimodal data fusion to better capture motion continuity and mitigate the influence of environmental constraints.

6. Summary

This study focuses on the accuracy and safety evaluation of pilot pose estimation and behavior recognition, proposing a lightweight computer vision-based approach(Figure 12). By constructing a joint network for limb and hand keypoint detection and integrating order-switched attention mechanisms with depth-wise separable convolution, the proposed method optimizes both recognition accuracy and computational efficiency. Experimental results demonstrate that the method exhibits strong robustness under varying lighting conditions while meeting real-time performance requirements, providing technical support for intelligent cockpits and pilot behavior monitoring. Additionally, a safety evaluation framework based on multimodal data fusion is developed, and an extended Bayesian approach is employed for quantitative analysis. Evaluation results indicate that the system can adapt to complex environmental variations, ensuring the stability of pilot behavior recognition and providing a scientific basis for flight safety management.

Despite the progress achieved in this study, further optimization is needed: (1) enhancing the recognition capability of complex operational behaviors; (2) exploring multimodal data fusion to improve behavior understanding; (3) strengthening system adaptability to extreme environments. In the future, the proposed method and evaluation framework can be extended to broader aviation safety management applications, supporting the development of intelligent cockpit technologies.

Author Contributions

Conceptualization, H.W.; Methodology, H.W. and X.L.; Software, H.W. and X.L.; Validation, H.W., X.L. and H.L.; Formal analysis, H.W. and X.L.; Data curation, H.W. and X.L.; Writing—original draft, H.W. and X.L.; Writing—review & editing, Y.S. and H.L.; Visualization, H.W.; Supervision, Y.S. and H.L.; Project administration, Y.S. and H.L.; Funding acquisition, Y.S. Confirmed. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Joint Fund of National Natural Science Foundation of China and Civil Aviation Administration of China, grant number U2033202, U1333119.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CSP	Crew Standard Operating Procedures
OKS	Object Keypoint Similarity
AP	Average Precision
AR	Average Recall
PCK	Percentage of Correct Keypoints
MS COCO	Microsoft Common Objects in Context
FreiHAND	Freiburg Hand Dataset
FFN	Feed-Forward Network
GFLOPs	Giga Floating Point Operations per Second

References

Liang, B.; Chen, Y.; Wu, H. A Conception of Flight Test Mode for Future Intelligent Cockpit. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; pp. 3260–3264. [Google Scholar]
Li, Q.; Ng, K.K.H.; Yiu, C.Y.; Yuan, X.; So, C.K.; Ho, C.C. Securing Air Transportation Safety through Identifying Pilot’s Risky VFR Flying Behaviours: An EEG-Based Neurophysiological Modelling Using Machine Learning Algorithms. Reliab. Eng. Syst. Saf. 2023, 238, 109449. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Zheng, C.; Zhao, L.; Liu, J.; Wang, L. Evaluation of Random Forest for Complex Human Activity Recognition Using Wearable Sensors. In Proceedings of the 2020 International Conference on Networking and Network Applications (NaNA), Haikou, China, 10–13 December 2020; pp. 310–315. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields 2019. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
Li, Y.; Wang, C.; Cao, Y.; Liu, B.; Luo, Y.; Zhang, H. A-HRNet: Attention Based High Resolution Network for Human Pose Estimation. In Proceedings of the 2020 Second International Conference on Transdisciplinary AI (transai), Irvine, CA, USA, 21–23 September 2020; pp. 75–79. [Google Scholar]
Wang, Z.; Liu, Y.; Zhang, E. Pose Estimation for Cross-Domain Non-Cooperative Spacecraft Based on Spatial-Aware Keypoints Regression. Aerospace 2024, 11, 948. [Google Scholar] [CrossRef]
Su, X.; Xu, H.; Zhao, J.; Zhang, F.; Chen, X. GM-HRNet: Human Pose Estimation Based on Global Modeling. In Proceedings of the 2023 IEEE 7th Information Technology and Mechatronics Engineering Conference (Itoec), Chongqing, China, 15–17 September 2023; Volume 7, pp. 1144–1147. [Google Scholar]
A Survey on Convolutional Neural Networks and Their Performance Limitations in Image Recognition Tasks. Available online: https://onlinelibrary.wiley.com/doi/epdf/10.1155/2024/2797320 (accessed on 5 March 2025).
Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint Localization via Transformer. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11782–11792. [Google Scholar]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-Resolution Transformer for Dense Prediction. arXiv 2021, arXiv:2110.09408. [Google Scholar] [CrossRef]
Zhao, Z.; Song, A.; Zheng, S.; Xiong, Q.; Guo, J. DSC-HRNet: A Lightweight Teaching Pose Estimation Model with Depthwise Separable Convolution and Deep High-Resolution Representation Learning in Computer-Aided Education. Int. J. Inf. Technol. 2023, 15, 2373–2385. [Google Scholar] [CrossRef]
Caishi, H.; Sijia, W.; Yan, L.; Zihao, D.; Feng, Y. Real-Time Human Pose Estimation on Embedded Devices Based on Deep Learning. In Proceedings of the 2023 20th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 15–17 December 2023; pp. 1–7. [Google Scholar]
He, Y.; Kang, G.; Dong, X.; Fu, Y.; Yang, Y. Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks 2018. arXiv 2018, arXiv:1808.06866. [Google Scholar]
Barabanov, M.Y.; Bedolla, M.A.; Brooks, W.K.; Cates, G.D.; Chen, C.; Chen, Y.; Cisbani, E.; Ding, M.; Eichmann, G.; Ent, R.; et al. Diquark Correlations in Hadron Physics: Origin, Impact and Evidence. Prog. Part. Nucl. Phys. 2021, 116, 103835. [Google Scholar] [CrossRef]
Reconstruction of Pilot Behaviour from Cockpit Image Recorder. Available online: https://arc.aiaa.org/doi/10.2514/6.2020-1873 (accessed on 12 March 2025).
Li, Y.; Li, K.; Wang, S.; Chen, X.; Wen, D. Pilot Behavior Recognition Based on Multi-Modality Fusion Technology Using Physiological Characteristics. Biosensors 2022, 12, 404. [Google Scholar] [CrossRef]
Wang, Z.Z.; Xia, X.; Chen, Q. Multi-Level Data Fusion Enables Collaborative Dynamics Analysis in Team Sports Using Wearable Sensor Networks. Sci. Rep. 2025, 15, 28210. [Google Scholar] [CrossRef] [PubMed]
Xie, D.; Zhang, X.; Gao, X.; Zhao, H.; Du, D. MAF-Net: A Multimodal Data Fusion Approach for Human Action Recognition. PLoS ONE 2025, 20, e0319656. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Wang, H.; Zhang, H. Cognitive Workload Assessment in Aerospace Scenarios: A Cross-Modal Transformer Framework for Multimodal Physiological Signal Fusion. Multimodal Technol. Interact. 2025, 9, 89. [Google Scholar] [CrossRef]
Rodis, N.; Sardianos, C.; Radoglou-Grammatikis, P.; Sarigiannidis, P.; Varlamis, I.; Papadopoulos, G.T. Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions 2024. IEEE Access 2024, 12, 159794–159820. [Google Scholar] [CrossRef]
Lowe, D.G. Object Recognition from Local Scale-Invariant Features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Deng, L.; Suo, H.; Jia, Y.; Huang, C. Pose Estimation Method for Non-Cooperative Target Based on Deep Learning. Aerospace 2022, 9, 770. [Google Scholar] [CrossRef]
Phisannupawong, T.; Kamsing, P.; Torteeka, P.; Channumsin, S.; Sawangwit, U.; Hematulin, W.; Jarawan, T.; Somjit, T.; Yooyen, S.; Delahaye, D.; et al. Vision-Based Spacecraft Pose Estimation via a Deep Convolutional Neural Network for Noncooperative Docking Operations. Aerospace 2020, 7, 126. [Google Scholar] [CrossRef]
Sharma, P.; Shah, B.B.; Prakash, C. A Pilot Study on Human Pose Estimation for Sports Analysis. In Pattern Recognition and Data Analysis with Applications; Gupta, D., Goswami, R.S., Banerjee, S., Tanveer, M., Pachori, R.B., Eds.; Lecture Notes in Electrical Engineering; Springer Nature: Singapore, 2022; Volume 888, pp. 533–544. ISBN 978-981-19-1519-2. [Google Scholar]
Li, X.; Du, H.; Wu, X. Algorithm of Pedestrian Pose Recognition Based on Keypoint Detection. In Proceedings of the 2023 8th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), Beijing, China, 3–5 November 2023; pp. 122–126. [Google Scholar]
Jeong, J.; Park, B.; Yoon, K. 3D Human Skeleton Keypoint Detection Using RGB and Depth Image. Trans. Korean Inst. Electr. Eng. 2021, 70, 1354–1361. [Google Scholar] [CrossRef]
Zhu, Z.; Dong, W.; Gao, X.; Peng, A.; Luo, Y. Towards Human Keypoint Detection in Infrared Images. In Proceedings of the 29th International Conference on Neural Information Processing, ICONIP 2022, New Delhi, India, 22–26 November 2022; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2023; Volume 1792, pp. 528–539. [Google Scholar]
Yuan, M.; Bi, X.; Huang, X.; Zhang, W.; Hu, L.; Yuan, G.Y.; Zhao, X.; Sun, Y. Towards Time-Series Key Points Detection through Self-Supervised Learning and Probability Compensation. In Proceedings of the Database Systems for Advanced Applications, Tianjin, China, 17–20 April 2023; Wang, X., Sapino, M.L., Han, W.-S., El Abbadi, A., Dobbie, G., Feng, Z., Shao, Y., Yin, H., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 237–252. [Google Scholar]
Yu, C.; Yang, X.; Bao, W.; Wang, S.; Yao, Z. A Self-Supervised Pressure Map Human Keypoint Detection Approch: Optimizing Generalization and Computational Efficiency across Datasets. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024. [Google Scholar] [CrossRef]
Aghaomidi, P.; Aram, S.; Bahmani, Z. Leveraging Self-Supervised Learning for Accurate Facial Keypoint Detection in Thermal Images. In Proceedings of the 2023 30th National and 8th International Iranian Conference on Biomedical Engineering (ICBME), Tehran, Iran, 30 November–1 December 2023; pp. 452–457. [Google Scholar]
Vdoviak, G.; Sledevič, T. Enhancing Keypoint Detection in Thermal Images: Optimizing Loss Function and Real-Time Processing with YOLOv8n-Pose. In Proceedings of the 2024 IEEE 11th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), Valmiera, Latvia, 31 May–1 June 2024; pp. 1–5. [Google Scholar]
Shuster, M.D. The TRIAD Algorithm as Maximum Likelihood Estimation. J. Astronaut. Sci. 2006, 54, 113–123. [Google Scholar] [CrossRef]
Zaal, P.M.T.; Mulder, M.; Van Paassen, M.M.; Mulder, J.A. Maximum Likelihood Estimation of Multi-Modal Pilot Control Behavior in a Target-Following Task. In Proceedings of the 2008 IEEE International Conference on Systems, Man and Cybernetics, Singapore, 12–15 October 2008; pp. 1085–1090. [Google Scholar]
Roggio, F.; Trovato, B.; Sortino, M.; Musumeci, G. A Comprehensive Analysis of the Machine Learning Pose Estimation Models Used in Human Movement and Posture Analyses: A Narrative Review. Heliyon 2024, 10, e39977. [Google Scholar] [CrossRef]
Brutch, S.; Moncayo, H. Machine Learning Approach to Estimation of Human-Pilot Model Parameters. In Proceedings of the AIAA Scitech 2024 Forum, Orlando, FL, USA, 8–12 January 2024; American Institute of Aeronautics and Astronautics: Reston, VA, USA, 2024. [Google Scholar]
Wei, C.; Long, W.; Jiang, S.; Chen, C.; Wu, D.; Jiang, L. A Trend of 2D Human Pose Estimation Base on Deep Learning. In Proceedings of the 2022 4th International Symposium on Smart and Healthy Cities (ISHC), Shanghai, China, 16–17 December 2022; pp. 214–219. [Google Scholar]
Liu, Y.; Qiu, C.; Zhang, Z. Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey. Neurocomputing 2024, 596, 128049. [Google Scholar] [CrossRef]
Wu, Y.; Kong, D.; Gao, J.; Li, J.; Yin, B. Joint Multi-Scale Transformers and Pose Equivalence Constraints for 3D Human Pose Estimation. J. Vis. Commun. Image Represent. 2024, 103, 104247. [Google Scholar] [CrossRef]
Li, W.; Liu, M.; Liu, H.; Guo, T.; Wang, T.; Tang, H.; Sebe, N. GraphMLP: A Graph MLP-like Architecture for 3D Human Pose Estimation. Pattern Recognit. 2025, 158, 110925. [Google Scholar] [CrossRef]
Xiang, X.; Li, X.; Bao, W.; Qiao, Y.; El Saddik, A. DBMHT: A Double-Branch Multi-Hypothesis Transformer for 3D Human Pose Estimation in Video. Comput. Vis. Image Underst. 2024, 249, 104147. [Google Scholar] [CrossRef]
Sun, Q.; Pan, X.; Ling, X.; Wang, B.; Sheng, Q.; Li, J.; Yan, Z.; Yu, K.; Wang, J. A Vision-Based Pose Estimation of a Non-Cooperative Target Based on a Self-Supervised Transformer Network. Aerospace 2023, 10, 997. [Google Scholar] [CrossRef]
Zhu, M.; Ho, E.S.L.; Chen, S.; Yang, L.; Shum, H.P.H. Geometric Features Enhanced Human–Object Interaction Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5026014. [Google Scholar] [CrossRef]
Cheng, C.; Xu, H. Human Pose Estimation in Complex Background Videos via Transformer-Based Multi-Scale Feature Integration. Displays 2024, 84, 102805. [Google Scholar] [CrossRef]
Zhang, T.; Li, Q.; Wen, J.; Philip Chen, C.L. Enhancement and Optimisation of Human Pose Estimation with Multi-Scale Spatial Attention and Adversarial Data Augmentation. Inf. Fusion 2024, 111, 102522. [Google Scholar] [CrossRef]
Luo, Y.; Gao, X. A Lightweight Network for Human Keypoint Detection Based on Hybrid Attention. In Proceedings of the 4th International Conference on Neural Networks, Information and Communication Engineering, NNICE 2024, Guangzhou, China, 19–21 January 2024; Institute of Electrical and Electronics Engineers Inc.: Guangzhou, China, 2024; pp. 10–15. [Google Scholar]
Xiao, Q.; Zhao, R.; Shi, G.; Deng, D. PLPose: A Bottom-up Lightweight Pose Estimation Detection Model. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024; pp. 1439–1443. [Google Scholar]
Xu, W.; Xu, Y.; Chang, T.; Tu, Z. Co-Scale Conv-Attentional Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021. [Google Scholar]
Suck, S.; Fortmann, F. Aircraft Pilot Intention Recognition for Advanced Cockpit Assistance Systems. In Foundations of Augmented Cognition: Neuroergonomics and Operational Neuroscience, Proceedings of the 10th International Conference, AC 2016, Held as Part of HCI International 2016, Toronto, ON, Canada, 17–22 July 2016; Schmorrow, D.D., Fidopiastis, C.M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 231–240. [Google Scholar]
Pasquini, A.; Pozzi, S.; McAuley, G. Eliciting Information for Safety Assessment. Saf. Sci. 2008, 46, 1469–1482. [Google Scholar] [CrossRef]
McMurtrie, K.J.; Molesworth, B.R.C. The Variability in Risk Assessment between Flight Crew. Int. J. Aerosp. Psychol. 2017, 27, 65–78. [Google Scholar] [CrossRef]
Guo, Y.; Sun, Y.; He, Y.; Du, F.; Su, S.; Peng, C. A Data-Driven Integrated Safety Risk Warning Model Based on Deep Learning for Civil Aircraft. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 1707–1719. [Google Scholar] [CrossRef]
Shi, L.-L.; Chen, J. Assessment Model of Command Information System Security Situation Based on Twin Support Vector Machines. In Proceedings of the 2017 International Conference on Network and Information Systems for Computers (ICNISC), Shanghai, China, 14–16 April 2017; pp. 135–139. [Google Scholar]
Jiang, S.; Su, R.; Ren, Z.; Chen, W.; Kang, Y. Assessment of Pilots’ Cognitive Competency Using Situation Awareness Recognition Model Based on Visual Characteristics. Int. J. Intell. Syst. 2024, 2024, 5582660. [Google Scholar] [CrossRef]
Sun, H.; Yang, F.; Zhang, P.; Jiao, Y.; Zhao, Y. An Innovative Deep Architecture for Flight Safety Risk Assessment Based on Time Series Data. CMES Comput. Model. Eng. Sci. 2023, 138, 2549–2569. [Google Scholar] [CrossRef]
Zheng, X.; Liu, Q.; Li, Y.; Wang, B.; Qin, W. Safety Risk Assessment for Connected and Automated Vehicles: Integrating FTA and CM-Improved AHP. Reliab. Eng. Syst. Saf. 2025, 257, 110822. [Google Scholar] [CrossRef]
Wang, J.; Fan, K.; Mo, W.; Xu, D. A Method for Information Security Risk Assessment Based on the Dynamic Bayesian Network. In Proceedings of the 2016 International Conference on Networking and Network Applications (NaNA), Hakodate, Japan, 23–25 July 2016; pp. 279–283. [Google Scholar]
Xu, C.; Hu, C.; Nie, W. Application of Fuzzy Theory and Digraph Method in Security Assessment System. In Proceedings of the 2010 International Conference on Intelligent Computation Technology and Automation, Changsha, China, 11–12 May 2010; Volume 1, pp. 754–757. [Google Scholar]
Wei, Z.; Zou, Y.; Wang, L. Applying Multi-Source Data to Evaluate Pilots’ Flight Safety Style Based on Safety-II Theory. In Engineering Psychology and Cognitive Ergonomics; Harris, D., Li, W.-C., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; Volume 14017, pp. 320–330. ISBN 978-3-031-35391-8. [Google Scholar]
Pei, H.; Li, G.; Ma, Y.; Gong, H.; Xu, M.; Bai, Z. A Mental Fatigue Assessment Method for Pilots Incorporating Multiple Ocular Features. Displays 2025, 87, 102956. [Google Scholar] [CrossRef]
Ma, Y.; Liu, Q.; Yang, L. Machine Learning-Based Multimodal Fusion Recognition of Passenger Ship Seafarers’ Workload: A Case Study of a Real Navigation Experiment. Ocean Eng. 2024, 300, 117346. [Google Scholar] [CrossRef]
Liu, X.; Xiao, G.; Wang, M.; Li, H. Research on Airworthiness Certification of Civil Aircraft Based on Digital Virtual Flight Test Technology. In Proceedings of the 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), San Diego, CA, USA, 8–12 September 2019; pp. 1–6. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need 2023. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications 2017. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection 2017. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Liu, Q.; Yao, J.; Yao, L.; Chen, X.; Zhou, J.; Lu, L.; Zhang, L.; Liu, Z.; Huo, Y. M²Fusion: Bayesian-Based Multimodal Multi-Level Fusion on Colorectal Cancer Microsatellite Instability Prediction. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023 Workshops; Woo, J., Hering, A., Silva, W., Li, X., Fu, H., Liu, X., Xing, F., Purushotham, S., Mathai, T.S., Mukherjee, P., et al., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2023; Volume 14394, pp. 125–134. ISBN 978-3-031-47424-8. [Google Scholar]
Qiu, H.; Yu, J.; Chyad, M.H.; Singh, N.S.S.; Hussein, Z.A.; Jasim, D.J.; Khosravi, M. Comparative Analysis of Kalman Filters, Gaussian Sum Filters, and Artificial Neural Networks for State Estimation in Energy Management. Energy Rep. 2025, 13, 4417–4440. [Google Scholar] [CrossRef]
Strelet, E.; Wang, Z.; Peng, Y.; Castillo, I.; Rendall, R.; Reis, M.S. Regularized Bayesian Fusion for Multimodal Data Integration in Industrial Processes. Ind. Eng. Chem. Res. 2024, 63, 20989–21000. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context 2015. arXiv 2015, arXiv:1405.0312. [Google Scholar]
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 472–487. [Google Scholar] [CrossRef]
Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z.; den Hengel, A. van Poseur: Direct Human Pose Regression with Transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 72–88. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June2018; pp. 7103–7112. [Google Scholar] [CrossRef]
Wang, Y.; Li, M.; Cai, H.; Chen, W.; Han, S. Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; Volume 2022, pp. 13116–13126. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet V2: Practical Guidelines for Efficient Cnn Architecture Design. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 122–138. [Google Scholar] [CrossRef]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-HRNet: A Lightweight High-Resolution Network. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10435–10445. [Google Scholar]
Zimmermann, C.; Ceylan, D.; Yang, J.; Russell, B.; Argus, M.; Brox, T. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images 2019. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Bulat, A.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. Toward Fast and Accurate Human Pose Estimation via Soft-Gated Skip Connections. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2020, Buenos Aires, Argentina, 16–20 November 2020; pp. 8–15. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, B.; Peng, C. SRHandNet: Real-Time 2D Hand Pose Estimation with Simultaneous Region Localization. IEEE Trans. Image Process. 2020, 29, 2977–2986. [Google Scholar] [CrossRef]

Figure 1. Structure of this article.

Figure 2. Overall pipeline of the proposed pilot behavior recognition and evaluation system.

Figure 3. Software workflow of the proposed pilot behavior recognition and evaluation system.

Figure 4. Schematic of sequential attention. (a) Description of what is standard attention; (b) Description of what is order-switched attention.

Figure 5. FFN_DW module structure.

Figure 6. Intelligent Cockpit Flight Simulation Platform, where the red rectangles indicate the practical installation positions of the cameras within the cockpit.

Figure 7. Sample Images of Pilot Posture for Keypoint Annotation.

Figure 8. Visualization Results of Pilot Pose Keypoint Detection.

Figure 9. The real-time effect of the joint model.

Figure 10. Representative examples of pilot operation identification under typical actions for safety evaluation based on the proposed indicator system.

Figure 11. Visual comparison between correctly and incorrectly identified pilot actions under different cockpit conditions.

Figure 12. Overall Workflow of Pilot Behavior Recognition and Safety Evaluation Framework.

Table 1. HRNet-Former backbone structure.

Size	Stage 1	Stage 2	Stage 3	Stage 4
$64 \times 48$	$[\begin{array}{l} 1 \times 1, 64 \\ 3 \times 3, 64 \\ 1 \times 1, 256 \end{array}] \times 2$	$[\begin{array}{l} O S A M o u d l e, 32 \\ H e a d = 1, F F N_{D W} = 7 \times 7 \end{array}] \times 2$	$[\begin{array}{l} O S A M o u d l e, 32 \\ H e a d = 1, F F N_{D W} = 7 \times 7 \end{array}] \times 2$	$[\begin{array}{l} O S A M o u d l e, 32 \\ H e a d = 1, F F N_{D W} = 7 \times 7 \end{array}] \times 2$
$32 \times 24$		$[\begin{array}{l} O S A M o u d l e, 64 \\ H e a d = 1, F F N_{D W} = 7 \times 7 \end{array}] \times 2$	$[\begin{array}{l} O S A M o u d l e, 64 \\ H e a d = 1, F F N_{D W} = 7 \times 7 \end{array}] \times 2$	$[\begin{array}{l} O S A M o u d l e, 64 \\ H e a d = 1, F F N_{D W} = 7 \times 7 \end{array}] \times 2$
$16 \times 12$			$[\begin{array}{l} O S A M o u d l e, 128 \\ H e a d = 2, F F N_{D W} = 5 \times 5 \end{array}] \times 2$	$[\begin{array}{l} O S A M o u d l e, 128 \\ H e a d = 2, F F N_{D W} = 5 \times 5 \end{array}] \times 2$
$8 \times 6$				$[\begin{array}{l} S A M o u d l e, 256 \\ H e a d = 4 \end{array}] \times 2$

Table 2. Specifications of cameras used for pilot keypoint data collection.

Item	Specification/Description
Camera model	Generic RGB Camera
Sensor type	CMOS
Resolution	1920 × 1080 pixels
Frame rate	30 fps
Mounting	Driver-side upper right, Driver-side upper left
Quantity	2

Table 3. Performance comparison of the models on the pilot keypoint detection dataset.

Moudule	Core Network	Input Size	Params	GFLOPs	AP	AR
SimpleBaseline [71]	ResNet-50	256 × 192	34.0 M	8.9	77.9	78.5
HRNetV1 [64]	HRNet-W32	256 × 192	28.5 M	7.1	80.8	81.6
TransPose [9]	HRNet-W32	256 × 192	8.0 M	10.2	81.3	83.2
PoseUR [72]	HRFormer-B	256 × 192	28.8 M	12.6	82.1	83.8
HRFormer-S [10]	HRFormer-S	256 × 192	2.5 M	1.3	80.6	82.2
Our Model	HRNet-Former	256 × 192	4.5 M	3.8	81.9	83.3

Table 4. Comparison of the inference speed of models.

Module	Params (M)	GFLOPs	Frame Rate/Minute
MobileNetV2 [73]	9.6	1.97	53.4
HRNetV1 [64]	7.6	1.70	28.6
HRFormer-S [10]	2.5	1.30	33.1
Our Model	4.5	3.80	44.3

Table 5. Performance comparison of the different models on the COCO val2017 dataset.

Module	Input Size	Params	GFLOPs	AP	AR
Complex network model
CPN [74]	256 × 192	27.0 M	6.2	68.6	-
SimpleBaseline [71]	256 × 192	34.0 M	8.9	70.4	76.3
HRNetV1 [64]	256 × 192	28.5 M	7.1	73.4	78.9
DAEK [75]	128 × 96	63.6 M	3.6	71.9	77.9
TransPose-H-S [9]	256 × 192	8.0 M	10.2	74.2	78.0
PoseUR-HRNet-32	256 × 192	28.8 M	4.48	74.7	-
Light-weight network model
MobileNetV2 [73]	256 × 192	9.6 M	1.48	64.6	70.7
ShuffleNetV2 [76]	256 × 192	7.6 M	1.28	59.9	66.4
Lite-HRNet-18 [77]	256 × 192	1.1 M	0.2	64.8	71.2
HRFormer-S [10]	256 × 192	2.5 M	1.3	70.9	76.6
Our model	256 × 192	4.5 M	3.8	72.8	78.4

Table 6. Comparison of the models on the FreiHAND dataset.

Module	Params	PCK Score	FPS
MobileNetV2 [73]	9.6 M	81.48	49.5
MobileNetV3 [79]	8.7 M	83.61	41.1
ShuffleNetV2 [76]	7.6 M	82.86	45.3
SRHandNet [80]	16.3 M	95.43	28.4
Our model	4.8 M	98.84	38.6

Table 7. Real-Time Performance Comparison of Joint Models.

Joint Model (Limb + Hand)	Params	FPS
HRNetV1 + MobileNetV2	38.1 M	7.6
HRFormer-S + MobileNetV2	12.1 M	11.2
HRNetV1 + ShuffleNetV2	36.1 M	7.4
HRFormer-S + ShuffleNetV2	10.1 M	11.1
HRNetV1 + SRHandNet [80]	44.8 M	5.4
HRFormer-S + SRHandNet [80]	18.8 M	8.8
Our model	9.7 M	14.3

Table 8. Performance comparison of methods on the limb keypoint dataset.

Module	Params	GFLOPs	AP	FPS
Standard Attention	5.0 M	8.56	82.2	21.3
Order-switched Attention (without FFN_DW)	4.1 M	3.34	74.4	39.7
Proposed model	4.5 M	3.80	82.3	35.4

Table 9. Indicator System for Pilot Behavior Recognition in Civil Aircraft Cockpits.

System	Subsystem	Indicator
Pilot Behavioral Recognition System	Pilot Attitude Estimation System	A1 Reliability of the pose estimation dataset
		A2 Recognition accuracy of pose estimation algorithms
		A3 Operational procedure masking
		A4 Hardware system power
		A5 Pose estimation algorithm real-time performance
		A6 Number of parameters quantities for the pose estimation model
		A7 Computational complexity of the pose estimation model
	Motion Capture System	A8 Reliability of Motion Capture Datasets
		A9 Operating procedure recognition accuracy
		A10 False touch rate for operating procedure recognition
		A11 Motion capture model computational complexity
		A12 Number of Motion Capture Model Parameters
		A13 Operating procedure complexity
		A14 Urgency of operating procedures
	Lighting Environment System	A15 Stability of filtering algorithms
		A16 Communication Transmission Stability
		A17 Cockpit light intensity level
		A18 External ambient light intensity level
		A19 Screen light intensity level
	Individual Differences	A20 Height
		A21 Arm length
		A22 Knowledge and experience
		A23 Duration of training received

Table 10. Comparison of the results of the effectiveness evaluation of this paper’s method and the manual method.

Action Type	Proposed Method Efficacy	Manual Method Efficacy
Click on the touch screen	0.782	0.98
Push the throttle lever	0.753	0.99
Open the landing gear	0.474	0.97
Switch on the autopilot	0.638	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, H.; Lu, X.; Sun, Y.; Liu, H. A Lightweight Framework for Pilot Pose Estimation and Behavior Recognition with Integrated Safety Assessment. Aerospace 2025, 12, 986. https://doi.org/10.3390/aerospace12110986

AMA Style

Wu H, Lu X, Sun Y, Liu H. A Lightweight Framework for Pilot Pose Estimation and Behavior Recognition with Integrated Safety Assessment. Aerospace. 2025; 12(11):986. https://doi.org/10.3390/aerospace12110986

Chicago/Turabian Style

Wu, Honglan, Xin Lu, Youchao Sun, and Hao Liu. 2025. "A Lightweight Framework for Pilot Pose Estimation and Behavior Recognition with Integrated Safety Assessment" Aerospace 12, no. 11: 986. https://doi.org/10.3390/aerospace12110986

APA Style

Wu, H., Lu, X., Sun, Y., & Liu, H. (2025). A Lightweight Framework for Pilot Pose Estimation and Behavior Recognition with Integrated Safety Assessment. Aerospace, 12(11), 986. https://doi.org/10.3390/aerospace12110986

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Framework for Pilot Pose Estimation and Behavior Recognition with Integrated Safety Assessment

Abstract

1. Introduction

2. Related Research Progress

2.1. Posture Estimation and Behavior Recognition

2.2. Security Assessment

3. Methodology

3.1. Pilot Body Keypoint Detection Model

3.1.1. HRNet-Former

3.1.2. Standard Attention Module and Order-Switched Attention Module

3.2. Pilot Hand Keypoint Detection Model

3.3. Assessment of Multi-Source Data Fusion

4. Discussion

4.1. Pilot Body Keypoint Detection Dataset

4.1.1. Dataset Source

4.1.2. Evaluation Criteria

4.1.3. Experimental Setup

4.1.4. Result Analysis

4.2. MS COCO Dataset

4.2.1. Introduction to the MS COCO Dataset

4.2.2. Experimental Settings

4.2.3. Analysis of Experimental Results

4.3. FreiHAND Hand Keypoint Dataset

4.3.1. Introduction to the FreiHAND Hand Keypoint Dataset

4.3.2. Analysis of Experimental Results

4.4. Joint Deployment of Limb and Hand Keypoints

4.5. Ablation Experiment

5. Evaluation

5.1. The Construction of a Pilot Behavior Recognition Indicator System

5.2. Experimental Content

5.3. Analysis of Experimental Results

6. Summary

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI