Applications of Machine Learning in Assessing Cognitive Load of Uncrewed Aerial System Operators and in Enhancing Training: A Systematic Review

Li, Qianchu; Molloy, Oleksandra; El-Fiqi, Heba; Eves, Gary

doi:10.3390/drones9110760

Open AccessSystematic Review

Applications of Machine Learning in Assessing Cognitive Load of Uncrewed Aerial System Operators and in Enhancing Training: A Systematic Review

¹

School of Science, University of New South Wales, Canberra, ACT 2612, Australia

²

School of Systems & Computing, University of New South Wales, Canberra, ACT 2612, Australia

³

CAE, Homebush West, NSW 2140, Australia

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(11), 760; https://doi.org/10.3390/drones9110760

Submission received: 11 September 2025 / Revised: 22 October 2025 / Accepted: 24 October 2025 / Published: 3 November 2025

(This article belongs to the Special Issue UAV Piloting, Training, Cooperation, and Interaction)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A review of 38 studies on the machine learning (ML) assessment of cognitive load (CL) for UAS operators and training has been conducted.
Support Vector Machine (SVM) was the most frequently used model, achieving up to 90% accuracy for UAS operators’ CL assessment.

What are the implications of the main findings?

ML approaches enhance UAS operator training through real-time, adaptive, and effective training compared with traditional training.
The findings provide a methodological foundation for developing applicable MLbased adaptive training systems in UAS contexts.

Abstract

This research is based on a systematic review of machine learning (ML) approaches for the cognitive load (CL) assessment of applications for unmanned aerial system (UAS) operator training. The review synthesises evidence on how ML techniques have been applied to assess CL using diverse data sources, including physiological signals (e.g., EEG, HRV), behavioural measures (e.g., eye-tracking), and performance indicators. It highlights the effectiveness of models such as Support Vector Machines (SVMs), Random Forests (RFs), and advanced deep learning (DL) architectures such as Long Short-Term Memory (LSTM), as well as how the use of different methods affects the performance of ML models, with studies reporting accuracies of up to 98%. The findings also indicate that, compared with traditional UAS training approaches, ML approaches can enhance training by providing adaptive assessment, with methodological factors such as model selection, data preprocessing, and validation being central to ML assessment performance. These findings highlight the value of accurate CL assessment as a foundation for adaptive training systems, supporting enhanced UAS operator performance and operational safety. By consolidating the methodological insights and identifying research gaps, this review provides valuable background information for advancing ML-based CL assessment and its integration into adaptive UAS operator training systems to enhance UAS operator training.

Keywords:

cognitive load; uncrewed aerial systems; machine learning; artificial intelligence; adaptive training; physiological data; human factors; UAS operator training

1. Introduction

Over the past decade, Uncrewed Aerial System (UAS) technology has seen significant advancements, particularly in autonomy and operational capabilities [1,2]. Despite these technological developments, human factors remain a dominant source of failure, with human error accounting for approximately 54% of operational failures in UAS contexts including military, commercial, and civil applications [3]. This persistent vulnerability highlights the critical role of human performance and decision-making in ensuring mission success and safety. Consequently, enhancing UAS operator training represents a key approach to mitigating human error, as operator proficiency directly determines mission safety, efficiency, and overall operational effectiveness [4].

However, current traditional UAS operator training generally includes classroom-based education and instruction and practical simulation, which relies heavily on instructor supervision [5]. Such reliance is associated with lower training effectiveness, training inefficiencies, increased costs, and a limited capacity to provide objective or personalised assessments of trainee operators [6,7]. In particular, traditional training approaches are limited in their ability to systematically address critical human error factors, especially those relating to operators’ cognitive state. As a result, recent research has increasingly investigated how Artificial Intelligence (AI), particularly machine learning (ML), can be used to assess aspects of human performance, such as cognitive load (CL), to enhance UAS operator training, and ultimately improve operator performance, training effectiveness, and overall operational safety.

Cognitive load plays a crucial role in UAS safety and training, as both excessive and insufficient load can adversely affect operator performance, decision-making, and task efficiency [8]. CL is commonly defined as “the amount of mental effort required to process task-related information within the limits of working memory capacity” [9]. It is a multidimensional construct influenced by task complexity, time pressure, prior experience, individual differences (e.g., motivation, attention, current mental state), stress, fatigue, and environmental context [9,10,11]. These understandings of CL highlight that CL is a dynamic process reflecting the real-time utilisation of working memory resources, which are inherently limited and vary across individuals and operational contexts [12]. When task demands exceed these cognitive resources, operators may experience cognitive overload, leading to slower response times, reduced motivation, fatigue, and increased error rates [13]. These effects ultimately impair UAS operators’ performance and compromise situational awareness, resulting in human error [14,15]. Therefore, understanding and managing CL is essential for optimising operator performance and maintaining safety in UAS operations. Contemporary behavioural science research further emphasises the importance of incorporating CL considerations into training design to support effective knowledge acquisition and skill development [16,17].

The complexity of UAS operations further underscores the significance of assessing CL for UAS operators. A typical UAS consists of uncrewed aircraft (drones), the remote pilot station, the control station, data links, and all supporting components required for safe and effective operation [18]. Consequently, UAS operators are responsible for the overall conduct of operations, hence their roles include remote piloting, sensor management and analysis (commonly performed by sensor operators), ground control station supervision, communications and data link maintenance, technical support, and ensuring regulatory compliance for safe and effective mission execution [19,20,21,22]. This multitasking is further complicated by limited access to direct sensory cues, such as physical movement, which makes maintaining situational awareness more difficult [23]. Instead, operators rely on remote visual inputs, often filtered through restricted camera perspectives with transmission constraints [24,25,26]. In such settings, operators are susceptible to elevated CL, which has been increasingly recognised as a major contributor to human performance errors in UAS contexts [12,14,27]. As a result, recent research has emphasised the importance of assessing and understanding UAS operators’ CL, particularly during training, to proactively enhance human performance, flight safety, and overall mission success [28,29].

Measures of CL play a vital role in enhancing the understanding of CL and human performance. Over the years, multiple measures have been developed to assess CL, including subjective, performance-based, and biometric approaches [30]. Subjective approaches, such as the NASA Task Load Index (NASA-TLX), rely on self-reported ratings of workload in six dimensions including mental, physical, and temporal demand, performance, effort, and frustration [31]. Performance-based measures, on the other hand, infer CL from behavioural outcomes such as reaction time, error rate, and task completion accuracy [31]. Although these approaches are simple and cost-effective, they often fail to capture dynamic or subtle fluctuations in CL during continuous operations [32,33].

To overcome these limitations of the continuous assessment of CL, particularly the need for the continuous and objective assessment of CL during complex UAS operator tasks regarding training and safety, recent research has increasingly turned to physiological and behavioural biometric signals [28,31]. Neurophysiological studies in aviation and related operational domains have demonstrated that cognitive load (CL) can be continuously characterised through its physiological correlates, such as electroencephalogram (EEG) activity, heart rate variability (HRV), and other biometric data [34,35,36]. These neural and autonomic indicators have been shown to reliably reflect variations in workload and task complexity, thereby providing objective and real-time markers for cognitive state assessing [34]. Accordingly, these findings suggest that biometric measures can be central to the development of continuous, data-driven methods for assessing cognitive load in UAS operator training and safety applications.

Biometric signals, such as eye-tracking, electrocardiogram (ECG), electroencephalogram (EEG), galvanic skin response (EDA), respiration (RSP), photoplethysmography (PPG), and skin temperature (SKT), have been shown to enable the effective, objective, and continuous assessment of CL in UAS operators [37,38,39]. Each biometric signal reflects a distinct physiological or cognitive process related to CL, making the combined data highly valuable for monitoring operator workload. For instance, EEG captures neural activity linked to attentional demand, ECG and PPG reflect autonomic responses to mental effort, while eye-tracking metrics reveal visual attention and information processing patterns [19,27,40]. However, these physiological and behavioural data are inherently complex, non-linear, and high-dimensional, particularly when recorded continuously during operator task performance [33,40]. This complexity makes manual analysis challenging, thereby highlighting the suitability of machine learning (ML) approaches for extracting meaningful patterns and interpreting CL dynamics from multimodal biometric datasets [28].

The adoption of ML approaches has enabled researchers to move beyond manual statistical data interpretation, allowing for real-time monitoring, prediction, and, more generally, the assessment of CL based on biometric signals [41]. Supervised learning ML models are particularly notable, as they can classify CL by learning the relationships between biometric signals and corresponding CL levels, typically labelled using subjective NASA-TLX scores or scenario-based criteria. Moreover, ML facilitates the integration of multimodal and heterogeneous data sources, such as physiological signals (e.g., EEG, ECG, EDA) and behavioural or performance measures (e.g., eye-tracking) [30], enabling a comprehensive and nuanced understanding of CL. Applied to continuous, time series biometric data, ML models support real-time, continuous CL assessment without interfering with task execution [42]. All these capabilities collectively enable the timely detection of critical cognitive states, such as overload, that may compromise operator performance during UAS training. Therefore, by providing immediate feedback on an operator’s cognitive state, ML-based assessment can help mitigate potential human errors during complex flight operations and support the development of adaptive UAS training frameworks that dynamically adjust training parameters such as difficulty or feedback based on the model-estimated CL levels. Consequently, ML has emerged as a key approach in UAS safety, given its strengths in large-scale data analysis, pattern recognition, and early risk detection [28,43].

Current studies have used a range of ML methods for the CL assessment of UAS operators, each characterised by distinct methodological choices or factors within their respective workflows, including model engineering and data preprocessing steps. Model engineering involves the selection and design of models, such as Support Vector Machines (SVMs), Random Forests (RFs), and Neural Networks (NNs), each of which has demonstrated varying degrees of effectiveness, typically measured by performance metrics (e.g., classification accuracy) in CL assessment tasks. Data engineering and preprocessing steps, such as data cleaning, normalisation, segmentation, feature extraction, and labelling, are significant when working with biometric signals, which are often noisy, artifact-prone, and exhibit substantial interindividual variability (for example, individuals may have different heart rate ranges). Furthermore, because these biometric measures are continuous across time in nature, segmentation is necessary to create inputs suitable for ML models, allowing them to learn CL patterns at different time points. Collectively, these ML workflow methodological factors determine the quality and structure of input data to the model, hence influence model performance and effectiveness [38]. The careful consideration of the entire ML workflow is therefore significant to consider for developing robust models for CL assessment and for informing adaptive training systems.

Despite the high cognitive demands faced by UAS operators, relatively few studies have specifically applied ML to assess CL in UAS operator training. To address this gap, in this review, we expanded the literature search to include studies on traditional aviation sensor operators, particularly those in sensor- or remote-intensive roles such as air traffic controllers (ATCs) and flight simulation personnel. In UAS operations, operators may serve as UAV pilots or sensor operators. Therefore, including studies on aviation sensor operators and simulation-based pilots is methodologically justifiable, as these roles would share comparable attentional, perceptual, and cognitive workload characteristics with UAS personnel. Furthermore, these studies often employ similar biometric measures for assessing cognitive load (e.g., EEG, eye-tracking, heart rate variability), typically collected under simulated or sensor-driven environments. In addition, this review considers studies that use ML models to assess operator-related data (such as performance or biometric signals) to inform training, measures that often reflect CL and thus offer transferable insights. The ML methodological approaches and analytical techniques employed in studies of aviation sensor operators and in ML-based assessments of operator-related data offer actionable guidance for developing adaptive training systems tailored to UAS personnel. Hence, this integrative approach, which combines findings from aviation sensor operator studies and ML-based analyses of operator-related data, ensures methodological rigour and compensates for the current scarcity of research on ML applications in UAS operator CL assessment, thereby maximising the relevance and practical value of the review’s findings for UAS operator training.

Building on these broadened insights gained from studies of aviation sensor operators and the ML-based assessment of operator-related data, recent advances in Artificial Intelligence (AI) and machine learning (ML) have introduced new avenues for adaptive and data-driven approaches to operator training. AI has shown valuable potential across aviation training, having been successfully applied to tasks such as autonomous flight control [44], intelligent tutoring, and adaptive instructional strategy optimisation [45,46]. In the context of UAS operator training, AI can support the delivery of personalised learning experiences and data-driven feedback, which may enhance training effectiveness by enabling adaptive learning tailored to individual cognitive load [7]. Machine learning, as a core component of AI, leverages data-driven pattern recognition to support adaptive and automated training environments. These models can continuously assess operator-related data, such as cognitive state, during training, enabling training modules to be dynamically tailored to individual needs, which forms the foundation of adaptive training systems.

ML models inform the design of adaptive training interventions, including the autonomous adjustment of content and difficulty, targeted feedback, and the development of personalised strategies aligned with each trainee’s cognitive, behavioural, or performance profile [7,26,47,48]. These capabilities reduce the dependence on human instructors and lower instructor workload, allowing training resources to be allocated more efficiently. Moreover, continuously aligning the complexity of UAS operator training with the trainee’s cognitive load and performance state enhances the acquisition of both technical skills and non-technical cognitive competencies. This adaptive alignment helps mitigate human error and improves overall training effectiveness, efficiency, and safety, thereby strengthening operational performance in UAS missions. Hence, the integration of AI and ML offers an approach for developing scalable and individualised UAS operator training.

While several systematic reviews have addressed the application of ML in aviation and UAS operator training, to date, there has been no comprehensive review focusing specifically on the methodological choices throughout the ML workflows used for analysing operator-related data and CL in UAS operator training. For instance, Alreshidi et al. [28] reviewed the use of tree-based models and Support Vector Machines for pilot workload and fatigue, primarily using physiological and psychological data. Jahanpour et al. [49] examined ML-based approaches for assessing cognitive fatigue in UAS settings, finding that despite short shift durations, monotony and task intensity still lead to substantial fatigue. Other reviews have highlighted the use of eye-tracking for attention and workload [50], and heart rate variability (HRV) for mental workload and stress monitoring [34,51]. In air traffic control, Suárez et al. [52] identified a recent surge in big data analytics and real-time cognitive monitoring, with an increasing emphasis on system-level integration and human–computer interaction. With regard to aviation operator training, Shaker and Al-Alawi [53] discussed the potential of big data and AI to improve training outcomes, though real-world implementations remain limited. However, these reviews have mainly focused on the types of ML models and biometric data used. A deeper examination of methodological choices, such as data preprocessing, feature extraction, and validation design, is equally critical, as these factors directly influence model effectiveness for cognitive load assessment, and ultimately determine the applicability and reliability of ML assessment model-informed adaptive UAS operator training systems.

Recognising the lack of a comprehensive review addressing the methodological choices in ML workflows for assessing CL in UAS operator training, this systematic review aims to provide an in-depth analysis of ML applications for assessing operator-related data, particularly CL, with an emphasis on the key methodological factors that influence ML assessment models’ effectiveness and applicability for enhancing training. Specifically, it examines how different ML methods have been employed to assess CL in UAS contexts and how different methodological decisions across the ML pipeline, such as data preprocessing, feature extraction, and validation design, affect model effectiveness. Furthermore, this review highlights how these methodological factors contribute to the development of adaptive, personalised training systems for UAS operators, and identifies current limitations that constrain their deployment in real-world training environments.

The following research questions are proposed to understand these ML applications for UAS operator training:

1.: What ML models have been effective in assessing CL for UAS operators?
2.: Which ML methodological factors influence the effectiveness of ML-based CL assessment?
3.: How can ML be used to enhance the training of UAS operators?

To address these questions, a systematic review of the literature on ML applications for assessing UAS operators’ CL and enhancing UAS operator training is conducted. The structure of this review is as follows: Section 2 outlines the methodology, Section 3 presents the consolidated findings, and Section 4 discusses the results and potential further investigations.

2. Materials and Methods

This systematic literature review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [54], ensuring a transparent and replicable process from the initial search strategy through to the predefined inclusion and exclusion criteria and independent screening. This systematic approach minimises bias, enhances credibility, and enables a comprehensive literature search, thereby facilitating the understanding of current ML applications for UAS operator CL assessment and enhancing UAS operator training. It also allows for the identification of key research gaps, offering practical insights for researchers, educators, and policymakers.

The literature search was conducted within the last 10 years, between 2014 and 2024, across four academic databases, namely Web of Science, Science Direct, Scopus, and PubMed. These databases were selected for their extensive coverage of ML, UAS, and research related to biometric measures. The search strategy utilised the following key words: “(“UAV” OR “UAS” OR “aviation”) AND (“cognitive load” OR “workload” OR “operator training” OR “pilot training” OR “biometric”) AND (“machine learning” OR “artificial intelligence”)”. The keywords were carefully selected to capture the breadth of research at the intersection of UAS operations, ML-enhanced training, and operator-related data (notably CL) assessment. The earlier literature on ML applications for UAS operator training and CL assessment was limited prior to 2014; thus, the selected period effectively captures the growth of contemporary research in this field.

The screening process involved systematically filtering publications based on their titles and abstracts, followed by a full-text review to assess relevance and reliability. This two-step process ensured that only the most relevant and high-quality studies were included in the review, as illustrated in Figure 1.

The following inclusion criteria are used to determine whether articles are included:

ML-based approaches used to assess CL.
ML-based approaches used to assess operator-related data in enhancing training.
Datasets related to UAS operators and sensor operators under simulated or sensor-driven environments.
Articles published between 1 January 2014 and 31 December 2024.

The following exclusion criteria were used to determine whether articles were excluded:

All articles not published in English.
All articles that are not peer reviewed.
All articles that are not review studies.
All articles that do not include ML approaches.
All articles that do not use data related to training variables or CL.
All articles that are not in the area of aviation or UAS.

3. Results

Recent advances in machine learning (ML) have enabled increasingly sophisticated approaches for assessing UAS and sensor operators’ cognitive load (CL) and operator-related data in UAS operator training. The literature search initially identified 1611 papers; after applying the PRISMA framework for screening and eligibility checks, 38 studies were included in the final review. These studies employed a range of ML models, including Support Vector Machines (SVM), Random Forests (RFs), eXtreme Gradient Boosting (XGBoost), K-Nearest Neighbours (KNN), deep learning models, and other model types (summarised in Figure 2), to assess UAS operator CL and enhance UAS operator training.

Importantly, the effectiveness of these approaches depends not only on the type of ML model, but also on a variety of methodological choices or factors within the ML workflow. Across the reviewed studies, diverse ML workflows were implemented, each incorporating different combinations of methodological components, such as data preprocessing and engineering methods, data fusion strategies, and model engineering decisions. Therefore, when interpreting the reported results, it is to consider these contextual differences, as methodological choices and experimental conditions can significantly impact model effectiveness and outcomes.

The general ML workflow for assessing operator-related data in these studies is summarised in Figure 3. Data were commonly collected from participants performing tasks in simulated, virtual reality (VR), or real-world sensor-driven environments, with participants wearing various biometric measurement devices and performance recording systems and completing subjective assessments such as the NASA-TLX after each task scenario. These biometric and performance/behaviour datasets are typically continuous (time series) and within-subject (individual) in nature, whereas subjective measures are discrete. Thus, these subjective assessments are not used as input features for ML models; instead, they are employed to validate data labels or evaluate ML model outputs. The actual inputs to ML models are biometric and performance measures, which, particularly in the case of raw physiological signals, often contain noise.

Preprocessing is therefore a key methodological stage in the workflow of ML-based CL assessment. The reviewed studies commonly apply artifact removal, normalisation, and sliding-window segmentation to transform raw physiological signals into features suitable for machine learning. Because physiological and behavioural data are continuously collected, segmentation divides the signal into short temporal windows (e.g., 10 s windows) labelled with cognitive load (CL) levels (e.g., low, medium, high), enabling temporal feature extraction and the continuous assessment of CL over time. Following preprocessing, ML models are trained to map features to target labels, with the careful optimisation of hyperparameters and preprocessing techniques to maximise performance. Model effectiveness or performance is typically evaluated using metrics such as accuracy, precision, recall, F1-score, and root mean squared error (RMSE). Finally, the optimally performed or most effective model is often validated for generalisability to confirm its applicability for operator training optimisation.

To provide a comprehensive understanding of how ML is applied to assessing CL and related operator data in enhancing UAS training, this section is structured into three subsections. Section 3.1 presents studies that explore ML methods for assessing cognitive load (CL), which is a prerequisite for adaptive training, as the accurate and continuous assessment of CL enables the data-driven adaptation of training difficulty, feedback, and instructional design. Section 3.2 examines how ML-based assessment methods can be extended to enhance UAS operator training through adaptive and personalised mechanisms. Finally, Section 3.3 presents the evaluation approaches used to assess ML model effectiveness.

3.1. Machine Learning Methods for Assessing Cognitive Load

This section covers 28 reviewed studies that investigated ML approaches to monitor, predict, or more generally assess the CL of UAS or sensor operators. An overview of these studies is provided in Table 1. These studies employed regression, binary classification, or multi-class classification tasks. Most studies used supervised learning, drawing on CL measures such as biometric signals, performance metrics, and subjective ratings, with labels based on experimental scenarios or standardised measures such as the NASA Task Load Index (NASA-TLX). Some studies addressed CL assessment as a regression problem, using a numerical range to represent CL levels.

While all studies share the common goal of assessing CL, the direct cross-study comparison of results is not appropriate. This is because reported model effectiveness is shaped by a wide range of scenario/setting constraints and methodological factors, including differences in data sources, experimental or scenario design, data preprocessing and engineering, and model engineering. These contextual and workflow differences can substantially impact outcomes and must be taken into account when interpreting ML results.

Across these studies, most researchers compare different ML models and methodological factors to determine which configurations yield the most effective assessment of CL within their experimental conditions. Therefore, this review focuses on the ML model and methodological factors reported as the best performing or ultimately adopted in each study, as determined by experimental results. This section summarises the key results regarding these selected ML models for CL assessment, with particular attention to the scenarios in which data are collected, the data modalities, and other methodological or engineering steps shown to influence effectiveness. To facilitate a structured analysis, the best performing or most effective models from the reviewed studies are grouped according to the taxonomy described in Fundamentals of Machine Learning for Predictive Data Analytics (Second Edition), as illustrated in Figure 4: similarity-based (Section 3.1.1), information-based (Section 3.1.2), error-based (Section 3.1.3), hybrid (Section 3.1.4), and deep learning techniques (Section 3.1.5) [78].

3.1.1. Similarity-Based ML Models

Similarity-based models predict outcomes based on the proximity or resemblance between data points, making them practical for irregularly distributed biometric signal datasets. K-Nearest Neighbours (KNN), an example of this category, was reported in studies by Wang et al. [56] and Shao et al. [57] to show the best performing classifier to monitor or predict CL across simulation mission phases (take-off, cruise, landing) and ramp control aviation tasks, achieving an accuracy percentage between 88.9% and 98.6%. KNN classifies CL levels by calculating K most similar neighbours in the feature space using distance metrics, and is widely chosen for its simplicity and high performance in classification tasks.

Wang et al. [56] used KNN with PPG signal data under scenarios of simulated flight mission phases (take-off, cruising, landing), using CL labelled by NASA-TLX scores. Meanwhile, Shao et al. [57] used K-means clustering to give objective CL labels such as low or high CL level using multimodal biometric data, including eye-tracking, respiration, and facial movement-derived metrics. These studies applied Random Forest (RF) to estimate feature importance for feature selection, while Principal Component Analysis (PCA) was used for dimensionality reduction by projecting features into principal components. In addition, they implemented z-score normalisation, which ensures all features contribute equally to model training, improving model performance and stability, then sliding window with overlap to address missing values and maintain data continuity for better model learning.

3.1.2. Information-Based ML Models

Information-based ML models perform classification by maximising the information gain or minimising the impurity at each decision point, without directly optimising prediction error. Methods such as Decision Trees (DT) and Random Forest (RF) were frequently used in the reviewed studies. Decision Trees recursively partition the dataset using feature-based rules, where each node corresponds to a decision criterion and branches represent possible outcomes, and are often chosen for their interpretability and ability to capture non-linear relationships. Lochner et al. [71] developed a DT model that achieved 80% accuracy in understanding the relationship between CL and trust in UAV point-to-point navigation operations using multimodal data including Galvanic Skin Response (GSR), heart rate (HR), accelerometer readings, performance data (e.g., landing metrics), subjective trust, and NASA-TLX CL scores. The preprocessing of these data included filtering to remove data noise and min–max normalisation to scale data features to a range of zero to one. The results revealed that low trust levels correlated with elevated CL, indicating an inverse relationship between trust and CL during UAV operations.

Single-tree models like DT are susceptible to overfitting when trained on small or noisy datasets, such as biometric data. In contrast, studies used multiple tree methods such as Random Forest (RF), which constructs multiple Decision Trees and combines their outputs through majority voting (for classification) or averaging (for regression), hence it is chosen often due to its interpretability of features. In Salvan et al. [74], RF with EEG signal data showed the best performance of 76% accuracy compared with KNN and other types of ML models in classifying CL into low, medium, and high levels under simulated flight missions with varying physical loads and ambient noise conditions. The model was trained using 38 participants’ data and 9 to test, preventing data leakage and proving model generalisability. Here, the EEG signals were preprocessed using a notch filter that eliminates powerline noise and eliminates non-movement data to improve signal quality and enable the model to focus on informative features. The study concludes that the model can provide real-time workload predictions once (1 HZ) per second.

Although RF showed a good performance in classifying CL, in many reviewed studies, RF was also commonly used to compute a feature importance score that works to reduce dimensions or feature selection, as seen in the studies by Wang et al. [56] and Shao et al. [57] with KNN. RF, when constructing each tree, gives features an importance score which can be used for feature selection or feature reduction.

3.1.3. Error-Based ML Models

Error-based ML models, on the other hand, focus on minimising the prediction error by optimising decision boundaries. Support Vector Machines (SVMs) and eXtreme Gradient Boosting (XGBoost) are particularly used in the reviewed studies for CL assessment. In particular, SVM is widely used and has shown a superior performance in eight studies. SVM seeks an optimal hyperplane to separate data points by maximising the margin between classes, and is often chosen for its ability to efficiently handle high-dimensional spaces and flexibility with kernel functions and regularisation terms to learn complex relationships [29,38,58,73].

For instance, Qin et al. [58] showed that SVM outperformed KNN, achieving 91.8% accuracy in the real-time classification of CL levels (low, medium, and high) using multimodal biometric data, including HRV data from ECG signal and eye metrics data, under simulated cruising, vigilance, and attack aviation operation phases. Here, CL was objectively labelled using Toeplitz Inverse Covariance-based clustering, an unsupervised technique tailored for time series data, to capture interdependencies in signal sequences. The data were segmented using 4 s sliding windows with 25% overlap and min–max normalised. The results revealed that HRV and eye movement metrics were responsive to CL fluctuations, with a significant increase in blink rate observed during fatigue.

In contrast, Shao et al. [73] and Massé et al. [60] developed an SVM that used CL labels based on EEG signal data from increasingly complex tasks. The model showed over 80% and 76.4% accuracy in classifying workload. Here, the EEG signals were preprocessed using bandpass filter signals within relevant frequency ranges, Independent Component Analysis (ICA) to remove noise, enabling the model to learn correct patterns, and min–max normalisation. These results indicate that by selecting the appropriate EEG channel combinations, a high accuracy and convenience in practical applications can be achieved.

To account for the individual differences in CL and model generalisation, Dell’Agnola et al. [38] employed a subject (individual)-specific SVM during various levels (low, medium, and high) of task load complexity of rescue missions with UAS and Sakib et al. [59] developed group-specific SVM, with both using multimodal physiological data. Dell’Agnola et al. [38] incorporated regularisation parameters to achieve subject specificity, allowing the model to tailor features that are significant to individual differences. This model achieved an accuracy between 87.3% and 91.2% on unseen data, outperforming SVM without subject-specific weights. The study utilised twenty-five filter-cleaned and normalised features selected using Recursive Feature Elimination (RFE), an embedded feature selection method that evaluates subsets of features by recursively eliminating the least important variables based on model weights and performance, thereby improving model performance. Sakib et al. [59] developed group-specific SVM models that showed that in 83% of cases, the model was able to accurately predict the other group’s (such as different gender and difficulty groups) CL, showing that the model had some generalisability and adaptability.

On the other hand, both Momeni et al. [31] and Dell’Agnola et al. [77] observed that XGBoost outperformed SVM under UAV search and rescue missions in assessing UAV operators’ CL. XGBoost is a gradient boosting ensemble method that iteratively builds Decision Trees, where each tree corrects the residuals of the previous one, making it well suited for real-time CL monitoring with heterogeneous signal data. It is often used for its abilities to handle complex feature relationships and generalisation capabilities. Momeni et al. [31] developed an XGBoost method that performs the binary classification of CL (low and high) using multimodal physiological signals (ECG, RSP, PPG, and SKT) that achieved an 86% accuracy on new unseen data during UAV search and rescue operations, indicating some model generalisability. The model used RFECV (RFE with cross-validation), selected 26 optimal features, and applied a sliding window with overlap. Similarly, Dell’Agnola et al. [77] employed XGBoost, also using multimodal physiological signals, and demonstrated binary CL classification with 80.2% accuracy, using 24 selected features derived from RFECV and SHapley Additive exPlanations (SHAP) that quantify the contribution of each feature and hence provide interpretable insights into feature importance.

3.1.4. Hybrid/Ensemble ML Models

Several studies designed hybrid or ensemble ML models that combined multiple models to improve the ML model learning of CL levels. Monfort et al. [75] proposes an ensemble that combined KNN, RF, and SVM to classify binary CL levels (high or low) based on multimodal data, including eye metrics, reaction time, accuracy, and error rates during operator performing plan search routes for three uncrewed aircraft operations. This ensemble approach achieved 78% real-time classification accuracy, surpassing each individual model’s performance. Data cleaning involved removing segments where eyes were closed, adjusting for inter-individual variability in eye-tracking, and applying low-pass filtering and linear interpolation to fill missing values.

Zhang et al. [29] built a two-layer hierarchical SVM with a Gaussian kernel to classify CL into three levels and showed 88.74% accuracy using fNIRS signals collected during simulator flight equipment failure emergency scenarios, outperforming single SVM and two-layered KNN. Preprocessing functional near-infrared spectroscopy (fNIRS) sensor data involved high and low filtering, spline interpolation, and min–max normalisation. Interpolation is used to preserve the continuity of the signal by estimating missing values or outliers using the trend of nearby data points, ensuring that the model receives complete sequences for stable learning. The model results identified that right engine failure elicited the highest CL response and significant activity in the prefrontal, motor, and occipital cortex brain regions, which were correlated with elevated load.

An integrated model is more robust to outliers and noise because the predictions from multiple models can mutually correct each other, reducing the potential interference that a single model might encounter. By integrating the prediction results of multiple different models, the integrated model can combine the advantages of each model, reduce the possible biases and variances of a single model, and thus exhibit a better performance in practical applications.

In summary, the models categorised above (Section 3.1.1, Section 3.1.2 and Section 3.1.3) are traditional ML models. Despite the simplicity and interpretability of these traditional ML models, they often struggle to model the complex relationships inherent in biometric data and typically require extracted features rather than raw signal data. In contrast, deep learning (DL) models are capable of learning complex relationships within data, thereby automatically extracting hierarchical features directly from raw input signals. Deep learning also excels in modelling the temporal and spatial dependencies embedded in time series biometric signals, making it a powerful method for CL monitoring and prediction.

3.1.5. Deep Learning ML Models

Deep learning (DL) models consist of multiple layers of interconnected computational units known as neurons. By increasing depth (i.e., adding more layers), DL models can learn increasingly complex and abstract data representations, allowing them to model non-linear and complex relationships. DL architectures, Neural Networks (NNs), Convolutional Neural Networks (CNNs), and Long Short-Term Memory networks (LSTMs) with attention, have been used and demonstrated optimal performance in several reviewed studies to assess CL.

Neural Networks (NNs) comprise multiple fully connected (dense) layers that adjust their connection weights through backpropagation, enabling them to learn complex non-linear patterns within CL. Gianazza [72] reported that their developed NN model performed best compared with traditional ML models with 81.9% accuracy in classifying CL levels during air traffic controller (ATC) tasks using z-score-normalised historical operation and trajectory data. Similarly, Laskowski et al. [66] developed an NN that achieved 85% accuracy using historical and task-based inputs also collected during ATC operations. They incorporated the Delphi method, a structured expert-based approach, to assign weights to key features that influence CL, thereby enhancing the model’s interpretability and practical utility.

In another study by Yiu et al. [61], a variant of an NN, the Bayesian Neural Network (BNN), was employed. BNN treats model weights as probability distributions and infers posterior distributions using Bayes’ theorem. This BNN, using EEG data, was able to assess operator workload level under the simulated operation of flying and monitoring under clear or hazardous weather conditions, achieving a 66.4% accuracy and outperforming traditional ML models. EEG signal data was preprocessed using a bandpass filter and ICA to clean the data. SHAP explanations provided interpretability for model predictions, supporting transparent model evaluation and model trustworthiness. The study revealed that waves in the temporal and frontal lobes and waves in the parietal lobe were key features, and showed a significant increase in CL under poor visibility conditions. However, standard NN architectures are not specifically designed to handle the temporal dependencies found in sequential or time series measures.

Convolutional Neural Networks (CNN), on the other hand, use convolutional layers able to extract localised spatial features from structured inputs in images or multi-channel biometric measures of CL. Luo et al. [13] applied CNNs, showing the best performance of 99.87% accuracy in classifying CL compared with LSTM and traditional ML models, using facial expression data collected during flight simulation landing under varied weather conditions. This data was preprocessed, involving removing features with low sampling rates and z-score normalisation. This CNN result identified significant indicators of CL, and reported that crosswind landings at night induced the highest CL. These results highlighted the relevance of emotional facial cues and validated using CNNs for the real-time facial analysis of CL. In addition, CNNs have proven to be effective in automatically extracting relevant features from raw window signal data for CL classification. Xi et al. [64] employed a pre-trained CNN to transfer, learn, and extract features from ECG signals collected during varying levels of lateral tracking simulated flight operations, which were then used to classify CL levels. Using the extracted features, the fine-tuned CNN outperformed the SVM classifier.

Another type of advanced deep learning NN model, Long Short-Term Memory (LSTM) networks, are a specialised type of Recurrent Neural Network (RNN) where current predictions learns from past input hidden states, and introduce memory cells with input, forget, and output gates to effectively learn long-term dependencies in time series data. Unlike standard NNs or CNNs, LSTMs are designed for sequential type data; however, they may still suffer from vanishing gradients due to backpropagation learning, so are limited to long sequences. To mitigate this, attention mechanisms have been integrated into LSTM architectures. These mechanisms assign weighted importance to each time step or input feature, improving both performance and interpretability.

Jiang et al. [62] showed that an LSTM with attention outperformed standard LSTM models by 6%, achieving 94% accuracy in classifying CL into five levels. The model was trained on multimodal data, including EEG signals, flight operation records, visual fixation data from eye-tracking, and flight performance metrics, collected during flight simulated five-sided take-off and landing tasks. This attention-enhanced LSTM shows the ability to monitor CL in real time using 2 s EEG data within 58 ms, where data was segmented using 3 s sliding windows. Similarly, Zhou et al. [69] demonstrated that an LSTM attention model can classify CL levels in real time using heart rate variability (HRV) measurements during different simulated flight turning tasks (climb turn, level flight turn, and descent turn), outperforming LSTM and other traditional ML models. The model achieved a 0.9491 F1-score, with preprocessing steps including linear interpolation, min–max normalisation, and 30 s sliding windows with 40% overlap. The attention results showed that climbing turns induced the highest CL levels (then descent turn, and level turn), characterised by an elevated heart rate and reduced HRV, whereas low-load scenarios showed higher parasympathetic activity. These studies demonstrate that LSTM models with attention mechanisms enable and improve the real-time classification of CL and enhance model transparency and reliability.

In conclusion, these DL and traditional ML models show effective performances in UAS and sensor operators’ CL assessment. Both categories have demonstrated their strengths in processing complex biometric and behavioural data to enable accurate CL classification and for further application in UAS operator adaptive training to enhance training.

3.2. Machine Learning Methods to Enhance UAS Operator Adaptive Training

Beyond cognitive load (CL) assessment, many studies have applied machine learning (ML) to enhance UAS operator training processes by enabling adaptive training, such as providing automated assessment, real-time feedback, or the dynamic adjustment of training levels. Table 2 presents a summary of these studies focused on informing or enhancing operator training. In these cases, ML models are employed to evaluate a range of training-related variables, including operator skill level and training performance, typically by analysing operator-related data such as physiological or biometric signals and performance metrics. The distribution of data types used in these studies is illustrated in Figure 5.

These ML approaches can classify or assess variables relevant to training adaptation, thereby facilitating the automatic adjustment of training content and feedback. To provide a structured overview, this section categorises the reviewed studies based on their primary function: Section 3.2.1 discusses approaches using physiological signals to support/inform automated training adaptation, while Section 3.2.2 focuses on methods that leverage performance-related data to support automated training evaluation and feedback.

3.2.1. ML Methods Supporting Biometric-Referenced Adaptive Training

Several studies employ ML models to assess the biometric signals of operators, providing important biometric references for adaptive training system design. For example, Wojciechowski et al. [7] proposed a Recurrent Neural Network (RNN) to monitor UAV operators’ real-time multimodal biometric signals, including EEG, ECG, blood pressure, skin temperature, facial expressions, and eye-tracking. These were collected when an operator performed a series of flight tasks in a drone flight simulator, such as manual flight control, the analysis and switching of autonomous flight modes, flight with stabilisation disabled, and first-person view navigation (FPV). In this setup, the RNN monitors biometric states in real time, implements an adaptive training system, compares them to benchmark values, and adjusts the training difficulty accordingly by modifying parameters such as speed, complexity, or task duration. For instance, when operators show a calm physiological state, training levels are gradually increased, while rises in heart rate, blood pressure, or errors prompt a reduction in training speed or difficulty.

Similarly, Caballero et al. [82] proposed an AdaBoost ML model to automatically predict flight difficulty levels, with the aim of moving towards an ML virtual instructor giving automated training assessment. The AdaBoost classifier achieved the best performance in classifying flight difficulty, with an accuracy rate of 87.14% using a multimodal biometric dataset including electromyography (EMG), electrodermal signals (EDSs), ECG, and PPG, alongside operator profile and instructor feedback data. These data were collected while participants performed landing tasks in a flight simulator. AdaBoost, by iteratively re-weighting misclassified samples, builds an ensemble of weak learners and minimises the overall error rate. BorutaSHAP was used for feature selection, combining the Boruta algorithm (a wrapper method using RF for feature relevance) and SHAP values (which quantify feature contributions). The key results revealed that eye-tracking data were the most significant predictor for task difficulty classification.

In contrast, some studies focused on singular brain-related signals such as EEG and fNIRS. For example, Yuan et al. [81] proposed an SVM with a Radial Basis Function (RBF) kernel based on functional near-infrared spectroscopy (fNIRS) data, achieving 92.2% accuracy in identifying the behavioural patterns of operators performing left and right turning tasks in a flight simulator. The data were artifact-filtered, segmented into 60 s sliding windows with 25% overlap, and normalised using min–max scaling. RF was then used for feature selection, demonstrating that selected fNIRS features vary with behavioural state and can improve the generalisation ability of the model for adaptive training. Similarly, Li et al. [83] developed an SVM using EEG-based power spectral density and functional connectivity features, achieving 80.8% accuracy in identifying risky behaviours during six flight simulation phases under varying weather conditions. This provided important physiological references for automated training assessment and adaptive training.

Beyond traditional ML models, advanced deep learning models such as Transformers have been applied. Pietracupa et al. [86] proposed a Transformer model that, unlike RNNs or LSTMs, uses self-attention to enable parallel processing and efficiently model long-range dependencies. In Pietracupa et al. [86], the Transformer was used with EEG signals, heart rate, and pupil diameter data to identify pre-error states of operators when completing flight simulation tasks of three types of equipment failures. The Transformer demonstrated the best performance in terms of the lowest False Negative rate, with an F1-score of 0.610 on the open-source Flanker EEG dataset, and 0.578 when transferred to pilot EEG data, indicating some degree of generalisability. EEG signals were denoised using notch filters, ICA for artifact removal, and z-score thresholding (cutoff of 5) to eliminate abnormal outliers. Importantly, the study demonstrated that as operator performance deteriorated, the model could still detect early warning signs of error states, a feature crucial for adjusting simulator task difficulty based on operator fatigue status and improving the precision and efficiency of flight training.

3.2.2. ML Models Supporting Automated Training Evaluation

Other studies focus on the use of flight parameters, such as sensor and behaviour data, to evaluate training variables like skill level, thus enabling automatic and adaptive training assessment. For example, Yin et al. [84] employed a Dynamic Bayesian Network (DBN) to account for temporal influences in continuously collected data and to analyse UAV operators’ 15 labelled training phases using movement behaviour data such as altitude, heading angle, rate of change, and flight speed, in support of training systems. The model demonstrated a high accuracy in identifying training stages, thereby enabling automated real-time feedback to guide skill development. Similarly, Paces and Insaurralde [85] and Yang et al. [80] proposed supervised ML models to extract potential patterns from flight simulator data, construct flight mission models, and identify trainee operational behaviours or assess performance. By comparing trainees’ operations with expert operations predicted by the model, these systems provide automated, detailed, real-time feedback, helping trainees understand their own behaviour and offering quantitative, objective performance assessments.

Rather than using only supervised ML models, Rodríguez-Fernández et al. [79] adopted an unsupervised approach combining clustering and fuzzy logic for the automatic skill level assessment of UAV operators during watch and rescue simulation operations. This method automatically generates descriptions and classifications of user skills by integrating clustering (which divides users into groups) and fuzzy logic (which identifies group skill levels and manages uncertainty in the data), more closely resembling expert human assessment. The experimental results showed that this method effectively identifies skill levels and provides automated feedback during training. Guevarra et al. [48] uses another type of ML, imitation learning, where an agent learns to imitate the operations of an instructor (the expert) and then compares trainee performance, providing real-time feedback in the flight simulator to correct operational errors. Although currently only verified for trainees who completed “straight and level” flight missions in the flight simulator, this system demonstrates the potential of ML in flight training and provides a basis for future adaptive training in more complex scenarios.

In summary, these studies demonstrate that ML approaches, by leveraging biometric and performance data, can provide dynamic assessment, timely feedback, and the adaptive adjustment of UAS operator training content and difficulty. Moreover, these approaches facilitate the objective evaluation of UAS operator skills, support the early detection of at-risk trainees, and optimise candidate selection processes, thereby improving both the effectiveness and efficiency of operator training programmes.

3.3. Machine Learning Evaluation Metrics

Turning to evaluation metrics, the reviewed literature employs a variety of approaches to assess the performance of ML models in CL and operator-related data assessment. This subsection summarises the main evaluation strategies/approaches reported in studies addressing both CL assessment and adaptive training applications. Most studies adopted multiple metrics, including accuracy, F1-score, recall, precision, and area under the receiver operating characteristic curve (AUC-ROC). Accuracy is the most frequently reported metric, valued for its simplicity and direct interpretability. Accuracy is calculated as

\begin{matrix} A c c u r a c y & = \frac{T P + T N + F P + F N}{T P + T N} \end{matrix}

(1)

where True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) are the predictions made by the models [69,75]. While accuracy offers a general measure of overall correctness, it may be less informative when class distributions are imbalanced. Therefore, other metrics such as F1-score are frequently used to provide a more balanced view of model performance, particularly in cases with class imbalance. The F1-score quantifies the harmonic mean between precision and recall, and a score closer to one indicates a better model performance [61,69,75]. Precision assesses the proportion of positive identifications that are actually correct, while recall reflects the proportion of actual positives correctly identified by the model. For regression tasks or continuous outcomes, some studies reported the Root Mean Square Error (RMSE), which measures the average difference between predicted and actual values, with a lower RMSE indicating a better predictive accuracy [59,63,70]. Additionally, the area under the receiver operating characteristic curve (AUC-ROC) is used to evaluate a model’s ability to discriminate between classes; AUC values closer to one indicate stronger discriminatory power [61,63].

Beyond performance-focused metrics, another model evaluation metric is generalisation, which assesses how well a model can apply learned patterns to new, unseen data. Many studies addressed this by employing cross-validation techniques [31], which partition the data into training and testing subsets to robustly estimate model generalisability [29,59,72,87]. In particular, leave-one-subject-out cross-validation (LOSO-CV) was frequently adopted, where each participant’s data was held out in turn as the test set [61,63,83,88]. This method is relevant for biometric and continuous CL measures collected within subjects, as it more accurately reflects the model’s ability to generalise across individuals. Similarly, leave-one-route-out validation, used in [63], involved holding out all data from a specific mission route to assess model adaptability to new operational contexts. These specialised cross-validation approaches ensure that model assessment is aligned with the practical requirements of CL monitoring and adaptive training applications.

4. Discussion

This discussion section interprets and analyses the significant insights presented in the results section. In addressing the research questions, it examines machine learning (ML) approaches applied to cognitive load (CL) assessment and their role in enhancing training through adaptive learning. It further examines both the potential applications and the current limitations of ML in supporting adaptive training, situates these findings within the broader literature, identifies methodological gaps, and outlines areas for future investigation. In addition to comparing ML models, the discussion considers wider methodological developments and ongoing challenges in the field, and it introduces a conceptual framework intended to support subsequent research and implementation. Based on the methodological factors summarised in Figure 6, this section presents considerations for designing ML-based CL assessment workflows that are transferable to real-world UAS operator training contexts.

4.1. ML Approaches for Cognitive Load Assessment

To address the first research question, the reviewed studies demonstrate that a wide range of ML models have been effectively applied to assess CL in UAS operator training contexts. However, no universally optimal ML apporoach has emerged. The reported effectiveness would depend on methodological factors such as scenario design, data preprocessing, labelling, and validation strategy, meaning that the results across studies are not directly comparable. To evaluate these approaches, the following sections first examine the models applied (Section 4.1.1), and then turn to their intended purposes, namely real-time monitoring and prediction (Section 4.1.2).

4.1.1. ML Models Applied to Cognitive Load Assessment

As illustrated in Figure 3, the reviewed studies employ a variety of ML models, including both traditional and deep learning approaches, that effectively monitor, predict, and more generally assess the CL of UAS operators. The effectiveness or performance of these models depends on a combination of methodological factors such as data quality, input feature selection, and training–validation strategies. Consequently, the findings do not point to a single universally effective or best performing ML model for CL assessment.

Traditional classifiers, particularly Support Vector Machines (SVMs) and XGBoost, remain among the most widely adopted and are reported to be the most effective. Their popularity stems from their capacity to handle high-dimensional feature spaces and non-linear feature interactions, while maintaining a relatively low computational cost. These methods deliver a high performance in binary classification tasks (e.g., distinguishing between low and high CL). Yet this reliance on binary classification is a limitation: medium-load states are frequently excluded, despite their operational importance as transitional phases where errors and lapses in situational awareness are likely to occur. Consequently, although SVM and XGBoost remain reliable baselines, their application in adaptive training systems is constrained. This highlights the need for multi-class or regression-based approaches.

Traditional ML models also require manual feature extraction, typically by transforming sliding-windowed biometric signal segments into handcrafted statistical or domain-specific features (e.g., mean, variance, frequency domain measures). The quality of these features is then subjective and demands domain expertise, while increasing the risk of feature selection bias. Hence, techniques such as Random Forest (RF) importance scoring and Recursive Feature Elimination (RFE) are generally applied to help select significant features and mitigate dimensionality challenges, thereby enhancing model effectiveness; however, the process remains resource-intensive [38,56,57,77,89].

In contrast, deep learning models are designed to process raw or minimally preprocessed windowed data and to automatically learn relevant features, thus reducing the need for manual feature extraction. This offers a more efficient approach to data preprocessing. Advanced deep learning (DL) models, such as Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and LSTM with attention, can excel at capturing the complex temporal dependencies and hierarchical patterns present in biometric and behavioural data. These architectures enable end-to-end learning without explicit feature extraction and are thus better suited for high-dimensional, complex tasks, particularly in multi-class CL classification scenarios [38,77]. While deep learning models typically outperform traditional ML approaches in complex tasks and warrant further investigation, their lack of interpretability remains a challenge. To address this, several studies have incorporated attention mechanisms, SHAP values, or other explainable ML approaches [13,62,69,77,82]. These enhancements allow advanced DL models to retain predictive power while increasing transparency, thereby enhancing their applicability and reliability in safety-critical domains like UAS training.

More recently, Transformer-based models have emerged for CL assessment, building on their success in natural language processing and time series analysis. Transformers use self-attention mechanisms to capture long-range dependencies within sequential and multimodal biometric data. In the context of UAS operator training, Transformer models offer distinct advantages, such as their ability to deliver fast, parallel inference, and their attention weights provide a degree of interpretability, highlighting which features or time periods significantly influence prediction. This makes them suitable for real-time adaptive training systems, enabling the enhancement of training. Pietracupa et al. [86] showed that Transformers can be effective in assessing EEG signals, outperform CNN, and provide fast classification, which makes them suitable for integration into adaptive training systems where continuous, personalised feedback is required. However, like other advanced neural models, their effective deployment depends on careful parameter tuning and access to sufficiently large, well-annotated datasets.

Despite the advantages of DL models, they may require large sample sizes to train, as they are vulnerable to overfitting on small datasets. Given that most studies involve relatively small datasets, typically with around 20 participants, the risk of overfitting in DL models remains significant. For example, Shao et al. [57] found that SVM outperformed Neural Networks in assessing CL, suggesting that the DL model may have overfitted, and hence showed a poor performance. This indicates that careful model configuration and data preprocessing are required to ensure proper learning and generalisability rather than overfitting to limited samples. Overall, the selection of ML models for the CL assessment of UAS operators is shaped by trade-offs among task complexity, data volume, interpretability, and model performance, hence, model selection should be carefully considered and compared.

4.1.2. ML-Based Real-Time Monitoring and Prediction of Cognitive Load

Most studies in this field develop ML models to assess CL by classifying CL levels in real time, thereby enabling continuous monitoring and prediction; however, prediction has received less emphasis and represents an area of growing interest. For example, the LSTM model by Jiang et al. [62] was able to monitor CL every 3 s using 2 s of EEG sensor data, achieving a response time of less than 1 s. Similarly, Salvan et al. [74] reported a model capable of real-time CL classification at a frequency of once per second, while Monfort et al. [75] proposed a model that could predict load every 60 s. Additionally, Pietracupa et al. [86] introduced a Transformer-based model with an inference speed of 0.01 s, further demonstrating the feasibility of real-time CL monitoring. Such findings highlight the potential for integrating these models into adaptive or automated training systems.

Studies implement window segmentation techniques (i.e., sliding windows) to divide data into short intervals (e.g., 30 s intervals), enabling models to continuously monitor and report CL by processing sequential data segments. This approach is particularly suited to biometric time series, as it leverages a sequential data structure to maintain temporal context and provide frequent, up-to-date assessments. Such real-time monitoring and reporting feedback is essential for informing adaptive training systems, which require an ongoing awareness of operators’ cognitive state to dynamically adjust training levels or difficulty.

In contrast, comparatively fewer studies explicitly address the predictive modelling of CL, i.e., predicting CL based on past time-stamped measurements. While monitoring provides valuable real-time feedback, prediction can enable early warnings of dangerous operator states, helping to address human error, a crucial consideration for enhancing efficiency and safety in UAS training. Notably, since ML inherently enables prediction and the use of sliding windows allows the model to go beyond monitoring, these approaches can implicitly predict CL. The further exploration of real-time predictive models could support early intervention, improving both efficiency and safety in UAS training.

In practice, with real-time assessment, there is also a tension between accuracy and latency. Models achieving a very high accuracy often do so at the expense of computational speed, while those optimised for speed may sacrifice robustness, and striking a balance between these competing demands remains a key challenge for real-world deployment. Finally, although many reviewed studies present technical feasibility, few explicitly test their integration into training environments. This gap underscores the need to move beyond proof-of-concept classification towards embedding ML-based CL assessment into closed-loop training frameworks. In such frameworks, the operator’s state informs adaptive instructional strategies in real time.

4.2. Key Methodological Factors in ML-Based CL Assessment

Answering the second research question, this section examines the key methodological factors, including data preprocessing methods, engineering techniques, and model development strategies, that substantially influence the effectiveness and/or performance of ML-based CL assessment for UAS operators. These factors not only determine reported model performance but also shape the extent to which findings can be generalised to real-world adaptive training systems. To structure this analysis, the discussion is organised into four subsections: data and preprocessing (Section 4.2.1), individual difference consideration (Section 4.2.2), data labelling strategies (Section 4.2.3), and evaluation/validation protocols (Section 4.2.4).

4.2.1. Data and Data Preprocessing

The effectiveness of ML models for the CL assessment of UAS operators fundamentally depends on the quality of the input data and the rigour of the preprocessing steps. This is commonly true for signal data collected from participants during simulated or real-world tasks, which are inherently noisy due to factors such as participant movement, sensor displacement, or environmental influences. To address these challenges, a variety of signal processing techniques are employed, most notably digital through filtering (e.g., bandpass or notch filters) to exclude frequency ranges dominated by noise or unrelated physiological activity. Effective noise removal is essential, as unfiltered noise can introduce spurious features and degrade model performance. However, overly aggressive filtering may discard subtle but meaningful physiological changes that differentiate CL states. Conversely, insufficient filtering can allow artifacts to confound the model’s learning process. Furthermore, optimal settings may vary by device type and individual physiology, reinforcing the need for systematic evaluation and transparent reporting.

Normalisation is especially used in multimodal signal data contexts where each sensor produces outputs on different scales or units. The most commonly used approaches are min–max normalisation and z-score normalisation. While min–max normalisation scales all features to a range between 0 and 1, it may distort feature relationships in heterogeneous, multimodal datasets with wide dynamic ranges. In contrast, z-score normalisation (standardising each feature to zero mean and unit variance) is generally more appropriate for physiological time series, as it accommodates scale and variability, and thus reduces model sensitivity to inter-individual and inter-modality differences. Z-score normalisation, however, requires the careful handling of missing values that can be addressed with interpolation (e.g., linear interpolation) prior to normalisation [56,75].

Advanced normalisation strategies, such as day/trial-dependent or subject (individual)-dependent normalisation [38], are used to account for temporal and inter-individual variability. These approaches have been shown to improve model performance when generalising across sessions or participants. This is important as individual baseline biometric states (e.g., resting heart rate, skin conductance) and physiological responses can vary significantly, both between individuals and across days or trials for the same person. Applying a single normalisation across all participants and sessions may mask such differences, reducing the sensitivity to meaningful changes and impairing generalisation. Individualised normalisation, by adjusting for personal and session-based baselines, improves the sensitivity and robustness of downstream ML models. This is important for adaptive training applications, which depend on accurately tracking individual changes in CL over time.

Another methodological consideration in preparing biometric and behavioural time series data is the choice of sliding window segmentation parameters. The reviewed studies commonly apply sliding windows ranging from 5 to 60 s, with overlap or step sizes varying between 25% and 50%. The optimal window size is dependent on the characteristics of the signal modality: shorter windows may be necessary to capture rapid fluctuations in signals like EEG or eye-tracking, whereas longer windows can be more appropriate for slower-changing signals such as heart rate or skin conductance. Smaller windows generate more training samples and may enable the model to detect transient changes in CL, but excessively short windows risk capturing noise rather than meaningful patterns and may impair the model’s ability to learn sustained cognitive states. Conversely, larger windows smooth out noise but can obscure important short-term variations and lead to less-responsive monitoring. The choice of window size and overlap, therefore, needs to be systematically validated, often through cross-validation experiments, to identify the optimal configuration for both model learning and practical application [38]. Notably, the segmentation strategy directly affects both the granularity and reliability of real-time monitoring or prediction, and thus has important implications for the effectiveness of adaptive feedback or training interventions that rely on timely and accurate CL assessment.

Different studies employ diverse data types, thus different preprocessing methods, all aimed at preparing and cleaning data to support effective machine learning (ML) model training and reliable results. Most commonly, studies use unimodal or multimodal biometric measures, which are continuous data. Such data enable ML models to assess cognitive load (CL) in real time. In addition to these biometric inputs, several studies also collect subjective workload ratings, most commonly the NASA-TLX, administered after task completion, providing discrete, self-reported measures of perceived workload [13,29,31,38,56,59,61,62,67,68,73,74,76,77,84]. NASA-TLX scores are not used as input data for ML models; rather, they serve as secondary measures to support CL-level labelling and validation. Specifically, they are employed to define CL levels labels or to verify task-based scenario labels, evaluating whether the collected data effectively differentiates between different CL levels. This validation is important because CL is a latent psychological construct without a direct observable ground truth. Subjective workload assessments thus act as external references to confirm that model-derived classifications or labels correspond to human-perceived workload. Incorporating such validation enhances the reliability of data labelling and improves overall data quality, thereby increasing the reliability of ML-based CL assessment.

In multimodal data fusion, studies differ in their use of unimodal versus multimodal approaches with ML models. Approximately ten studies demonstrate that ML models using a single sensor modality, for example, EEG, which can effectively monitor or predict CL. However, given the multidimensional nature of cognitive workload, several studies have increasingly adopted multisensor (multimodal) data integration to capture a wider spectrum of physiological and behavioural indicators. ML models incorporating multiple biometric measures, and sometimes performance metrics, have been shown in several studies to outperform unimodal approaches in terms of classification accuracy and robustness [38,82]. Nevertheless, this improvement comes with trade-offs where richer multimodal data can support more powerful models, but increased system complexity and cost may hinder practical deployment [82] by adding computational and data collection burdens, and challenges in synchronisation and processing.

In addition, preprocessing steps are used to address sensor heterogeneity. For multimodal physiological data, it is common for different sensor channels to operate at slightly different sampling rates or have missing values at various time points. To address these discrepancies, interpolation methods are widely used not only to fill missing values but also to align signals to a common time grid [57]. Such time alignment is important for ensuring that features extracted from different modalities are temporally synchronised, which is essential for the downstream ML model to learn the true relationships between concurrent physiological responses and CL. Failure to properly align multimodal signals may introduce artificial lags or mismatches, thereby degrading both the quality of data and the model’s performance.

The data sample size affects the data quality in ML-based cognitive load (CL) assessment and substantially influences model effectiveness and reliability, which are critical for future real-world applications. Most studies collected biometric data when participants performed the designed operations, where the number of participants determines the sample size. The reviewed literature shows participant numbers ranging from five to forty-eight, with the mode sample size being around twenty-three. These studies indicate that ML models can be trained to assess CL using datasets of varying sizes, though their broader generalisability has yet to be established. This difference in sample size would influence model reliability and applicability for training applications. Smaller sample sizes are more vulnerable to noise and outliers, which can distort learning and reduce model stability. In contrast, larger datasets enhance representativeness and allow models to generalise more effectively across unseen participants or contexts, which is crucial for adaptive training applications. Nevertheless, the relatively small sample sizes observed (five to forty-eight participants) raise questions about statistical representativeness, as such ranges may not adequately capture population diversity or individual variability. Consequently, many studies may develop seemingly effective CL assessment models based on limited statistical power, potentially resulting in overestimated accuracies and restricted external validity. Expanding both the number and diversity of participants will therefore be essential for enhancing the reliability of ML-based CL assessment models, particularly for their application in enhancing training.

Beyond the number of participants, the studies also vary in the number of trials and the duration of simulated operational tasks, which directly affect the total amount and diversity of collected data. While these factors also contribute to the overall data quality and variability, the number of participants remains the critical determinant, as it defines the statistical representativeness and generalisability of the dataset. Therefore, this discussion primarily focuses on participant sample size, while acknowledging that trial frequency and task duration are also important considerations for ensuring sufficient data volume and variability in future studies.

Another important data consideration is the quality, such that the fidelity and realism of the simulation directly affects data quality and, consequently, the reliability and applicability of ML models. Most reviewed studies collected biometric data during simulated task environments; a few studies, particularly those involving air traffic controllers (ATCs), obtained physiological data during actual operations. As most reviewed studies were conducted in simulated environments, the findings mainly reflect ML performance under controlled experimental conditions, while a few real-world studies (mainly in ATCs) demonstrate that ML approaches can also effectively assess CL in operational settings. As demonstrated by [59], no significant differences were found between simulated and real-world environments in key behavioural and psychophysiological measures, indicating that immersive, high-fidelity simulations can serve as scalable and valid alternatives to field-based data collection for ML model development.

While simulators provide controlled, repeatable conditions ideal for experimental standardisation, they may still lack the complexity, unpredictability, and operational stressors inherent in real-world missions, potentially limiting model generalisability. Therefore, the realism of simulation environments may require further careful consideration. Realistic, immersive simulation environments tend to yield more accurate and ecologically valid biometric data [7], ultimately enhancing both ML model performance and reliability. Such data collected in such an environment should be considered to be close to a real-life operation environment for the data to be realistic, which would improve model reliability and applicability to real-world applications. Although both simulated and real-world studies demonstrate the feasibility of ML-based CL assessment, simulation-based studies report model accuracies under controlled and low-noise conditions, whereas real-world studies show greater variability but stronger ecological validity. This difference highlights a trade-off between experimental control and operational realism that should be further investigated in future work.

4.2.2. Addressing Individual Differences

Biometric signal measures are inherently individualised constructs, influenced by a range of personal factors, for example, natural variation in heart rate ranges among individuals. For this consideration, studies have developed ML models that are able to capture and identify variability introduced by individual differences, improving the model’s generalisability for real-life applications. Two main strategies have emerged in the literature for accommodating individual differences.

The first approach involves developing individual- or subject-specific models, whereby a separate model is trained for each participant [29,70,87]. This method enables the model to capture unique physiological and behavioural patterns associated with each operator’s CL, leading to higher within-subject classification accuracy. These models would be effective when sample data are available for each individual; however, they are resource-intensive and impose substantial data collection and computational overhead for each trainee. As a result, the scalability of this approach is limited, especially in large-scale or dynamic training environments where operator cohorts are continually changing.

The second approach integrates individual differences into feature engineering. For example, models may assign higher weights to features known to be sensitive to inter-individual variability, or include participant-specific metadata as input variables [38]. This individualised modelling can improve classification accuracy without the high cost of training separate models for each operator. Nonetheless, this approach may still struggle to fully capture subtle or complex individual traits and may overfit to observed differences, potentially impairing generalisation to new subjects or scenarios. This indicates the ongoing trade-off between personalisation and scalability, and the need for solutions that are both effective and adaptable for real-world training programs.

4.2.3. Data Labelling Approaches

The effectiveness of ML approaches for CL assessment is fundamentally influenced by data labelling strategies. The reviewed studies commonly employ several labelling approaches, including scenario task-based labels, subjective workload ratings, expert annotation, and unsupervised learning techniques.

Scenario-based labelling, a common method, involves labelling data based on the difficulty or structure of task scenarios, under the assumption that more complex or multitasking-demanding tasks induce a higher CL. While this approach can approximate an objective “ground truth” in controlled experimental settings, it assumes that all participants experience similar cognitive demands for the same task; however, this premise may not hold due to individual differences in skill, strategy, or familiarity. Notably, real-time or field data labelling presents a further challenge, as ground truth CL is difficult to obtain in practice. Some studies have improved label quality by incorporating expert guidance, where domain specialists annotate or validate CL levels based on behavioural observations, physiological markers, or task performance [89]. Expert-annotated labels tend to be more reliable and nuanced; however, this process is resource-intensive and not scalable, particularly for large datasets or real-time applications.

Another widely used approach relies on subjective workload ratings, such as NASA-TLX scores, in which participants provide real-time or retrospective assessments of their perceived workload. Although subjective labels are easily obtainable and provide insights into UAS operator experience, they are vulnerable to human bias, recall inaccuracies, and mood-dependent effects. Operators may not reliably distinguish between different dimensions on questionnaires, and asking them to perform ratings can be disruptive. As a result, continuous data during transitions between CL levels cannot be collected; however, such data are important for pre-error CL identification. Hence, subjective ratings are not regarded as objective ground truth, and their use may introduce noise into model training and evaluation.

On the other hand, some studies explore unsupervised or semi-supervised learning techniques to generate labels directly from the data [57,58,79]. Clustering algorithms, for example, can identify naturally occurring patterns or clusters in physiological or behavioural data, which can subsequently be mapped to CL categories in a more objective manner. These data-driven labels reduce human bias, improve consistency, and may uncover latent workload states not captured by scenario-based or self-report measures. Studies have demonstrated the feasibility of these approaches and indicated their potential to further improve data labelling quality, thereby enhancing model performance and reliability in training applications.

4.2.4. Validation and Evaluation Protocols

Studies have applied diverse methods to evaluate and validate the effectiveness or performance of ML models for CL assessment. Performance is most commonly reported in terms of classification metrics, with accuracy being the primary metric used to identify the best performing models. This is because accuracy offers a clear and intuitive measure of overall classification effectiveness. However, in datasets with a significant class imbalance, accuracy may be misleading or insufficiently reflective of true performance. Techniques such as synthetic minority oversampling (SMOTE) have also been used to balance class distributions and improve the reliability of performance metrics [86].

To provide a more detailed understanding, many studies also report additional metrics such as recall (sensitivity), precision, and the F1-score, which collectively provide a more nuanced view of performance, particularly in imbalanced or multi-class settings. In addition, some studies further employ the area under the ROC curve (AUC-ROC) for a more comprehensive evaluation, especially in complex scenarios [61,63]. However, these metrics are often reported descriptively rather than interpreted. For example, recall can be relevant in adaptive training applications, where failing to detect a high CL (i.e., False Negatives) poses a greater operational risk than occasional false alarms. In such contexts, it is preferable to prioritise sensitivity over accuracy, as “better to over-alert than to overlook.” This highlights the need for the more systematic exploration of performance metrics beyond accuracy to ensure operational reliability and safety in training systems.

Another dimension of validation is generalisation, which is fundamentally tied to the effectiveness of a model on unseen data and its applicability in real-life adaptive training contexts. To assess generalisation, most studies adopt cross-validation strategies such as k-fold cross-validation (k-fold CV), leave-one-out cross-validation (LOOCV), or, increasingly, leave-one-subject-out (LOSO) and group-based methods. In k-fold CV, the dataset is partitioned into k subsets, and the model is iteratively trained and tested so that each subset serves as the validation set once, with the results then averaged to help mitigate bias due to random data splits. LOOCV, in which a single instance is held out as the test set for each iteration, is particularly useful for small datasets, maximising data utilisation but often at the cost of higher variance.

Importantly, biometric data in this field are typically subject- and session-dependent, exhibiting significant inter- and intra-individual variability. Consequently, an increasing number adopt more specific cross-validation methods, particularly LOSO or group-based strategies, to evaluate model generalisability and guard against overfitting [61,63,83,88]. These methods are relevant for within-subject biometric data, as they provide a rigorous assessment of a model’s performance on completely unseen participants or conditions after being trained on some individuals or scenarios. Despite their importance, a notable number of studies do not explicitly report their validation procedures, which makes cross-study comparisons and replication challenging. Overall, rigorous and transparent evaluation protocols are essential for establishing the reliability and generalisability of ML models in CL assessment and adaptive training systems.

In addition, validation protocols for the way data is split for training and validation significantly influence the generalisation and effectiveness of ML models in real-world applications. This is significant for biometric or time series data with within-subject (individual) patterns, because data are collected from each participant during experiments. In many studies, the data splitting method is not explicitly reported, and random shuffling is often assumed. However, random splitting can disrupt within-subject, temporal, or structural dependencies inherent in biometric time series data, thereby limiting the model’s ability to generalise to real-world conditions. A high validation accuracy under random splits does not necessarily indicate that a model has effectively learned target patterns; it may simply reflect the model’s capacity to fit internal data correlations.

Robust evaluation designs are therefore essential to accurately assess generalisation. For instance, ref. [70] implemented a chronological split to preserve the temporal consistency in feature encoding. However, this approach may lead to overfitting, as the training and test data could come from one participant’s continuous stream, which limits generalisation. Other studies by Momeni et al. [31], Dell’Agnola et al. [38], Nittala et al. [63], Dell’Agnola et al. [77], Binias et al. [87] trained models on certain trials or mission scenarios and tested them on entirely new ones, allowing the model to learn the direct relationships between CL levels and biometric signals in varied contexts. However, these splits can be affected by trial-specific noise, which may not reflect broader, real-world variability. Moreover, studies by Yiu et al. [61], Nittala et al. [63], Salvan et al. [74], Monfort et al. [75], Li et al. [83], Pietracupa et al. [86], Alkadi et al. [88] avoid participant data leakage and best assess generalisation to new operators, a crucial consideration for adaptive training applications. Nevertheless, even participant-wise splitting may lead to overfitting to characteristics specific to the training group if inter-individual variability is not explicitly modelled. Although participant-wise splits are the least prone to data leakage, it would be optimal considering generalisation and applicable to adaptive training.

Collectively, this demonstrates the importance of appropriate validation design in revealing a model’s true generalisation performance. Since many datasets are constructed by assigning participants to different CL scenarios, internal dependencies can arise between data segments. Splitting data without accounting for these dependencies can inflate performance metrics and limit generalisability. Therefore, carefully designed training–validation splitting strategies are essential to ensure that models perform reliably with entirely new individuals or task conditions, which is important for developing adaptive training systems.

In conclusion, a careful, transparent, and context-aware approach to ML workflows, methodological factors, and data processing is essential for advancing the reliable application of ML to assessing the CL of UAS operators and for enhancing or informing UAS operator training. These considerations collectively lay the groundwork for developing reliable, generalisable, and practically useful ML models to support effective UAS operator training and performance improvement.

4.3. ML Enhancement of UAS Operator Training

To address the third research question, the literature review demonstrates that ML approaches can enhance the effectiveness and adaptability of operator training systems by enabling real-time, data-driven assessment and personalised instruction. Compared with traditional training, which often struggles to deliver individualised feedback, is resource intensive, and may not adapt dynamically to trainee needs, ML-based systems can automatically and continuously analyse operator-related data, providing objective, comprehensive, and consistent guidance [7,90]. This data-driven approach supports adaptive training, automating assessment, delivering real-time feedback, and dynamically adjusting training complexity, which collectively improve both efficiency and effectiveness [80]. However, implementing ML assessment models in real-world UAS operator training also introduces several practical considerations that influence their reliability, scalability, and acceptance. These challenges primarily concern deployment feasibility, ethical responsibility, and cost sustainability, all of which shape the successful application and long-term effectiveness of ML-based cognitive load (CL) assessment within adaptive UAS operator training systems. To address these, this section is structured into four subsections that address ML’s potential for enhancing operator training (Section 4.3.1), deployment considerations (Section 4.3.2), ethical considerations (Section 4.3.3), and cost considerations (Section 4.3.4).

4.3.1. ML Potential for Enhancing UAS Operator Training

ML models that assess CL provide UAS operators with a scientific basis for optimising UAS operator training and enhancing operator performance, thereby reducing human errors [29,59,67,81,84]. Real-time CL assessments form adaptive training systems that continuously assess UAS operator-related data, such as biometric data, and compare it to expert benchmarks or predefined thresholds, delivering dynamic, real-time feedback and adjusting training complexity in alignment with UAS trainees’ cognitive capability [7,48,63,85,90]. This enables ML to function as a virtual instructor, replicating expert decisions while offering objective, personalised guidance during training, tailored to operators’ CL or status conditions [80,82].

Such adaptive training systems can identify signs of cognitive overload or underload and recalibrate task difficulty, thus maintaining trainees within an optimal learning zone. By learning from biometric or performance data, ML models can develop a detailed understanding of operator profiles, which in turn helps instructors identify individual strengths and weaknesses. This supports the customisation of training curricula, enhances automated trainee qualification selection, and improves performance evaluations [48,79,90]. Moreover, such systems mitigate bias or inconsistency from human instructors by offering data-driven feedback, allowing trainees to progressively refine their skills through consistent reinforcement and improved overall readiness. They also help to reduce instructor workload, streamline training resources, and ultimately lower training costs [84].

4.3.2. Deployment Considerations

Deployment considerations concern the process of transitioning ML CL assessment models from controlled experimental or simulated environments into real-world UAS operator adaptive training applications. This stage determines whether models that perform effectively under laboratory conditions can maintain accuracy, efficiency, and reliability when faced with the variability and unpredictability of operational contexts. As argued by [38,59], a model’s ability to generalise beyond its training data is central to ensuring its safe and effective use in real-world applications. Therefore, deployment is not only a technical step but also a validation of whether an ML model can bridge the gap between research and operational implementation.

A key deployment concern identified across the reviewed studies lies in the generalisation capability of ML models, as only a limited number of works evaluate model performance on unseen participant or scenario datasets, as discussed in Section 4.2.4. While most studies employ cross-validation to test model generalisation [13,29,31,56,57,61,72,77,81,82,83], these evaluations are often limited to small datasets, typically ranging from 5 to 47 participants (see Section 4.2.1), which may restrict the model’s capacity to adapt to new operators or unseen tasks. As suggested by Luo et al. [13] and Xi et al. [64], models trained on small and homogeneous datasets may struggle to maintain performance stability in real-world applications, where larger and more diverse datasets can provide a broader representation of physiological and behavioural variability, enabling models to better capture inter-individual and contextual differences in cognitive load. This variability is particularly relevant for UAS operator training, where individual differences in expertise, stress tolerance, and situational awareness substantially affect cognitive responses. Therefore, dataset size and diversity are critical determinants of model generalisability and, consequently, deployment success, ensuring that ML-based cognitive load assessment models remain reliable and applicable in high-stakes UAS training contexts.

Another critical deployment consideration concerns computational efficiency, which reflects the model’s real-time processing capacity for assessing cognitive load (CL) and determines whether ML models can be feasibly integrated into UAS operator training systems. Timely feedback is essential for adaptive training interventions, as it enables the continuous assessment of cognitive states and the dynamic adjustment of task difficulty or feedback intensity, implying that models should consider both predictive accuracy and inference speed. Recent studies have begun exploring architectures that address this; as noted by Pietracupa et al. [86] Transformer-based models are particularly promising in this regard due to their high degree of parallelisation and speed during training. They have the ability to capture long-range dependencies which enables faster, scalable processing without a significant loss of accuracy, suggesting strong potential for real-time integration in UAS operator training environments. Nevertheless, further investigation is warranted to optimise this trade-off between accuracy and computational efficiency for practical deployment.

Trust in ML-supported adaptive systems is another important consideration for deployment, as for ML-based cognitive load (CL) assessment models to be effectively integrated into training workflows, both trainees and instructors must have confidence in the model’s outputs [82]. Such trust, as highlighted by [55], depends largely on the transparency of the model, as users are more likely to trust and act upon ML-generated feedback when they can understand how predictions are made, which in turn determines how effectively ML-generated feedback is incorporated into real-world operational contexts. As mentioned previously, deep learning (DL) approaches offer distinct advantages in automatically extracting complex, non-linear relationships across multimodal data, which can improve assessment accuracy and reduce manual feature engineering [13,62]. However, these models often operate as “black boxes”, with limited transparency regarding how cognitive load (CL) predictions are derived, which may present challenges for adoption in practical training environments, as they lack the explainability essential for user acceptance/trust [60,82]. Explainable AI (XAI) frameworks, such as attention-based mechanisms and SHAP interpretability layers, have been proposed to improve model transparency [62,69,71,86]. This suggests that integrating such XAI techniques into ML pipelines represents a key deployment requirement to enhance trust, ensure reliability, and enable ML systems to function effectively within adaptive UAS training applications.

On the other hand, deploying ML application in real-world UAS training also introduces additional considerations, including cybersecurity, user acceptance, and regulatory compliance. Transitioning ML models from controlled experimental settings to operational environments involves addressing distinct challenges between military and civilian (dual-use) domains. In military contexts, system certification and data security are critical to protect sensitive biometric and operational data and maintain reliability under mission conditions. In civilian training environments, the emphasis shifts toward scalability, user accessibility, and compliance with aviation regulations. Successful deployment thus depends on developing adaptable systems that satisfy both military requirements for reliability and accountability and civilian needs for usability and regulatory approval.

4.3.3. Ethical Considerations

Ethical considerations form a central dimension in the development and deployment of an ML-based CL assessment model for UAS operator training. These concerns encompass data privacy, informed consent, and algorithmic fairness in automated decision-making. As ML increasingly guides adaptive training decisions that directly influence trainee evaluation and learning outcomes, ethical oversight becomes essential to ensure that systems operate transparently and respect human dignity and autonomy. In this regard, robust ethical governance not only protects participant rights but also strengthens model reliability and the overall integrity of AI-driven UAS training systems [91].

The collection and use of biometric and performance data, central to ML-based CL assessment, pose unique ethical challenges because such human-related data are inherently personal and, in many cases, potentially identifiable [61,68]. Ensuring ethical integrity in these ML assessment models requires robust frameworks for data ownership, consent, and protection against misuse or bias, which are essential for ensuring that ML models produce reliable and ethically sound outputs applicable to real-world UAS operator training. The reviewed studies generally obtained institutional ethics approval to ensure compliance with research governance standards [26,29,31,61,68,77,85,92], yet these protections may not be sufficient as ML systems move from controlled experiments to real-world operational training. Prior research in applied aviation psychology indicates that improper data handling or the unauthorised secondary use of biometric data can compromise not only privacy but also perceptions of fairness in training evaluation [37,38]. Future applications should therefore consider establishing comprehensive governance frameworks to regulate ownership, access, and data-sharing between training institutions and system developers, ensuring ethical integrity throughout the full data lifecycle.

Beyond data ethics, model transparency and fairness represent further ethical imperatives. Since ML models underpin adaptive feedback systems, their outputs must remain unbiased and aligned with human values; therefore, ensuring transparency is essential to detect and prevent potential bias in automated decision-making. Although explainability methods, such as SHapley Additive exPlanations (SHAP) and attention visualisation, offer partial interpretability, they often fail to capture the full reasoning process behind model predictions [13,62,69,71]. This opacity can create uncertainty about whether model outputs are fair or systematically biased, thereby introducing ethical risks and potentially resulting in inappropriate or inequitable training adjustments. To mitigate these risks, studies (e.g., [13,48,59]) suggest incorporating human expert oversight, which is necessary to verify model decisions, interpret outputs, and ensure their alignment with ethical and operational principles. This emphasises the need for maintaining a balanced human–AI collaboration, in which ML systems function as decision support tools rather than autonomous evaluators. Sustained human supervision ensures that system outputs remain consistent with operational objectives, contextual realities, and broader ethical standards, supporting both fairness and resilience in especially important for safety-critical military UAS training contexts.

4.3.4. Cost Constraints

Cost constraints refer to the financial and resource limitations that influence the design, development, and sustainability of ML-based cognitive load (CL) assessment models for adaptive training applications. While such ML-supported adaptive training systems have the potential to reduce long-term training expenses by automating performance assessment, reducing instructor workload, and improving overall efficiency, their practical implementation remains resource-intensive. Developing systems that are both reliable and adaptable to real-world UAS operator training environments require substantial initial and ongoing financial and time investment to ensure their effectiveness, reliability, and long-term functionality. This highlights the need to consider the practical cost challenges involved in translating the machine learning (ML) assessment model into scalable operational training solutions.

The reviewed studies indicate that building reliable ML models capable of supporting adaptive training relies heavily on access to large, high-quality datasets [61,64,83]. Achieving this typically requires recruiting larger participant cohorts, using high-fidelity simulators, and integrating advanced sensing hardware such as EEG, PPG, or eye-tracking devices [59,60,76], all of which contribute substantially to the initial development costs. For instance, Wojciechowski et al. [7] and Sakib et al. [59] emphasised that increasing simulation realism and multimodal sensing precision directly improves data quality but also raises equipment and calibration costs. Moreover, as dataset size and complexity grow, additional costs arise from establishing secure and scalable data management systems, ensuring encryption and compliance with ethical standards. Together, these factors make the early development phase both resource-intensive and time-consuming.

Beyond the initial development phase, the financial implications extend across the system’s lifecycle. As noted in prior studies (e.g., [38,55]), deploying CL assessment ML models in real-world training environments requires regular recalibration and retraining to mitigate data drift and maintain model reliability and effectiveness over time, which introduces additional expenses. These costs are further compounded by ongoing updates to sensing hardware, software maintenance, and the management of increasingly large data storage systems. These requirements reflect that adaptive ML systems are not static products but evolving tools that need ongoing technical and financial support to remain effective.

Furthermore, the adoption of such systems also requires additional investment in human resources, as instructors and training supervisors are needed to oversee ML models’ CL outputs [66]. Since these models cannot achieve complete accuracy or reliability, human experts must be trained to operate, interpret, and monitor ML-based systems, ensuring that the technology is applied appropriately and that model feedback is effectively integrated into training workflows. Balancing automation with human oversight therefore introduces an additional cost component related to professional development and system familiarisation, both of which are essential to maintain operational trust and instructional quality.

4.4. Limitations and Future Directions

There is a clear trend toward the increased adoption of ML-based cognitive load (CL) assessment models to inform and enhance adaptive training in UAS operator research. However, several methodological and practical limitations persist across the reviewed literature. Most studies have not yet validated their models in real-world or longitudinal UAS operator training contexts (e.g., [7,80]). Future investigations could therefore explore the long-term reliability and operational stability of ML-driven adaptive training systems, particularly their capacity to sustain predictive accuracy over time and adapt to evolving operator states or CL dynamics.

In addition to validation and generalisability, many reviewed studies developed ML models using relatively small and homogeneous datasets, which restricts the ability of models to generalise to diverse operators or operational conditions (e.g., [38,59]). Expanding dataset diversity to include a broader range of participant profiles and environmental factors would enhance representativeness and improve the robustness of model predictions. Similarly, simulation designs could be diversified and made more realistic to increase ecological validity and ensure stronger alignment with real-world UAS operations.

Another key limitation relates to the consideration of significant methodological factors, especially individual variability and multimodal data integration. Although individual variability in CL is documented, developing individualised models remains computationally intensive and methodologically complex. Thus, further research is needed to build scalable operator-specific architectures that can more accurately model CL in real-world UAS training environments. Additionally, the multidimensional nature of CL, which spans biometric data, performance, environmental context, and personal traits, demands efficient algorithms for signal alignment, data fusion, and dimensionality reduction to support integrated, real-time analysis.

Furthermore, the practical deployment of these systems must address computational efficiency, real-time feedback requirements, and strict data privacy and security regulations, especially given the sensitive nature of biometric data. Regular ML model recalibration and validation will be essential to address data drift, model ageing, and evolving training requirements [55]. The ongoing role of human experts is indispensable, both for ethical oversight and for ensuring that AI systems align with operational standards and organisational values. Also, the development of transparent and interpretable ML models, the standardisation of methodological protocols, and effective human–AI collaboration frameworks should be further considered for the sustainable and trustworthy adoption of ML-driven adaptive training in UAS operator contexts.

Emerging approaches such as transfer learning and reinforcement learning (RL) offer promising directions for future research. Transfer learning enables the use of knowledge from related aviation or sensor-based domains to improve data efficiency and cross-task generalisation, which is particularly valuable when large, high-quality datasets are limited. Recent studies have demonstrated the validity of applying transfer learning techniques to cognitive load (CL) assessment [64,86], supporting the potential of these methods to reduce data requirements and enhance cross-domain adaptability in UAS operator contexts. Reinforcement learning (RL), in contrast, allows adaptive training systems to dynamically adjust task difficulty or scenario complexity in response to trainee performance and cognitive state, thereby personalising learning experiences in real time. Prior studies in adaptive training have shown the potential of RL to effectively optimise scenario sequencing and support skill acquisition under varying workload demands [93,94]. In parallel, human-in-the-loop (HITL) frameworks have proven effective in complex decision support and large language model (LLM) systems, where expert feedback is continuously integrated to guide model behaviour [95,96]. Adapting such frameworks to UAS operator training could ensure that ML-based adaptive systems remain transparent, ethically aligned, and operationally grounded. While most current ML-based training systems rely on supervised or static models, these emerging approaches, transfer learning, reinforcement learning, and human-in-the-loop frameworks, offer complementary strengths that address the existing limitations in data efficiency, adaptability, and trust. These directions outline a road map for building more flexible, intelligent, and human-aligned adaptive training environments.

5. Conclusions

The rapid development of Uncrewed Aerial Systems (UAS) has established them as an important component of technological progress in aviation. As UAS increasingly take on roles in both the commercial and defence sectors, the requirement for effective operator training becomes central. Within this context, cognitive load (CL) has been identified as a key determinant of operator performance, as elevated CL contributes to errors and reduced mission effectiveness. Addressing CL in training is therefore necessary for ensuring operational reliability. Machine learning (ML) provides adaptive capabilities that can be applied to assess and respond to CL, offering a structured approach to making UAS operator training more responsive, efficient, and tailored to individual demands.

This systematic review synthesised 38 studies from an initial pool of 1611, focusing on applications of ML in CL assessment for UAS operator training. The findings highlight that ML models, ranging from traditional classifiers such as SVM to advanced deep learning such as LSTM architectures, can support adaptive instruction, continuous monitoring, and data-driven feedback. These capabilities reduce human error, improve situational awareness, and align training demands with individual cognitive capacity. Importantly, this review also identified methodological factors, such as data preprocessing, labelling strategies, and validation protocols, that influence model effectiveness and generalisability.

A limitation of this review is its scope, which centres primarily on ML applications for CL and operator-related training factors. Broader ML contributions, including enhancements to training environments, user interfaces, and scenario realism, were not the primary focus but represent valuable directions for future inquiry. Furthermore, this review considered studies published between 2014 and 2024 to provide a comprehensive overview of methodological developments over time. This wide temporal range was intended to capture the evolution of ML approaches in UAS operator training. However, given the rapid technological advancement in both UAS and AI/ML, earlier studies may no longer reflect the current state of the field in terms of algorithmic sophistication, data availability, or computational capability. Therefore, the interpretation of these earlier studies may require an awareness of their historical context, as their technical relevance and applicability may differ from more recent work. Future research may therefore focus on recent studies to reflect the current state of ML capabilities. Remaining challenges, such as small datasets, limited cross-subject validation, and insufficiently realistic simulation designs, continue to pose barriers to operational deployment and must be addressed in future investigations.

Ultimately, the integration of ML into immersive and operational UAS operator training contexts represents a progression toward adaptive and scalable training systems. By supporting personalised, real-time, and data-driven instruction, ML can contribute to enhancing operator readiness, facilitating skill acquisition, and supporting safety in complex UAS mission environments.

Author Contributions

Conceptualisation, O.M. and G.E.; methodology, O.M., G.E. and H.E.-F.; data collection, Q.L.; visualisation, Q.L.; analysis, Q.L., O.M., H.E.-F. and G.E.; writing—review and editing, Q.L., O.M., H.E.-F. and G.E. All authors have read and agreed to the published version of the manuscript.

Funding

The research was made possible through the support of the Defence Trailblazer Program, supported by the Department of Education, the University of New South Wales, and the University of Adelaide (RG 245502/245503).

Data Availability Statement

No new data were created or analysed in this study. Data sharing is not applicable to this article.

Acknowledgments

We would like to thank all participants for their contribution to this research.

Conflicts of Interest

The authors Oleksandra Molloy and Heba El-Fiqi are employed by the University of New South Wales (Canberra), Qianchu Li is a student investigator at the University of New South Wales (Canberra), and Gary Eves is employed by the company CAE—they declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ATCs	Air Traffic Controllers
BVP	Blood Volume Pulse
CL	Cognitive Load
CNN	Convolutional Neural Network
CV	Cross-Validation
DL	Deep Learning
DT	Decision Tree
EDA	Electrodermal Activity (Galvanic Skin Response, GSR)
ECG	Electrocardiogram
EEG	Electroencephalogram
EDS	Electrodermal Signals
EMG	Electromyography
fNIRS	Functional Near-Infrared Spectroscopy
HR	Heart Rate
HRV	Heart Rate Variability
ICA	Independent Component Analysis
KNN	K-Nearest Neighbours
LSTM	Long Short-Term Memory
ML	Machine Learning
NASA-TLX	NASA Task Load Index
NN	Neural Network
PCA	Principal Component Analysis
PPG	Photoplethysmography
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
RBF	Radial Basis Function
RFE	Recursive Feature Elimination
RNN	Recurrent Neural Network
RMSE	Root Mean Squared Error
RSP	Respiration
SHAP	SHapley Additive exPlanations
SKT	Skin Temperature
SVM	Support Vector Machine
UAS	Uncrewed Aerial System
UAV	Uncrewed Aerial Vehicle
VR	Virtual Reality
XGBoost	Extreme Gradient Boosting

References

Rabiu, L.; Ahmad, A.; Gohari, A. Advancements of Unmanned Aerial Vehicle Technology in the Realm of Applied Sciences and Engineering: A Review. J. Adv. Res. Appl. Sci. Eng. Technol. 2024, 40, 74–95. [Google Scholar] [CrossRef]
Wanner, D.; Hashim, H.A.; Srivastava, S.; Steinhauer, A. UAV avionics safety, certification, accidents, redundancy, integrity, and reliability: A comprehensive review and future trends. Drone Syst. Appl. 2024, 12, 1–23. [Google Scholar] [CrossRef]
Grindley, B.; Phillips, K.; Parnell, K.J.; Cherrett, T.; Scanlan, J.; Plant, K.L. Over a decade of UAV incidents: A human factors analysis of causal factors. Appl. Ergon. 2024, 121, 104355. [Google Scholar] [CrossRef] [PubMed]
Orlady, L.M. Airline pilot training today and tomorrow. In Crew Resource Management; Elsevier: Amsterdam, The Netherlands, 2010; pp. 469–491. [Google Scholar]
Qi, S.; Wang, F.; Jing, L. Unmanned aircraft system pilot/operator qualification requirements and training study. In MATEC Web of Conferences, Proceedings of the 2018 2nd International Conference on Mechanical, Material and Aerospace Engineering (2MAE 2018), Wuhan, China, 10–13 May 2018; EDP Sciences: Les Ulis, France, 2018; Volume 179, p. 03006. [Google Scholar]
Stulberg, A.N. Managing the unmanned revolution in the US Air Force. Orbis 2007, 51, 251–265. [Google Scholar] [CrossRef]
Wojciechowski, P.; Wojtowicz, K.; Błaszczyk, J. AI-driven method for UAV Pilot Training Process Optimization in a Virtual Environment. In Proceedings of the 2023 IEEE International Workshop on Technologies for Defense and Security (TechDefense), Rome, Italy, 20–22 November 2023; IEEE: New York, NY, USA, 2023; pp. 240–244. [Google Scholar]
Gopher, D.; Donchin, E. Workload: An examination of the concept. In Handbook of Perception and Human Performance, Vol. 2. Cognitive Processes and Performance; John Wiley & Sons: Hoboken, NJ, USA, 1986. [Google Scholar]
Sweller, J. Cognitive load during problem solving: Effects on learning. Cogn. Sci. 1988, 12, 257–285. [Google Scholar] [CrossRef]
Mattys, S.L.; Wiget, L. Effects of cognitive load on speech recognition. J. Mem. Lang. 2011, 65, 145–160. [Google Scholar] [CrossRef]
Le Cunff, A.L.; Giampietro, V.; Dommett, E. Neurodiversity and cognitive load in online learning: A systematic review with narrative synthesis. Educ. Res. Rev. 2024, 43, 100604. [Google Scholar] [CrossRef]
Cummings, M.; Huang, L.; Zhu, H.; Finkelstein, D.; Wei, R. The impact of increasing autonomy on training requirements in a UAV supervisory control task. J. Cogn. Eng. Decis. Mak. 2019, 13, 295–309. [Google Scholar] [CrossRef]
Luo, S.; Zhang, C.; Zhu, W.; Chen, H.; Yuan, J.; Li, Q.; Wang, T.; Jiang, C. Noncontact perception for assessing pilot mental workload during the approach and landing under various weather conditions. Signal Image Video Process. 2025, 19, 98. [Google Scholar] [CrossRef]
Nguyen, T.; Lim, C.P.; Nguyen, N.D.; Gordon-Brown, L.; Nahavandi, S. A review of situation awareness assessment approaches in aviation environments. IEEE Syst. J. 2019, 13, 3590–3603. [Google Scholar] [CrossRef]
Buchner, J.; Buntins, K.; Kerres, M. The impact of augmented reality on cognitive load and performance: A systematic review. J. Comput. Assist. Learn. 2022, 38, 285–303. [Google Scholar] [CrossRef]
Heitmann, S.; Grund, A.; Fries, S.; Berthold, K.; Roelle, J. The quizzing effect depends on hope of success and can be optimized by cognitive load-based adaptation. Learn. Instr. 2022, 77, 101526. [Google Scholar] [CrossRef]
Tugtekin, U.; Odabasi, H.F. Do interactive learning environments have an effect on learning outcomes, cognitive load and metacognitive judgments? Educ. Inf. Technol. 2022, 27, 7019–7058. [Google Scholar] [CrossRef]
Kapustina, L.; Izakova, N.; Makovkina, E.; Khmelkov, M. The global drone market: Main development trends. In SHS Web of Conferences, Proceedings of the 21st International Scientific Conference Globalization and its Socio-Economic Consequences 2021, Online, 13–14 October 2021; EDP Sciences: Les Ulis, France, 2021; Volume 129, p. 11004. [Google Scholar]
Borghini, G.; Astolfi, L.; Vecchiato, G.; Mattia, D.; Babiloni, F. Measuring neurophysiological signals in aircraft pilots and car drivers for the assessment of mental workload, fatigue and drowsiness. Neurosci. Biobehav. Rev. 2014, 44, 58–75. [Google Scholar] [CrossRef]
Rastgoftar, H. Safe Human-UAS Collaboration Abstraction. arXiv 2024, arXiv:2402.05277. [Google Scholar] [CrossRef]
Williams, K.W.; Mofle, T.C.; Hu, P.T. UAS Air Carrier Operations Survey: Training Requirements; Technical Report; U.S. Department of Transportation: Washington, DC, USA, 2023. [Google Scholar]
European Commission. Commission Implementing Regulation (EU) 2019/945 of 12 March 2019 on the Requirements for the Design and Manufacture of Unmanned Aircraft Systems Intended to be Operated Under the Rules and Conditions Laid Down in Regulation (EU) 2019/947. 2019. Official Journal of the European Union, L 152, 11.6.2019, pp. 1–40. Available online: https://regulatorylibrary.caa.co.uk/2019-945-pdf/PDF.pdf (accessed on 23 October 2025).
Lin, J.; Matthews, G.; Wohleber, R.W.; Funke, G.J.; Calhoun, G.L.; Ruff, H.A.; Szalma, J.; Chiu, P. Overload and automation-dependence in a multi-UAS simulation: Task demand and individual difference factors. J. Exp. Psychol. Appl. 2020, 26, 218. [Google Scholar] [CrossRef]
McCarley, J.S.; Wickens, C.D. Human Factors Concerns in UAV Flight. Institute of Aviation, Aviation Human Factors Division, University of Illinois at Urbana-Champaign: Urbana-Champaign, IL, USA, 2004; Available online: https://www.researchgate.net/profile/Christopher-Wickens/publication/241595724_HUMAN_FACTORS_CONCERNS_IN_UAV_FLIGHT/links/00b7d53b850921e188000000/HUMAN-FACTORS-CONCERNS-IN-UAV-FLIGHT.pdf (accessed on 23 October 2025).
Chen, J.Y.; Durlach, P.J.; Sloan, J.A.; Bowens, L.D. Human–robot interaction in the context of simulated route reconnaissance missions. Mil. Psychol. 2008, 20, 135–149. [Google Scholar] [CrossRef]
Planke, L.J.; Lim, Y.; Gardi, A.; Sabatini, R.; Kistan, T.; Ezer, N. A cyber-physical-human system for one-to-many uas operations: Cognitive load analysis. Sensors 2020, 20, 5467. [Google Scholar] [CrossRef] [PubMed]
Marinescu, A.; Sharples, S.; Ritchie, A.; López, T.S.; McDowell, M.; Morvan, H. Exploring the relationship between mental workload, variation in performance and physiological parameters. IFAC-PapersOnLine 2016, 49, 591–596. [Google Scholar] [CrossRef]
Alreshidi, I.; Moulitsas, I.; Jenkins, K.W. Advancing aviation safety through machine learning and psychophysiological data: A systematic review. IEEE Access 2024, 12, 5132–5150. [Google Scholar] [CrossRef]
Zhang, C.; Yuan, J.; Jiao, Y.; Liu, H.; Fu, L.; Jiang, C.; Wen, C. Variation of Pilots’ Mental Workload Under Emergency Flight Conditions Induced by Different Equipment Failures: A Flight Simulator Study. Transp. Res. Rec. 2024, 2678, 365–377. [Google Scholar] [CrossRef]
Paas, F.; Tuovinen, J.E.; Tabbers, H.; Van Gerven, P.W. Cognitive load measurement as a means to advance cognitive load theory. In Cognitive Load Theory; Routledge: London, UK, 2016; pp. 63–71. [Google Scholar]
Momeni, N.; Dell’Agnola, F.; Arza, A.; Atienza, D. Real-time cognitive workload monitoring based on machine learning using physiological signals in rescue missions. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; IEEE: New York, NY, USA, 2019; pp. 3779–3785. [Google Scholar]
Fernandez Rojas, R.; Debie, E.; Fidock, J.; Barlow, M.; Kasmarik, K.; Anavatti, S.; Garratt, M.; Abbass, H. Electroencephalographic workload indicators during teleoperation of an unmanned aerial vehicle shepherding a swarm of unmanned ground vehicles in contested environments. Front. Neurosci. 2020, 14, 40. [Google Scholar] [CrossRef]
Haapalainen, E.; Kim, S.; Forlizzi, J.F.; Dey, A.K. Psycho-physiological measures for assessing cognitive load. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing, Copenhagen, Denmark, 26–29 September 2010; pp. 301–310. [Google Scholar]
Wang, P.; Houghton, R.; Majumdar, A. Detecting and Predicting Pilot Mental Workload Using Heart Rate Variability: A Systematic Review. Sensors 2024, 24, 3723. [Google Scholar] [CrossRef]
Hamann, A.; Carstengerdes, N. Assessing the development of mental fatigue during simulated flights with concurrent EEG-fNIRS measurement. Sci. Rep. 2023, 13, 4738. [Google Scholar] [CrossRef] [PubMed]
Borghini, G.; Aricò, P.; Di Flumeri, G.; Ronca, V.; Giorgi, A.; Sciaraffa, N.; Conca, C.; Stefani, S.; Verde, P.; Landolfi, A.; et al. Air Force Pilot Expertise Assessment during Unusual Attitude Recovery Flight. Safety 2022, 8, 38. [Google Scholar] [CrossRef]
Jorna, P.G. Spectral analysis of heart rate and psychological state: A review of its validity as a workload index. Biol. Psychol. 1992, 34, 237–257. [Google Scholar] [CrossRef]
Dell’Agnola, F.; Jao, P.K.; Arza, A.; Chavarriaga, R.; Millán, J.d.R.; Floreano, D.; Atienza, D. Machine-learning based monitoring of cognitive workload in rescue missions with drones. IEEE J. Biomed. Health Inform. 2022, 26, 4751–4762. [Google Scholar] [CrossRef]
Debie, E.; Rojas, R.F.; Fidock, J.; Barlow, M.; Kasmarik, K.; Anavatti, S.; Garratt, M.; Abbass, H.A. Multimodal fusion for objective assessment of cognitive workload: A review. IEEE Trans. Cybern. 2019, 51, 1542–1555. [Google Scholar] [CrossRef] [PubMed]
Charles, R.L.; Nixon, J. Measuring mental workload using physiological measures: A systematic review. Appl. Ergon. 2019, 74, 221–232. [Google Scholar] [CrossRef]
Cheng, L.; Yu, T. A new generation of AI: A review and perspective on machine learning technologies applied to smart energy and electric power systems. Int. J. Energy Res. 2019, 43, 1928–1973. [Google Scholar] [CrossRef]
Wang, S.; Gwizdka, J.; Chaovalitwongse, W.A. Using wireless EEG signals to assess memory workload in the n-back task. IEEE Trans. Hum.-Mach. Syst. 2015, 46, 424–435. [Google Scholar] [CrossRef]
Ma, H.L.; Sun, Y.; Chung, S.H.; Chan, H.K. Tackling uncertainties in aircraft maintenance routing: A review of emerging technologies. Transp. Res. Part E Logist. Transp. Rev. 2022, 164, 102805. [Google Scholar] [CrossRef]
Sandström, V.; Luotsinen, L.; Oskarsson, D. Fighter pilot behavior cloning. In Proceedings of the 2022 International Conference on Unmanned Aircraft Systems (ICUAS), Dubrovnik, Croatia, 21–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 686–695. [Google Scholar]
Georgila, K.; Core, M.G.; Nye, B.D.; Karumbaiah, S.; Auerbach, D.; Ram, M. Using reinforcement learning to optimize the policies of an intelligent tutoring system for interpersonal skills training. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 737–745. [Google Scholar]
Wang, F. Reinforcement learning in a pomdp based intelligent tutoring system for optimizing teaching strategies. Int. J. Inf. Educ. Technol. 2018, 8, 553–558. [Google Scholar] [CrossRef]
Madeira, T.; Melício, R.; Valério, D.; Santos, L. Machine learning and natural language processing for prediction of human factors in aviation incident reports. Aerospace 2021, 8, 47. [Google Scholar] [CrossRef]
Guevarra, M.; Das, S.; Wayllace, C.; Epp, C.D.; Taylor, M.; Tay, A. Augmenting flight training with AI to efficiently train pilots. Aaai Conf. Artif. Intell. 2023, 37, 16437–16439. [Google Scholar] [CrossRef]
Jahanpour, E.S.; Berberian, B.; Imbert, J.P.; Roy, R.N. Cognitive fatigue assessment in operational settings: A review and UAS implications. IFAC-PapersOnLine 2020, 53, 330–337. [Google Scholar] [CrossRef]
Peißl, S.; Wickens, C.D.; Baruah, R. Eye-tracking measures in aviation: A selective literature review. Int. J. Aerosp. Psychol. 2018, 28, 98–112. [Google Scholar] [CrossRef]
Masi, G.; Amprimo, G.; Ferraris, C.; Priano, L. Stress and workload assessment in aviation—A narrative review. Sensors 2023, 23, 3556. [Google Scholar] [CrossRef]
Suárez, M.Z.; Valdés, R.M.A.; Moreno, F.P.; Jurado, R.D.A.; de Frutos, P.M.L.; Comendador, V.F.G. Understanding the research on air traffic controller workload and its implications for safety: A science mapping-based analysis. Saf. Sci. 2024, 176, 106545. [Google Scholar] [CrossRef]
Shaker, M.H.; Al-Alawi, A.I. Application of big data and artificial intelligence in pilot training: A systematic literature review. In Proceedings of the 2023 International Conference on Cyber Management and Engineering (CyMaEn), Bangkok, Thailand, 26–27 January 2023; IEEE: New York, NY, USA, 2023; pp. 205–209. [Google Scholar]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Bmj 2021, 372, n71. [Google Scholar] [CrossRef]
Arico, P.; Borghini, G.; Di Flumeri, G.; Colosimo, A.; Graziani, I.; Imbert, J.P.; Granger, G.; Benhacene, R.; Terenzi, M.; Pozzi, S.; et al. Reliability over time of EEG-based mental workload evaluation during Air Traffic Management (ATM) tasks. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; IEEE: New York, NY, USA, 2015; pp. 7242–7245. [Google Scholar]
Wang, Y.; Zhang, C.; Liu, C.; Liu, K.; Xu, F.; Yuan, J.; Jiang, C.; Liu, C.; Cao, W. Analysis on pulse rate variability for pilot workload assessment based on wearable sensor. Hum. Factors Ergon. Manuf. Serv. Ind. 2024, 34, 635–648. [Google Scholar] [CrossRef]
Shao, Q.; Jiang, K.; Li, R. A numerical evaluation of real-time workloads for ramp controller through optimization of multi-type feature combinations derived from eye tracker, respiratory, and fatigue patterns. PLoS ONE 2024, 19, e0313565. [Google Scholar] [CrossRef] [PubMed]
Qin, H.; Zhou, X.; Ou, X.; Liu, Y.; Xue, C. Detection of mental fatigue state using heart rate variability and eye metrics during simulated flight. Hum. Factors Ergon. Manuf. Serv. Ind. 2021, 31, 637–651. [Google Scholar] [CrossRef]
Sakib, M.N.; Chaspari, T.; Behzadan, A.H. Physiological data models to understand the effectiveness of drone operation training in immersive virtual reality. J. Comput. Civ. Eng. 2021, 35, 04020053. [Google Scholar] [CrossRef]
Massé, E.; Bartheye, O.; Fabre, L. Classification of electrophysiological signatures with explainable artificial intelligence: The case of alarm detection in flight simulator. Front. Neuroinform. 2022, 16, 904301. [Google Scholar] [CrossRef]
Yiu, C.Y.; Ng, K.K.; Li, X.; Zhang, X.; Li, Q.; Lam, H.S.; Chong, M.H. Towards safe and collaborative aerodrome operations: Assessing shared situational awareness for adverse weather detection with EEG-enabled Bayesian neural networks. Adv. Eng. Inform. 2022, 53, 101698. [Google Scholar] [CrossRef]
Jiang, G.; Chen, H.; Wang, C.; Xue, P. Mental workload artificial intelligence assessment of pilots’ EEG based on multi-dimensional data fusion and LSTM with attention mechanism model. Int. J. Pattern Recognit. Artif. Intell. 2022, 36, 2259035. [Google Scholar] [CrossRef]
Nittala, S.K.; Elkin, C.P.; Kiker, J.M.; Meyer, R.; Curro, J.; Reiter, A.K.; Xu, K.S.; Devabhaktuni, V.K. Pilot skill level and workload prediction for sliding-scale autonomy. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; IEEE: New York, NY, USA, 2018; pp. 1166–1173. [Google Scholar]
Xi, P.; Law, A.; Goubran, R.; Shu, C. Pilot workload prediction from ECG using deep convolutional neural networks. In Proceedings of the 2019 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Istanbul, Turkey, 26–28 June 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
Qiu, J.; Han, T. Integrated Model for Workload Assessment Based on Multiple Physiological Parameters Measurement. In Engineering Psychology and Cognitive Ergonomics, Proceedings of the 13th International Conference, EPCE 2016, Held as Part of HCI International 2016, Toronto, ON, Canada, 17–22 July 2016, Proceedings 13; Springer: Berlin/Heidelberg, Germany, 2016; pp. 19–28. [Google Scholar]
Laskowski, J.; Pytka, J.; Laskowska, A.; Tomilo, P.; Skowron, Ł.; Kozlowski, E.; Piatek, R.; Mamcarz, P. AI-Based Method of Air Traffic Controller Workload Assessment. In Proceedings of the 2024 11th International Workshop on Metrology for AeroSpace (MetroAeroSpace), Lublin, Poland, 3–5 June 2024; IEEE: New York, NY, USA, 2024; pp. 46–51. [Google Scholar]
Jao, P.K.; Chavarriaga, R.; Dell’Agnola, F.; Arza, A.; Atienza, D.; Millán, J.d.R. EEG correlates of difficulty levels in dynamical transitions of simulated flying and mapping tasks. IEEE Trans. Hum.-Mach. Syst. 2020, 51, 99–108. [Google Scholar] [CrossRef]
Aricò, P.; Borghini, G.; Di Flumeri, G.; Colosimo, A.; Bonelli, S.; Golfetti, A.; Pozzi, S.; Imbert, J.P.; Granger, G.; Benhacene, R.; et al. Adaptive automation triggered by EEG-based mental workload index: A passive brain-computer interface application in realistic air traffic control environment. Front. Hum. Neurosci. 2016, 10, 539. [Google Scholar] [CrossRef]
Zhou, W.g.; Yu, P.p.; Wu, L.h.; Cao, Y.f.; Zhou, Y.; Yuan, J.j. Pilot turning behavior cognitive load analysis in simulated flight. Front. Neurosci. 2024, 18, 1450416. [Google Scholar] [CrossRef]
Zak, Y.; Parmet, Y.; Oron-Gilad, T. Subjective Workload assessment technique (SWAT) in real time: Affordable methodology to continuously assess human operators’ workload. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; IEEE: New York, NY, USA, 2020; pp. 2687–2694. [Google Scholar]
Lochner, M.; Duenser, A.; Sarker, S. Trust and Cognitive Load in semi-automated UAV operation. In Proceedings of the 31st Australian Conference on Human-Computer-Interaction, Fremantle, Australia, 2–5 December 2019; pp. 437–441. [Google Scholar]
Gianazza, D. Learning air traffic controller workload from past sector operations. In Proceedings of the ATM Seminar, 12th USA/Europe Air Traffic Management R&D Seminar, Seattle, WA, USA, 26–30 June 2017. [Google Scholar]
Shao, Q.; Li, H.; Sun, Z. Air Traffic Controller Workload Detection Based on EEG Signals. Sensors 2024, 24, 5301. [Google Scholar] [CrossRef]
Salvan, L.; Paul, T.S.; Marois, A. Dry EEG-based Mental Workload Prediction for Aviation. In Proceedings of the 2023 IEEE/AIAA 42nd Digital Avionics Systems Conference (DASC), Barcelona, Spain, 1–5 October 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar]
Monfort, S.S.; Sibley, C.M.; Coyne, J.T. Using machine learning and real-time workload assessment in a high-fidelity UAV simulation environment. In Next-Generation Analyst IV; SPIE: Bellingham, WA, USA, 2016; Volume 9851, pp. 93–102. [Google Scholar]
Aricò, P.; Reynal, M.; Di Flumeri, G.; Borghini, G.; Sciaraffa, N.; Imbert, J.P.; Hurter, C.; Terenzi, M.; Ferreira, A.; Pozzi, S.; et al. How neurophysiological measures can be used to enhance the evaluation of remote tower solutions. Front. Hum. Neurosci. 2019, 13, 303. [Google Scholar] [CrossRef] [PubMed]
Dell’Agnola, F.; Momeni, N.; Arza, A.; Atienza, D. Cognitive workload monitoring in virtual reality based rescue missions with drones. In Virtual, Augmented and Mixed Reality. Design and Interaction, Proceedings of the 12th International Conference, VAMR 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, 19–24 July 2020, Proceedings, Part I; Springer: Cham, Switzerland, 2020; pp. 397–409. [Google Scholar]
Kelleher, J.D.; Mac Namee, B.; D’arcy, A. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies; MIT press: Cambridge, MA, USA, 2020. [Google Scholar]
Rodríguez-Fernández, V.; Menéndez, H.D.; Camacho, D. Automatic profile generation for UAV operators using a simulation-based training environment. Prog. Artif. Intell. 2016, 5, 37–46. [Google Scholar] [CrossRef]
Yang, S.; Yu, K.; Lammers, T.; Chen, F. Artificial intelligence in pilot training and education–towards a machine learning aided instructor assistant for flight simulators. In HCI International 2021—Posters, Proceedings of the 23rd HCI International Conference, HCII 2021, Virtual Event, 24–29 July 2021, Proceedings, Part II; Springer: Cham, Switzerland, 2021; pp. 581–587. [Google Scholar]
Yuan, J.; Ke, X.; Zhang, C.; Zhang, Q.; Jiang, C.; Cao, W. Recognition of Different Turning Behaviors of Pilots Based on Flight Simulator and fNIRS Data. IEEE Access 2024, 12, 32881–32893. [Google Scholar] [CrossRef]
Caballero, W.N.; Gaw, N.; Jenkins, P.R.; Johnstone, C. Toward automated instructor pilots in legacy air force systems: Physiology-based flight difficulty classification via machine learning. Expert Syst. Appl. 2023, 231, 120711. [Google Scholar] [CrossRef]
Li, Q.; Ng, K.K.; Yiu, C.Y.; Yuan, X.; So, C.K.; Ho, C.C. Securing air transportation safety through identifying pilot’s risky VFR flying behaviours: An EEG-based neurophysiological modelling using machine learning algorithms. Reliab. Eng. Syst. Saf. 2023, 238, 109449. [Google Scholar] [CrossRef]
Yin, W.; An, R.; Zhao, Q. Intelligent Recognition of UAV Pilot Training Actions Based on Dynamic Bayesian Network. J. Physics Conf. Ser. 2022, 2281, 012014. [Google Scholar] [CrossRef]
Paces, P.; Insaurralde, C.C. Artificially intelligent assistance for pilot performance assessment. In Proceedings of the 2021 IEEE/AIAA 40th Digital Avionics Systems Conference (DASC), San Antonio, TX, USA, 3–7 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–7. [Google Scholar]
Pietracupa, M.; Ben Abdessalem, H.; Frasson, C. Detection of Pre-error States in Aircraft Pilots Through Machine Learning. In Generative Intelligence and Intelligent Tutoring Systems, Proceedings of the 20th International Conference, ITS 2024, Thessaloniki, Greece, 10–13 June 2024, Proceedings, Part II; Springer: Cham, Switzerland, 2024; pp. 124–136. [Google Scholar]
Binias, B.; Myszor, D.; Cyran, K.A. A machine learning approach to the detection of pilot’s reaction to unexpected events based on EEG signals. Comput. Intell. Neurosci. 2018, 2018, 2703513. [Google Scholar] [CrossRef]
Alkadi, R.; Al-Ameri, S.; Shoufan, A.; Damiani, E. Identifying drone operator by deep learning and ensemble learning of imu and control data. IEEE Trans. Hum.-Mach. Syst. 2021, 51, 451–462. [Google Scholar] [CrossRef]
Taheri Gorji, H.; Wilson, N.; VanBree, J.; Hoffmann, B.; Petros, T.; Tavakolian, K. Using machine learning methods and EEG to discriminate aircraft pilot cognitive workload during flight. Sci. Rep. 2023, 13, 2507. [Google Scholar] [CrossRef]
Watkins, D.; Gallardo, G.; Chau, S. Pilot support system: A machine learning approach. In Proceedings of the 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 31 January–2 February 2018; IEEE: New York, NY, USA, 2018; pp. 325–328. [Google Scholar]
Jain, N.; Gambhir, A.; Pandey, M. Unmanned aerial networks—UAVs and AI. In Recent Trends in Artificial Intelligence Towards a Smart World: Applications in Industries and Sectors; Springer: Singapore, 2024; pp. 321–351. [Google Scholar]
Zhu, W.; Zhang, C.; Liu, C.; Xu, F.; Jiang, C.; Cao, W. Assessment of Pilot Mental Workload Based on Physiological Signals: A Real Helicopter Cross-country Flight Study. In Proceedings of the 2023 IEEE 5th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 11–13 October 2023; IEEE: New York, NY, USA, 2023; pp. 638–643. [Google Scholar]
Troussas, C.; Krouska, A.; Mylonas, P.; Sgouropoulou, C. Reinforcement learning-based dynamic fuzzy weight adjustment for adaptive user interfaces in educational software. Future Internet 2025, 17, 166. [Google Scholar] [CrossRef]
Sun, Q.; Xue, Y.; Song, Z. Adaptive user interface generation through reinforcement learning: A data-driven approach to personalization and optimization. In Proceedings of the 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC), Qingdao, China, 13–15 December 2024; IEEE: New York, NY, USA, 2024; pp. 1386–1391. [Google Scholar]
Amaliah, N.R.; Tjahjono, B.; Palade, V. Human-in-the-Loop XAI for Predictive Maintenance: A Systematic Review of Interactive Systems and Their Effectiveness in Maintenance Decision-Making. Electronics 2025, 14, 3384. [Google Scholar] [CrossRef]
Emami, Y.; Almeida, L.; Li, K.; Ni, W.; Han, Z. Human-in-the-loop machine learning for safe and ethical autonomous vehicles: Principles, challenges, and opportunities. arXiv 2024, arXiv:2408.12548. [Google Scholar]

Figure 1. Systematic review paper selection following PRISMA 2020 [54] framework diagram.

Figure 2. Distribution of the best-performing reported ML models in the reviewed studies for assessing CL and operator-related data.

Figure 3. General workflow of ML for assessing CL and operator-related data.

Figure 4. Taxonomy of ML-based models applied for assessing CL.

Figure 5. Distribution of operator-related data types used in ML-based UAS operator training studies.

Figure 6. Key methodological features to consider for CL assessment.

Table 1. Summary of the 28 papers on ML/AI applications in assessing cognitive load of sensor operators.

Refs.	Input of ML Model (Data)	Output of ML Model	ML Model and Results
Arico et al. [55]	Electroencephalogram (EEG) signal data (collected from 12 subjects).	High, medium, or low workload.	Step-Wise Linear Discriminant Analysis (SWLDA).
Wang et al. [56]	Photoplethysmography (PPG) signal data and NASA-TLX (collected from 21 subjects).	High, medium, or low mental workload.	K-Nearest Neighbors (KNN) achieved the best performance with 88.9% accuracy.
Shao et al. [57]	Fatigue status, eye-tracking, and respiratory data (collected from 8 subjects).	High or low workload level.	KNN showed the best performance with 98.6% accuracy in classifying workload.
Luo et al. [13]	Facial expression via FaceReader software and NASA-TLX score (collected from 21 subjects).	High, medium, and low workload.	Convolutional Neural Network (CNN) showed the best accuracy of 99.87%.
Qin et al. [58]	Visual analogue scale (subjective measure), heart rate variability (HRV), and eye-tracking data (collected from 11 subjects).	Mental fatigue (low/medium/high).	Support Vector Machine (SVM) achieved the best performance of 91.8% in determining levels of mental fatigue.
Sakib et al. [59]	NASA-TLX, CARMA for stress level, heart rate (HR), heart rate variability (HRV), skin electrical activity (EDA), skin temperature (SKT), and behavioural data (e.g., reaction time), (collected from 25 subjects).	Cognitive load, performance, and stress levels: low, medium low, medium, medium high, and high.	SVM showed the best performance in classifying cognitive load.
Massé et al. [60]	Subjective level of fatigue behavioural data, and EEG signal data (collected from 24 subjects).	Attention (alarm) detected or not.	SVM model performed best with an average individual accuracy rate of 76.4%.
Yiu et al. [61]	EEG and NASA-TLX score (collected from 30 subjects).	Classification of weather conditions: good visibility conditions or poor visibility condition.	Bayesian Neural Network (BNN) model performed the best in the task of classifying, with accuracy rate of 66.5%.
Jiang et al. [62]	EEG signal, flight operation, eye-tracking, and flight record data (collected from 6 subjects).	Classification of mental workload into 5 levels: low to high.	Long Short-Term Memory (LSTM) with attention achieved average 94% accuracy.
Nittala et al. [63]	HR, flight control data, flight sensor data, environment feature (day, night, clear, and cloud), and NASA-TLX score (collected from 15 subjects).	Skill level (novice or expert), and workload (high or low).	SVM achieved AUC of 0.99.
Xi et al. [64]	ECG and NASA-TLX.	Mental workload (high/medium/low).	Using CNN and transfer learning can classify workload levels.
Qiu and Han [65]	HR, range of motion (ROM), HRV, respiration, and NASA-TLX (collected from 6 subjects).	Operator workload level.	Four features selected by PCA showed significant inverse trends with increasing workload.
Laskowski et al. [66]	Air traffic data and task characteristics.	Operator workload levels.	Delphi method (expert) with Neural Network model achieved 98% accuracy.
Jao et al. [67]	EEG and NASA-TLX (collected from 24 subjects).	Cognitive load (high/low).	Linear regression showed success in assessing cognitive load.
Aricò et al. [68]	EEG and NASA-TLX (collected from 12 subjects).	High or low cognitive load.	Classification algorithm with automatic stop Stepwise Linear Discriminant Analysis (asSWLDA).
Zhou et al. [69]	HRV data (collected from 28 subjects).	Cognitive load (low/medium/high).	LSTM the showed best performance of 0.9491 F1-score in accurately identifying the cognitive load levels.
Dell’Agnola et al. [38]	Respiration (RSP), ECG, PPG, SKT, and NASA-TLX (collected from 24 subjects).	High/low CL.	SVM with subject-specific weights achieved best accuracy performance.
Momeni et al. [31]	ECG, RSP, PPG, SKT signal data, and NASA-TLX score (collected from 24 subjects).	CL (high/low).	XGboost achieve best 86% accuracy on unseen data.
Planke et al. [26]	EEG, eye-tracking data, task performance data, and NASA-TLX (collected from 5 subjects).	CL (high/medium/low).	Supervised ML model.
Zak et al. [70]	Joystick movement, and NASA-TLX score (collected from 6 subjects).	Operator workload (high/low).	Showed Lasso Regression, NN, RF, and XGBoost can all assess workload.
Zhang et al. [29]	fNIRS signals using change of oxyhemoglobin as features, and NASA-TLX score (collected from 25 subjects).	Mental workload (high/medium/low).	SVM hierarchical achieved high accuracy (88.74%).
Lochner et al. [71]	GSR signal, acceleration signal, HR, NASA-TLX score, and system trust scale score (collected from 43 subjects).	Trust state (high/low).	Decision Tree showed an accuracy of 80%.
Gianazza [72]	Radar trajectory data and sector records.	Workload (low/normal/high).	NN showed best performance of 81.9% accuracy to classify workload.
Shao et al. [73]	EEG signal data, and NASA-TLX score (collected from 16 subjects).	Cognitive state (rest/low/medium/high).	SVM with the gamma wave metrics showed effectiveness in capturing cognitive state variations.
Salvan et al. [74]	EEG and NASA-TLX score (collected from 48 subjects).	Mental workload (high/medium/low).	RF showed best accuracy of 76% in predicting mental workload.
Monfort et al. [75]	Eye-tracking, and behaviour data (such as reaction time, accuracy, and error rates) (collected from 20 subjects).	Mental workload (low/high).	Ensemble of KNN, RF, and SVM achieved 78% accuracy in predicting UAV operator workload.
Aricò et al. [76]	ECG, GSR, EEG, blink rate, reaction time, and NASA-TLX score (collected from 16 subjects).	Workload (low/high).	Automatic stop Stepwise Linear Discriminant Analysis (asSWLDA) showed success in assessing workload.
Dell’Agnola et al. [77]	ECG, RSP, SKT, PPG EDA, and NASA-TLX score (collected from 24 subjects).	Cognitive load (2 or 3 levels).	XGboost showed best performance of 80.2% accuracy in classifying 2 levels of CL.

Table 2. Summary of studies on ML assessing operator-related data for informing/enhancing training.

Refs.	Input of ML Model (Data)	Output of ML Model	ML Model and Results
Wojciechowski et al. [7]	Electroencephalography (EEG), electrocardiography (ECG), blood pressure, skin temperature, facial expression, eye-tracking (ET), and Simulator Sickness Questionnaire (SSQ).	UAV operator’s pressure level (classification of operator state).	Recurrent Neural Network (RNN) assessed operator pressure levels.
Guevarra et al. [48]	Operator operation trajectory data (i.e., action and reaction time) and operator control inputs.	Detection of operator performance errors.	Imitation learning (behavioral cloning) applied for autonomous pilot error state detection.
Rodríguez-Fernández et al. [79]	UAV operator operation records including features of score, cooperation, aggressiveness, and initial plan complexity.	Classification of operator skill level (low, medium, high).	Unsupervised clustering (K-means, agglomerative, divisive, and partition around medoids) and fuzzy logic to evaluate operator training skills.
Yang et al. [80]	Flight simulator data, including aircraft state variables (e.g., attitude, altitude, airspeed) and operator control actions.	Classification of trainee performance alignment with expert “standard operation.”	ML approach provided real-time, objective, and quantitative assessment of operator performance.
Yuan et al. [81]	functional near-infrared spectroscopy (fNIRS) signal (collected from 25 subjects).	Classification of operator behaviour (left vs. right turning).	SVM-RBF achieved 92.6% accuracy using 60% of selected features.
Caballero et al. [82]	Electromyography (EMG), forearm acceleration (ACC), EDA, ECG, respiration (RES), eye-tracking, and PPG signal data (collected from 21 subjects).	Classification of pilot task difficulty (low or high).	AdaBoost achieved the best performance with 87.14% accuracy.
Li et al. [83]	EEG signals (collected from 20 subjects).	Classification of unsafe vs. safe behaviour.	SVM-Linear achieved 80.8% accuracy.
Yin et al. [84]	Operator movement data including altitude, altitude change rate, heading angle, heading change rate, and flight speed.	Recognition of 15 classes of training actions.	Dynamic Bayesian Network successfully recognised current training actions.
Paces and Insaurralde [85]	Flight parameters including position, navigation data, and flight attitude.	Classification of safe vs. unsafe flying behaviour.	AI algorithm effectively evaluated pilot flight performance.
Pietracupa et al. [86]	EEG signal data.	Classification of pre-error vs. non-error states.	Transformer model achieved the best F1-score of 0.578.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Molloy, O.; El-Fiqi, H.; Eves, G. Applications of Machine Learning in Assessing Cognitive Load of Uncrewed Aerial System Operators and in Enhancing Training: A Systematic Review. Drones 2025, 9, 760. https://doi.org/10.3390/drones9110760

AMA Style

Li Q, Molloy O, El-Fiqi H, Eves G. Applications of Machine Learning in Assessing Cognitive Load of Uncrewed Aerial System Operators and in Enhancing Training: A Systematic Review. Drones. 2025; 9(11):760. https://doi.org/10.3390/drones9110760

Chicago/Turabian Style

Li, Qianchu, Oleksandra Molloy, Heba El-Fiqi, and Gary Eves. 2025. "Applications of Machine Learning in Assessing Cognitive Load of Uncrewed Aerial System Operators and in Enhancing Training: A Systematic Review" Drones 9, no. 11: 760. https://doi.org/10.3390/drones9110760

APA Style

Li, Q., Molloy, O., El-Fiqi, H., & Eves, G. (2025). Applications of Machine Learning in Assessing Cognitive Load of Uncrewed Aerial System Operators and in Enhancing Training: A Systematic Review. Drones, 9(11), 760. https://doi.org/10.3390/drones9110760

Article Menu

Applications of Machine Learning in Assessing Cognitive Load of Uncrewed Aerial System Operators and in Enhancing Training: A Systematic Review

Highlights

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Machine Learning Methods for Assessing Cognitive Load

3.1.1. Similarity-Based ML Models

3.1.2. Information-Based ML Models

3.1.3. Error-Based ML Models

3.1.4. Hybrid/Ensemble ML Models

3.1.5. Deep Learning ML Models

3.2. Machine Learning Methods to Enhance UAS Operator Adaptive Training

3.2.1. ML Methods Supporting Biometric-Referenced Adaptive Training

3.2.2. ML Models Supporting Automated Training Evaluation

3.3. Machine Learning Evaluation Metrics

4. Discussion

4.1. ML Approaches for Cognitive Load Assessment

4.1.1. ML Models Applied to Cognitive Load Assessment

4.1.2. ML-Based Real-Time Monitoring and Prediction of Cognitive Load

4.2. Key Methodological Factors in ML-Based CL Assessment

4.2.1. Data and Data Preprocessing

4.2.2. Addressing Individual Differences

4.2.3. Data Labelling Approaches

4.2.4. Validation and Evaluation Protocols

4.3. ML Enhancement of UAS Operator Training

4.3.1. ML Potential for Enhancing UAS Operator Training

4.3.2. Deployment Considerations

4.3.3. Ethical Considerations

4.3.4. Cost Constraints

4.4. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI