Explanation of Machine-Learning Solutions in Air-Trafﬁc Management

: Advances in the trusted autonomy of air-trafﬁc management (ATM) systems are currently being pursued to cope with the predicted growth in air-trafﬁc densities in all classes of airspace. Highly automated ATM systems relying on artiﬁcial intelligence (AI) algorithms for anomaly de-tection, pattern identiﬁcation, accurate inference, and optimal conﬂict resolution are technically feasible and demonstrably able to take on a wide variety of tasks currently accomplished by humans. However, the opaqueness and inexplicability of most intelligent algorithms restrict the usability of such technology. Consequently, AI-based ATM decision-support systems (DSS) are foreseen to integrate eXplainable AI (XAI) in order to increase interpretability and transparency of the system reasoning and, consequently, build the human operators’ trust in these systems. This research presents a viable solution to implement XAI in ATM DSS, providing explanations that can be appraised and analysed by the human air-trafﬁc control operator (ATCO). The maturity of XAI approaches and their application in ATM operational risk prediction is investigated in this paper, which can support both existing ATM advisory services in uncontrolled airspace (Classes E and F) and also drive the inﬂation of avoidance volumes in emerging performance-driven autonomy concepts. In particular, aviation occurrences and meteorological databases are exploited to train a machine learning (ML)-based risk-prediction tool capable of real-time situation analysis and operational risk monitoring. The proposed approach is based on the XGBoost library, which is a gradient-boost decision tree algorithm for which post-hoc explanations are produced by SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). Results are presented and discussed, and considerations are made on the most promising strategies for evolving the human–machine interactions (HMI) to strengthen the mutual trust between ATCO and systems. The presented approach is not limited only to conventional applications but also suitable for UAS-trafﬁc management (UTM) and other emerging applications.


Introduction
In traditional air-traffic management (ATM), artificial intelligence (AI) has been used to enhance air traffic and airspace operational efficiency for decades [1][2][3][4]. A significant contemporary challenge is the integration of the unmanned aircraft system (UAS) into both current and future ATM contexts [4], as the current ATM systems are unable to cope with the envisaged increase in air traffic densities, especially in low-level operations and in urban environments. Within the same unit of airspace volume considered, the magnitude of information that UAS-traffic mangement (UTM) needs to exchange, process, and track is much higher than that of conventional ATM. If not properly addressed, these conditions will compromise the safety of airspace operations whenever unforeseen perturbations introduce major deviations from the nominal flight plans [5]. Thus, computationally efficient optimisation algorithms are needed to deal with the increasing amounts of exchanged and processed data, constraints, and objectives characterising future ATM/UTM paradigms.
In this context, future ATM and UTM decision-support systems (DSS) are expected to significantly rely on artificial intelligence (AI), which allows moving away from the limited flexibility of algorithmic logics typically found in declarative automation [4]. Increases in automation complexity and in the amount of processed information are eliciting further research in human-machine interactions (HMI) towards improving human-machine coordination and teaming [6]. ATM DSS have progressively evolved to enhance the support provided to the ATM operator's decision-making process [7][8][9]. Since each task has unique objectives and constraints, the ATM system needs different types of analytics and reasoning techniques for different tasks. Nonetheless, utilising complex AI inference with characterised by a black-box behaviour could lead to a lack of transparency and a consequent loss of the operator's trust and situation awareness. The black-box model only presents the final solution to the operator without showing the rationale behind [10], which challenges human ability to verify and understand the suggested solution. However, this verification is crucial, specifically for applications where systems need to be traceable for safety-critical operations [11]. Therefore, emerging ATM/UTM applications of AI is expected to elicit an evolution of design and certification methodologies to ensure the ongoing acceptance and trust by end-users. The research community is developing the optimal formats to present an explanation in a manner that the machine will be more likely trusted by human operator to the appropriate level, minimising both unwanted overtrust and distrust. State-of-the-art eXplainable AI (XAI) approaches are also being evaluated in terms of their support to integrated human-autonomy decision making, enhancing understandability and transparency [12]. However, the maturity of XAI algorithms and their presentation are currently limited in aerospace applications [13,14]. This paper focuses on the development of a viable solution to introduce XAI in the ATM DSS based on machine-learning (ML) inference.
With the current popularity of ML applications, various commonly used algorithms have been made available as open-source libraries. This allows researchers to rapidly integrate all the components from freely available libraries, thereby simplifying the prototyping of software and reducing barriers to access [15,16]. The paper reviews and investigates the most suitable and efficient open-source ML model that not only offers high prediction accuracy but also can be comprehended by the operator. The study investigates the feasibility and maturity of the proposed interpretable AI framework and focuses on the development of a ATM DSS functionality based on open-source algorithms and datasets. In particular, aviation occurrences and meteorological databases, which are both publicly available, are exploited to support the design of a hypothetical risk-prediction tool, which could support both present day ATM-as in the case of air-traffic advisory service in uncontrolled airspace-as well as future, performance-based autonomy concepts-for instance, by driving the inflation of avoidance volumes. The verification case study presented in this paper is restricted to conventional air-traffic data due to limited availability of UAS traffic data; however, the framework can be applied directly to UTM and urban air mobility (UAM) applications as soon as mature datasets become available.
The remaining parts of this section present a review of predictive models, XAI, and ATM HMI evolutions, whereas Section 2 introduces the theory underpinning the selected AI-prediction and explanation models. Section 3 describes the proposed implementation, datasets, model-training process, and performance metrics. The performance of the adopted AI inference engine is presented and discussed in Section 4. The explanation results and the proposed ATM HMI concept are presented in Sections 5 and 6, respectively.

Predictive Methods
Traditional predictive methods are divided into two types: deductive methods based on governing theory (e.g., physics, mathematics) and inductive methods based on data analytics. Deductive methods strictly require the availability of theoretical models. Moreover, high degrees of realism and/or long forecast periods require high computing powers and/or time often beyond the available resources [17]. On the other hand, inductive methods include both statistical methods and ML methods. Statistical methods are de-rived strictly according to mathematical methods, meaning that the overall phenomenon is modelled by characterising the correlation between estimators and variables [18]. For instance, the time-series model, one of the most commonly used quantitative methods, only captures the patterns and temporal evolution based on the data without considering the causal structure of the occurrence [19]. As an example, Jaimyoung [20] proposed a linear regression model to predict the highway driving time from a few minutes to an hour in the future. These statistical models are mainly used for reasoning and estimation. Some assumptions and restrictions need to be introduced before model development, such as fixed algorithms and features that need to be manually identified and extracted.
Contrastingly, the structure of the ML algorithm is flexible, as it needs to assume that the input data can be very limited. Hence, it is very suitable to deal with complex and highly nonlinear relationships [21]. Ensemble learning is one of the most popular and promising ML algorithms. Its fusion methods include boosting, bagging, and stacking [22]. Ensemble learning not only explains the interaction between input variables and predictive models but also identifies the relative importance of the key factors. Gradient-Boosted Decision Tree (GBDT) is an integrated learning method based on the Decision Tree (DT). Unlike a traditional DT, the advanced modification enables GBDT to iteratively train a series of weak learners and strategically generate the best tree set to improve prediction accuracy [23]. Ramon [24] adopted the GBDT model to predict the take-off time, which provided more accurate prediction results than the Enhanced Tactical Flow Management System (ETFMS). The open-source model XGBoost, developed based on the GBDT framework, is adopted in this paper. This library has many advantages in processing nonlinear data and can extract features from variables containing noise and redundant information [25]. As an example, Gu et al. [26] proposed a lane-changing decision system for autonomous vehicles based on the XGBoost model. By making full use of large-scale dataset training, the XGBoost model has much better performance and practicability than conventional lane-changing systems. Global analysis of the features' importance can explain the results of the XGBoost model to a certain extent.

Explainable AI (XAI)
Most ML models lack the explainability of the algorithm itself and the output results, which then challenges the operator to understand and accept such a solution. Henceforth, post-hoc explanations models are considered in this research to enhance the explainability of the results. According to Arrieta et al. [13], the XAI method is systematically analysed, and its importance is divided into two-step classification, as shown in Figure 1.
The first classification is between interpretable models and post-hoc explainability XAI approaches and is based on the distinction between models that are inherently transparent and models for which a post-hoc explaination can be constructed. In particular, when the ML model does not inherently offer a transparent explanation, a particular method is needed to explain its decision so an additional classification can be encountered. Model-agnostic post-hoc explanation methods can be applied to any ML models, whereas model-specific methods are developed only for a specific ML model and will not work in conjunction with other techniques. Generally, black-box AI models will require a post-hoc explanation method. There are two types of explanation: global and local. The local explanation divides the complex model into simple individuals while also analysing the individuals' relationship. The global explanation offers an overall understanding of the model, aiming to make the entire decision-making process completely transparent and comprehensive. Usually, it is necessary to utilise these two explanation methods simultaneously to satisfactorily explain the ML algorithm, which can effectively reduce the potential skew and uncertainty affecting a single method. The first classification is between interpretable models and post-hoc explainability XAI approaches and is based on the distinction between models that are inherently transparent and models for which a post-hoc explaination can be constructed. In particular, when the ML model does not inherently offer a transparent explanation, a particular method is needed to explain its decision so an additional classification can be encountered. Model-agnostic post-hoc explanation methods can be applied to any ML models, whereas model-specific methods are developed only for a specific ML model and will not work in conjunction with other techniques. Generally, black-box AI models will require a post-hoc explanation method. There are two types of explanation: global and local. The local explanation divides the complex model into simple individuals while also analysing the individuals' relationship. The global explanation offers an overall understanding of the model, aiming to make the entire decision-making process completely transparent and comprehensive. Usually, it is necessary to utilise these two explanation methods simultaneously to satisfactorily explain the ML algorithm, which can effectively reduce the potential skew and uncertainty affecting a single method.
Due to the opacity of the black-box model, the post-hoc explanation model is used to interpret the prediction model and its results. However, a single explanation method limits and biases the interpretation and understanding of the black-box model. Therefore, two different explanation models are used to simultaneously explain a black-box algorithm and prediction results, thereby enhancing the interpretability of XGBoost incident and accident prediction. The mature and freely available post-hoc explanation models considered in this study are SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). SHAP is a visual local explanation method for tree models. While assigning the weight and value to each feature, the local explanation is extended to capture the features' interaction directly, and a large number of local explanations are used to understand the global structure. Besides, its built-in visual components can intuitively display the influence of complex variables [27]. The LIME model, on the other hand, is completely independent from the prediction model itself and only attempts to explain its results. In other words, LIME can explain any blackbox prediction without inspecting the model. These two XAI methods are supported by XGBoost that aims to further explain the prediction results in more detail and highlight the driven factors of the results. Due to the opacity of the black-box model, the post-hoc explanation model is used to interpret the prediction model and its results. However, a single explanation method limits and biases the interpretation and understanding of the black-box model. Therefore, two different explanation models are used to simultaneously explain a black-box algorithm and prediction results, thereby enhancing the interpretability of XGBoost incident and accident prediction. The mature and freely available post-hoc explanation models considered in this study are SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). SHAP is a visual local explanation method for tree models. While assigning the weight and value to each feature, the local explanation is extended to capture the features' interaction directly, and a large number of local explanations are used to understand the global structure. Besides, its built-in visual components can intuitively display the influence of complex variables [27]. The LIME model, on the other hand, is completely independent from the prediction model itself and only attempts to explain its results. In other words, LIME can explain any black-box prediction without inspecting the model. These two XAI methods are supported by XGBoost that aims to further explain the prediction results in more detail and highlight the driven factors of the results.

Human-Machine Interactions
Current developments of the ATM system and interface technology are aimed at accommodating denser traffic. A number of HMI innovations have been proposed for ATM implementation supported by system evolutions [28]. The two common design streams are visualisation [29][30][31] and control function improvements in DSS [7][8][9]. With the increasing AI exploitation in decision-support tools, the air-traffic controller (ATCo)/airtraffic control operator (ATCO)'s concerns when working with highly automated systems have not diminished but instead have been exacerbated [32]. Experienced human operators tend to be reluctant to adopt suggested solutions from highly autonomous DSS if these are not trustworthy, traceable, and interpretable, especially in very complex situations [14]. Hence, DSS are required to adopt XAI to increase the understandability and trust of human operators. One emerging concept that aims to enhance the cognitive states of the human operator in real time during the complex and time-critical operations with a high level of automation is Cognitive Human-Machine Interfaces and Interactions (CHMI 2 ) [33,34]. A human cognitive state monitoring/enhancement by machine prevents cognitive overload and human oversight when increasing the level of autonomy in decision-support systems. The CHMI 2 framework is comprised of three main modules: sensing, inference, and adaptation. Similar to other proposed HMI concepts in the literature, the CHMI 2 sensing module also uses advanced neurophysiological sensors, which are, however, used to collect the responses in real time. The collected and pre-filtered data are passed through the inference module to estimate the cognitive state of the operator. This inferred cognitive state is the key driver to dynamically adapt the HMI formats/functions and automation behaviour [35].

Prediction and Explanation Models
This section introduces the theory underpinning the predictive AI algorithm adopted in this study and subsequently the post-hoc explanation methods that were implemented. The first part introduces the rationale and equations of the XGBoost predictive model to describe its operating principle. The second part explains the operating principles and equations of two different interpretation models, which are LIME and SHAP.

XGBoost
XGBoost is a library developed from the GBDT algorithm to combine multiple weak learners through the boosting method [36]. The basic algorithm is based on the Classification And Regression Trees (CART), which has high performance in both interpretability and transparency [37]. XGBoost has seven main hyper-parameters that can be adjusted to improve the algorithm's progress and robustness and also to reduce overfitting. The set of hyperparameters include:

•
Learning rate: in order to prevent overfitting, a shrink step is used in the update process. Each time the weight of a leaf node is updated, the learning-rate coefficient is multiplied to avoid excessive step sizes. A small learning rate can improve the robustness of the model; • Maximum depth: this value is the maximum depth of branching of the tree, which is also used to avoid overfitting. The larger the value, the easier it is for the model to learn more specific and local samples; • Minimum child weight: this parameter represents the minimum sample weight required to generate a child node. When the sum of the weights of all samples on the leaf node is less than the set value, the construction process will stop splitting. This parameter is used to avoid over-fitting. When the value is large, it can prevent the model from learning local anomalies in the training data; • Maximum number of iterations: this is the maximum number of trees generated and also the maximum number of iterations. The higher the number of trees, the better the performance, and more computing time is required; • Lambda regularisation: this is used to control L2 regularity. It is the coefficient in front of the score of the leaf node in the objective function; • Alpha regularisation: This is used to control L1 regularity. It also speeds up the algorithms in very high dimensions; • Gamma value: in order to further split the leaf nodes of the tree, the minimum loss reduction must be set. In other words, the split is determined by observing whether the loss has decreased; • Sub-sample: this parameter controls the proportion of random sampling for each tree; too large or too small a value will lead to over-fitting and under-fitting of the model.
Equation (1) illustrates the basic CART formula, where K is the number of trees, F represents all possible CART trees, and x i represents a specific CART tree.
The training of the model is essentially additive training. Figure 2 illustrates an example of a CART prediction model. The algorithm also classifies the missing values of model.
Equation (1) illustrates the basic CART formula, where is the number of trees, represents all possible CART trees, and represents a specific CART tree.
The training of the model is essentially additive training. Figure 2 illustrates an example of a CART prediction model. The algorithm also classifies the missing values of each node. The number in the leaf is the value of . The detailed mathematical formulation of the XGboost algortithm is presented in Appendix A.

Post-Hoc Explanation Model
The global explanations of prediction results from XGBoost can be enhanced by adopting two post-hoc local explanation models: SHAP and LIME. These two models are supported by XGBoost with the aim of explaining the prediction results in more detail and highlighting the factors driving the results. Table 1 summarises the differences between LIME and SHAP. Table 1. The features and differences between LIME and SHAP.

Features
Local Explaination Method; Does not interface with the algorithm inside the black box; Independently generates new samples based on each feature.

Local Explaination Method;
Based on game theory; Calculates the importance of additive features for each specific prediction.

Post-Hoc Explanation Model
The global explanations of prediction results from XGBoost can be enhanced by adopting two post-hoc local explanation models: SHAP and LIME. These two models are supported by XGBoost with the aim of explaining the prediction results in more detail and highlighting the factors driving the results. Table 1 summarises the differences between LIME and SHAP. Table 1. The features and differences between LIME and SHAP.

Features
Local Explaination Method; Does not interface with the algorithm inside the black box; Independently generates new samples based on each feature.
Local Explaination Method; Based on game theory; Calculates the importance of additive features for each specific prediction.
Advantage Model-Agnositc method; even if the prediction model changes, LIME is able to make the local explanation.
Supports multiple explain plots; Allows comparative studies between features; Can quickly implement and explain tree-based models.

Disadvantage
The explained results are not stable enough, and different interpretation models will produce different results.
Long calculation time, slower interpretation production speed.

LIME
The LIME method is used to explain the prediction of any classifier or regression model [38]. It does not interface with the algorithm inside the black box but randomly generates input data to collect aggregate outputs. The feature importance is obtained by comparing the prediction results between the artificially perturbed data and original example, thereby generating an explanation for each single prediction of any ML models. Out of the various possible explanations, LIME selects the explanation ξ(x) to be presented to the human operator based on the following minimisation problem [38]: where f represents the black-box model that needs to be explained; g and G are, respectively, the explanation model and the collection of potentially interpretable models; π x is the distance between data x in the new dataset and the original data instance; L defines the model fidelity, which is a measure of how unfaithful the explanation is in approximating the original model around π x ; and Ω is the complexity of the model, which shall be low enough to be interpretable by the human operator.

SHAP
SHAP is an alternative method for local ML explanation. It is a method based on the concept of game theory to calculate the importance of additive features for each specific prediction [39]. Similar to the feature correlation explanation, the method describes the function of the opaque model by ranking or measuring the impact, relevance, or importance of each feature in the prediction output. Lundberg et al. [27] developed TreeExplainer based on SHAP, which is a visual local explanation method. While assigning the weight and value of each feature, the local explanation is extended to capture the interaction of the features directly, and a large number of local explanations are used to understand the global structure. It calculates the Shapley value or attribution value of each feature and then measures the impact of the feature on the final output value [40,41]: where g represents the explanation model; M is the number of input features; z represents whether the feature exists; and φ is the attribution value of each feature (Shapley value). The output S of the decision tree conditioned on the feature subset is defined as f x (S). The SHAP value is based on game theory to average all possible conditional expectations. For a certain feature j, it is necessary to calculate the Shapley value for all possible feature combinations (including different orders) and then weighted summation: where S is a subset of the features used in the model; x is the vector of feature values of the sample to be explained; and p is the number of features. The working principle is to average the marginal contribution of all sequences. In other words, by entering all the sequences, the marginal revenue generated by each feature is obtained, and then all the gains are averaged, and finally, the Shapley value is calculated for each feature.
The SHAP methodology supports various types of explanation plots. In this paper, two plots are chosen to analyse and explain the results of model predictions: summary plot and dependence plot. The models fulfil the gap that other models cannot provide, which is to present how each feature matters and its effect distribution. It is vital for operator decision making when further reasoning is required. The summary plot shows the potential interactions between various features, allowing the global features to be interpreted through this local explanation. The dependence plot also explains the contribution of a single variable to the predicted value and analyses individual features in more detail.

Model Implementation and Verification Methodology
This section describes the AI explanation model implementation as well as the verification methodology. The first part explains the overall framework of prediction and explanation model integration followed by dataset selection and data preparation. Lastly, the performance metrics are discussed for model performance evaluation.

Prediction Framework
The implemented prediction framework is illustrated in Figure 3. In the initial stage, a trained XGBoost model is generated by inputting labelled training data. Successively, testing data are fed into the trained model to obtain representative prediction results. The accuracy of the prediction model is assessed by comparing the prediction results with actual labelled values. The system integrates the SHAP model and the LIME model to obtain two post-hoc explanations. The SHAP model requires testing data and the trained model, while the LIME model requires the input of testing data and forecast results. These models provide the explanation of the trained model and the explanation of the specific parameters. Finally, the two explanation modes and forecast results are integrated and displayed through the interaction module. In the verification case study presented in the following section, the XGBoost library is trained to predict the risk of incidents and accidents based on meteorological data, which allows to implement a real-time situationassessment tool for air-traffic advisory services of the kind provided in present day ICAO airspace Classes D, E, and F. The same risk-assessment functionality can also be exploited to drive the inflation of avoidance volumes in emerging performance-based ATM/UTM concepts involving autonomous separation assurance and collision-avoidance services.
in more detail.

Model Implementation and Verification Methodology
This section describes the AI explanation model implementation as well as the verification methodology. The first part explains the overall framework of prediction and explanation model integration followed by dataset selection and data preparation. Lastly, the performance metrics are discussed for model performance evaluation.

Prediction Framework
The implemented prediction framework is illustrated in Figure 3. In the initial stage, a trained XGBoost model is generated by inputting labelled training data. Successively, testing data are fed into the trained model to obtain representative prediction results. The accuracy of the prediction model is assessed by comparing the prediction results with actual labelled values. The system integrates the SHAP model and the LIME model to obtain two post-hoc explanations. The SHAP model requires testing data and the trained model, while the LIME model requires the input of testing data and forecast results. These models provide the explanation of the trained model and the explanation of the specific parameters. Finally, the two explanation modes and forecast results are integrated and displayed through the interaction module. In the verification case study presented in the following section, the XGBoost library is trained to predict the risk of incidents and accidents based on meteorological data, which allows to implement a real-time situationassessment tool for air-traffic advisory services of the kind provided in present day ICAO airspace Classes D, E, and F. The same risk-assessment functionality can also be exploited to drive the inflation of avoidance volumes in emerging performance-based ATM/UTM concepts involving autonomous separation assurance and collision-avoidance services.  In this particular activity, the main function is to process the N-dimensional array generated when inputting data and handle abnormal values. Pandas is a tool based on Numpy, which mainly solves input data analysis and exploration tasks. Usually, NumPy and Pandas are used together. Scikitlearn and XGBoost are Python ML models. In this research, XGBoost mainly provides core algorithms, while Scikit-learn mainly provides pre-processing of input data and data analysis of output results.

Datasets
In this case study, due to the limited availability of publicly available traffic datasets, open-source databases of meteorological and aviation incidents and accidents have been adopted in the verification case study are collected from the Bureau of Meteorology [42] and ATSB (Australian Transport Safety Bureau) national aviation occurrence database [43]. observation data at 9 a.m. and 3 p.m. The entire dataset is divided into five channels, as follows: • ID and date • Synthetic daily data • Observation data at 9 a.m. • Observation data at 3 p.m.

•
Incidents and accidents report To display the data and feature labels effectively, the feature names are simplified; for the operation of the model, the data of wind direction is converted from characters to numerical values; the incidents and accidents numerical value is translated into 0 and 1, corresponding to False and True, respectively. Table 2a details the full names corresponding to the abbreviations. Table 2b the codes corresponding to the wind-direction characters.

Model Preparation
The segmentation of the dataset is critical for the prediction model. The ratio of the train-test split directly affects the entire prediction model. If the testing set's size is too small or too large, it causes insufficient model accuracy or lack of generalisation ability. If the training set is too large, this increases the risk of overfitting. The conventional data-split strategy in many research activities is a ratio of 80% training to 20% testing. In this research, the split ratio is 67% to 33%: two-thirds of the training set and one-third of the testing set. The purpose is to ensure that the model has enough data for training while reducing overfitting and increasing its generalisation ability. The total input data is 518 days, and the system randomly selects 347 days of data as the training set and 171 days of data as the test set. To obtain a better prediction accuracy, the test sets up three hyperparameters to compare the results: learning rate, maximum depth, and minimum child weight. The main purpose of these three hyperparameters is to prevent overfitting but in different ways:

•
The learning rate helps to adjust the iteration step size, improving the model's overall robustness by reducing the weight of each step; • The maximum depth adjusts the number of specific samples and local samples that the model can obtain to avoid overfitting; • The minimum child weight determines the sum of the minimum leaf node sample weights, thereby preventing the model from learning local special samples.
The learning rate ranges from 0 to 1. For the test, 1, 0.1, 0.01, and 0.001 were taken. The maximum depth range is from 0 to unlimited. In the test, 3, 5, 8, and 10 were taken. For minimum child weight, it will lead to insufficient fitting if the value is too high. Therefore, four values were chosen in the test, which are 1, 3, 4, and 7. The remaining hyperparameters are set to be fixed values, and no adjustments were made in this experiment, as shown below: • The hyperparameter adjustment process has two main objectives. The first purpose is to optimise the Xgboost algorithm performance for the particular problem at hand, typically by adjusting the hyperparameters to maximise prediction accuracy. The other objective is to evaluate the influence of different hyperparameters on the prediction results. Some problems have good prediction results under certain hyperparameter settings, but once the hyperparameter settings are changed, the accuracy of the prediction results drops sharply. Therefore, by comparing the prediction results under different hyperparameter settings, it is possible to improve the versatility and usability of the system.

Performance Metrics
In regression problems, performance metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Coefficient of Determination (R 2 ) are commonly adopted. In this study, the accuracy of the predicting model is calculated based on Equation (5), where P Error means the sum of the differences between each predicted value and its corresponding actual value. N Test is the total number of test sets.
Area under the curve (AUC) is a binary model evaluation index that refers to the area under the receiver operating characteristic (ROC) curve. Equation (6) is the calculation of AUC.

Model Performance Evaluation
The prediction performance of XGBoost is analysed by comparing the results from different hyperparameter combinations, as detailed in Section 3.3. Simultaneously, without resorting to other explanation methods, XGboost explains globally the trained model through its feature correlation analysis and feature importance analysis.

Prediction Outcomes
The accuracy of all models exceeds 85%, and the highest accuracy is 92.40% where the learning rate is 0.001, and minimum child weight is 7. Besides, the lowest accuracy is 86.55% when the learning rate is 1, the maximum depth is 5, and the minimum child weight is 1 and 3. The correct classification rate is high. AUC (area under the curve) is used to measure the robustness of the estimated model against overfitting. In the test, the lowest AUC value is 0.69 when the learning rate, minimum child weight, and maximum depth are 1, 1, and 3, respectively. Additionally, the highest values are 0.82 when the learning rate and minimum child weight are 0.1 and 5 and the maximum depth is 3, which shows that the prediction model has high robustness and high accuracy. Regardless of the hyperparameter setting, reliable accuracy and good robustness are obtained as a result. Therefore, the hyperparameters used in the training model are a learning rate of 0.001, minimum child weight of 7, and maximum depth of 3. All detailed test results of the XGBoost model trained according to different hyperparameters are shown in the Supplementary Data.

XGBoost Global Explanation
To be able to comprehend the model, the analysis of data features is crucial. The correlation between variables (features) in the input data can be observed in Figure 4. The highest positive correlation between two variables is 0.98, while the highest negative is −0.69. Temperatures in each period and MSL pressure are the most relevant features in the study. Remaining features have no close correlation among themselves.  XGBoost produces its inferences based on a tree model different from linear models; tree models are naturally robust to related variables [44]. For XGBoost, the existence of relevant variables will only increase the calculation time, but it will not significantly impact the accuracy of the model. When preparing the data in this experiment, the relevant variables were not screened because there were less than 20 types of variables. Figure 5 shows the calculated importance of input variables in the adopted dataset. The graph provides a global explanation, and its calculation factors are weight, gain, and coverage (WG&C). Weight counts the number of times a variable is used as a division variable in all trees. For example, as the highest value in this calculation method, SGW XGBoost produces its inferences based on a tree model different from linear models; tree models are naturally robust to related variables [44]. For XGBoost, the existence of relevant variables will only increase the calculation time, but it will not significantly impact the accuracy of the model. When preparing the data in this experiment, the relevant variables were not screened because there were less than 20 types of variables. Figure 5 shows the calculated importance of input variables in the adopted dataset. The graph provides a global explanation, and its calculation factors are weight, gain, and coverage (WG&C). Weight counts the number of times a variable is used as a division variable in all trees. For example, as the highest value in this calculation method, SGW (speed of maximum wind gust) was used to divide the tree a total of 245 times. As shown in Figure 5, all variables can be divided into five echelons. The three variables of the first echelon-SGW, 3H and 9SW-are used as important features for tree division. Gain refers to the average value of the gain brought by the existence of a feature split node in all trees. For example, the average gain of 9SW (9 a.m. wind speed) ranked second in all trees is 3.864103. The larger the gain value, the smaller the loss function for the node and the model complexity in the subsequent round. In Figure 6, all variables can be divided into three echelons: the first and second echelons are SGW and 9SW, respectively, and the remaining variables are the third echelon. Figure 7 shows the coverage of each variable in the model, which refers to the average of the number of samples covered by this variable in all trees when a feature exists as a split node. For example, the third-place 3SW (3 p.m. wind speed) covers an average of 31.736 samples. It can be divided into two gradients; the first gradient is SGW and 9SW, the average value of which is much higher than the other values. The remaining variable is the second gradient.       Figures 8 and 9, respectively, present the total gain brought by the gain calculation method and the total number of samples obtained by the coverage calculation method. In summary, under the three calculation methods, the two variables SGW and 9SW play a very critical role in the prediction results. Therefore, it can also be seen that the occurrence of incidents and accidents mainly depends on wind speed. Other variables, such as 3H, E, and 9MSL, have a higher importance in some calculation methods. It can also be concluded that in some cases, the occurrence of incidents and accidents is also highly sensitive to humidity, evaporation, and MSL pressure.  Figures 8 and 9, respectively, present the total gain brought by the gain calculation method and the total number of samples obtained by the coverage calculation method. In summary, under the three calculation methods, the two variables SGW and 9SW play a very critical role in the prediction results. Therefore, it can also be seen that the occurrence of incidents and accidents mainly depends on wind speed. Other variables, such as 3H, E, and 9MSL, have a higher importance in some calculation methods. It can also be concluded that in some cases, the occurrence of incidents and accidents is also highly sensitive to humidity, evaporation, and MSL pressure. method and the total number of samples obtained by the coverage calculation method. In summary, under the three calculation methods, the two variables SGW and 9SW play a very critical role in the prediction results. Therefore, it can also be seen that the occurrence of incidents and accidents mainly depends on wind speed. Other variables, such as 3H, E, and 9MSL, have a higher importance in some calculation methods. It can also be concluded that in some cases, the occurrence of incidents and accidents is also highly sensitive to humidity, evaporation, and MSL pressure.

Post-Hoc Local Explanation Generation
In this chapter, the XGboost model and prediction results are explained through SHAP and LIME explanations models, which, in addition to providing strong explainability and legibility, also provide an explanation of the trained predictive model and specific predictive target, local explanation, and global explanation.

SHAP Explanation
All variables in the model are considered as contributors. The importance of the SHAP value is to assign predicted values to all contributing features. To have a better global understanding of the features in the model, the summary plot is usually analysed first. The summary plot provides a visual ordering of the results by colour and coordinate axes, as presented in Figure 10. A dot in the graph represents each piece of data for each

Post-Hoc Local Explanation Generation
In this chapter, the XGboost model and prediction results are explained through SHAP and LIME explanations models, which, in addition to providing strong explainability and legibility, also provide an explanation of the trained predictive model and specific predictive target, local explanation, and global explanation.

SHAP Explanation
All variables in the model are considered as contributors. The importance of the SHAP value is to assign predicted values to all contributing features. To have a better global understanding of the features in the model, the summary plot is usually analysed first. The summary plot provides a visual ordering of the results by colour and coordinate axes, as presented in Figure 10. A dot in the graph represents each piece of data for each variable. The redder the colour of the dot, the larger the actual value. Contrarily, the bluer the colour, the smaller the actual value. The distribution along the Y-axis represents the data's features, while the X-axis is the SHAP value composed of the colour system and the data dots. The jitter overlap of the data dots on the Y-axis combines the distribution of the SHAP value of each feature. Besides, the features are sorted on the Y-axis. On the X-axis, the model impact is sorted from high to low. For instance, the maximum wind gust and the wind speed at 9 a.m. greatly influence the forecast results. The higher the wind speed, the greater the forecast's positive impact and the higher the SHAP value. Contrastingly, low wind speed has a negative impact on the result. In addition, the higher the value of sunshine time and MSL pressure, the greater the negative impact on model prediction, while the lower the actual value, the higher the SHAP value. Furthermore, it can be seen from the distribution range of the data on the X-axis that the influence of wind speed on the prediction results is much stronger than sunshine time and MSL pressure. Therefore, a preliminary conclusion can be drawn that no matter how high the value of the sunshine time and the average sea-level pressure are, as long as there is a strong wind, there will be a high probability of incidents and accidents.  Figures 11 and 12 are SHAP dependence plots that present feature dependency. Compared with the summary graph, the dependency graph is a more specific local explanation. It focuses on the impact of a single variable on the overall inference. For each data instance, a point is drawn with a characteristic value on the X-axis, and a corresponding Shapley value is drawn on the Y-axis.  Figures 11 and 12 are SHAP dependence plots that present feature dependency. Compared with the summary graph, the dependency graph is a more specific local explanation. It focuses on the impact of a single variable on the overall inference. For each data instance, a point is drawn with a characteristic value on the X-axis, and a corresponding Shapley value is drawn on the Y-axis.  Figures 11 and 12 are SHAP dependence plots that present feature dependency. Compared with the summary graph, the dependency graph is a more specific local explanation. It focuses on the impact of a single variable on the overall inference. For each data instance, a point is drawn with a characteristic value on the X-axis, and a corresponding Shapley value is drawn on the Y-axis. Figure 11. SHAP explanation of dependence plot for speed of max wind gust. Figure 11. SHAP explanation of dependence plot for speed of max wind gust.  Figure 11 shows the SHAP feature-dependence plot for maximum wind gust speed. Taking the wind speed of 60 km/h as the dividing line, when the wind gust speed exceeds this value, the output of the SHAP characteristic value will rise rapidly. However, when it reaches 70 km/h, no matter how the wind speed value increases, the contribution to the forecast value will no longer continue to increase. In other words, wind gust speeds exceeding 70 km/h tend to cause the same hazard to aircraft. Figure 12 shows the humidity at 3 p.m. The influence of humidity is divided into two trends and three states. The first trend is that when the humidity is lower than 60%, the contribution to incidents and accidents is continuously reduced to a negative number. Once this value is exceeded, its impact will increase rapidly. The three intervals are humidity from 0 to 40%, 40% to 60% and greater than 60%. When the humidity is 0-40%, there is no impact on incidents and accidents. when the humidity is 40-60%, the impact is negative. In other words, this humidity is statistically associated with reduced occurrences of incidents and accidents. When it is greater than 60%, the risk rises sharply, and when it reaches 70% and higher values, it will contribute noticeably to the occurrence of incidents and accidents.

LIME Explanation
LIME can be used to show the explanation of the output of a single sample. Figures  13 and 14, respectively, illustrate the predicted probabilities and detailed explanations of samples 42 and 154 in the test set. As shown in Figure 13, the XGBoost model predicts that  Figure 11 shows the SHAP feature-dependence plot for maximum wind gust speed. Taking the wind speed of 60 km/h as the dividing line, when the wind gust speed exceeds this value, the output of the SHAP characteristic value will rise rapidly. However, when it reaches 70 km/h, no matter how the wind speed value increases, the contribution to the forecast value will no longer continue to increase. In other words, wind gust speeds exceeding 70 km/h tend to cause the same hazard to aircraft. Figure 12 shows the humidity at 3 p.m. The influence of humidity is divided into two trends and three states. The first trend is that when the humidity is lower than 60%, the contribution to incidents and accidents is continuously reduced to a negative number. Once this value is exceeded, its impact will increase rapidly. The three intervals are humidity from 0 to 40%, 40% to 60% and greater than 60%. When the humidity is 0-40%, there is no impact on incidents and accidents. when the humidity is 40-60%, the impact is negative. In other words, this humidity is statistically associated with reduced occurrences of incidents and accidents. When it is greater than 60%, the risk rises sharply, and when it reaches 70% and higher values, it will contribute noticeably to the occurrence of incidents and accidents.

LIME Explanation
LIME can be used to show the explanation of the output of a single sample. Figures 13 and 14, respectively, illustrate the predicted probabilities and detailed explanations of samples 42 and 154 in the test set. As shown in Figure 13, the XGBoost model predicts that the probability of incidents and accidents is 4.6%. LIME's predicted probability is 5.2%. The centre plot shows the contribution of the main features corresponding to the linear model weight to the prediction. When the wind gust speed is greater than 44 km/h and less than or equal to 59 km/h, it will have a negative impact on the probability of incidents and accidents. This is also similar to the explanation given by the SHAP feature-dependence plot. The prediction shown in Figure 14 is that incidents and accidents may occur. The XGBoost model has a prediction probability of 65%, while the LIME linear regression model has a prediction probability of only 33.3%. It shows that gusts of 74 km/h contribute the most in terms of relative weight. When 9SW is greater than 28 km/h, it has a positive contribution to the predicted result. This can also be confirmed with the SHAP explanation.  The prediction shown in Figure 14 is that incidents and accidents may occur. The XGBoost model has a prediction probability of 65%, while the LIME linear regression model has a prediction probability of only 33.3%. It shows that gusts of 74 km/h contribute the most in terms of relative weight. When 9SW is greater than 28 km/h, it has a positive contribution to the predicted result. This can also be confirmed with the SHAP explanation.

Model-Explanation Interface Concept
This section presents an HMI concept for ATM advisory DSS based on post-hoc explanation models, i.e., SHAP and LIME. The proposed interface is flexible and applicable in both ATM/UTM DSS; however, the concepts presented here are specifically designed for the case-study application documented in this article, which is the The prediction shown in Figure 14 is that incidents and accidents may occur. The XGBoost model has a prediction probability of 65%, while the LIME linear regression model has a prediction probability of only 33.3%. It shows that gusts of 74 km/h contribute the most in terms of relative weight. When 9SW is greater than 28 km/h, it has a positive contribution to the predicted result. This can also be confirmed with the SHAP explanation.

Model-Explanation Interface Concept
This section presents an HMI concept for ATM advisory DSS based on post-hoc explanation models, i.e., SHAP and LIME. The proposed interface is flexible and applicable in both ATM/UTM DSS; however, the concepts presented here are specifically designed for the case-study application documented in this article, which is the hypothetical riskprediction DSS for air-traffic advisory services. The proposed graphical user interface (GUI) to present the explanations to the operator is illustrated in Figure 15. This interface shows a detailed explanation for both global and local scales. This interface is meant to be displayed on the secondary monitor and does not clutter the tactical situation display on the primary monitor. The display layout is divided into three parts, as shown in Figure 16. The red part illustrates the global explanation of the training model, that is, explaining the weight and priority of each feature after the model is trained, which is similar to the summary plot of SHAP. This information can be viewed at any time. Moreover, ATCO can allow the viewing of more feature information by sliding up and down.
The orange area at the bottom presents the local explanation of the training model detailing the relative distribution and weight of the specified features in the training model. The display of this section is not fixed, and ATCO necessitates clicking on the feature list in the red area. Up to four local feature explanations can be displayed at the same time, and they are arranged sequentially from right to left, as data can be displayed without dragging the features.
The green area on the right explains the prediction results in individual cases. Differently from the explanation based on the training model, the explanation for individual prediction is updated in real time. On the upper right of the area is the data sequence. In addition to the results of the system model, the prediction results of the explanation model are also given. The prediction results are evaluated through two different algorithm models. In the data-list section, the feature name and the real-time value of the feature are provided. The figure on the right shows the weight coefficient of the feature under the current value.
The proposed interaction mode aims to support ATCO to inspect this new information without excessively affecting the usage habits and working environment. The design of this HMI mainly focusses on the following three key points: • Monitoring of automated processing; • Awareness of anomalies and abnormal states; and • Handling of emergencies.
Therefore, the information interaction of the hypothetical risk-prediction DSS is divided into two states and three modes. The two states are Predict Non-Incidents and Accidents (PNIA) and Predict Occur Incidents and Accidents (POIA). The three methods are as follows: • Silent mode: when the prediction of incidents and accidents is below the safety threshold; • Prompt mode: when the probability of incident or accident is above the safety threshold; and • Detailed information mode: this can be viewed regardless of status.
individual prediction is updated in real time. On the upper right of the area is the data sequence. In addition to the results of the system model, the prediction results of the explanation model are also given. The prediction results are evaluated through two different algorithm models. In the data-list section, the feature name and the real-time value of the feature are provided. The figure on the right shows the weight coefficient of the feature under the current value.  The proposed interaction mode aims to support ATCO to inspect this new information without excessively affecting the usage habits and working environment. The design of this HMI mainly focusses on the following three key points: • Monitoring of automated processing; • Awareness of anomalies and abnormal states; and • Handling of emergencies.
Therefore, the information interaction of the hypothetical risk-prediction DSS is divided into two states and three modes. The two states are Predict Non-Incidents and Accidents (PNIA) and Predict Occur Incidents and Accidents (POIA). The three methods are as follows: • Silent mode: when the prediction of incidents and accidents is below the safety threshold; • Prompt mode: when the probability of incident or accident is above the safety This state, depicted in Figure 17, is adopted by the system when no significant risks are predicted, and the advisory DSS is in silent mode. In this state and mode, the system will display "Predictive Alert: Active" in the information list area (top left of the main screen) and make predictions periodically. As long as the system predicts that there is no risk, the DSS will maintain this mode, which is characterised by no specific displayed information to hinder or distract ATCO's attention. The mode of the system is switched from silent mode to prompt mode whenever the system predicts a nonnegligible risk of incidents. Figures 18 and 19 depict the visual interaction in prompt mode. Within the information list area, "Predictive Alert" changes from "Active" to "Warning" and flashes the alert. In parallel, the tag associated with the aircraft flickers to guide the ATCO to deal with emergencies.   The mode of the system is switched from silent mode to prompt mode whenever the system predicts a nonnegligible risk of incidents. Figures 18 and 19 depict the visual interaction in prompt mode. Within the information list area, "Predictive Alert" changes from "Active" to "Warning" and flashes the alert. In parallel, the tag associated with the aircraft flickers to guide the ATCO to deal with emergencies. The mode of the system is switched from silent mode to prompt mode whenever the system predicts a nonnegligible risk of incidents. Figures 18 and 19 depict the visual interaction in prompt mode. Within the information list area, "Predictive Alert" changes from "Active" to "Warning" and flashes the alert. In parallel, the tag associated with the aircraft flickers to guide the ATCO to deal with emergencies.   The mode of the system is switched from silent mode to prompt mode whenever the system predicts a nonnegligible risk of incidents. Figures 18 and 19 depict the visual interaction in prompt mode. Within the information list area, "Predictive Alert" changes from "Active" to "Warning" and flashes the alert. In parallel, the tag associated with the aircraft flickers to guide the ATCO to deal with emergencies.   The results of the prediction system are shown in Figure 20. Based on the original information, the label is extended to include the predicted probability of incident/accident and the type of incident/accident predicted. At the same time, it also prompts ATCO to inform the pilot. In order to open the information on the label, the system needs to receive ATCO control inputs. The input method is the mouse click, which requires ATCO to click on the label of a single aircraft with the mouse to view more information and end the flashing and alarm prompts. The results of the prediction system are shown in Figure 20. Based on the original information, the label is extended to include the predicted probability of incident/accident and the type of incident/accident predicted. At the same time, it also prompts ATCO to inform the pilot. In order to open the information on the label, the system needs to receive ATCO control inputs. The input method is the mouse click, which requires ATCO to click on the label of a single aircraft with the mouse to view more information and end the flashing and alarm prompts.

Conclusions and Future Research
The foreseen proliferation of conventional air traffic and unmanned aircraft system (UAS) operations, particularly in low altitude airspace, prompts the need to increase the level of automation and also exploit the unique strengths of artificial intelligence (AI) and machine learning (ML). Due to the lack of transparency of the black-box model that characterises a wide variety of AI/ML algorithms, it is essential to develop strategies to ensure the explainability of decision-support system (DSS) resolutions. This paper presented the integration of a relatively conventional ML predictive model, XGBoost, with both global and local explanations methodologies, SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). The proposed solution was verified through a representative case study addressing an hypothetical, advisory DSS for real-time risk-prediction in uncontrolled airspace, for which the relative importance of various weather factors on the probability of accidents and incidents was assessed. Due to the insufficient availability of UAS-traffic data, the presented verification addressed conventional traffic in controlled airspace with risk prediction based on meteorological data. The prediction results of the XGBoost model applied to this particular case study are well within the acceptable range and show an overall high accuracy, confirming the ML model's reliability. The SHAP and LIME algorithms are applied to the trained XGBoost model and are able to identify and illustrate very clearly the significant correlations in the dataset and how individual variables affect the prediction both locally and globally.
This research has established a feasible method to improve human-autonomy teaming by introducing explainability for AI inference processes in DSS for traditional airtraffic management (ATM) and UAS-traffic management (UTM). However, the methods described in this article have been so far verified only in the hypothetical DSS for advisory services. In our future work, we will focus on integrating AI algorithm-interpretation methods into the development stage of algorithms and systems, extending the proposed interpretation framework to other applications or use other forms of AI to improve mutual trust between humans and intelligent systems.

Conclusions and Future Research
The foreseen proliferation of conventional air traffic and unmanned aircraft system (UAS) operations, particularly in low altitude airspace, prompts the need to increase the level of automation and also exploit the unique strengths of artificial intelligence (AI) and machine learning (ML). Due to the lack of transparency of the black-box model that characterises a wide variety of AI/ML algorithms, it is essential to develop strategies to ensure the explainability of decision-support system (DSS) resolutions. This paper presented the integration of a relatively conventional ML predictive model, XGBoost, with both global and local explanations methodologies, SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). The proposed solution was verified through a representative case study addressing an hypothetical, advisory DSS for real-time risk-prediction in uncontrolled airspace, for which the relative importance of various weather factors on the probability of accidents and incidents was assessed. Due to the insufficient availability of UAS-traffic data, the presented verification addressed conventional traffic in controlled airspace with risk prediction based on meteorological data. The prediction results of the XGBoost model applied to this particular case study are well within the acceptable range and show an overall high accuracy, confirming the ML model's reliability. The SHAP and LIME algorithms are applied to the trained XGBoost model and are able to identify and illustrate very clearly the significant correlations in the dataset and how individual variables affect the prediction both locally and globally.
This research has established a feasible method to improve human-autonomy teaming by introducing explainability for AI inference processes in DSS for traditional air-traffic management (ATM) and UAS-traffic management (UTM). However, the methods described in this article have been so far verified only in the hypothetical DSS for advisory services. In our future work, we will focus on integrating AI algorithm-interpretation methods into the development stage of algorithms and systems, extending the proposed interpretation framework to other applications or use other forms of AI to improve mutual trust between humans and intelligent systems.