Protocol for Evaluating Explainability in Actuarial Models

Lozano-Murcia, Catalina; Romero, Francisco P.; Gonzalez-Ramos, Mᵃ Concepción

doi:10.3390/electronics14081561

Open AccessArticle

Protocol for Evaluating Explainability in Actuarial Models

by

Catalina Lozano-Murcia

^1,2

,

Francisco P. Romero

^1,*

and

Mᵃ Concepción Gonzalez-Ramos

¹

Department of Information Technologies and Systems, University of Castilla la Mancha, 13071 Ciudad Real, Spain

²

Centro de Estudios en Ciencias Exactas, Escuela Colombiana de Ingeniería Julio Garavito, Bogotá 111166, Colombia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(8), 1561; https://doi.org/10.3390/electronics14081561

Submission received: 5 March 2025 / Revised: 5 April 2025 / Accepted: 8 April 2025 / Published: 11 April 2025

(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

This paper explores the use of explainable artificial intelligence (XAI) techniques in actuarial science to address the opacity of advanced machine learning models in financial contexts. While technological advancements have enhanced actuarial models, their black box nature poses challenges in highly regulated environments. This study proposes a protocol for selecting and applying XAI techniques to improve interpretability, transparency, and regulatory compliance. It categorizes techniques based on origin, target, and interpretative capacity, and introduces a protocol to identify the most suitable method for actuarial models. The proposed protocol is tested in a case study involving two classification algorithms, gradient boosting and random forest, with accuracy of 0.80 and 0.79, focusing on two explainability objectives. Several XAI techniques are analyzed, with results highlighting partial dependency variance (PDV) and local interpretable model-agnostic explanations (LIME) as effective tools for identifying key variables. The findings demonstrate that the protocol aids in model selection, internal audits, regulatory compliance, and enhanced decision-making transparency. These advantages make it particularly valuable for improving model governance in the financial sector.

Keywords:

explainable artificial intelligence (XAI); actuarial science; machine learning; explainability; decision-making; model governance

1. Introduction

Actuarial science has focused on addressing the challenges related to identifying, quantifying, and managing risks using mathematical and statistical methods that enable the creation of effective models for risk management.

Globalization, technological advancements in data processing and generation, data science, and software development have optimized the work of actuaries. However, these advancements also bring challenges, ranging from adaptability issues to ethical dilemmas in handling sensitive information.

These challenges stem from the increasing efficiency of risk measurement and classification techniques, which, due to their ‘black box’ nature, hinder the detailed explanation of applied relationships. In simpler terms, what is happening inside of these models?

These techniques, which are standard in other areas and derived from machine learning or artificial intelligence algorithms, might outperform conventional methods in terms of accuracy. Nevertheless, their lack of transparency, traceability, explainability, and auditability limits their application in specific contexts. This limitation stems from their inability to meet the regulatory requirements of different regions, leading to the need for less precise models.

Machine learning, a subfield of artificial intelligence, emphasizes creating algorithms capable of learning from data and generating predictions. In an environment with constant data generation, such as the financial sector, machine learning has ample problems and data for its application. However, interpreting these models in actuarial science often presents challenges [1,2], raising concerns about their accuracy and reliability. Explainable artificial intelligence (XAI) [3], with techniques such as partial dependence plots (PDP) [4,5], SHAP [6], decision trees, or variable importance (VI) [7], among many others [8], has emerged to address these challenges by enabling the development of accurate, interpretable models. These models allow complex results to be contrasted with expert judgment and facilitate informed decision-making, which is critical in the financial sector.

In recent years, multiple frameworks have been developed in the literature to improve transparency in artificial intelligence models in various areas of knowledge. For example, Lorentzen and Mayer [4] propose a framework to explain machine learning models in the actuarial context, using techniques such as PDP and Shapley additive explanations (SHAP) to facilitate the comparison of results with expert criteria. In the medical field [9], a framework has been introduced that integrates data from multiple hospital centers to optimize the interpretability and traceability of clinical prediction models, increasing confidence in diagnostic and therapeutic decisions. In cybersecurity, ref. [10] propose an AI system audit model that evaluates explainability and fairness to improve intrusion detection without compromising user privacy. In the public policy domain, Gerlings and Constantiou [11] suggest an explainable AI approach for monitoring financial transactions in regulatory compliance, providing robustness metrics that evaluate the consistency of results under variations in data. Another interesting work is the XAI used in a finance systematic literature review (SLR) that identifies 138 relevant articles from 2005 to 2022 and highlights empirical examples demonstrating XAI’s potential benefits in the financial industry [12].

These frameworks reflect the growing attention in different disciplines towards transparency and confidence in AI models, which justify the development of a specific framework in actuarial science that comprehensively addresses interpretation and compliance needs and that also facilitates the adoption of standards such as the EU’s proposed AI Act or ISO/IEC 42001 [13] focused on guaranteeing transparency in the use of AI in business terms.

Our work provides a methodology that integrates XAI techniques into the actuarial domain to address the challenges of interpretability and transparency in machine learning models. Our framework utilizes intrinsic and post hoc interpretability techniques such as PDP, SHAP, and local interpretable model-agnostic explanations (LIME) [14] to provide comprehensive insights into model behavior. By categorizing XAI methods based on their origin, destination, and interpretability objectives, the framework ensures the selection of the most suitable explainability technique for various actuarial applications. This approach aims to enhance decision-making, regulatory compliance, and model governance within highly technical and regulated financial environments.

The proposed framework was evaluated through a case study involving predictive models for credit granting, comparing gradient boosting [15] and random forest [16] approaches. The study demonstrated the framework’s ability to identify key features, improve transparency, and support informed decision-making by applying multiple XAI techniques. The results confirmed the framework’s effectiveness in providing consistent explanations and facilitating audits of complex models. This comprehensive evaluation underscores the framework’s potential to strengthen actuarial governance and adaptability to real-world financial challenges.

The study is structured into five sections: the first provides an introduction and technical context; the second outlines the selection of XAI techniques; the third presents the proposed protocol for the selection of XAI techniques; the fourth illustrates the application of the protocol to a case study comprising two scenarios; and finally, the last section offers conclusions and suggestions for future work.

2. Context and Related Work

This section explores and provides a framework for addressing the challenges of interpretation and transparency in machine learning models for decision-making and regulatory compliance in actuarial science.

2.1. Definition: Explainable Artificial Intelligence (XAI)

XAI techniques [17] consist of methods and approaches to make artificial intelligence models and systems more transparent and understandable to various information users. The main goal is to enable people to comprehend how and why an artificial intelligence model makes certain decisions or predictions. XAI techniques also facilitate key processes like model auditing, which includes testing models for fairness and potential discrimination and ensuring compliance with ethical and legal standards. These techniques enhance transparency by enabling people to understand how data are collected and used in training artificial intelligence models, thus reinforcing the quality and transparency of the models. These techniques enhance transparency by enabling people to understand how data are collected and used in training artificial intelligence models, thus reinforcing the quality and transparency of the models. They are typically categorized by their evaluation goals: interpretability, origin, and destination. Additionally, they can be grouped by their outcomes, where key categories include:

Data and feature visualization is an essential technique for helping people understand how input features relate to the model’s outputs [5]. Charts and diagrams can intuitively show these relationships and facilitate comprehension.
Feature importance techniques assess the relative importance of input features in the model’s predictions, helping to identify which variables influence the model’s decisions the most, for instance in fair risk prediction application [18].
Employing AI models that are inherently more interpretable, such as decision trees or linear regressions, instead of more complex black box models like deep neural networks [19], is the most standardized practice in the industry.
Some approaches allow AI models to generate rules or explanations describing how they made a particular decision. This can be especially useful in critical applications where justification is needed [20].
Natural language: generating natural language explanations that describe the models decisions in terms understandable by humans [21].

2.2. XAI Techniques

A wide range of XAI techniques have been developed to meet various needs arising from the implementation of AI solutions across different domains. XAI methods can be classified based on three main attributes: interpretability, model origin, and application scope. These are detailed with some examples in Table 1.

2.3. Contribution of XAI to the Actuarial Context

Explainable artificial intelligence is essential in the insurance sector as it enhances transparency, fostering stakeholder trust in AI-driven decision-making. Here are some uses of XAI in this context:

Interpretable models: More interpretable algorithms, such as decision trees or linear regressions. These models are more accessible to explain and understand than black-box models [8] like neural networks. They are commonly used in highly regulated contexts, such as technical provisions.
Localization of explanations: Provides specific explanations for each prediction. For instance, why was an insurance application denied [32]? What factors contributed to that decision? What surcharges could be applied to offer the requested coverage?
Model auditing [33]: The periodic audits of AI models to detect any unexpected behavior or bias. This could be by governance models, internal control processes, external audits, or regulatory bodies. XAI techniques reduce the need for exact replicability and allow evaluation of the robustness of the established relationships, facilitating reviews.
Transparency in decision-making: Clearly, communication about how AI models is used in underwriting, claims processing, and risk evaluation. This builds trust among decision-makers and regulators, (AI Act [34]). In this context, processes are highly technical and must, understandably, be communicated to senior management and the board, identifying the most sensitive variables to facilitate understanding and impact evaluation.
Education and training: Employees and insurance agents training on how AI models work and how to interpret their results [35]. Internal understanding is essential for successful implementation, especially for those models’ affecting commercialization and not used by developers.
Sensitivity tests evaluate how predictions change when certain features are modified, helping better understand the relationships between variables and the models decisions. XAI techniques help identify the most influential variables, aiding in anticipating effects due to variations or potential impacts from environmental or management changes. Other relevant use is for feature selection [36] and model debugging.

3. Selection of XAI Techniques

While there are studies addressing the application of XAI techniques in actuarial models, most focus on specific aspects or particular applications. For instance, the work by [37] discusses the need for interpretability in insurance pricing and proposes a framework to explain models within this context. However, the present article stands out by offering a comprehensive methodological framework that not only evaluates various XAI techniques, but also provides a detailed protocol for their selection and application in different actuarial contexts. This holistic and practical approach represents a novel contribution, as it integrates multiple evaluation metrics and applied examples, thereby facilitating the structured and effective adoption of explainable AI in the actuarial sector.

Explainable artificial intelligence is essential for understanding and trusting AI models. However, its selection and application must adhere to the same rigor and governance as any other AI technique. Therefore, when an XAI method has to be applied, its appropriateness must be evaluated [38]. The most important attributes include the following dimensions, which are subsequently listed.

3.1. Usability

The choice of technique must consider its usability within the context of the problem and the user. Considerations such as computational capacity, the recurring incorporation of the technique, and the usability of its results condition the use of one method over another. Therefore, it is relevant to consider if the technique offers the following dimensions.

3.1.1. Model Consistency

Do the explanations inspire confidence in the AI model or raise doubts about the process or results? This aspect depends on the purpose of the XAI technique, which should be evaluated considering the goal and in comparison with other techniques.

The indicator (1) is based on the correlation of the results from the explainability method after its application to different algorithms in the same scenario [39]. It seeks to ensure the consistency of variable contribution across different instances.

C (E) = \frac{1}{m (m - 1) / 2} \sum_{i = 1}^{m - 1} \sum_{j = i + 1}^{m} c o r r_{S p e r m a n} (E (A_{1}), E (A_{j}))

(1)

where C(E) is the score of the explainability technique E; m is the number of the machine learning algorithm applied, and corrSpearman() is the Spearman correlation coefficient between these results.

3.1.2. Facilitating Decision-Making

Does the technique identify aspects relevant to interpretability, or are additional steps required? In the latter case, the complexity of the process should be compared to other alternatives. This indicator (2) can be assessed by evaluating how the decision-making process has been affected since implementing the technique, and it can be helpful in model lifecycle evaluations. A simple metric would be:

{B D}_{E} = \frac{{P X}_{E} - {P O X}_{E}}{{P O X}_{E}}

(2)

where BDE is the decision-making improvement indicator from incorporating XAI model E, POX is the user’s accuracy without the explanation, and PX is the user’s accuracy with the explanation—all with values from 0 to 1.

3.1.3. Replicability in Production

The computational time and execution must be viable within the established model governance; therefore, the use of certain metrics to assess this characteristic is essential in implementation processes. Below are two representative examples.

Computational time and efficiency (TO(3)) [40] evaluates the relationship between the computational cost of the XAI technique and the quality of the explanations obtained. This aspect is particularly relevant for techniques with high computational costs, especially in cases with limited hardware resources.

T O (E) = w_{1} * T_{n o r m . E} + w_{2} * (1 - K_{s c o r e})

(3)

K_{s c o r e} (E) = \frac{1}{m (m - 1) / 2} \sum_{E = 1}^{m - 1} \sum_{J = E + 1}^{m} c o r r_{S p e r m a n} (E (X), J (X))

(4)

T_{n o r m . i} = \frac{(T m a x - T_{E})}{(T m a x - T m i n)}

(5)

where T (3) denotes the computation time for XAI technique E, and w1 and w2 are weights adjusted according to specific requirements, and must add up to 1.

The Kscore (4) indicates the quality of the explanation, comparing the same XAI technique to two different models. J denotes the total number of techniques being compared, and X represents the vector of standardized metric scores for each technique and m is the number of the machine learning algorithm applied.

As an alternative to Kscore, the C(E) (1) can be used, with the weights w1 and w2 adjusted according to specific requirements.

For comparison purposes, it is recommended that the computational time (5) be scaled according to the maximum and minimum of the different techniques applied.

This equation assumes that a lower TO score is preferable, given that a lower computation time and a higher Kscore are desired. If the Kscore is perfect (equal to one), its contribution to TO is zero. Conversely, if the Kscore is poor (close to zero), it contributes to TO increases [40].

Another alternative score is statistical parity difference [41], which is a commonly used metric that measures the difference in the proportion of positive outcomes for two groups. It is often employed to assess the fairness of a decision-making process when there are two groups of interest (e.g., men and women). SPD (6) is the difference between the proportion of positive outcomes for one group and the proportion of positive outcomes from the other group.

{S P D}_{E} = P (Y = 1 ∣ A = m i n o r i t y) - P (Y = 1 ∣ A = m a j o r i t y)

(6)

where Y represents the model’s predictions, and A represents the group of the sensitive attribute. The result should be close to 0 to indicate no bias.

These techniques should not generate significant delays in model production or hinder the evaluation of data structure changes. If the technique is not feasible for implementation, it should be discarded in favor of a more suitable option.

3.1.4. Fairness and Bias

It is possible to assess the presence of biases in both the model and the XAI technique by evaluating its ability to accurately identify the influence of characteristics that may introduce biases, such as gender or race, in decision-making. Additionally, the technique ensures fairness in its explanations, preventing preferential treatment of specific features for inappropriate reasons.

This indicator (7) evaluates potential bias or unfair treatment of certain groups, especially regarding sensitive attributes (e.g., positive or negative discrimination based on regulatory factors). It identifies a set of fairness criteria (e.g., age, gender, race) and assesses whether the provided explanations meet these criteria for each sensitive feature.

F B (E) = 1 - \sum_{i = 1}^{n} w_{i} * V_{i}

(7)

where n is the total number of relevant fairness criteria, wi is the weight assigned to the i-th fairness criterion based on its relevance; these weights should range from 0 to 1. If the technique generates standardized results, they can be used directly; if not, scaling the weights obtained or applying a uniform assignment is recommended. V_i is a binary variable indicating whether the model satisfies the i-th fairness criterion (1 for satisfying, 0 for not). These techniques should not generate significant delays in model production or hinder the evaluation of data structure changes. If the technique is not feasible for implementation, it should be discarded in favor of a more suitable option.

It is worth emphasizing that the statistical parity difference (SPD) metric (6) is particularly relevant in actuarial contexts for evaluating fairness when must comply with regulatory standards and uphold ethical principles of non-discrimination. Furthermore, recent developments in reinforcement learning with human feedback (RLHF) [42] offer promising pathways to align AI model behavior with fairness expectations informed by human judgment.

3.2. Interpretability

This refers to how easily users can understand the explanation provided by the XAI method. Key evaluation criteria include the following:

3.2.1. Simplicity of Results

The explanation can be delivered to the model’s end-user without intervention from the developer, and it is easy to interpret. When comparing and selecting XAI techniques, one of the most intuitive indicators is related to simplicity (8), measured by the number of features involved in the explanation [27]. An XAI method that highlights fewer features is generally more interpretable.

S (E) = \frac{1}{n}

(8)

where n is the number of highlighted features in the explanation. The closer the value is to 1, the simpler the model. In XAI comparison, it is recommended to scale with this metric in the different techniques.

3.2.2. Coherence

The results make sense in the context of the problem. The contrast with expert judgment should be suitable, insightful, or inconsistent with the context of the problem. The most representative index its feature importance (9), it determines which features significantly impact predictions and whether the importance of these features aligns with domain knowledge and expert intuition. For its application, an XAI reference technique like VI should be considered.

F I (E) = | f_{i} \{f_{i}| f_{i} \in R_{E} i f f_{i} i s a m o n g t h e k p r i n c i p a l v a l u e s o f R_{D}\} | / k

(9)

where k is the number of principal features of R_D. FI(E) s the explainability score for method E, R_E is the feature ranking provided by method E, feature is represented as f₁, f₂,…, f_n, and R_D is the reference ranking of features. This can be:

A priori list: a top list of variables based on their relevance in the study context provided by domain experts.
A posteriori list: the average result from different feature importance determination techniques. At least three techniques should be applied to derive a mean ranking from the results.

3.3. Confidence

Ensuring that the results of the XAI technique are reliable compared to the reviewed model and its outcomes is essential. Two key characteristics must be evaluated:

3.3.1. Faithfulness

This evaluates whether the explanations generated by an XAI method accurately reflect the proper behavior of the underlying model. So, it evaluates the faithfulness (10) of an explanation E(x) generated by an XAI method for an instance x, by determining the difference between the original model prediction f(x) and the prediction using only the most important features from the XAI method g(x).

F (E) = \frac{(f (x) - g (x)) + 1}{2}

(10)

where g(x) is a simplified model based on the explanatory features provided by the XAI method. The closer to one, the better the result.

This metric is critical, as an explanation that does not faithfully represent the model’s logic can lead to user errors.

3.3.2. Stability and Robustness:

It must be assessed whether the XAI technique accurately explains the model’s behavior and response to small input data perturbations, both locally and globally.

This score (11) evaluates the stability of the explanation across different data scenarios, using the correlation of the technique’s results when applied to different datasets or small data perturbations [39]. Robust models should produce consistent results even when input data changes.

R (E) = \frac{1}{n (n - 1) / 2} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} c o r r (E (S_{1}), E (S_{j}))

(11)

where R(E) is the robustness score of the explainability technique E, n is the number of different scenarios, and corr() is the Pearson correlation coefficient between these results from scenario S₁ and scenario S_j.

4. Proposed XAI Selection Protocol for Actuarial Problems Discussion

Before selecting an XAI technique, it is important to answer the following questions: why and for what purpose is it needed? Is it to evaluate a model’s regulatory compliance? Is the aim to better understand the model or increase user confidence? Is the goal to identify relevant variables to facilitate decision-making? Different contexts may require different levels and types of explainability, and therefore, it is necessary to take several considerations into account (Table 2).

4.1. Actuarial Application

Focusing on the analysis of AI models for actuarial problems reveals some specific needs of the financial industry that require the use of explainability techniques as part of the governance in model development as we explain in the following subsections:

4.1.1. Understanding the Model

The process of understanding the model involves the steps in Table 3.

4.1.2. Results Explanation

The explanations provided must accomplish the following characteristics.

Relevance and precision of explanations: Evaluate the relevance and precision of explanations about the specific actuarial problem. Explanations must be relevant and valuable for actuarial decision-making. This aspect can be evaluated with FI, in any of its versions.
Clarity and understandability of explanations: Assess the clarity and understandability of the explanations provided by the XAI model. The explanations should be accessible to actuaries and other end users without requiring deep technical knowledge. For example, if a pricing model combines diverse sources of information, the explanations should be evaluated according to the number of highlighted features; too many features could make the explanation difficult for insurance agents to follow. Coherence means whether the most relevant features remain similar for clients with similar profiles.

4.1.3. Model Inspection

Inspecting the model is one of the most important aspects of our proposal and must be focused on the following aspects:

Transparency in the modeling process: Evaluate the transparency of the modeling process provided by the XAI model. Actuaries must understand how the model was trained, what data were used, and how the explanations were generated. This characteristic will be a fundamental part of model governance documentation, and a similar process should allow for similar explainability results, even if the core model does not allow for exact replicability.
Regulatory and ethical compliance [43]: Assess whether the XAI model complies with the actuarial field’s relevant regulatory and ethical requirements. They can include data privacy, fairness, and non-discrimination, among other factors. The results from feature importance and fairness tests will facilitate this type of assessment of the models.

4.2. Proposed Protocol

Selecting explainability and interpretability (XAI) techniques for financial models is relevant given the significant impact of decisions based on these models. Below is a proposed guideline protocol for selecting XAI models, summarized in Figure 1:

4.2.1. Understanding the Problem

Understanding the problem involves several key steps. First, it requires a clear definition of the specific issue that XAI will address within the insurance sector, such as risk assessment, fraud detection, or policy pricing. Next, it is crucial to research and comprehend the relevant regulatory and legal requirements for the industry in a given region, such as GDPR, HIPAA, or specific financial regulations. Finally, understanding the context in which financial models will be applied is essential. This includes identifying whether the application relates to loans, investments, credit risk, or fraud detection, as well as recognizing specific explainability requirements, such as regulatory compliance, user understanding of the model, or building confidence in decisions.

4.2.2. Define the Explainability Goal

Based on the preliminary steps, the selection of one or more XAI techniques should be aligned with the interpretability goal, which in actuarial contexts typically includes:

Part of the model development cycle–technical validation:

Model optimization: Identify interactions and their impact on model output, enabling the evaluation of feature relevance, prioritization of variables, optimization, or identifying potential deficiencies or areas of improvement in the models.
Detection of biases or technical discrimination: Identify possible biases generated by AI models, ensuring balanced inputs or hyperparameters to guarantee equitable models.
Model selection: When multiple models are available, prediction accuracy may not be the only factor considered when implementing or deploying the final model. Bias analysis, regulatory compliance, or ease of implementation can become additional factors. In this regard, XAI techniques can be used to evaluate metrics beyond prediction accuracy.
Understanding established relationships in the base model: Identify interactions and their impact on model output, which allows for evaluating the effects of varying relevant characteristics, facilitating sensitivity analysis while reducing total processing costs. It also allows for the consistent evaluation of the model’s development context and economic rationale.
Evaluate financial models: Analyze the utilized financial models. This includes understanding their architecture, the data they use, and the relevant features for financial decision-making.

Decision-making:

Explanation to third parties: Understand how the relationships established by the model work, which features or variables influence the predictions most, and the rationale behind these relationships. It will aid in communicating with non-technical stakeholders and serve as a decision-making tool.
Contrasting explainability results: XAI techniques rely on training and test data, are sensitive to inputs, and are influenced by the nature of the selected model. Therefore, it is advisable to compare results with one or more models.
Expert review and contrast analysis: Compare the results with expert judgment to ensure coherence and validate the findings.

Audit Process:

Internal control: In developing the internal review processes carried out by key functions in the second and third lines of defense, replicating complex or deep learning models is not always feasible. Implementing XAI techniques facilitates the evaluation of the models proposed outputs without requiring the exact recalculation of results, which may not be viable in some models.
Regulatory compliance: Facilitate the evaluation of compliance with regulatory requirements, such as ensuring no gender-based discrimination in insurance pricing or establishing consistency between variables and the context of the problem, allowing for bias detection.

4.2.3. XAI Model Selection

XAI model selection should ensure the effectiveness and fairness of AI models. First, model optimization plays a crucial role in identifying feature interactions and their impact on model outputs, enabling the evaluation of feature relevance, prioritization of variables, and optimization while uncovering potential deficiencies or areas for improvement. Second, detecting biases or technical discrimination is essential for ensuring equitable models by identifying imbalances in inputs or hyperparameters that could lead to biased predictions. Finally, when selecting a model from multiple options, prediction accuracy may not be the sole criterion; factors such as bias analysis, regulatory compliance, or ease of implementation also become critical. In this context, XAI techniques provide valuable insights into metrics beyond accuracy, aiding in comprehensive evaluation and decision-making. Several XAI techniques may be helpful in the financial context, and are presented in Table 4. At least two of these techniques should be tested to cross-reference the results. In Table 4 the term ‘Global’ refers to techniques that provide explanations for the overall behavior of the model, while ‘Local’ refers to techniques that explain individual predictions or specific data subsets

4.2.4. Evaluation of Results and Selection of Techniques

This encompasses an understanding of the relationships established in the base model and analyzing the financial models it aims to explain. This involves identifying interactions and their impact on model outputs, enabling sensitivity analysis by evaluating the effects of varying relevant characteristics while reducing processing costs. Additionally, it ensures a consistent assessment of the models development context and economic rationale. Furthermore, the evaluation includes analyzing the architecture of financial models, the data they utilize, and the features most relevant to financial decision-making, ensuring that the model aligns with its intended purpose and provides actionable insights.

4.2.5. Implementation of XAI Techniques in Governance Model

The implementation of explainable AI (XAI) techniques within a governance model encompasses several key considerations, including explanation to third parties, the comparison of explainability results, expert review and contrast analysis, and the audit process, which involves internal control and regulatory compliance. To ensure the effectiveness of XAI, it is crucial to document and validate the generated explanations, ensuring they are comprehensible, useful, reproducible, and reliable for end users. Additionally, proper education and training should be provided to users and professionals, enabling them to utilize the explanatory AI model effectively in their daily tasks.

Furthermore, maintaining and updating the explanatory AI model is essential to sustaining its accuracy and relevance. This includes the establishment of a structured maintenance plan that incorporates periodic performance reviews and the integration of the latest updates and data. By implementing these measures, organizations can enhance the transparency, reliability, and regulatory compliance of AI-driven decision-making processes, thereby fostering greater trust and accountability in their use.

4.2.6. Re-Evaluation of Needs

The objectives of analysis or evaluation may change according to the economic context or business dynamics; therefore, it will be necessary to assess the relevance of the technique(s) employed and, if needed, identify suitable XAI techniques.

4.3. Assessment Framework

Regarding the proposed evaluation metrics, a combination is suggested by the general objectives of the actuarial context, presented in Section 4.2. Table 5 introduces a weighted scoring system to guide the inclusion of each metric in the evaluation process, with standardized values ranging from 0 (worst) to 1 (best).

The proposal considers a weighting of metrics according to the objective and is a guide of weights for the inclusion of metrics in a weighting exercise in the evaluation of the results of each one, applied to standardization of each metric, taking the evaluation from 0 to 1, with 0 being the worst rating and 1 the best, in Table 3.

The definition of this weighting scheme could incorporate expert knowledge by using structured elicitation techniques such as the Delphi method [44] and the analytic hierarchy process (AHP) [45]. The Delphi method facilitates expert consensus through iterative anonymous feedback, while the AHP derives the relative importance of metrics through structured pairwise comparisons. Combining these approaches with consensus methods ensures a balanced and informed weighting system.

The suggested weightings have been obtained from an order of priority according to the context of the explainability objective, based on a preponderant weight V₁ (12) of 22% for the most relevant metric, decreasing by a factor of proportion, as given in (4).

V_{i} = \frac{V_{i - 1}}{S_{i - 1} + a_{i}}

(12)

where V_i. is the weight of the i-th metric, ordered by relevance level, S_i is the reference division factor of the previous metric, and a_i is the addition of weight decrease in the i-th factor, guaranteeing that the sum of the weights is 100% for is the proposed case V₁ = 22%, S₂ = 1.05 and a_i, for all i > 2 = 0.1., considering 8 features.

In summary, according to the defined protocol, the selection of XAI techniques is determined by the specific explainability objective. For audit purposes, techniques are selected based on their usability, interpretability, and confidence performance, focusing on the attributes necessary for assessing model behavior and ensuring regulatory compliance (faithfulness, robustness, and coherence). For model selection, techniques are compared based on their ability to explain model behavior, maintain consistency in variable importance, and provide reliable support for comparing alternatives. In both cases, the protocol supports using multiple techniques to provide complementary insights and improve the evaluation process.

5. Case Study

This study is based on the application of explainability techniques for predictive models of credit granting using a FICO dataset [46]. Two explainability objectives are proposed: (1) the selection of the best model, and (2) an internal audit of the results.

5.1. Dataset

The analysis starts with anonymized data related to home equity line of credit (HELOC) applications. The primary purpose of the data, which comes from a competition, is to predict whether an applicant will make their HELOC payment in the next 24 months.

This dataset has 24 variables (1 target, 23 independent) with 10.459 records, described in and generated for the competition. These do not present the possibility of biases that go against regulation, so they do not contain variables such as gender or age; they are designed to provide relevant information on past credit behavior and the potential risk of default. The dataset included variables related to the payment behavior of claimants that can be grouped into three categories

Delinquency and inquiries (Group 1)
Credit history and activity (Group 2)
Yield and financial risk (Group 3)

The target variable is RiskPerformance, which is binary and classifies applicants as “Good” (good payer) or “Bad” (bad payer), depending on whether they have had more than 90 days of late payment in a 24-month period.

5.2. Machine Learning and Explainability Results

As a solution to the problem, three machine learning techniques were applied because their good results on similar problems [47]. The results are presented in Table 6, resulting in gradient boosting being the model with the best results under all the evaluation metrics proposed.

On the other hand, five XAI techniques, LIME, SHAP, PERM [31], Morris sensitivity (MS) [48], and PDV, are applied to the two models with the highest accuracy. Since the accuracy results of DT, ANN and KNN are outperformed by RF and GB, on the other hand, the calculation times of the explainability metrics are very high with kNN and its variability in terms of resampled bases. Therefore, we focus on showing the results of RF and GB.

The results are shown in Figure 2 for the gradient boosting model, and in Figure 3 for the random forest model.

ExternalRiskEstimate is the variable with the highest consistency in its relevance, ranking first in most techniques (LIME, SHAP, and PDV), and a very close position in Morris sensitivity, which indicates that it is a critical variable in the credit risk assessment and probably a robust indicator of the probability of default.

AverageMInFile and NumSatisfactoryTrades are prominent variables that hold high positions in all techniques, suggesting that data on the age of the credit file and the number of satisfactory transactions are relevant for the GB model.

On the other hand, MSinceMostRecentInqexcl7days and NetFractionRevolvingBurden show some variability across techniques, but are still among the most relevant variables. This suggests that the time elapsed since the last credit inquiry and the proportion of revolving credit in use significantly impact risk analysis, although the magnitude of this impact may vary depending on the explanatory technique used.

The Morris sensitivity technique shows more variability in ranking certain variables than the other techniques, as observed in the cases of MSinceMostRecentDelq and NetFractionRevolvingBurden. This may indicate a greater sensitivity to small perturbations in these variables, implying that the model is less robust in its response to changes in these dimensions.

The results of the explainability techniques for the random forest (RF) model allow us to appreciate some key differences in the importance assigned to several variables, which have relevant implications for the interpretation and transparency of both models.

Consistency persists in key variables such as ExternalRiskEstimate, which appears to be the most important, showing the preponderance of the variable for the problem. On the other hand, it presents variability in the importance of secondary variables, such as AverageMInFile and PercentTradesNeverDelq, which are consistently positioned in the first places, which contrasts with the GB model, where some techniques, such as Morris sensitivity, assign less importance to these variables. Finally, inconsistency is evident in variables of lesser importance, such as NumInqLast6M and NumBank2NatlTradesWHighUtilization, whose positions vary between techniques in the RF model, compared to the GB model, where the ranking was more consistent. This variability in the RF model suggests that certain XAI techniques (e.g., Morris sensitivity) may be more sensitive to small variations in the data, affecting interpretability and confidence in secondary risk factor analysis.

5.3. Protocol Application

The different metrics were calculated for each XAI technique of reference models according to the defined protocol. These results are presented in Table 7 and Table 8. As mentioned, the fairness attribute is guaranteed by the structure of the base; therefore, it is complied with, and then, no weight will be assigned in the metrics matrix. The previous metrics reflect changes for the coherence metric, where in GB, it is PDV that has the highest score, while in RF, it is LIME. There is also a difference in simplicity where GB presents more in PERM and RF in PDV; however, in general, the order and metrics are similar between models, reflecting consistency between the application of XAI techniques to models for this particular problem.

We present the application of the protocol to the two proposed phases, applying the weights of the metrics presented for the two review approaches in Table 9.

5.4. Protocol Result: Understanding Established Relationships

Objective: To compare two different credit granting prediction models (random forest, gradient boosting) using XAI techniques to select the best model based on relevance and accuracy of the explanations.
Attribute evaluation: Attributes of usability, interpretability, and confidence in the explanations generated by each XAI technique will be evaluated. The results are presented in Table 10 for gradient boosting model.
The results for random forest are presented in Table 11.

Based on the above results for GB, it is suggested that the use of the partial dependency variance (PDV) technique is recommended for the process. The next options would be LIME and PERM, the first with good results in consistency and decision-making and the second highlighting the simplicity of faithfulness, stability, and robustness.

For random forest, PDV is reaffirmed as the best option, with LIME as the second choice and SHAP as the third. Notably, for this machine learning technique, Morris sensitivity exhibits lower volatility in its results compared to the ranking observed for gradient boosting, which could indicate that this technique is more stable for RF than for GB.

While the scoring differences between techniques such as LIME (0.679) and PDV (0.690) in Table 8 may appear small, the intent of the framework is not to prescribe a single optimal technique, but rather to offer a structured comparison based on multiple technical and practical attributes. In fact, the close scores support the recommendation of using multiple XAI techniques in parallel, particularly when diverse stakeholder perspectives are involved. The combined application of LIME, PDV, or others can provide complementary views and increase the robustness of the interpretability analysis.

5.5. Protocol Result: Audit Process

Objective: To review the modeling process and its results, evaluating transparency, regulatory compliance, and internal audit.
Techniques: Same as one target.
Attribute evaluation: Attributes of usability, interpretability, and confidence of the explanations generated by each XAI technique will be evaluated. With the results presented in Table 12 for GB and Table 13 about random forest.

With the above results, it is suggested that the PDV technique is recommended for the review process, followed by LIME and SHAP. In this case, the consistency and the result of stability and robustness place it in third place for GB, and LIME in first, then PDV if it is RF. In general, LIME and PDV present valuable structures for the review process and reflect stability concerning the main features to address the problem.

The XAI analysis has highlighted that the most relevant variables are those related to credit history and activity (Group 2). However, the most relevant seems to be ExternalRiskEstimate in almost all the techniques applied to the two models.

It was possible to independently verify that the selected variables were relevant to the problem in the different metrics, facilitated the audit process of the black box models, and allowed the evaluation of potential improvements to the process itself.

As can be seen, the explanatory techniques offer different perspectives on the importance of each variable. LIME and SHAP present more consistency and stability in their ranking of variables, which helps obtain a reliable and stable explanation of the main risk factors. Stability in the ranking of key variables may give greater confidence in the applicability of these models in credit contexts, where transparency and consistency in decision-making are essential for regulatory compliance.

On the other hand, Morris sensitivity and PDV reveal certain differences in the importance of secondary variables, which can be useful in identifying sensitivity to changes in data or detecting possible complex interactions. However, this variability can also complicate interpretation if a single stable view of the risk factors is sought, so using more than one technique and applying the suggested protocol allows a more complete view, according to the desired approach to the problem.

6. Discussion and Conclusions

The development of a framework for the application of XAI techniques to common actuarial problems was sought. The developed framework proposes the inclusion in the model development and governance process of explainability techniques subject to the user’s objective and the problem, as well as the criteria for their selection and evaluation, thus generating a robust process for different casuistry that facilitates the use of IA techniques in this context.

The use of XAI techniques facilitates the interpretation and comparison of models, allowing, if desired, to address the challenges of auditability, understanding, comparison and generation of new modeling approaches, as well as to evaluate the logic in the established relationships and facilitate the challenges of efficient risk management in the financial sector and the explanation of results to non-technical counterparties.

6.1. Evaluation of the Framework

The protocol proposed in this study provides a structured and systematic framework for the assessment of explainability in actuarial models. The main advantage is that it allows an objective comparison between different XAI techniques, applying some suggested metrics for the consistency and robustness of each method. This approach facilitates the selection of the most appropriate explainability techniques for each model based on the variables’ importance and the models sensitivity to small changes in the data. By applying this protocol, risk analysts can consistently identify the most critical variables and assess the impact of secondary variables on the prediction, thus improving decision-making.

The proposed framework enables actuaries to understand better key variables influencing predictions, such as ExternalRiskEstimate and credit history and activity variables, while providing tools to audit complex models effectively. This capability supports transparency in decision-making processes and fosters confidence among stakeholders, including regulators and non-technical audiences. Additionally, the framework aids in identifying model biases and optimizing feature prioritization, contributing to ethical and equitable model development.

6.2. Practical Implications

In a risk management context, which is the main challenge in actuarial science, consistency in the importance of key variables, such as that observed with the case of study, ExternalRiskEstimate and AverageMInFile, are relevant to maintain model transparency and reliability. Differences in the importance of secondary variables across techniques and models suggest that a combination of XAI methods could be beneficial. This would allow both the robustness of the main variables to be captured and a more detailed view of the less important variables to be obtained, providing a more complete risk analysis. This would allow techniques to be applied according to the analysis objective.

In the practical case developed, significant limitations are found both in the application context and in the interpretation of the characteristics of the variables associated with their source of creation for competition, which makes it difficult to contrast expert criteria in the results obtained. However, part of the challenge to be addressed by the proposed framework is to overcome these barriers through agnostic models and, therefore, conclusive explanations without expert judgment.

6.3. Limitations and Challenges

The protocol has some limitations, such as the definition and weighting of metrics. However, it meets the main objective of facilitating the implementation and testing of AI models in the actuarial context for the developer, the reviewer, and the decision-maker. It is important to mention that to strengthen it, it is necessary to carry out several exercises with different problems, and it is not easy to find open datasets to develop this type of exercise.

Finally, it is possible to see the potential of this type of models that transform the black boxes and open a window for future work focused on evaluation and replication exercises within the framework of risk management processes in the actuarial environment, such as technical risk management and commercial or auditing processes for daily supervision in the financial sector, being a clear path for strengthening the governance of models in the daily work of the actuary.

7. Future Work

The proposal for a framework for the evaluation of XAI techniques for actuarial problems is an initiative of permanent analysis, given the dynamics of this area of knowledge and the multiplicity of potential applications.

However, one of the main future works is related to the definition of pairs of problems and techniques that facilitate the development of the process according to the process and the exclusion or inclusion of additional techniques according to the needs of the modeling process or the context.

Other highly topic will be integration of XAI aligned with regulatory documents [49] such as the proposed EU AI Act and standards such as ISO/IEC 42001, emphasizing transparency, fairness, and accountability in AI-driven decisions [50]. Future research should focus on adapting the proposed framework to meet these regulatory standards by exploring the implications of evolving regulations on the governance of the model. This includes assessing the role of XAI in addressing issues such as bias detection and ensuring fairness in actuarial practices.

In another hand, the future research should focus on expanding the proposed XAI protocol to incorporate fairness-aware techniques, thereby addressing not only interpretability but also bias detection and mitigation. This extension would ensure that efforts toward model transparency are accompanied by mechanisms that promote equitable treatment, in line with evolving regulatory frameworks and ethical standards in the financial industry.

Author Contributions

Conceptualization: F.P.R. and C.L.-M.; methodology: C.L.-M.; validation: F.P.R.; experimentation: M.C.G.-R.; writing: C.L.-M. and F.P.R.; visualization: M.C.G.-R.; supervision: F.P.R.; project administration: F.P.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by FEDER and the State Research Agency (AEI) 590 of the Spanish Ministry of Economy and Competition under grant SAFER: PID2019-104735RB-C42 591 (AEI/FEDER, UE).

Data Availability Statement

Publicly available datasets were analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Richman, R. AI in Actuarial Science—A Review of Recent Advances; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
Richman, R. AI in actuarial science—A review of recent advances—Part 2. Ann. Actuar. Sci. 2021, 15, 230–258. [Google Scholar] [CrossRef]
Barredo, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Lorentzen, C.; Mayer, M. Peeking into the Black Box: An Actuarial Case Study for Interpretable Machine Learning; Swiss Association of Actuaries SAV: Estocolmo, Sweden, 2020. [Google Scholar]
Greewell, B. pdp: An R Package for Constructing. R J. 2017, 9, 421–436. [Google Scholar] [CrossRef]
Matthews, S.; Hartman, B. mSHAP: SHAP Values for Two-Part Models. Risks 2022, 10, 3. [Google Scholar] [CrossRef]
Greenwell, B.; Boehmke, B.; McCarthy, A. A Simple and Effective Model-Based Variable Importance Measure. arXiv 2018, arXiv:1805.04755. [Google Scholar]
Owens, E.; Sheehan, B.; Mullins, M.; Cunneen, M.; Ressel, J.; Castignani, G. Explainable Artificial Intelligence (XAI) in Insurance. Risks 2022, 10, 230. [Google Scholar] [CrossRef]
Yang, G.; Ye, Q.; Xia, J. Unbox the black-box for the medical explainable AI via multi-modal and multi-centre data fusion: A mini-review, two showcases and beyond. Inf. Fusion 2022, 77, 29–52. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Al Hamadi, H.; Damiani, E.; Yeun, C.Y.; Taher, F. Explainable Artificial Intelligence Applications in Cyber Security: State-of-the-Art in Research. IEEE Access 2022, 10, 93104–93139. [Google Scholar] [CrossRef]
Gerlings, J.; Constantiou, I. Machine Learning in Transaction Monitoring: The Prospect of xAI. In Proceedings of the 56th International Conference on System Sciences, Maui, HI, USA, Virtual Event, 4–7 January 2022. [Google Scholar]
Černevičienė, J.; Kabašinskas, A. Explainable artificial intelligence (XAI) in finance: A systematic literature review. Artif. Intell. Rev. 2024, 57, 216. [Google Scholar] [CrossRef]
Golpayegani, D.; Pandit, H.; Lewis, D. Comparison and Analysis of 3 Key AI Documents: EU’s Proposed AI Act, Assessment List for Trustworthy AI (ALTAI), and ISO/IEC 42001 AI Management System. In Irish Conference on Artificial Intelligence and Cognitive Science; Springer: Cham, Switzerland, 2023; pp. 189–200. [Google Scholar]
Salih, A.; Raisi-Estabragh, Z.; Galazzo, I.B.; Radeva, P.; Petersen, S.E.; Lekadir, K.; Menegaz, G. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. arXiv 2024, arXiv:2305.02012v3. [Google Scholar] [CrossRef]
Fauzan, M.A.; Murfi, H. The accuracy of XGBoost for insurance claim prediction. Int. J. Adv. Soft Comput. Its Appl. 2018, 10, 159–171. [Google Scholar]
Lin, W.; Wu, Z.; Lin, L.; Wen, A.; Li, J. An ensemble random forest algorithm for insurance big data analysis. IEEE Access 2017, 5, 16568–16575. [Google Scholar] [CrossRef]
Bahalul, H.; Najmul, I.; Patrick, M. Explainable Artificial Intelligence (XAI) from a user perspective: A synthesis of prior literature and problematizing avenues for future research. Technol. Forecast. Soc. Change 2023, 186, 122120. [Google Scholar]
Ning, Y.; Li, S.; Ng, Y.Y.; Chia, M.Y.C.; Gan, H.N.; Tiah, L.; Mao, D.R.; Ng, W.M.; Leong, B.S.-H.; Doctor, N.; et al. Variable importance analysis with interpretable machine learning for fair risk prediction. PLoS Digit. Health 2024, 3, e0000542. [Google Scholar] [CrossRef] [PubMed]
Lipton, Z.C. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
Ghorbani, A.; Berenbaum, D.; Ivgi, M.; Dafna, Y.; Zou, J. Beyond Importance Scores: Interpreting Tabular ML by Visualizing Feature Semantics. arXiv 2021, arXiv:2111.05898. [Google Scholar] [CrossRef]
Danilevsky, M.; Qian, K.; Aharonov, R.; Katsis, Y.; Kawas, B.; Sen, P. A Survey of the State of Explainable AI for Natural Language Processing. In Proceedings of the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020. [Google Scholar]
Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. Harv. J. Law Technol. 2018, 31, 842–889. [Google Scholar] [CrossRef]
Ben-Haim, Y.; Tom-Tov, E. A Streaming Parallel Decision Tree Algorithm. J. Mach. Learn. Res. 2010, 11, 849–872. [Google Scholar]
Goldstein, A.; Kapelner, A.; Bleich, J.; Pitkin, E. Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. J. Comput. Graph. Stat. 2015, 24, 44–65. [Google Scholar] [CrossRef]
Apley, D.W.; Zhu, J. Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc. Stat. Methodol. Ser. B 2020, 82, 1059–1086. [Google Scholar] [CrossRef]
Friedman, J.H.; Popescu, B.E. Predictive Learning via Rule Ensembles. Ann. Appl. Stat. 2008, 2, 916–954. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. arXiv 2016, arXiv:1602.04938. [Google Scholar]
Andrews, R.; Diederich, J.; Tickle, A.B. Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl.-Based Syst. 1995, 8, 373–389. [Google Scholar] [CrossRef]
Saltelli, A.; Ratto, M.; Andres, T.; Campolongo, F.; Cariboni, J.; Gatelli, D.; Saisana, M.; Tarantola, S. Global Sensitivity Analysis. The Primer; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2007. [Google Scholar]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Available online: https://christophm.github.io/interpretable-ml-book/ (accessed on 31 January 2025).
Delcaillau, D.; Ly, A.; Papp, A.; Vermet, F. Model Transparency and Interpretability: Survey and Application to the Insurance Industry. Eur. Actuar. J. 2022, 12, 443–484. [Google Scholar] [CrossRef]
Zhang, C.A.; Cho, S.; Vasarhelyi, M. Explainable Artificial Intelligence (XAI) in auditing. Int. J. Account. Inf. Syst. 2022, 46, 100572. [Google Scholar] [CrossRef]
European Union. European Union Official Journal. Available online: https://digital-strategy.ec.europa.eu/es/policies/regulatory-framework-ai (accessed on 31 January 2025).
Koster, O.; Kosman, R.; Visser, J. A Checklist for Explainable AI in the Insurance Domain. arXiv 2021, arXiv:2107.14039. [Google Scholar]
Zacharias, J.; von Zahn, M.; Chen, J.; Hinz, O. Designing a feature selection method based on explainable artificial intelligence. Electron. Mark. 2022, 32, 2159–2184. [Google Scholar] [CrossRef]
Kuo, K.; Lupton, D. Towards Explainability of Machine Learning Models in Insurance Pricing. arXiv 2020, arXiv:2003.10674. [Google Scholar]
Holzinger, A.; Carrington, A.; Müller, H. Measuring the Quality of Explanations: The System Causability Scale (SCS). KI—Kunstliche Intell. 2020, 34, 193–198. [Google Scholar] [CrossRef]
Alvarez-Melis, D.; Jaakkola, T.S. On the robustness of interpretability methods. arXiv 2018, arXiv:1806.08049. [Google Scholar]
Lozano, C.; Romero, F.; Serrano, J.; Olivas, J.A. A comparison between explainable machine learning methods for classification and regression problems in the actuarial context. Mathematics 2023, 11, 3088. [Google Scholar] [CrossRef]
Pagano, T.; Loureiro, R.; Lisboa, F.; Peixoto, R.; Guimaraes, G.; Cruz, G.; Araujo, M.; Santos, L.; Cruz, M.; Oliveira, E.; et al. Bias and Unfairness in Machine Learning Models: A Systematic review on datasets, tools, fairness metrics, and identification and mitigation methods. Big Data Cogn. Comput. 2023, 7, 31. [Google Scholar] [CrossRef]
Ning, Y.; Li, S.; Ng, Y.Y.; Chia, M.Y.C.; Gan, H.N.; Tiah, L.; Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar]
Martin, M.; Christopher, P.H.; Martin, C. Creating ethics guidelines for artificial intelligence and big data analytics customers: The case of the consumer European insurance market. Patterns 2021, 2, 100362. [Google Scholar]
Tredger, E.R.W.; Lo, J.T.H.; Haria, S.; Lau, H.H.K.; Bonello, N.; Hlavka, B.; Scullion, C. Bias, guess and expert judgement in actuarial work. Br. Actuar. J. 2016, 21, 545–578. [Google Scholar] [CrossRef]
Tsang, H.; Lee, W.; Tsui, E. AHP-Driven Knowledge Leakage Risk Assessment Model: A Construct-Apply-Control Cycle Approach. Int. J. Knowl. Syst. Sci. (IJKSS) 2016, 7, 1–18. [Google Scholar] [CrossRef]
Fair Isaac Corporation. FICO Community. Available online: https://community.fico.com/s/explainable-machine-learning-challenge (accessed on 31 January 2024).
Moscato, V.; Picariello, A.; Sperlí, G. A benchmark of machine learning approaches for credit score prediction. Expert Syst. Appl. 2021, 165, 113986. [Google Scholar] [CrossRef]
Morio, J.; Balesdent, M. Estimation of Rare Event Probabilities in Complex Aerospace and Other Systems; Elsevier: Amsterdam, The Netherlands, 2015. [Google Scholar]
AAE Artificial Intelligence and Data Science Working Group, Explainable Artificial Inteligence for C-Level Executives in Insurance. Actuarial Association of Europe. Discussion Paper. 2024. Available online: https://actuary.eu/wp-content/uploads/2024/08/AAE-Discussion-Paper-on-XAI-DEF.pdf (accessed on 7 April 2025).
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. Explainable AI for Trees: From Local Explanations to Global Understanding. arXiv 2019, arXiv:1905.04610. [Google Scholar] [CrossRef]

Figure 1. Protocol for implementing XAI techniques in actuarial problem contexts.

Figure 2. Feature ranking by XAI technique by gradient boosting model.

Figure 3. Feature ranking by XAI technique by random forest model.

Table 1. Main XAI techniques.

Techniques	Interpretability	Origin	Scope
Counterfactual explanations [22]	Post hoc	Agnostic	Local
Decision trees [23]	Intrinsic	Specific	Global
Feature importance [7]	Post hoc	Agnostic	Global
Individual conditional expectation (ICE) [24]	Post hoc	Agnostic	Local
Accumulated local effects (ALE) [25]	Post hoc	Agnostic	Global
Interaction measure (H-statistic) [26]	Post hoc	Agnostic	Local
LIME [27]	Post hoc	Agnostic	Local
Partial dependence plot [5]	Post hoc	Agnostic	Global
Rule extraction [28]	Post hoc	Agnostic	Global
Sensitive analysis [29]	Post hoc	Agnostic	Local and global
SHAP (Shapley explanations) [30]	Post hoc	Agnostic	Local and global
Surrogate models [31]	Post hoc	Agnostic	Local and global

Table 2. Key evaluation criteria for XAI techniques.

Criteria	Key Points
Model Type	Different XAI methods work better with certain models (e.g., deep learning vs. linear/decision trees); agnostic techniques are model-independent.
Available Methods	A variety exists (rule-based, post-processing, visualization); selection depends on computational resources and specific explainability goals.
Multiple Techniques	Combining techniques can yield complementary insights and enhance interpretability.
Evaluation	Assess based on the quality of information, ease of interpretation, computational cost, and robustness/sensitivity to input variations.

Table 3. XAI evaluation criteria according to objectives.

Aspect	Key Point/Example
Interpretation of Key Features	Explain critical variables (e.g., income and age in credit models).
Consistency and Stability	Ensure explanations are robust and consistent (e.g., high C(E) values).
Feedback	Incorporate actuary feedback for improvements (e.g., surcharges based on age/activity).
Ease of Use	Seamlessly integrate into workflows and assess computational efficiency (using TO formula).

Table 4. XAI techniques according to objectives.

Goal	Technique
Technical Validation:
Model optimization	Global–local
Detection of biases or technical discrimination	Global
Model Selection	Feature importance Partial dependence plot Rule extraction Surrogate models
Understanding relationships in the base model	Local
Decision-Making:
Explanation to third parties	Shape PDP
Contrasting explainability results	Global
Expert review and contrast analysis	Shape PDP
Audit Process:
Internal control and/or regulatory compliance	Global and local: Shapley, LIME, PDP

Table 5. Suggested evaluation framework by XAI attributes.

XAI Attribute	Usability				Interpretability		Confidence
Explainability Goal	Consistency	Decision-Making	Replicability	Fairness	Coherence	Simplicity	Faithfulness	Stability and Robustness
Consistency and stability of explanations	18%	5%	7%	3%	14%	10%	21%	22%
Feedback and continuous improvement	22%	7%	21%	3%	10%	5%	14%	18%
Ease of use and practical applicability	5%	21%	22%	3%	7%	18%	10%	14%
Relevance and accuracy of explanations.	10%	7%	5%	3%	21%	14%	22%	18%
Clarity and comprehensibility of explanations.	21%	7%	5%	3%	10%	22%	14%	18%
Transparency of the modeling process	14%	5%	3%	18%	10%	7%	22%	21%
Regulatory compliance	10%	3%	7%	22%	14%	5%	21%	18%

Table 6. FICO models results.

Models	AUC	Accuracy	Precision	Recall	F1
Decision tree (DT)	0.71	0.71	0.71	0.67	0.69
KNN	0.70	0.70	0.67	0.73	0.70
Artificial neural networks (ANN)	0.71	0.71	0.73	0.70	0.71
Random forest (RF)	0.79	0.71	0.72	0.65	0.69
Gradient boosting (GB)	0.80	0.73	0.73	0.69	0.70

Table 7. Evaluation results of technical attributes in XAI FICO gradient boosting.

XAI Attribute	Usability				Interpretability		Confidence
Explainability Goal	Consistency	Decision-Making	Replicability	Fairness	Coherence	Simplicity	Faithfulness	Stability and Robustness
LIME	0.8298	0.500	0.739	1.000	0.667	0.333	0.500	0.908
SHAP	0.9418	0.600	0.245	1.000	0.500	0.500	0.495	0.705
PERM	0.6905	0.200	0.727	1.000	0.500	1.000	0.500	0.962
MS	0.6379	0.200	0.879	1.000	0.500	0.250	0.490	0.416
PDV	0.8653	0.500	0.696	1.000	0.667	0.667	0.485	0.912

Table 8. Evaluation results of technical attributes in XAI FICO random forest.

XAI Attribute	Usability				Interpretability		Confidence
Explainability Goal	Consistency	Decision-Making	Replicability	Fairness	Coherence	Simplicity	Faithfulness	Stability and Robustness
LIME	0.8298	0.500	0.706	1.000	0.900	0.500	0.492	0.945
SHAP	0.9418	0.600	0.211	1.000	0.857	0.714	0.482	0.635
PERM	0.6905	0.200	0.689	1.000	0.818	0.455	0.497	0.970
MS	0.6379	0.200	0.732	1.000	0.600	0.333	0.492	0.573
PDV	0.8653	0.500	0.691	1.000	0.800	1.000	0.482	0.962

Table 9. Weights of the metrics case of study.

	Features	Evaluation of Prediction	Models Audit
Usability	Consistency	24.75%	14.00%
	Make decision	18.00%	4.00%
	Replicability	22.00%	10.00%
Interpretability	Coherency	7.00%	18.00%
Interpretability	Simplicity	4.00%	7.00%
Confidence	Fidelity	14.00%	24.75%
Confidence	Stability and robustness	10.00%	22.00%

Table 10. Evaluation results of technical attributes in XAI FICO models—1 GB.

	LIME	SHAP	PERM	MS	PDV
Usability	0.458	0.395	0.367	0.387	0.457
Interpretability	0.060	0.055	0.075	0.045	0.073
Confidence	0.161	0.140	0.166	0.110	0.159
Results	0.679	0.590	0.608	0.543	0.690

Table 11. Evaluation results of technical attributes in XAI FICO models—Phase 1 RF.

	LIME	SHAP	PERM	MS	PDV
Usability	0.451	0.388	0.359	0.355	0.456
Interpretability	0.083	0.089	0.075	0.055	0.096
Confidence	0.163	0.131	0.166	0.126	0.164
Results	0.697	0.607	0.600	0.536	0.716

Table 12. Evaluation results of technical attributes in XAI FICO models—2 GB.

	LIME	SHAP	PERM	MS	PDV
Usability	0.210	0.180	0.177	0.185	0.211
Interpretability	0.143	0.125	0.160	0.108	0.167
Confidence	0.324	0.278	0.335	0.213	0.321
Results	0.677	0.583	0.673	0.506	0.698

Table 13. Evaluation results of technical attributes in XAI FICO models—2 RF.

	LIME	SHAP	PERM	MS	PDV
Usability	0.207	0.177	0.174	0.171	0.210
Interpretability	0.197	0.204	0.179	0.131	0.214
Confidence	0.330	0.259	0.336	0.248	0.331
Results	0.733	0.640	0.689	0.550	0.755

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lozano-Murcia, C.; Romero, F.P.; Gonzalez-Ramos, M.C. Protocol for Evaluating Explainability in Actuarial Models. Electronics 2025, 14, 1561. https://doi.org/10.3390/electronics14081561

AMA Style

Lozano-Murcia C, Romero FP, Gonzalez-Ramos MC. Protocol for Evaluating Explainability in Actuarial Models. Electronics. 2025; 14(8):1561. https://doi.org/10.3390/electronics14081561

Chicago/Turabian Style

Lozano-Murcia, Catalina, Francisco P. Romero, and Mᵃ Concepción Gonzalez-Ramos. 2025. "Protocol for Evaluating Explainability in Actuarial Models" Electronics 14, no. 8: 1561. https://doi.org/10.3390/electronics14081561

APA Style

Lozano-Murcia, C., Romero, F. P., & Gonzalez-Ramos, M. C. (2025). Protocol for Evaluating Explainability in Actuarial Models. Electronics, 14(8), 1561. https://doi.org/10.3390/electronics14081561

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Protocol for Evaluating Explainability in Actuarial Models

Abstract

1. Introduction

2. Context and Related Work

2.1. Definition: Explainable Artificial Intelligence (XAI)

2.2. XAI Techniques

2.3. Contribution of XAI to the Actuarial Context

3. Selection of XAI Techniques

3.1. Usability

3.1.1. Model Consistency

3.1.2. Facilitating Decision-Making

3.1.3. Replicability in Production

3.1.4. Fairness and Bias

3.2. Interpretability

3.2.1. Simplicity of Results

3.2.2. Coherence

3.3. Confidence

3.3.1. Faithfulness

3.3.2. Stability and Robustness:

4. Proposed XAI Selection Protocol for Actuarial Problems Discussion

4.1. Actuarial Application

4.1.1. Understanding the Model

4.1.2. Results Explanation

4.1.3. Model Inspection

4.2. Proposed Protocol

4.2.1. Understanding the Problem

4.2.2. Define the Explainability Goal

4.2.3. XAI Model Selection

4.2.4. Evaluation of Results and Selection of Techniques

4.2.5. Implementation of XAI Techniques in Governance Model

4.2.6. Re-Evaluation of Needs

4.3. Assessment Framework

5. Case Study

5.1. Dataset

5.2. Machine Learning and Explainability Results

5.3. Protocol Application

5.4. Protocol Result: Understanding Established Relationships

5.5. Protocol Result: Audit Process

6. Discussion and Conclusions

6.1. Evaluation of the Framework

6.2. Practical Implications

6.3. Limitations and Challenges

7. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI