XAI for Churn Prediction in B2B Models: A Use Case in an Enterprise Software Company

: The literature related to Artiﬁcial Intelligence (AI) models and customer churn prediction is extensive and rich in Business to Customer (B2C) environments; however, research in Business to Business (B2B) environments is not sufﬁciently addressed. Customer churn in the business environment and more so in a B2B context is critical, as the impact on turnover is generally greater than in B2C environments. On the other hand, the data used in the context of this paper point to the importance of the relationship between customer and brand through the Contact Center. Therefore, the recency, frequency, importance and duration (RFID) model used to obtain the customer’s assessment from the point of view of their interactions with the Contact Center is a novelty and an additional source of information to traditional models based on purchase transactions, recency, frequency, and monetary (RFM). The objective of this work consists of the design of a methodological process that contributes to analyzing the explainability of AI algorithm predictions, Explainable Artiﬁcial Intelligence (XAI), for which we analyze the binary target variable abandonment in a B2B environment, considering the relationships that the partner (customer) has with the Contact Center, and focusing on a business software distribution company. The model can be generalized to any environment in which classiﬁcation or regression algorithms are required.


Introduction
In a global market, where customers can change their preferences and buy from competitors, it is necessary to adopt strategies that encourage customer-brand engagement, for example, by proposing alternatives to strategically profitable customers, who could have a positive tendency to abandon the relationship with the brand, or by letting the less profitable ones go [1,2]. Through the RFID model [3] based on the recency, frequency, importance, and duration of interactions between the customer and Contact Center, a metric is proposed that makes it possible to determine the value of the customer from the perspective of after sales services, and therefore to design the most recommendable actions in order to build customer loyalty. However, these decisions, which could be left in the hands of black box algorithms, must be subject to interpretability to avoid discriminatory biases and to be able to make explainable decisions, thus constituting the cornerstone of this document.
Traditionally, customer churn studies are closely related to Business to Customer (B2C) environments. In fact, customer characteristics and behavior can vary considerably depending on whether the relationships are Business to Business (B2B) or B2C. [4]. Although companies that base their business model on relationships with other companies tend to have fewer customers, these customers make larger and more frequent purchases compared to their counterparts in a B2C environment [5], and their retention is seen as fundamental in the development of sustainable business relationships [6,7], hence the importance of fields, there are a limited number of studies that report a research process applied to the real world. Secondly, studies related to customer churn are very numerous in the B2C environment, where machine learning (ML) or deep learning (DL) procedures are applied with a significant degree of prediction, but without applying interpretability (XAI). Finally, we use a complementary model to the traditional RFM and LRFMP as a predictive analysis criterion, the RFID model, to respond to potential customer churn, as we have not found works that use the variables that typify customer service (RFID) in the prediction of abandonment.
In the rest of this paper we will develop and apply the XAI model, according to the following structure: in Section 2, we will review the current status of the use of XAI methodologies and their application scenarios, contrasting the GAP between the use of ML algorithms and the use of explainability in relation to the customer churn rate; in Section 3, we will address the methodological framework that we will use in the prediction and explainability of churn; in Section 4, we will detail the proposed model; in Section 5, we will implement the XAI model applied to the customer churn rate in a B2B model and within a business environment dedicated to the distribution and implementation of software licenses; and finally, in Sections 6 and 7 we will present the conclusions and future work.

Related Work on Customer Churn B2C and B2B
Customer churn has been one of the main topics of attention for researchers and companies, with abundant literature in B2C environments (Figure 1), as the loss of a customer has a direct impact on the bottom line of any company, in addition to the loss of brand image, and since attracting a new customer is substantially more costly financially than retaining existing customers [16].
there are a limited number of studies that report a research process applied to the r world. Secondly, studies related to customer churn are very numerous in the B2C en ronment, where machine learning (ML) or deep learning (DL) procedures are appl with a significant degree of prediction, but without applying interpretability (XAI). nally, we use a complementary model to the traditional RFM and LRFMP as a predict analysis criterion, the RFID model, to respond to potential customer churn, as we ha not found works that use the variables that typify customer service (RFID) in the pred tion of abandonment.
In the rest of this paper we will develop and apply the XAI model, according to following structure: in Section 2, we will review the current status of the use of XAI me odologies and their application scenarios, contrasting the GAP between the use of M algorithms and the use of explainability in relation to the customer churn rate; in Sect 3, we will address the methodological framework that we will use in the prediction a explainability of churn; in Section 4, we will detail the proposed model; in Section 5, will implement the XAI model applied to the customer churn rate in a B2B model a within a business environment dedicated to the distribution and implementation of so ware licenses; and finally, in Sections 6 and 7 we will present the conclusions and futu work.

Related Work on Customer Churn B2C and B2B
Customer churn has been one of the main topics of attention for researchers and co panies, with abundant literature in B2C environments (Figure 1), as the loss of a custom has a direct impact on the bottom line of any company, in addition to the loss of bra image, and since attracting a new customer is substantially more costly financially th retaining existing customers [16].
The following graph shows the studies related to customer churn in B2C enviro ments, until September 2022. Research production in recent years has been mainly oriented towards the teleco commerce, banking, and insurance sectors, Table 1. The following graph shows the studies related to customer churn in B2C environments, until September 2022.
Research production in recent years has been mainly oriented towards the telecom, commerce, banking, and insurance sectors, Table 1. Some significant examples are related to churn in the telecommunication industry [17], in the banking sector [18], in the insurance sector [19], in the retail sector [20], and in Ecommerce [21].
B2B models have received less attention than B2C models, and there is a total of 17 articles published from 1999 to August 2022 ( Figure 2, Table 2). The characteristics of the B2B business, with a lower impact in number of customers, but with much higher transactional values, make these models acquire special connotations, since the loss of any customer can have a very negative impact on turnover and brand image [22,23]. In addition, customer churn in B2B scenarios has been studied mainly from the perspective of resource allocation for business development, or in the analysis and prediction of current and future customer profitability [24]. Some significant examples are related to churn in the telecommunication industry [17], in the banking sector [18], in the insurance sector [19], in the retail sector [20], and in Ecommerce [21].
B2B models have received less attention than B2C models, and there is a total of 17 articles published from 1999 to August 2022 ( Figure 2, Table 2). The characteristics of the B2B business, with a lower impact in number of customers, but with much higher transactional values, make these models acquire special connotations, since the loss of any customer can have a very negative impact on turnover and brand image [22,23]. In addition, customer churn in B2B scenarios has been studied mainly from the perspective of resource allocation for business development, or in the analysis and prediction of current and future customer profitability [24]. Figure 2 below shows the publications per year related to customer churn in B2B environments, and Table 2 shows these papers.       Table 2 shows these papers.
The strategy followed in the work is related to the use of transactional data based on the RFM model [9,28,29,31]; through the relationship between supplier and customer over time in different phases, before, during, and after the purchase process, known in marketing as the Customer Journey Map [25]; or by collecting sales and interaction data [19,27,30,[36][37][38]. The following study [26] uses the metric of the benefit implied by the correct classification of a customer and a cost associated with those who are incorrectly classified, and this other study uses the quality of service to determine the subscription rate [33]. As a general rule, it can be seen that all studies are based on a combination of interpretable (white box) and non-interpretable (black box) predictive models, but without using interpretability; however, only the following study proposes a customer segmentation that combines a predictive analysis with interpretability [32].

Related Work XAI
Decisions based on ML algorithms are having an increasingly significant social impact; however, most of these systems are based on black box algorithms, i.e., models whose rules are not understandable to humans [39].
AI research since its beginnings has been characterized by the development and implementation of predictive models. However, the first steps in interpretable models were taken in the 1970s and 1990s towards initiatives such as MYCIN [40], seeking an explanation in the diagnosis of infectious diseases; GUIDON [41], in the elaboration of computer-assisted learning; systems based on alternative lines of reasoning (TMS) and neural networks applied to the healthcare field were developed. Since 2010, the concern derived from bias in decision making has led to more focus on the development of Explainable Artificial Intelligence (XAI) models. Explainability requires interpretability, but explainability has to do with the need for the explanation to be deep enough to be audited [42].
According to Miller, "Interpretability is the degree to which a human can understand the cause of a decision." [43]. It is essential to understand why a given prediction was made by the model in question.
Features that should be incorporated in interpretable models [14] can be enumerated as follows: explanations should be contrasting [44], why a certain prediction was made rather than another. In addition, explanations are selected: we are interested in selecting the criteria that fit as most probable in the elaboration of the explanation. Explanations should be social, i.e., an explanation is linked to the explainer and the receiver of the explanation. Explanations focus on the abnormal [45], i.e., causes that are attributed with high potential but low probability. Explanations are true, so the event should be predicted with the highest possible probability. The explanations are consistent with previous beliefs: this is what is called confirmation bias, devaluing those explanations that do not agree with our beliefs [46].
The first formula to achieve interpretability is to use interpretable ML algorithms, including linear regression, logistic regression, decision trees, RuleFit, and Naive Bayes [14], thus deducing correlations between features that allow defining and interpreting the model at a global level [47].
Another option is to extract knowledge from a black box model by approximating it to interpretable models [48,49].
Finally, we have agnostic methods, whose implementation does not depend on the ML model used [50]. A review of agnostic models according to their global/local character is presented in Figure 3. assisted learning; systems based on alternative lines of reasoning works applied to the healthcare field were developed. Since 20 from bias in decision making has led to more focus on the deve Artificial Intelligence (XAI) models. Explainability requires interp bility has to do with the need for the explanation to be deep enou According to Miller, "Interpretability is the degree to which a the cause of a decision." [43]. It is essential to understand why made by the model in question.
Features that should be incorporated in interpretable models as follows: explanations should be contrasting [44], why a certa rather than another. In addition, explanations are selected: we ar the criteria that fit as most probable in the elaboration of the ex should be social, i.e., an explanation is linked to the explainer an planation. Explanations focus on the abnormal [45], i.e., causes high potential but low probability. Explanations are true, so the ev with the highest possible probability. The explanations are cons liefs: this is what is called confirmation bias, devaluing those e agree with our beliefs [46].
The first formula to achieve interpretability is to use interp including linear regression, logistic regression, decision trees, R [14], thus deducing correlations between features that allow defin model at a global level [47].
Another option is to extract knowledge from a black box mo to interpretable models [48,49].
Finally, we have agnostic methods, whose implementation ML model used [50]. A review of agnostic models according to the is presented in Figure 3. The current trend is to focus on model-independent interp  The current trend is to focus on model-independent interpretation tools [14,50,51]. The following is a list of studies related to interpretability methods applied to black box ML models (Table 3). In the following studies in Table 4, the interpretability models applied to the churn rate are explored in more detail: Table 4. XAI, (TS = (CUSTOMER CHURN) AND TS = (XAI)) AND (PY = (1999-2022)). As can be seen, interpretability applied to customer churn prediction is a technique that is in the process of research and practical application, especially in B2B models. In this paper, we provide a set of interpretability techniques applied to real data, corresponding to a management software manufacturing company, using the RFID model by aggregating the interactions between customer and supplier in a predetermined period.

Methodology
To achieve our goals, we propose a methodology based on knowledge discovery databases (KDD) and the cross-industry standard process for data mining (CRISP-DM) [62]. Figure 4 shows the stages and the models used in each of them.

Methodology
To achieve our goals, we propose a methodology based on knowledge discovery databases (KDD) and the cross-industry standard process for data mining (CRISP-DM) [62]. Figure 4 shows the stages and the models used in each of them.

RFID Model
The RFID model is based on the parameters of recency, frequency, importance, and duration of interactions between the customer and Contact Center during a defined period of time [3]. This model helps us to determine the value of the customer from the point of view of their interactions with the Contact Center, as well as providing us with a segmentation and a strategy of actions to be carried out for each group of customers.
From the ticket information stored in a conventional operational CRM, the model obtains two types of recommendations for customers based on the history of their interactions with the customer service: individualized and grouped. The model is parameterized with the information provided by customer service experts. These same users are also in charge of determining and implementing the final strategies for the treatment of marketing campaigns and/or interaction with customers ( Figure 5).

RFID Model
The RFID model is based on the parameters of recency, frequency, importance, and duration of interactions between the customer and Contact Center during a defined period of time [3]. This model helps us to determine the value of the customer from the point of view of their interactions with the Contact Center, as well as providing us with a segmentation and a strategy of actions to be carried out for each group of customers.
From the ticket information stored in a conventional operational CRM, the model obtains two types of recommendations for customers based on the history of their interactions with the customer service: individualized and grouped. The model is parameterized with the information provided by customer service experts. These same users are also in charge of determining and implementing the final strategies for the treatment of marketing campaigns and/or interaction with customers ( Figure 5).

Methodology
To achieve our goals, we propose a methodology based on knowledge discovery databases (KDD) and the cross-industry standard process for data mining (CRISP-DM) [62]. Figure 4 shows the stages and the models used in each of them.

RFID Model
The RFID model is based on the parameters of recency, frequency, importance, and duration of interactions between the customer and Contact Center during a defined period of time [3]. This model helps us to determine the value of the customer from the point of view of their interactions with the Contact Center, as well as providing us with a segmentation and a strategy of actions to be carried out for each group of customers.
From the ticket information stored in a conventional operational CRM, the model obtains two types of recommendations for customers based on the history of their interactions with the customer service: individualized and grouped. The model is parameterized with the information provided by customer service experts. These same users are also in charge of determining and implementing the final strategies for the treatment of marketing campaigns and/or interaction with customers ( Figure 5). The process is detailed below:

1.
Obtaining data from the CRM, which correspond to the set of tickets opened by each customer.

2.
Pre-processing of the information, the period of analysis is defined, and an initial exploratory data analysis (EDA) is addressed. 3.
Information aggregation process: for each customer, and for the period considered, the values of the recency, frequency, duration, and importance of the interactions are obtained. A process of information aggregation is carried out, so that an aggregate value is obtained for each customer for each of the characteristics that make up the RFID model.

4.
Application of the 2-tuple model [63] on the data obtained in the previous step, the aim of which is to bring all the information into the same working domain. The 2-tuple model allows working with heterogeneous information, unifying this information in linguistic evaluations, expressed in a basic set of S linguistic terms. In this way, all the heterogeneous information based on numerical, interval, or linguistic ranges can be unified in a fuzzy set, through an aggregation process ( Figure 6).
The process is detailed below: 1. Obtaining data from the CRM, which correspond to the set of tickets opened by each customer. 2. Pre-processing of the information, the period of analysis is defined, and an initial exploratory data analysis (EDA) is addressed. 3. Information aggregation process: for each customer, and for the period considered, the values of the recency, frequency, duration, and importance of the interactions are obtained. A process of information aggregation is carried out, so that an aggregate value is obtained for each customer for each of the characteristics that make up the RFID model. 4. Application of the 2-tuple model [63] on the data obtained in the previous step, the aim of which is to bring all the information into the same working domain. The 2tuple model allows working with heterogeneous information, unifying this information in linguistic evaluations, expressed in a basic set of S linguistic terms. In this way, all the heterogeneous information based on numerical, interval, or linguistic ranges can be unified in a fuzzy set, through an aggregation process ( Figure 6).

5.
Obtaining the global valuation of each client, by applying the AHP model [65] to each of the features that make up the RFID model (Table 5). In our model, we will use the AHP method to establish the weights of each of the criteria that will determine the total score of each customer, after the aggregation and ranking process using the RFID 2-tuple model. Table 5. Saaty's scale [65].

Equal importance
The comparative weighting of the criteria i and j is the same. 3

Moderate importance
The weighting of the criteria compared is moderately higher for the criterion i over the criterion j.

Strong importance
The weighting of the criteria compared is strongly higher for the criterion i over the criterion j. 7 Very strong importance The weighting of the criteria compared is very strong for the criterion i over the criterion j.

9
Extreme importance The weighting of the criteria compared is extremely strong for the criterion i over the criterion j.

Reciprocals
If criterion i compared to criterion j is associated with one of the preceding numbers, then j has a reciprocal when compared to i.

5.
Obtaining the global valuation of each client, by applying the AHP model [65] to each of the features that make up the RFID model (Table 5). In our model, we will use the AHP method to establish the weights of each of the criteria that will determine the total score of each customer, after the aggregation and ranking process using the RFID 2-tuple model. Table 5. Saaty's scale [65].

Equal importance
The comparative weighting of the criteria i and j is the same.

Moderate importance
The weighting of the criteria compared is moderately higher for the criterion i over the criterion j.

Strong importance
The weighting of the criteria compared is strongly higher for the criterion i over the criterion j.

Very strong importance
The weighting of the criteria compared is very strong for the criterion i over the criterion j.

Extreme importance
The weighting of the criteria compared is extremely strong for the criterion i over the criterion j.

Reciprocals
If criterion i compared to criterion j is associated with one of the preceding numbers, then j has a reciprocal when compared to i. The vector of weights for each of the criteria, w, is constructed using the eigenvector method through the following equation: where λ max is the maximum eigenvalue of PW i and w i is the normalized eigenvector associated with the principal eigenvalue of PW i . This approach provides the best priority weightings for each criterion or sub-criterion ( Figure 7).
Mathematics 2022, 10, x FOR PEER REVIEW 10 of 29 The vector of weights for each of the criteria, , is constructed using the eigenvector method through the following equation: where λ is the maximum eigenvalue of and is the normalized eigenvector associated with the principal eigenvalue of . This approach provides the best priority weightings for each criterion or sub-criterion (Figure 7).
A review of the AHP method and its applications can be found in the following references [66,67]. 6. Establishment of an individualized recommendation strategy. 7. Customer clustering, according to the k-means model [68]. 8. Obtaining a recommendation strategy by groups.
In our study, we will apply a set of ML algorithms and try to analyze the interpretability of the algorithm as a higher accuracy and higher ROC AUC curve score.
In our study, we will transform the data into a numerical domain, integrating the variable abandonment, and then develop the predictive model and analyze its interpretability.

XAI
From the interpretability point of view, some authors distinguish two types of models [69]: white box models, that allow one to establish correspondence between input and output; and black box models, in which the rules on which they base their decision making must be interpretable [69]. In this other study, the interpretability of white box models is questioned [70]. An inverse correspondence between interpretability and accuracy can be seen in Figure 8. A review of the AHP method and its applications can be found in the following references [66,67].
Obtaining a recommendation strategy by groups.
In our study, we will apply a set of ML algorithms and try to analyze the interpretability of the algorithm as a higher accuracy and higher ROC AUC curve score.
In our study, we will transform the data into a numerical domain, integrating the variable abandonment, and then develop the predictive model and analyze its interpretability.

XAI
From the interpretability point of view, some authors distinguish two types of models [69]: white box models, that allow one to establish correspondence between input and output; and black box models, in which the rules on which they base their decision making must be interpretable [69]. In this other study, the interpretability of white box models is questioned [70]. An inverse correspondence between interpretability and accuracy can be seen in Figure 8. Interpretation methods for machine learning can be classified according to several criteria [70].  Interpretation methods for machine learning can be classified according to several criteria [70].
• ¿Intrinsic or post hoc? Interpretability is either inherent to the learning model (intrinsic), or it allows for analysis after model training (post hoc). • ¿Specific or agnostic? Interpretability is achieved in a specific way by applying specific models for the object of study. Agnostic models are independent of the ML algorithm. • ¿Local or global? It is necessary to respond to individual or global predictions. Global methods describe the average behavior of the ML model, and they are very useful when you want to analyze the overall mechanism of the data. Local methods, however, explain individual predictions.
The main XAI techniques are shown in Table 6. These techniques will be applied to the RFID dataset to analyze the interpretability of the ML algorithms used for the detection of technology partner abandonment in the proposed business case. In our study, we will apply a set of ML algorithms and try to analyze the interpretability of the algorithm with higher accuracy and a higher ROC AUC curve score.

Application of XAI to Customer Churn Rate
One of the strengths of the Contact Center is to try to maximize customer satisfaction, and an important variable in this regard is the degree of satisfaction of the Contact Center staff [72].
In this methodological guide, we will approach a working procedure whose objective is to analyze the binary target variable abandonment, based on the values obtained by the RFID model, and in future works the proposed model will be extended to Contact Center staff turnover.
Following KDD and CRISP methodology [62], Figure 4: • In Section 4.1, we will review the problem domain, create a target dataset based on the RFID model to which we will add the customer cancellation request variable (abandonment), then pre-process and transform the data to a numeric domain.
• In Section 4.2, a set of pre-model techniques will be applied to obtain the first knowledge provided by the dataset. Subsequently, the ML algorithms detailed in Table 7 will be applied to obtain the optimal algorithm for the case study. • In Section 4.3, we will interpret the results and the convenience or not of using interpretability, in which case we will apply the global and local agnostic algorithms seen in Section 3 of this paper; finally, conclusions will be drawn.

Data Collection
The data are collected from the CRM, and we will execute the process described in Section 3.1. In this case, we are interested in the value of recency, frequency, importance, and duration in numerical format. We will carry out the aggregation process and from there we will store this information to determine the binary target variable abandonment in relation to the rest of the criteria.
The value associated with the type of incident in the CRM indicates whether or not it is a request for cancellation by the customer, the value in the CRM of attribute Type = "Cancellation Request". Given • r e : corresponds to the number of days since the last service request by the customer u e (using the end date of the analysis period as a reference). Therefore, r e = di f f days(t 2 − max(ticket_date i )), where di f f days is a function that returns the difference in days between two dates, and max is a function that returns the last date of the different incoming dates. • f e : is the number of times the customer has made a service request, i.e., with different ticket codes, ticket_id i . • i e : is the average importance. This is a linguistic variable that must be transformed into a numerical variable, i e = x e [ticket_importance i ]. • d e : contains the total duration in days of all the customer's tickets. Therefore, contains the value of the service type, in this case the customer's service cancellation request.
The next step will be to perform a data cleaning, that is, we will check if the information collected requires some kind of debugging, for example, outliers. It usually happens that incidents can be opened without a specific customer, with these being imputed to generic customers, growing in number above the average, and thus distorting the information collected and therefore the analysis.
Once the data cleaning is done, a normalization process is carried out. The machine learning algorithms work best when the numerical input variables fall within a similar scale. In this case, we will normalize in the range (0,5).

Customer Churn Prediction
Once the above steps are completed, some of the pre-model, data visualization, and exploration techniques are applied to explore, interpret, and gain initial insight into the dataset and thus predict churn or non-churn. The application of these techniques will help to identify the key features of the model and, being model-independent, they are applicable to any dataset and prior to any initial selection of the chosen ML model.
The first technique used is the univariate analysis, through histograms. Secondly, a multivariate analysis allows us to establish a correlation map between variables [73], as well as the distribution of outcomes, and thus obtain an initial data analysis.
Once the first approximation and evaluation of the dataset has been made, we can divide it into training and test, considering that the variable x corresponds to the RFID criteria (recency, frequency, importance, and duration) and y is the variable to be predicted, i.e., customer abandonment data (yes/no).
An analysis will be carried out using the algorithms shown in Table 7 to determine which of them best fits the predictive model. Each of the models described in Table 7 is evaluated through a cross-validation process (K-fold), and the receiver operating characteristic (ROC) and area under the curve (AUC) curves are analyzed, and the accuracy mean [74,75]. The higher the area under the curve, AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. The ROC curve can be seen in Figure 9, where on the y-axis we have the true positive ratio (TPR), and on the x-axis we find the false positive ratio (FPR). Accuracy can be obtained as the result of the quotient of the sum of correct predictions by the total number of predictions.

Logistic Regression
Random Forest Support Vector Machine K-nearest neighbors Decision Tree Classifier Gaussian NB XGboost Each of the models described in Table 7 is evaluated thro cess (K-fold), and the receiver operating characteristic (ROC (AUC) curves are analyzed, and the accuracy mean [74,75]. Th curve, AUC, the better the model is at predicting 0 classes as 0 curve can be seen in Figure 9, where on the y-axis we have th and on the x-axis we find the false positive ratio (FPR). Accu result of the quotient of the sum of correct predictions by the t Figure 9. Example of ROC/AUC curve applied to customer abandon Other algorithms could also have been used in the predictive process, such as deep neural networks [76], CatBoost, or LightGBM [77,78]. The objective is not so much to seek accuracy as to generate an ML model-independent explainability methodology.
Once the different machine learning models have been tested, we will discuss the explainability of each one of them versus the predictive capacity. We will keep the model that best meets the predictive expectation, and we will use interpretability in case it is necessary.

ML Interpretability Analysis
Next, this section will analyze the interpretability of the ML models described in Section 4.2. The methodology designed in this paper is extensible to any case in which we must predict a variable (classification or regression) based on the rest of the characteristics (Figure 10).

ML Interpretability Analysis
Next, this section will analyze the interpretability of the ML m tion 4.2. The methodology designed in this paper is extensible to must predict a variable (classification or regression) based on the re ( Figure 10). We will start by applying the partial dependency plot (PDP), that one or two features have on the prediction result of an ML mo allows us to work with univariate and bivariate graphs, and it wil the correlation between variables.
The PDP is an average of the lines of an ICE diagram; in the n the individual conditional expectation curves (ICE) model, offers cusing on individual data instances [56].
Next, the ELI5 model is used to measure the importance of the see when our model can respond to counterintuitive results. EL model using the XGboost library, and then analyze the importance the applied model [55].
In the next step, the LIME model is based on approximating through explainable models (linear regression, decision tree), to m terpretable [52].
Finally, we will apply SHAP, which allows us to know which c most influential for the model to make the correct decision to pr tomer was rated with a low or high possibility of abandonment [54 In addition to the methods indicated above, the reliability of th We will start by applying the partial dependency plot (PDP), which shows the effect that one or two features have on the prediction result of an ML model [53]. The diagram allows us to work with univariate and bivariate graphs, and it will help us to determine the correlation between variables.
The PDP is an average of the lines of an ICE diagram; in the next step, we work with the individual conditional expectation curves (ICE) model, offers a local expectation, focusing on individual data instances [56].
Next, the ELI5 model is used to measure the importance of the features; it helps us to see when our model can respond to counterintuitive results. ELI5 allows us to fit the model using the XGboost library, and then analyze the importance of each feature within the applied model [55].
In the next step, the LIME model is based on approximating the black box model through explainable models (linear regression, decision tree), to make its predictions interpretable [52].
Finally, we will apply SHAP, which allows us to know which characteristics were the most influential for the model to make the correct decision to predict whether the customer was rated with a low or high possibility of abandonment [54].
In addition to the methods indicated above, the reliability of this study could be complemented with other methods of measurement by contrast, such as the Gini index, analysis of variance, Chi-squared test, regression t-test, and variance test.
All these evaluations will give us a global vision of the selected model and will explain which characteristics are determinant in customer abandonment, and thus guide the necessary compensatory actions to mitigate it.

Proposed Model Applied in an Enterprise Software Company
In this section, we present an example of the application of the methodological guide developed in Sections 3 and 4 of this paper. We will try to predict whether a partner (customer) abandons the relationship with the software company, based on the valuation of its relationship with the Contact Center. For a total of 200,615 partners, 198,493 remain after 3 years (2018-2020) and 2122 have left the partner relationship.
In the clustering process (k-means) applied to the above dataset, five clusters of partners are anticipated, Table 8, with the following drop-out rate represented in Figure 11 below. Table 8. Results of the k-means algorithm expressed in the 2-tuple model. In the clustering process (k-means) applied to the above dataset, ners are anticipated, Table 8, with the following drop-out rate repre below.  Following KDD and CRISP-DM methodology [62], Figure 4, an domain has been revised, we add the cancellation request (churn) var cellation Request", Figure 12, to the RFID dataset, thus obtaining the s data are cleaned and transformed to a numeric domain between 0 and function ; next, since the data are unbalanced, we wil to avoid this problem. Then, we will use different ML classification a to analyze the relationship between accuracy and interpretability, app ity in the case of higher accuracy and low explanation. Finally, we w and local agnostic algorithms described in Sections 3 and 4 of this pa analyze the conclusions. Following KDD and CRISP-DM methodology [62], Figure 4, and once the problem domain has been revised, we add the cancellation request (churn) variable, Type = "Cancellation Request", Figure 12, to the RFID dataset, thus obtaining the set RFIDT. Next, the data are cleaned and transformed to a numeric domain between 0 and 5, using the Python function MinMaxScaler; next, since the data are unbalanced, we will adjust the datasets to avoid this problem. Then, we will use different ML classification algorithms (Table 7) to analyze the relationship between accuracy and interpretability, applying interpretability in the case of higher accuracy and low explanation. Finally, we will apply the global and local agnostic algorithms described in Sections 3 and 4 of this paper, and obtain and analyze the conclusions.

Proposed Model Applied in an Enterprise Software Company
In this section, we present an example of the application of the methodological guide developed in Sections 3 and 4 of this paper. We will try to predict whether a partner (customer) abandons the relationship with the software company, based on the valuation of its relationship with the Contact Center. For a total of 200,615 partners, 198,493 remain after 3 years (2018-2020) and 2122 have left the partner relationship.
In the clustering process (k-means) applied to the above dataset, five clusters of partners are anticipated, Table 8, with the following drop-out rate represented in Figure 11 below. Table 8. Results of the k-means algorithm expressed in the 2-tuple model. Following KDD and CRISP-DM methodology [62], Figure 4, and once the problem domain has been revised, we add the cancellation request (churn) variable, Type = "Cancellation Request", Figure 12, to the RFID dataset, thus obtaining the set . Next, the data are cleaned and transformed to a numeric domain between 0 and 5, using the Python function ; next, since the data are unbalanced, we will adjust the datasets to avoid this problem. Then, we will use different ML classification algorithms (Table 7) to analyze the relationship between accuracy and interpretability, applying interpretability in the case of higher accuracy and low explanation. Finally, we will apply the global and local agnostic algorithms described in Sections 3 and 4 of this paper, and obtain and analyze the conclusions.

Data Acquisition, Processing, and Transformation
The data are collected from the CRM platform, once processed, and transformed, we have the following description of the model (Table 9). As part of the process described in Sections 3 and 4, we will apply some of the premodel, data visualization, and exploration techniques necessary to explore, interpret, and gain initial knowledge of the dataset. They help us to identify the key features of the model and, being model-independent, they are applicable to any dataset and prior to any initial ML model selection.
The following is a univariate analysis ( Figure 13).

Data Acquisition, Processing, and Transformation
The data are collected from the CRM platform, once processed, and trans have the following description of the model (Table 9). As part of the process described in Sections 3 and 4, we will apply some model, data visualization, and exploration techniques necessary to explore, in gain initial knowledge of the dataset. They help us to identify the key features o and, being model-independent, they are applicable to any dataset and prior to ML model selection.
The following is a univariate analysis ( Figure 13). For both cases, the number of partners is represented on the y-axis, and on there is the time interval in days for recency and the number of interactions in frequency.
In addition, for duration and importance we obtain ( Figure 14): Figure 14. Importance and duration of transactions by partner. For both cases, the number of partners is represented on the y-axis, and on the x-axis, there is the time interval in days for recency and the number of interactions in the case of frequency.
In addition, for duration and importance we obtain ( Figure 14): There is a correlation between recency and frequency, but there is hardly any correlation between criteria and abandonment. Some other metrics can help to measure the nonlinear relationship of the characteristics, such as distance correlation, mutual information, and maximum information coefficient. For the case study, we will use Pearson's correlation.
Next, in the  For both cases, the number of partners is represented on the y-axis, and on the there is the time interval in days for recency and the number of interactions in the frequency.
In addition, for duration and importance we obtain ( Figure 14): Figure 14. Importance and duration of transactions by partner.
The following step displays the correlation matrix ( Figure 15). The following step displays the correlation matrix ( Figure 15). There is a correlation between recency and frequency, but there is hardly any correlation between criteria and abandonment. Some other metrics can help to measure the nonlinear relationship of the characteristics, such as distance correlation, mutual information, and maximum information coefficient. For the case study, we will use Pearson's correlation.
Next, in the Table 10, we perform a transformation to a range [0,5] through of the function .

ML Algorithms Evaluation
Next, to analyze whether to apply the set of interpretability algorithms described in Section 3, a set of pre-model techniques will be applied to obtain the first knowledge of the dataset. Subsequently, each of the models described in Table 7 is evaluated through a cross-validation process (K-fold), and the receiver operating characteristic (ROC) and area under the curve (AUC), and accuracy mean are analyzed, to obtain the optimal algorithm for the case study.

ML Algorithms Evaluation
Next, to analyze whether to apply the set of interpretability algorithms described in Section 3, a set of pre-model techniques will be applied to obtain the first knowledge of the dataset. Subsequently, each of the models described in Table 7 is evaluated through a cross-validation process (K-fold), and the receiver operating characteristic (ROC) and area under the curve (AUC), and accuracy mean are analyzed, to obtain the optimal algorithm for the case study.
The results obtained can be analyzed in Table 11, The model selected according to the procedure described is XGboost, a black box model, and responds to the need to use interpretability.
XGBoost is used in supervised learning problems, and the objective is to predict a target variable y i from a set of variables x i . A common example of supervised learning is linear regression, where the prediction of a variable y i is obtained as, y i = ∑ k (β k x ik ), and the characteristics making up the input are weighted by weights β k .
When we talk about training a model, we are talking about adjusting parameters β, for which we need to define the objective function that best fits the training data x i , and produce as a response the best fitted value to y i . A notable feature of the objective functions is that they consist of two parts, the training loss, and the regularization term: where L is the training loss function and Ω is the degree of complexity of explainability of the model. In this case, the model is defined as follows [79]: where K is the total number of trees and f k is a function in the function space F. The objective is to mix classification trees to measure which of the combinations is the best for our model.

Data Unbalancing
Our churn class has very few samples in relation to the majority class (no churn = partner). This causes an imbalance of data, and therefore the training of the model will be deficient, responding in an unbalanced way to the detection of the dropout pattern to be predicted.
In the first analysis performed using XGBoost, the model gave us the following results ( Figure 16). The model selected according to the procedure described is XGboost, a black box model, and responds to the need to use interpretability.
XGBoost is used in supervised learning problems, and the objective is to predict a target variable from a set of variables . A common example of supervised learning is linear regression, where the prediction of a variable is obtained as, = ∑ ( ) , and the characteristics making up the input are weighted by weights .
When we talk about training a model, we are talking about adjusting parameters , for which we need to define the objective function that best fits the training data , and produce as a response the best fitted value to . A notable feature of the objective functions is that they consist of two parts, the training loss, and the regularization term: where is the training loss function and is the degree of complexity of explainability of the model. In this case, the model is defined as follows [79]: where is the total number of trees and is a function in the function space . The objective is to mix classification trees to measure which of the combinations is the best for our model.

Data Unbalancing
Our churn class has very few samples in relation to the majority class (no churn = partner). This causes an imbalance of data, and therefore the training of the model will be deficient, responding in an unbalanced way to the detection of the dropout pattern to be predicted.
In the first analysis performed using XGBoost, the model gave us the following results ( Figure 16). As can be seen, the model presents an extraordinary result, with almost 99% prediction, but based on an accuracy of 100% in the majority class (no churn = 0) and 0% in the minority class (churn = 1). It is therefore essential to perform a data unbalancing process, and we must try to increase the degree of prediction of the minority class. As can be seen, the model presents an extraordinary result, with almost 99% prediction, but based on an accuracy of 100% in the majority class (no churn = 0) and 0% in the minority class (churn = 1). It is therefore essential to perform a data unbalancing process, and we must try to increase the degree of prediction of the minority class.
To deal with the possible problem of data imbalance in the dropout class, we have resorted to modifying the XGBoost training algorithm by introducing a value to the hyperparameter scale_pos_weight, which is designed to adjust the behavior of the algorithm in unbalanced classification problems. A suitable value for this parameter is found in estimating a correction corresponding to the inverse of the class distribution. For example, in a dataset where the ratio between the minority and majority class is 1 to 100, it is correct to apply a value of scale_pos_weight = 100 [80].
In addition, scale_pos_weight has been combined with the Smote-Tomek process [81], which consists of simultaneously applying a subsampling and oversampling algorithm to the dataset. This obtains in one of the different model trainings the following best result (Figures 17 and 18). atics 2022, 10, x FOR PEER REVIEW in estimating a correction corresponding to the inverse of the ple, in a dataset where the ratio between the minority and m correct to apply a value of _ _ ℎ = 100 [80]. In addition, _ _ ℎ has been combined wit [81], which consists of simultaneously applying a subsampl rithm to the dataset. This obtains in one of the different mode result (Figures 17 and 18).      In exchange for losing precision in the majority class, we gain it in the minority class. The next step, because it is XGBoost of a black box algorithm, consists of developing the interpretability process described in Sections 3 and 4 of this paper.

Interpretability Techniques Application
Next, the interpretability model detailed in Sections 3 and 4.3 of this paper is applied, and a study of the results obtained will be carried out.

Partial Dependency Diagram (PDP)
When we consider more than a certain number of variables, it is necessary to analyze the partial dependence of one or two variables in relation to the prediction of the response variable. Through the PDP diagram, we can perform this type of analysis, and the shaded area represents the confidence interval [53]. As can be seen in the graphs, normalization has been carried out between 0 and 5.
The diagrams in Figures 19-21 show the influence of recency, frequency, and duration on the prediction of abandonment, and the diagram in Figure 22 shows the degree of correlation between recency and frequency. In the graphs presented, importance has not been considered, since the value is biased towards the mean value (M).       Finally, we include in this section the ICE plots, which are similar to th offer a more detailed view on the behavior of nearly similar clusters arou curve of the PD plot. The ICE algorithm provides insight into the various va ditional relationships estimated by the black box ( Figure 23). Finally, we include in this section the ICE plots, which are similar to the PD plots but offer a more detailed view on the behavior of nearly similar clusters around the mean curve of the PD plot. The ICE algorithm provides insight into the various variants of conditional relationships estimated by the black box ( Figure 23). Finally, we include in this section the ICE plots, which are similar to the PD plots offer a more detailed view on the behavior of nearly similar clusters around the m curve of the PD plot. The ICE algorithm provides insight into the various variants of c ditional relationships estimated by the black box ( Figure 23).

Feature Importance (ELI5)
The concept of the importance of characteristics is simple: it is a matter of assess the importance of a given characteristic by calculating the increase in prediction error a making a permutation of this characteristic [82].
To make random tree predictions more interpretable, each model prediction can presented as a sum of feature contributions (plus bias), showing how the features lea a particular prediction. ELI5 does this by showing the weights of each feature, indica their influence on the final prediction decision across all trees. This is a good step in

Feature Importance (ELI5)
The concept of the importance of characteristics is simple: it is a matter of assessing the importance of a given characteristic by calculating the increase in prediction error after making a permutation of this characteristic [82].
To make random tree predictions more interpretable, each model prediction can be presented as a sum of feature contributions (plus bias), showing how the features lead to a particular prediction. ELI5 does this by showing the weights of each feature, indicating their influence on the final prediction decision across all trees. This is a good step in the direction of agnostic interpretation of the model, but not fully agnostic, as we will see later, using LIME. The results obtained are shown below in Table 12. According to the results obtained, the criterion importance has the greatest weight in the evaluation of the characteristics, followed by frequency, duration, and recency. However, as we have seen, the importance is biased towards the mean values within the whole sample.

Local Substitute (LIME)
LIME is a local model and works by checking what happens to the predictions when variations in the input data are introduced [52]. For this purpose, LIME generates new datasets with these variations, thus obtaining sets of predictions. The results applied to the model under study can be seen below.
In the first case, Figure 24, a record has been chosen in which there is a 99.64% success rate in the prediction of non-abandonment. In Figure 25, the prediction of abandonment is 75.49%, which corresponds to a partner who has left the partner channel.
variations in the input data are introduced [52]. For this purpose, LIME generates datasets with these variations, thus obtaining sets of predictions. The results applie the model under study can be seen below.
In the first case, Figure 24, a record has been chosen in which there is a 99.64% suc rate in the prediction of non-abandonment. In Figure 25, the prediction of abandonm is 75.49%, which corresponds to a partner who has left the partner channel.

SHAP Values
The objective of the SHAP interpretability model is to be able to provide an expl tion for an instance based on the contribution of each of the characteristics to the diction ( Figure 26) [54]. variations in the input data are introduced [52]. For this purpose, LIME generates datasets with these variations, thus obtaining sets of predictions. The results applie the model under study can be seen below.
In the first case, Figure 24, a record has been chosen in which there is a 99.64% suc rate in the prediction of non-abandonment. In Figure 25, the prediction of abandonm is 75.49%, which corresponds to a partner who has left the partner channel.

SHAP Values
The objective of the SHAP interpretability model is to be able to provide an expl tion for an instance based on the contribution of each of the characteristics to the diction ( Figure 26) [54].

SHAP Values
The objective of the SHAP interpretability model is to be able to provide an explanation for an instance x based on the contribution of each of the characteristics to the prediction ( Figure 26) [54]. As for the importance of the characteristics, the model is a priori m ELI5, since the importance measure is not a significant parameter for t studying, remembering that the default value is 0.5 in most of the sampl Figures 27 and 28 show the prediction of SHAP values for the par respectively, in Figures 24 and 25 by the LIME mode.  As for the importance of the characteristics, the model is a priori more reliable than ELI5, since the importance measure is not a significant parameter for the model we are studying, remembering that the default value is 0.5 in most of the samples. As for the importance of the characteristics, the model is a priori more relia ELI5, since the importance measure is not a significant parameter for the mode studying, remembering that the default value is 0.5 in most of the samples. Figures 27 and 28 show the prediction of SHAP values for the partner repr respectively, in Figures 24 and 25 by the LIME mode.    It is significant to note that, in the segmentation study, for the partner represented in Figures 25 and 27, it would be identified with cluster #1 and, therefore, it is a recently incorporated partner that needs to take its first steps to start and, with a high risk of aban donment, and in fact, it abandoned . On the contrary, the partner represented in Figure  24 and 28, would be identified with cluster #4. The profile corresponds to a partner with a large installed base of partners that uses the Contact Center to solve specific problems therefore, with a low risk of abandonment.

Skater
Because of its interest in the use case under study, we have introduced Skater, sinc it allows both global and local interpretation; for global explanations, it is based on th use of PDP, and for local explanations it is based on LIME. It corresponds to a unified framework recently introduced and under development [83].
The results obtained are shown below.
In the graph in Figure 29, the forecasts obtained with Skater fit with those of th SHAP model in Figure 26. Analyzing in more detail, we obtain the dependence graphs Figure 30, where the relationship between the classification variable with respect to each of the characteristics can be seen. It is significant to note that, in the segmentation study, for the partner represented in Figures 25 and 27, it would be identified with cluster #1 and, therefore, it is a recently incorporated partner that needs to take its first steps to start and, with a high risk of abandonment, and in fact, it abandoned. On the contrary, the partner represented in Figures 24 and 28, would be identified with cluster #4. The profile corresponds to a partner with a large installed base of partners that uses the Contact Center to solve specific problems, therefore, with a low risk of abandonment.

Skater
Because of its interest in the use case under study, we have introduced Skater, since it allows both global and local interpretation; for global explanations, it is based on the use of PDP, and for local explanations it is based on LIME. It corresponds to a unified framework recently introduced and under development [83].
The results obtained are shown below.
In the graph in Figure 29, the forecasts obtained with Skater fit with those of the SHAP model in Figure 26. Analyzing in more detail, we obtain the dependence graphs, Figure 30, where the relationship between the classification variable with respect to each of the characteristics can be seen.  The graphs in Figure 30 confirm the trend seen in the PDP model. The tendency to abandon is centered on those partners with medium and low recency values (L, M), low  The graphs in Figure 30 confirm the trend seen in the PDP model. The tendency to abandon is centered on those partners with medium and low recency values (L, M), low frequency values (L), and with low, medium, and high duration of incidents (L, M, H). As mentioned above, they correspond to partners who have recently entered the channel and have not managed to mature sufficiently to be able to market and implement the software's solutions. The graphs in Figure 30 confirm the trend seen in the PDP model. The tendency to abandon is centered on those partners with medium and low recency values (L, M), low frequency values (L), and with low, medium, and high duration of incidents (L, M, H). As mentioned above, they correspond to partners who have recently entered the channel and have not managed to mature sufficiently to be able to market and implement the software's solutions.

Discussion
The objective of this study is to complete a working methodology for the analysis of the interpretability in ML models, using agnostic models (global and local) that have been used for the analysis of the explainability of partner churn in a B2B environment.
Applying the methodological procedure designed in the RFID model, in a typical B2C environment, it is possible to achieve high levels of interpretability in the customer churn rate, since there is a direct dependence between the frequency, recency, duration, and importance of incidents and the churn rate. In the B2B model that has been proposed, as mentioned above, the relationship between the partner (distributor = customer) and the software company is very close. That is to say, the partner has had to develop a whole line of business and investment in its relationship with the software company, which translates into:

•
Adaptation to the business plans established by the software company: number of people trained, sales commitment, and annual turnover.

•
Highly demanding training process for each partner's technical, commercial, and pre-sales personnel.

•
As the company grows, it becomes necessary to hire specialized personnel, with the consequent related economic cost.
Therefore, as the partner grows in sales, the relationship with the software company is closer and, consequently, the abandonment rate of that partner is lower (see Figure 11, Table 8). On the other hand, we find recently incorporated partners that do not reach the maturation process described above tend to abandon before their investment in the business model proposed by the software company is greater. Because of the above, and for the business case in question, the more interactions that take place between the partner and Contact Center, the lower the probability of abandonment (high frequency and recency). Logically, the partner who is more established in the channel has a greater number of customers to serve and, therefore, the greater the number of interactions with the software company.
Regarding the working methodology applied to solve the problem of interpretability in ML models, an in-depth study of agnostic interpretability models has been carried out and applied to the context of the problem to be addressed. This methodology helps us to interpret the decisions made by black box algorithms, and uses agnostic interpretability models, not dependent on the ML model, and therefore its flexibility and applicability to any type of learning model is guaranteed.
The innovation of this paper is based on three differentiating factors: • Use of a customer assessment based on the RFID model, and on the set of interactions between the customer and the brand through the Contact Center. • Application of ML models oriented to customer churn rate prediction in B2B environments. According to the literature review, in Section 2, there is not much research production in B2B environments.

•
Application of a working methodology that provides an agnostic interpretability procedure, extensible to any predictive model in B2B or B2C environments in which we must predict a variable (classification or regression) based on the rest of the characteristics.

Conclusions and Future Work
In conclusion, the present work develops a completely new line of evaluation of the prediction of customer abandonment, from the point of view of the customer's interactions with the Contact Center. Based on the RFID model, it is possible to determine the brand's evaluation of the customer through these interactions and, consequently, to analyze each of the characteristics that make up the model and their weight in the final evaluation of the binary target variable abandonment. An additional novelty is the application of the model to a B2B environment, for which the literature is scarce, in models that determine the prediction of abandonment, and consequently in the application of interpretability (XAI).
As an example, we have applied the implemented model to a dataset from a software manufacturing company with a large network of partners (customers) distributed around the world. A set of predictive models has been applied on unbalanced data; the churn rate represents 1.06% of the total sample. Therefore, data balancing techniques had to be applied, adjusting the behavior of the algorithm in unbalanced classification problems, in addition to simultaneously applying subsampling and oversampling algorithms to the dataset.
Then, the described working procedure has been applied, consisting of the application of successive interpretability techniques. The results obtained through the implemented interpretability methodology reveal that the conclusions are aligned with the clustering implemented with the RFID model. The more the partners interact with the Contact Center, the less propensity they have to abandon the relationship with the software company. The clustering (k-means) developed through the RFID model classifies partners into five groupings, and the abandonment prediction fits perfectly with the clusters in which the partner has a lower rating on the recency and frequency variables.
As future work, we propose the following: • One metric of concern for Contact Centers is the employee attrition rate; turnover in the Contact Center is very high, mainly due to work and emotional demands [84].
Using the procedure designed in this paper to analyze, predict, and interpret Contact Center staff attrition rates would be a major challenge. • Extend the model to any industry and any B2B and B2C environment, with a focus on retail, insurance, banking, and service delivery.

•
Finally, consolidate the model with what customers think, i.e., contrast the model with the customer satisfaction score (NPS), the customer's assessment of the brand or of each of the customer satisfaction score (CSAT) interactions, or the customer effort score (CES) [13]. In addition to adding other factors such as the metrics introduced in the customer engagement value model, the following metrics can also be added (CEV) [85].

•
In certain fields, such as decision making in image detection processes, it will be necessary to adapt the interpretability model described by incorporating the use of artificial neural networks [76,86,87]; in future works, an extension of the XAI model will be proposed by adapting these improvements.

Conflicts of Interest:
The authors declare no conflict of interest.