A Novel Model Structured on Predictive Churn Methods in a Banking Organization

A constant in the business world is the frequent movement of customers joining or abandoning companies’ services and products. The customer is one of the company’s most important assets. Reducing the customer abandonment rate has become a matter of survival and, at the same time, the most efficient way to maintain the customer base, since the replacement of dropouts by new customers costs, on average, 40% more. Aiming to mitigate the churn (customer evasion) phenomenon, this study compared predictive models to discover the most efficient method to identify customers who tend to drop out in the context of a banking organization. A literature review of related works on the subject found the neural network, decision tree, random forest and logistic regression models were the most cited, and thus the models were chosen for this work. Quantitative analyses were carried out on a sample of 200,000 credit operations, with 497 explanatory variables. The statistical treatment of the data and the developments of predictive models of churn were performed using the Orange data mining software. The most expressive results were achieved using the random forest model, with an accuracy of 82%.


Introduction
The behavior of consumers and their relationships with companies are being affected by the profound changes brought about by access to and speed of information, technological advances, and fierce competitiveness, making them more demanding and less loyal to companies.
Successful companies have satisfactory and long-term relationships with their customers (Zhu et al. 2017), allowing the institution to focus its efforts on the customer's needs and opening up new service possibilities (Mehta et al. 2016). Searching for new customers is an essential activity for companies, but more costly than maintaining a relationship with an existing customer. Higher attrition rates typically characterize new customers. In return, a satisfied customer indicates and recommends the company to other potential customers (Depren 2018).
Keeping a customer cost approximately five times less than acquiring a new one (He et al. 2014). According to Reichheld and Sasser (1990), in the case of bank branches, a study showed that a 5% reduction in the abandonment rate-a metric that indicates how much your company has lost in revenue or customers-is capable of increasing profits by 85%. Analyzing customer behavior enables strategies that can be used for retention; however, this activity is a task that requires the analysis of large volumes of data. Tools like data mining can help the company to understand the customer's need according to what it can offer.
Data mining is a technique to extract data from the histories stored in computational devices, analyze them, and transform them into useful information, with the objective of more accurate strategic decision-making (Steiner et al. 1999). Data mining technology can detect important clues behind large amounts of available information, enabling the creation of a predictive model of customer churn.
CRM emerged as an essential tool to win and increase customer loyalty and has become necessary, considering the increasingly different demands on companies. With the use of information technology, CRM started to be used as a strategic tool to maintain and improve the relationship with the customer and, mainly, prevent them switching to a competitor.
'Churn' is customers' action in giving up operating with a company and migrating to its competitors (Ghorbani and Taghiyareh 2009). The loss of clientele is an increasing concern and must be addressed as a priority by organizations, as it can impact revenue and, consequently, companies' survival.
The retention rate is represented by the number of customers who remain with the company, calculated on the entire (Jahanzeb and Jabeen 2007) customer base. The churn rate can be extracted as follows: C0-number of customers at the beginning of the period A1-new customers who joined in the given period C1-number of customers at the end of the period.
To handle and manage the churn rate, it is essential to create a predictive model. In most cases, the techniques are based on the client's history, through variables with greater influences on the churn phenomenon. The models analyze, identify, and classify each customer's probabilities of evasion, individually, aiming at customer retention measures (Idris et al. 2012).
Effective predictive models analyze customer history through a mass of data extracted from the computing devices that support the applications used by customers.
This data undergoes mining that transforms it into relevant information, identifying customer behavior and enabling decision making (Adebiyi et al. 2016).
This article aims to compare the four main machine learning algorithms according to the literature to find the best predictive churn model to address preventively customer evasions in a banking organization.
In Section 2, the most popular algorithms in predictive churn models will be shown through bibliometric research. In Section 3, the processes of the CRISP-DM methodology for the construction of the model will be described. Section 4 will outline the results obtained and, finally, Sections 5-8 will report the discussions, conclusions, limitations of the article, and suggestions for future work, respectively.

Materials and Methods
To investigate the current state of research on churn predictive models, a literature review was performed. This approach makes it possible to provide an overview of previous research on the subject and a holistic perspective on the common knowledge bases. In addition, the review contributes to the development of the machine learning field concerning predictive models, identifying different research streams with different approaches, and using the most varied algorithms. A review approach is characterized by thoroughness and rigor, leading to the legitimacy and objectivity of the results (Creswell and Creswell 2017;Jesson et al. 2011;Tranfield et al. 2003).
As shown in Figure 1, in a first step, the research field was accessed, obtaining an overview of the relevant works in the field of machine learning that deals with the churn phenomenon. The objective was to identify articles from relevant journals referring to churn predictive models.
To be included in the review, the title or keywords of an article should contain the string: ("churn*") AND ("customer*" OR "client*") AND ("predict* OR "machine learning*" OR "model*"). The terms were connected using Boolean logic, a method by which the database can search for certain combinations of keywords. Using an asterisk with the search term ensured that variations (e.g., client and clients or model and modeling, or predict and prediction) were included in the results of the bibliographic searches. As an additional inclusion criterion, the type of publication was defined as a journal article, in English and published between 2000, when the study of the topic began, until 2021.
A search in the Web of Science (WoS) indexing database resulted in 288 published articles. After applying the refinement to the category "Computer Science Artificial Intelligence", the result dropped to 102 articles.
The same query was performed on the SCOPUS indexing base, returning 276 published articles. After refining the subject area to "Computer Science", the result reduced to 225 articles.
A f urther refinement was performed, removing all articles that were not cited as references by any other research, leaving 186 articles.
For the 186 articles, only 160 documents were publicly available, which served as a basis for pre-processing, and they were submitted for further refinement to group related terms. To identify the most recurrent predictive models in these publications, we chose to use a bibliometric analysis of the texts. According to Pritchard (1969), bibliometrics can be defined as the application of statistical and mathematical methods to books and other media, with its focus on measuring the tangible output of scientists through books, articles, patents, and others.
The analysis of terms was performed with the support of software called VOSviewer, which is a tool for building and viewing bibliometric networks. It allows the creation of networks 107 of citation relationships, bibliographic coupling, cocitation, or co-authorship (Van Eck and Waltman 2010).
VOSviewer is also a tool we have developed over the past few years that relatively easily provides the basic functionality needed to view bibliometric networks (Van Eck and Waltman 2014).
Keywords are nouns or phrases that reflect the core content of a publication. Content is studied through keyword distribution analysis. Bibliometric data showed that there were 3174 terms, however, only terms that appeared at least ten times in the analyzed texts were selected, returning 69 keywords, and of these, only 60% (41 terms) were selected. As can be seen in Figure 2, in the network, the view items are represented by their label and a circle. The item's weight determined the label size and circle of an item. The more significant the weight of an item, the larger (Yang et al. 2017) the item label and circle will be. In this study, the software identified three clusters, composed of terms that tend to be mentioned together. To choose the clusters, VosViewer runs an algorithm that maximizes the strength of association between the pairs of terms and minimizes the number of nodes in the cluster. The color of an item was determined by the cluster the item belonged to, and the lines between the items represent links in Table 1    Through the bibliometric analysis of the collected studies, it was possible to verify through the network map shown in Figure 2 and detailed in Table 1, produced by the VosViewer tool, that the most used algorithms in the consulted works that deal with predictive model churn, in order of relevance, were:

Artificial Neural Network
Artificial neural networks are models of a computational system that mimic the way the human brain operates, in a simplified form. The processing of information is performed by neurons that work with electrical signals passing through them, applying the flip-flop logic; that is, opening and closing the means of transmission for the passage of signals, in such a way that the neuron only transmits information capable of surpassing a certain threshold.
The artificial neuron is made up of input terminals (x1, x2, ..., xn) and their function is analogous to that of the dendrites of the biological neuron. Dendrites are structures that receive nerve stimuli from the environment (or other neurons) and transmit these stimuli to the neuron's body (Ciaburro and Venkateswaran 2017;Potdar et al. 2017).
The artificial neuron is made up of input terminals (x1, x2, ..., xn) and their function is analogous to that of the dendrites of the biological neuron. Dendrites are structures that receive nerve stimuli from the environment (or other neurons) and transmit these stimuli to the neuron's body (Jain et al. 2021).
It has a central adder plus an activation function that can be compared to the cell body of the biological neuron in which the stimulus coming from the dendrites is processed.
Similarly, artificial neural networks work in such a way that their output is a function of the weighted sum (weighs) of the inputs added to a bias. Each neuron is responsible for performing a very simple task, which is to become active if the received signal exceeds a predetermined threshold, fixed by its activation function, which could be linear, unit step, sigmoid, hyperbolic tangent, among others. However, as the sigmoid function is able to support the nonlinearities of the data and deal with high complexity, it is a type of activation function widely used for artificial neural networks (Ciaburro and Venkateswaran 2017; Vinicius 2017).
Its neurons are organized in layers. The first layer contains the network's input neurons. Training observations are presented to the network through these entries. The final (or output) layer contains ANN's predictions for any case presented in its input neurons. In between, we usually have one or more layers called hidden, intermediate, or hidden (Burger 2018).
Building an ANN consists of establishing an architecture for the network. This involves choosing the number of hidden layers, number of neurons in each layer, and which activation function to use. Then, use an algorithm to find the weights of connections between neurons (Montgomery et al. 2021).

Decision Tree
Decision tree algorithms are currently among the most popular and most used since they are intuitive for the typical user (Burger 2018).
This model can be used both to perform regression and classification. Thus, when a tree model has an objective variable that fits into a discrete set of values, there is a classification tree, which is the object of this article. In these structures, the leaves represent the labels of the objective classes (in this study, "evaded" and "not evaded") and the branches represent the divisions of predictor variables that lead to the objective classes.
Therefore, by default, the process of forecasting a tree is carried out through the stratifications of the space of independent variables, which, in a simplified way, can be performed in two steps: (a) The space of the predictor variables, that is, the set of possible values for X1, X2, ..., Xp, is divided into "j" distinct and non-overlapping regions, entitled R1, R2, ..., Rj (James et al. 2013). (b) For each observation that falls in the "Rj"' region, the same prediction must be made, which is simply the classification, or even the probability of classification, among the possible discrete response values for the training observations in "Rj"in the case of this study, "evaded" or "not evaded" (James et al. 2013).
For the configuration of a tree, it is necessary to define some measures used in the construction of a DT, such as: information concept, system entropy (or simply entropy), entropy of each attribute and information gain.


Information: when classifying something that can take two values (binary classification), the information for the class C = ci is defined as: where p(C = ci) is the possibility to choose the class ci = 1.2 (Harrington 2012).  System entropy: is a measure of the heterogeneity (or lack of regularity) of a dataset. Measures how organized or unorganized a dataset is (Kelleher et al. 2020).
where Equation (4) and T represents the training dataset. The dataset is partitioned into subsets through the results of each (Flach 2012) attribute.
 Entropy of each attribute: For the subsets formed from the successive divisions in the training dataset we have new entropy values, which we will denote by Ti (i represents the i-th division). The entropy of a given attribute at denoted by Hat(T), is calculated as the weighted average of the subsets' entropies as follows (Kubat 2017): where Equation (6) is the probability that a training example is in T and |T| is the number of examples in the entire training dataset.


Information gain: The information gain when taking into account a given at attribute given by: Information gain is a measure of the relevance of an attribute on a scale of 0 as being least applicable to 1 as most beneficial (Burger 2018).
It is possible to prove that Iat(T) cannot be negative, and that information can only be obtained, never lost by considering a specific attribute (Kubat 2017).
Applying Equation (7) to each attribute, we can find out which one provides the maximum amount of information and this one will be chosen to be the root node in the construction of the DT.
Briefly, decision trees have the following advantages: they allow straightforward interpretation, and they are flexible because they are not sensitive to data distribution and use relevant attributes in tree modeling.

Random Forest
The random forest is a collection of different decision trees (DT) (Burger 2018). They are an example of a set of models (ensemble models); that is, a model that is formed by a set of simpler models (Mishra and Reddy 2017).
The name random is given as a function of selecting a random sample that uses the attributes necessary for the construction of each decision tree from the set of generated trees. This set is called a forest.
Each tree is learned using a bootstrap sample from the training dataset. With each of the generated samples, a tree is modelsined (Torgo 2016).
For each bootstrap sample generated, there is a random sampling of characteristics (explanatory variables or attributes) that will be used in the creation of each tree. Therefore, different trees will have different features when generating the decision nodes.
The creation of each tree is done analogously to the AC algorithm including all the entropy and information gain calculations involved.
The process of classifying new observations is as follows: each RF tree classifies the observations of the test set into one or another category; that is, if the RF has N trees, a given observation will have N votes. The response variable that receives the most votes in all trees is the prediction of the set (Torgo 2016).

Logistic Regression
It can be said that logistic regression is a statistical model that, instead of modeling the value of the Y response directly, as is done in linear regression, models the probabilities that Y belongs to a specific class (James et al. 2013;Yanfang and Chen 2017). Usually, this class, in its most basic states, is binary and all possibilities fall under "yes", which can be represented by the binary variable 1 (one), or "no", which can be represented by the binary variable 0 (zero).
Therefore, to ensure that the model's output values are within the scale that goes from 0 to 1, the probability of the output "p(x)" must be calculated using a function that provides the outputs within this pattern for all possible values of the "n" predictor variables "x". Although many functions meet this description, in logistic regressions, the logistic function defined by Equation ( 8) is used.
With a little manipulation of Equation (8), one can find the generalized formula for the logistic regression, as shown in Equation (9): The dependent variable is to be interpreted in the form of probability of occurrence of the analyzed events, with probability limited between 0 and 1. Buckinx and Poel (2005) consider logistic regression adequate for predicting churn for five main reasons: (a) It is conceptually simple and is often used in marketing, especially at the individual consumer level. (b) Ease of understanding and interpretation of the model. (c) Modeling has been shown to provide good and robust results in general predictive studies and has broad support in the theoretical framework. (d) Although simple, it has been demonstrated by several authors that modeling can even surpass more robust and sophisticated methods.

Models
CRISP-DM (Cross Industry Standard Process for Data Mining) was developed by Chapman et al. (2000) and is a robust and consolidated methodology in data science, which focuses on solving business problems, providing a structured approach to planning a data mining project in a cyclical and flexible way.
To address the problem with data mining techniques, due to the large amount of data that has to be dealt with; it is recommended that the mining process be iterative and phased (Camilo and Silva 2009). The CRISP-DM methodology was used in this research and served as a reference for data processing. Although the evaluation phase is after modeling, according to the methodology flow, it will be carried out constantly, as it serves to validate the actions that are taken throughout the process.
The problem is to indicate whether the customer is about to evade or remain a customer of the banking organization, considering the customer's historical data and the scenarios in which he/she is inserted. Hidden patterns are captured with machine learning techniques and used to predict customer behavior.

Business and Data Understanding
The data were made available in CSV format, containing 200,000 instances, with 497 attributes, from 29 tables, being a class to identify the customer's situation (active or evaded). The variables are numerical, categorical, text and time types, and were retrieved from the databases of the different systems that support the business of the banking organization. The data were heterogeneous and provide information on: • Client's data; • Segment data; • Credit manager details; • Segment risk assessment; • Customer behavior; • Customer segment behavior.
Because it is sensitive data to the banking organization under the prism of competition and competitiveness, and from the point of view of bank secrecy, to allow use of the data in this research, the attribute names were renamed to "Axxx", where xxx is a sequences from 000 to 496, as can be seen in Table 2.

Data Preparation
Data were obtained from several tables and various systems of the organization and consolidated into a single file in CSV format, maintaining the client's referential integrity.
For the import and processing of data, modeling of predictive algorithms, and evaluations of results, the Orange data mining software (Version 3.28) was used, an opensource data visualization and analysis tool. Orange has a drag and drop interface, similar to tools such as SPSS Modeler and Azure ML from IBM and Microsoft, respectively. Data mining is done through visual programming or Python script. The tool was developed in the bioinformatics laboratory at the Faculty of Computer Science and Technologies, University of Ljubljana, Slovenia, in conjunction with a supporting community open source (Demšar et al. 2013).
When importing the data, if the parameter "auto-discover categorical variables" is checked, the widget can identify whether data are numeric or categorical, based on the values used in the columns. However, classification errors can occur, especially when dealing with discrete values, making it necessary to revisit each attribute to adjust the typification. After reclassification of the 497 columns, 407 were attributes (8 categorical and 399 numeric) and 90 metadata (data on data, not used for statistical inference), 7 categorical, 20 numeric, and 63 texts.
The select column widget was used to identify which attributes were of the metadata type, which ones to use as a class and which ones to use/ignore for the elaboration of the models. In a previous analysis, it was possible to verify that some attributes did not contribute (without data or with the same values for all instances) and, therefore, they were disregarded, as shown in Figure 3. Considering involuntary churn, in which the organization has no interest in keeping the customer due to default issues, the attributes that indicate damage to the operation were also disregarded. After reclassifying the attributes, the configuration was as follows: 77 ignored, 10 meta, and 410 valid attributes. Selecting a subset of attributes, using its relevance as a criterion, is a way to reduce the dimensionality of the features space, which in this study, due to the large number of variables available for selection, is necessary to reduce the computational and analysis complexity, considering that the increasing number of attributes does not always translate into better result accuracy.
To select the most significant attributes, the rank widget was used, which scores the most relevant variables according to the correlation between them, whether discrete or numerical. Functionality can use algorithms such as: 1. Information gain: the expected amount of information (entropy reduction) (Mohammad 2018). 2. Gain rate: a ratio between the gain of information and the intrinsic information of the attribute, which reduces the tendency for multivalued resources that occur in the gain of information. 3. Gini: the inequality between the values of a frequency distribution. 4. ANOVA: the difference between the mean values of the resource in different classes.
5. X 2 : dependency between resource and class measured by chi-square statistics. 6. ReliefF: the ability of an attribute to distinguish between classes in similar data instances. 7. FCBF (fast correlation based filter): an entropy-based measure, which also identifies redundancies due to peer correlations between resources.
For classification models, a matrix is used where the predicted results are tabulated against the original classes, known as the confusion matrix. This matrix seeks to identify the relationship between successes and errors in the model's classification. Table 3   From the confusion matrix, it is possible to assess whether given data was classified correctly or not, against known data, for a given class. Traditionally, the metrics used to validate ranking models are: • F1: is the harmonic mean between sensitivity and accuracy; that is, it summarizes the information of these two metrics; • The ROC curve (receiver operating characteristic) is a metric derived from the confusion matrix, in the sensitivity and specificity dimensions.
To find the best ranking method, the machine learning chosen was the random forest algorithms, as it is less sensitive to data distribution. It was used with the configuration default and the widget test and score to provide the evaluation result. The sampling method chosen was cross-validation with 10 folds and stratified (Yadav and Shukla 2016). For comparison purposes, the metric area under the ROC curve (AUC), F1, accuracy (CA), precision, recall, and specificity were adopted. Therefore, the wrapper (forward stepwise selection) technique was adopted, with the construction of a model using random forest, as in the approaches already described, comparing the metrics resulting from the selection of the best n variables. For these selections, the best ranking algorithm (FCBF) was adopted. Table 4 and Figure  4 show the results of tests performed with 1 to 25 best ranked attributes. As can be seen, from the 20th variables onwards, the improvements achieved were insignificant, with the greater use of 20 variables proving innocuous.  Regarding the data preparation stage, it is concluded that, for the studied data, the best performing algorithm to filter the most relevant attributes was the FCBF (fast correlation based filter). Through the result obtained through the random forest, without optimization, it was possible to select 20 attributes (from the initial 496), returning classification hits with levels close to 70%.

Modeling
As already mentioned, in Orange the model is represented by means of a flowchart containing the entire structure of the learning process with all the configuration, visualization and operation steps to which the data are submitted, including, but not limited to the application of specific algorithms, visualization of data at a certain stage of the flow, presentations of results and statistics on modifiers applied to the data, segregation of sampling for training and testing, among several other steps inherent to the mining process. Figure 5 illustrates the proposed model contemplating the steps of preprocessing, training, testing and validation. For model validation purposes, it was decided to divide the data sample into training, testing and validation, in the proportion of 60%, 20%, and 20%, so two data sampler widgets were used. The first separates the data into validation (20% of the data, equivalents to 40,000 instances) and training/test (80% of the data, that is 160,000 instances), and the second for training (75% of the 80% of the data, making a total of 120,000 cases) and testing (25% of the 80% of the data, making a total of 40,000 instances). Training and testing are connected to the test and scores widgets while the validation data (separated in the first data sampler widgets were used to perform the prediction of the models, in order not to use the same values of the instances used to train and test the models, which could result in biased results.

Neural Network
The neural network widget uses the multi-layer perceptron algorithm that can learn nonlinear as well as linear models. The algorithm is based on the skleran library and can be configured as follows: a. Neurons in hidden layers; b. Activation; c. Optimizers; d. Alpha; e. Iterations.
To identify the best configuration of the neural network hyperparameters, the random search method was used, gradually changing each parameter, and collecting the resulting AUC and F1 metrics for each adjusted value. To mitigate the risk of an overfitted model, the model was subjected to 10 different samples, with the same number of instances and in a stratified way, with data not used in training and testing. For each sample, the values of AUC and F1 were obtained and with these values the means (x) and standard deviation (s 2 ) for each configuration were calculated, according to Table 5.
Twenty-three sets were tested and the line in bold (6) was the one that obtained the best AUC and F1 value, respecting the mean and standard deviation range.

. Decision Tree
The widget tree is an algorithm that divides the data into nodes to find the desired class. It can handle both discrete and continuous datasets and can be used for classification and regression. Here are the tuning settings for algorithm optimization: a. Minimum number of instances in sheets; b. Do not separate subsets; c. Maximum depth limit.
With an approach analogous to that u s e d in neural networks, each hyperparameter of the decision tree was tested and after 20 attempts it was possible to find the ideal configuration for model optimization, without the occurrence of overlapping, according to Table 6. The bold line denotes the best setting, with the AUC and F1 metrics within the mean and standard deviation of the 10 validation samples. As a result, after optimizing the hyperparameters, the decision tree had 22,623 nodes and 11,312 leaves.

Random Forest
The random forest widget implements a set of decision trees, which in turn is trained with a bootstrap sample set. The final model is based on the majority of votes from the individual trees in the forest. As it is an extension of the decision tree seen in the 3.3.2 items, it can also be used for classification and regression. Here are the tuning settings for algorithm optimizations: a. Number of trees; b. Number of attributes considered in each division; c. Balancing the distribution of classes; d. Individual tree depth limit; e. Do not separate subsets.
The same approach as used in the models previously studied was used to identify the best configuration of the random forest hyperparameters. There were 30 attempts, but only in line 28 (in bold) was the best result of the metrics found within the validation range (mean ± standard deviation), as shown in Table 7.

Hyperparameters Training and Testing Validation
No.

Tests (a) (b) (c) (d) (e) AUC F1 Accuracy Accuracy Sensitivity Specificity AUC
x ± s 2 F1 x ± s 2 1 10 5 Desl. The logistic regression widget implements the model of the same name which is characterized by learning a logistic regression model from the data. It is used for sorting only. Here are the tuning settings for algorithm optimization: a. Regularization; b. Cost; c. Balancing.
As with the other models already seen, the choice of hyperparameters followed the same approach as in the previous models. Twenty tests were performed, but in the seventh configuration the best result was found. Table 8 shows the values of the hyperparameters and the metrics found.

Results
The way to carry out the comparison between models and to visualize the result for the different classes of the dependent variable is the confusion matrix and the ROC curve. For this purpose, the confusion matrix and ROC analysis widgets were used in the proposed flowcharts, with all study models, respectively.
Tables 9 and 10 contain the confusion matrix for the trained models and proportion for model prediction, by class. Values in bold represent the number of instances correctly sorted. Table 11 displays instances sorted correctly (true positive and negative) and incorrectly (false positive and negative). Values marked in bold represent the best results obtained in the studied models.
Through Table 12, it is also possible to define the measures of: (1) AUC rate; (2) F1 rate; (3) overall accuracy rate of the model; (4) the accuracy rate of positive values; (5) sensitivity rate, known as the true positive rate; (6) specificity rate, called the true negative rate; (7) Matthews correlation coefficient; and ( 8) training and testing times of each model measured in seconds. Values in bold express the best values obtained for each metric. Figure 6 shows the ROC curves of the class 'N' for each predictive model of churn.  Table 11. Classification hits of the predictive models of churn.

Models Classification Neural Network Decision Tree Random Forest Logistic Regression
Correctly 27  Through a quick observation, it is easily possible to verify the superiority of the random forest model relative to the predictive capacity, which is reflected in all the evaluated metrics. Although it did not excel in any metrics, the decision tree ranked just below the random forest technique. At an intermediate level, with satisfactory results, were the neural network and, lastly, logistic regression.
The ROC curve was obtained by calculating for each cut-off point (represented on the main diagonal) the value of specificity and sensitivity. This curve confirms the good predictive capacity of the model (high sensitivity and specificity), since the more significant the area between the ROC curve and the main diagonal, the better the model's performances. The model using the random forest technique presented the result of 0.843, which means that there is 84.3% probability that the model correctly classifies each customer, a significant influence when the possible outcomes are dichotomous.

Discussion
The main reason for applying analytics and data mining techniques to a corporate churn prevention strategy is to optimize the use of available monetary or personnel resources. With limited funds, in a mass prevention campaign, the amounts to invest on each client are limited. On the other hand, resources can be wasted on customers who are not, in fact, at risk of being lost. The application of data mining techniques allows you to reduce waste and increase the value you invest in your target customers.
Since this work aimed to identify which customers tend to churn, notably, we have sensitivity as the main metric for evaluating the models, as this expresses the amount of positives correctly classified by the model. From this perspective, the model that used the random forest technique reached 83.2%, followed by neural network and decision tree with 75.6% and 75.5%, respectively, and, well below the others, logistic regression with only 65.4%. For the random forest, this means practically saying that the model manages to classify 83 customers out of every 100 customers.
Of the expected customers who will evade, according to Table 10, the random forest model presented the best accuracy, with 78.4% accuracy. If, for example, a preventive retention action is structured aimed at customers expected to evade by the model, 78 out of every 100 selected customers avoided. The other 22 out of 100 were false positives, wrongly considered evaded, who would be covered by the withholding action unnecessarily. Similarly, the model was also able to correctly identify approximately 85 correctly out of every 100 customers that would remain active. Therefore, the higher the accuracy, the lower the level of waste. In the context of a banking organization, business cancellations due to customer abandonment can represent billions of dollars in a year for medium to large banks. Thus, the random forest precision, with low waste, brings operational efficiencies to the banking organization, reducing the loss or loss of revenue.

Conclusions
According to the theoretical framework, there is a need for companies to be focused on the customer. For this purpose, it is essential to understand the relationship stages, with methods and processes that enable monitoring and management. To increase profitability or even to remain active in the market, the company must avoid reducing its customer base. Predicting which customers are about to evade, or migrate to a competitor, with the intention of providing mechanisms to avoid this situation is an issue that can be solved through predictive analytics methodologies, allowing for proactive management by organizations. In the context of Brazilian financial institutions, especially in banking organizations, there are few theoretical studies on predictive approaches, which, among other factors, may be characteristic of an incipient culture of predictive approaches to support retention. Consequently, after applying the churn model, it was possible to draw the profile of the customers most likely to drop out, as well as those least likely. Another important factor to consider is that for each customer who fails to evade, there is a reduction in the risk of the customer making negative comments about the company (risk management). In addition, keeping a client is on average five times cheaper than acquiring a new one (financial management).
Through bibliometric analysis of the studies collected in the Web Of Science and SCOPUS indexing bases, it was possible to verify that the most used algorithms in the consulted works dealing with churn predictive models are artificial neural network, decision tree, random forest, and logistic regression algorithms, in that order.
The results obtained show evidence that the proposed models can be efficient in determining the risk of customer abandonment, based on relevant predictive variables. In the base analyzed, of the 497 attributes analyzed, with only 20 was it possible to reach levels close to 70% accuracy in the classification. Table 11 showed that the overall hit rate of the random forest model is 83% and the ROC curve presented a hit rate of 84.3%. The model was also superior in precision and sensitivity metrics (82.7% and 83.2%), being also ideal in relation to neural network, decision tree and logistic regression.
Although the decision tree model presented a 9% lower predictive capacity in the ROC curve, compared to the random forest model, the former more easily presents and interprets the results, and can be read by people without advanced technical knowledge, providing clarity on which variable should be addressed to mitigate dropout, making it a good option for the organization depending on the level of acceptance of classification accuracies.
It is possible that in more robust databases, which allow overcoming the weighted limitations in relation to the number of customers in the sample and the verified heterogeneity, the performance of all models will probably evolve, notably the random forest and artificial intelligence neural networks algorithms, adherent to the levels of results observed in the academic literature dealing with the subject.

Method Limitations
One of the limitations of this study is that the entire database was extracted from a specific operating segment of the banking organization, and the results obtained may not be the same for the other products/segments, which have different dynamics in the life cycle of the credit transaction.
The reduction in turnover is not always due to churn. In many cases, the reduction or cancellation of products results from the loss of interest/need by the customer. There are clients who redeem their investments to invest in the acquisition of a property, for example, others settle credit operations because they no longer need it. At the level of quantitative analysis, with the data collected, we do not see ways to distinguish these occurrences, which is also a limitation of the method.

Future Works
In carrying out this work, the predictor variables used were based on the context of the client, behavior, and risk, without evaluating factors external to the bank. However, considering that involuntary churn is due to various factors, such as the customer's ability to pay or rates that are no longer attractive, it essential to assess whether macroeconomic variables, such as inflation, levels of prices, growth rate, national income, gross domestic product (GDP), and variation in unemployment rates have a significant influence on the dropout rate.
It is also suggested to apply other data mining techniques for the classification tasks with meta classifier algorithms, such as, for example, AdaBoostM1, Bagging and Logit Boost, or the use of a combination of mining algorithms in the classification task using different classifier committee methods, such as stacking and vote, or machine learning with multicriteria. Finally, performing a comparative analysis of the performance of these techniques with the others used in this research could be the focus of another work (Silva et al. 2017). Acknowledgments: Plácido Rogério Pinheiro is grateful to the National Council for Scientific and Technological Development (CNPq) for developing this project.

Conflicts of Interest:
The authors declare no conflicts of interest.