Regularization of Autoencoders for Bank Client Proﬁling Based on Financial Transactions

: Predicting if a client is worth giving a loan—credit scoring—is one of the most essential and popular problems in banking. Predictive models for this goal are built on the assumption that there is a dependency between the client’s proﬁle before the loan approval and their future behavior. However, circumstances that cause changes in the client’s behavior may not depend on their will and cannot be predicted by their proﬁle. Such clients may be considered “noisy” as their eventual belonging to the defaulters class results rather from random factors than from some predictable rules. Excluding such clients from the dataset may be helpful in building more accurate predictive models. In this paper, we report on primary results on testing the hypothesis that a client can become a defaulter in two scenarios: intentionally and unintentionally. We verify our hypothesis applying data driven regularized classiﬁcation using an autoencoder to client proﬁles. To model an intention as a hidden variable, we propose an especially designed regularizer for the autoencoder. The regularizer aims to obtain a representation of defaulters that includes a cluster of intentional defaulters and unintentional defaulters as outliers. The outliers were detected by our model and excluded from the dataset. This improved the credit scoring model and conﬁrmed our hypothesis.


Introduction
One of the most difficult and highly prioritized banking tasks is assessing the creditworthiness of clients, which is known as the credit scoring task. The credit score is typically used to predict the loan default risk probability. The absence of the explicit dependency between a client profile and its reliability level makes the task challenging. Currently, all existing solutions are based on various forecasting models, which are usually constructed on prior information about the client. Despite the fact that many banks worldwide spend much on developing novel scoring models and improving the existing ones, this task remains topical, since even a small improvement in the quality of a scoring model can significantly increase the profits Abellán and Castellano (2017).
Scoring models are built on collected datasets containing information about borrowers' profiles and the target variable, which defines if a client managed to pay off a loan or became a defaulter. Building (training) such models is grounded on an assumption that there is some dependency between borrowers' behavior before they were given a loan and their behavior afterwards. There are almost unchangeable factors such as psychological traits Karlan et al. (2012); Lea (2020); Ranyard et al. (2017) or even spelling error ratio Lee and Singh (2020) that were shown to affect the probability of borrowers' default. On the other hand, many unpredictable factors may also affect this probability: global macroeconomic situation, pandemics, death of a relative, sudden disease, or heavy distress . The presence of such clients in the training data makes the dependency less clear and makes the model harder to train Silver (2012).
In this research, we focus on studying the behavior of clients, who have already been given a loan. We are aware as to which of them were 30 days late in a payment and who were not. Their behavior is represented by their financial transactions performed with a given credit card. Individual spending is a highly informative trace of human behavior that may have a close relationship with social behavioral traits that are important for predicting a personal risk. We assume that investigating such behavior may contribute to understanding, why the loans are not paid off, and improving scoring models.
We hypothesize that there are two groups of clients who default on loans: the first miss payments intentionally, while the second miss them due to some serious reasons. Distinguishing such defaulters is critical for the decision-making process as only the first type of clients should be filtered out while training a predictive model. This paper presents an approach to distinguish defaulters without any knowledge of the ground truth-which of the two categories they belong to-in an unsupervised manner. To solve the task, we propose an autoencoder model with regularization. As the paper reports on work-inprogress, only primary results with a trivial clients' representation are described, which are still promising.
The research questions are the following: • if the posterior information about clients' behavior after receiving a loan can be used to improve the credit scoring model based on prior information about clients; • if such an improvement can be achieved using transactional data features.
The contributions of this paper are the following: 1. We hypothesize how defaulters can be distinguished with respect to their transaction profile, formulate a verifiable consequence, and test it. 2. We propose a neural network (NN) model capable of clients profiling based on their transactions that is based on this hypothesis. 3. Finally, we propose the NN-based method for filtering defaulters to detect outliers, who may degrade the scoring model quality.
The rest of the paper has the following structure. We briefly review related works in Section 2. In Section 3, we describe the dataset we use. In Section 4, we present the hypotheses on borrowers' behavior and formulate and a verifiable consequence for testing it. The proposed methods for the autoencoder regularization are described in Section 5, followed by the results and discussion in Section 6. The conclusion is given in Section 7.

Related Work
Optimization of scoring models is a crucially important problem for any creditor. The banking system development and the increase in the number of clients has resulted in the need to automate the solution to this problem. Since the credit scoring problem has been known for a long time and is well researched, there are many already existing solutions based on completely different approaches. State-of-the-art methods of indirect scoring take into account various parameters of the client, such as life situation, credit history, transactional data, etc. Numerous mathematical models have been developed to predict the repayment or to determine clients' behavior patterns. Due to advances in machine learning, this task has received a new round of investigation. One possible statement of the problem, mentioned before, is to predict whether the client makes a loan repayment on-time. This task is solved based on a client profile, which includes not only general information such as gender and age but also financial factors, for example, the payment history. Several publicly available datasets, such as German Credit Data Asuncion and Newman (2007) or Australian Credit Approval Quinlan (1987), became standard benchmarks both for the credit scoring task and the classification in general. There are many papers published on credit scoring, thus we refer only to some of them: Chen and Huang (2003); Desai et al. (1996); Hand and Henley (1997); Lee and Chen (2005); Lee et al. (2002); Steenackers and Goovaerts (1989); West (2000).
Information about clients financial transaction profiles is typically used to detect fraudulent transactions Brown and Pariseau (2009);Gordon (2003); Paulsen et al. (2008); Thakur et al, (2012). Despite the evidence that transactional data could be used for scoring models Zoldi (2013), no academic literature is available, as accessing this type of data may be complicated.

Dataset
In this research paper, we use a dataset from a local Russian bank. This dataset contains transaction records of 70,000 anonymized clients. Each of the clients was approved by the bank to take out a loan. Thus, the clients in a dataset are known to be creditworthy with respect to the bank decision model.
Clients are considered defaulters if a loan is not repaid within the 30 day period and are labeled with "1"-s. Clients, who have made the required payments on time, are labeled with "0"-s. A record is a certain client credit card transaction and is described with one of 16 MCCs (Merchant Category Code) the transaction corresponds to, as well as its size (amount of money), date and time, and other specific details.
By the types of client's purchases, location or amount of expenses, a typical portrait, or profile, of a defaulter can be learned to prevent giving loans to people with a similar model of financial behavior.
The dataset has specific features: 1. The data has no information about interest rates, periods, or amounts of loans issued; 2. We are not able to fully assess the client's income or creditworthiness from the data, because the provided transactions do not reflect all of the proceeds to the client's account, including the client's salary.
In this study, we squeeze the transaction records to form aggregated MCC vectors of spendings amount per each client. Thus, our dataset consists of 70,000 weighted vectors of 16 elements and binary labels. The synthesized sample of the dataset can be found here: http://genome.ifmo.ru/files/papers_files/Risks2020/dataset_example.csv accessed on 13 March 2021. Each line represents the MCC vector of the client's spendings and the binary flag "defaulter", which is equal to one if the client is a defaulter. We use these vectors as client features. Standard normalization and scaling were applied to the data. The dataset is imbalanced and contains 7000 defaulters only. To handle it, we oversample the minority class in our experiments, which is described in Section 6.2. The dataset feature importance plot with the titles of MCCs is presented in Figure 1.

A Hypothesis on Defaulters and Verifiable Consequence
First, we introduce a hypothesis on the reasons why borrowers can become defaulters.
Hypothesis 1. Clients can become a defaulter because (1) they took a loan and had no plans to repay it, we call such clients intentional defaulters (IDs); or (2) because something happened, which made them unable to pay it off, we call such clients unintentional defaulters (UDs).
Suppose a client who suddenly lost his job, had to cover medical costs, got divorced, or married. All these events are hard to predict in advance by exploring the client's profile when they apply for a loan. However, they significantly affect the client's financial stability in the future and lead to categorizing them as a defaulter. Moreover, unexpected life problems usually do not depend on UDs' spending beforehand, therefore, these behavior patterns are not presented in transactions. Hence, the UDs generally behave like dutiful borrowers who pay the loan off on time and in full (as opposed to defaulters). It is important to note that this is a simplified and stereotypical description of the two groups. In reality, we suppose that clients show a tendency to follow one or another pattern. The motivation behind this tendency is a subject of further research, it may be different from what we described.
Although these two types of clients are indistinguishable for the bank, this difference is very important for predictive models that banks use to decide on a loan approval for the client; unintentional defaulters complicate client analysis.
Hypothesis 2. The behavior of IDs is generally more similar, while the reasons why clients can become UDs may not depend on the personality, but are caused by some various external factors.
Following this hypothesis, in this study, we perform behavioral profiling via clustering. We assume that the IDs would form a single cluster due to similar behavior patterns, while UDs behave irregularly compared to IDs, thus, they can be referred to as the outliers of the IDs cluster.
Verifiable Consequence of Hypothesis 2. Eliminating UDs from the dataset can improve the separability of the defaulting and dutiful borrowers.
In our experiments, the separability of two borrower types is measured based on the classification score. To demonstrate the separability improvement, we suggest comparing the classification scores of the full initial dataset and the filtered dataset. The second one is obtained using client profiling, followed by the detection and elimination of outliers, described in detail in Section 5.
It is worth noting that the hypothesis is not restricted to specific names of behavior patterns, because the ground truth labels for these two groups of clients are essentially unknown.

Learning the Hypothesis-Driven Representation of Clients with Autoencoder
The described hypotheses involve measuring the similarity of clients, which requires representing clients in some metric space. Despite the clients are described with their transactional profile, it makes sense to learn clients' representation in a unified vector space. As we have no ground truth about which of two groups of defaulters each client belongs to, this is the unsupervised representation learning task. In machine learning, autoencoders are typically used to solve it Le (2015). A crucial benefit of applying autoencoders is that we can guide its training by introducing a special regularization term that will make the data representation to be in a form that we stated in Hypothesis 2.

Autoencoders for Transactional Profiling
To reduce the dimensionality of the data, we use the autoencoder with one hidden layer. The dimension of the hidden layer determines the number of features that we want to extract for describing clients. The autoencoder architecture is presented in Figure 2. In this work, we use the simplest model to have as much control of the learning process as possible. The autoencoder consists of an encoder and a decoder that are trained together: the encoder is trained to approximate a mapping from the input space X to a space Z of desirable objects representation, and the decoder is trained to approximate the inverse mapping. It worth noting that the mapping is not known in advance, therefore, it can be viewed as searching for the most appropriate space Z of lower dimensionality to map the objects to.
The vanilla autoencoder is trained using some reconstruction loss L representing how well the autoencoder can restore its input. The learning process can be described as a process of solving the following optimization problem: where a θ is the autoencoder with weights θ from some parameter space Θ and D is a training set. In this work, the loss function used is a mean squared error: x i are data points and y i are labels. We formulated expectations on how the objects behave, and we can guide the autoencoder to find a space in which the objects behave in the expected way allowing restoring input better. This can be done by introducing a regularization term R representing how well the learned representation satisfies our expectations. The learning process can now be described as a process of solving the following optimization problem: where α is a regularization coefficient. We use α = 0.1 to make it of the same order as the value of loss function L. R is determined by the specific regularization method described in the Sections 5.2 and 5.3. Since we guide the learning process only by our assumptions and we do not use any labels to train the autoencoders, this scheme is not a subject of overfitting. It is also worth noting that

•
Despite the regularization depends only on the output of the hidden layer, we train all the layers with the corresponding loss. • Weights of the autoencoder are updated iteratively, causing that any relationship between objects in the target representation space can be dramatically changed.
The proposed methods aim to identify similar behavior patterns of clients, stated in Section 4. The regularization can be considered to be data driven. In the remaining subsections we describe three methods to define this regularization term.

Neighbor-Based Minimization Method
To make the cluster of IDs to be well-separable, we need to minimize the distance between points in the cluster and maximize the distance between the cluster points and other points, i.e., outliers. This should result in a single cluster of IDs, while all the outliers should be considered to be UDs.
Neighbor-based minimization method is designed to solve the task using the density concept. We want the points corresponding to IDs to form a tightly packed cluster. First, for each point p we calculate the sum of Euclidean distances to its k nearest neighbors, n p . After that, we sort these values: n (1) ≤ n (2) ≤ . . .. Finally, we evaluate d i = n i+1 − n i . We can guide the autoencoder to obtain such a representation that only one cluster is very tight, while all other points are more distant from each other (or form small clusters). For such a tightly packed cluster, n p would be very small and similar to each other, while n q for other points would be distinguishably higher, which means that the corresponding values of d would follow the same pattern.
Here, the following function is considered to be the regularizer for data driven autoencoder training: . (4) The minimum of the distance relation function may be achieved on the boundary of the cluster. Therefore, the usage of this function causes the regularizer to maximize the distance between the cluster boundary and outliers and make the cluster tighter. However, we must note that this representation is not robust as it involves reevaluation of such distances after each update of the autoencoder weights. Using this regularization is also not safe as the minimum can be achieved for other points, for instance, for the most distant point and all the other points.
As a result, the representation of the cluster with well-separated outliers can be obtained (Figure 3). Algorithm 1 presents a pseudocode of the method.

Barycenter-Based Minimization Methods
Barycenter-based minimization method also aims to form the cluster of IDs and separate the UDs from it. To do this, the "densest" point should be found first, which we refer to as barycenter. That is a point, which distance to k nearest neighbors is minimal. This point can be considered to be the center of the cluster. Here, the regularizer is expected to bring the cluster point closer to its center and bring outliers away. For this purpose, we evaluate Euclidean distances from each point to its k nearest neighbors and sum these distances. The point with the minimum sum value is selected as the barycenter b. Then, for the barycenter b the Euclidean distances to every other point p, ρ(b, p) are evaluated. We sort them and evaluate differences to make the series more robust: The minimization function is defined in the same way with the previous minimization method: The minimum is reached on the cluster boundary, determining its radius. Thus, the ratio of the cluster radius and the closest to the cluster outlier is minimized. In our experiments, we make a constraint on the cluster size, restricting its infinum and supremum size to make the cluster containing 20-50% of all the points.
This method is a more robust than the first one because its values depend only on distances to a single point, which do not change that drastically after weights updates. However, the stability may be affected when after weights update a new point is chosen as a new barycenter, which eliminates effects of training the representation with respect to our expectation. There can be an issue when another point is becoming a boundary point, so the learning progress is also a bit eliminated.
To overcome the described problems, we introduce stabilized barycenter-based method, aimed at increasing the stability of the cluster radius using an improved regularizer function. In this modification, the cluster center is fixed over the forward passes of the autoencoder making the radius dependent on it. In addition, the minimization of the radius ratio here depends on the previous value of the cluster radius, which is provided by the multiplier. Thus, the regularizer is represented as a hysteresis function: where r b is the cluster radius at the previous training step. If the difference between radii is small, this multiplier value is close to 1, otherwise, it is greater than 1, thereby increasing the product value. The method pseudo-code is presented below (Algorithm 2): Algorithm 2 Algorithm for the stabilized barycenter-based method for i in points do for j in points do

Experimental Setup
In this research paper, the following experimental pipeline is used: 1. First, various predictive models for credit scoring are trained and evaluated on the full initial dataset. The best classifier is chosen as the baseline. 2. The autoencoder models are trained on the full dataset (as opposed to the filtered one).
Three proposed regularization methods result in three trained models. 3. After that, these models are used for defaulters filtering (i.e., the forward pass is performed for them only) based on the encoder output. 4. To evaluate the encoder dimensionality reduction impact, principal component analysis (PCA) is applied to the full dataset. 5. To compare the proposed filtering approach, the dataset random filtering is performed as well. The filtered datasets should be identically distributed. 6. Finally, the obtained datasets are used to train and evaluate the best baseline classifier for further results comparison.
In our experiments, the 70/30 as train/test ratio and the 5-fold cross-validation is used.

Performance Evaluation and Implementation Details
The evaluation is based on the weighted F 1 -score and the AUC score as metrics, which are defined as follows.
where Precision is the percentage of correctly classified objects relative to those that the model assigned to this class, Recall refers to the percentage of correctly recognized objects relative to all objects of the same class. We use weighted F 1 -score, which weights the score of each class by the number of samples in that class.
The area under the curve, or AUC, is an aggregated characteristic of classification score and is based on the ROC (Receiver Operating Characteristic) curve. In turn, ROC plots the dependence between True Positive Rate (TRP) and False Positive Rate (FPR). AUC varies between 0 and 1. The higher the AUC value, the better the predictive model.
As the dataset is highly imbalanced, we first perform its balancing using the imblearn library (https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_ sampling.SMOTE.html accessed on 13 March 2021). We use the SMOTE oversampling method, which allows preserving the data distribution. This method synthesized new samples of the minority class based on the k nearest neighbors of a randomly selected instance of the minority class. A new point is taken between this instance and its randomly selected neighbor. We use k = 5. All the reported results are obtained on the balanced datasets.
All the methods are implemented with TensorFlow2/Keras and scikit-learn Python libraries. For experiments, we used GeForce GTX 1080 Ti GPU.

1.
Logistic regression (LR). A simple classification technique is based on the logistic function to model a dependence.

2.
Decision trees (DT). The idea is to recursively split a feature space with a feature value until some criterion is reached (e.g., a tree leaf has a minimum number of target classes).

3.
Feedforward neural network (NN). This is the simplest type of artificial neural network that includes fully connected layers. 4.
K-nearest neighbors (kNN). A metric classification technique, which assigns a class to an object based on its k nearest neighbors classes.

5.
Random forest (RF). An ensemble method, in which several independent models make predictions; after that, the final prediction is formed by voting in case of the classification problem. 6.
Gradient boosting (GB). This is an ensemble method, which minimizes the training error of classifier linear composition based on gradient descent.

Results
Following the experimental pipeline, we first classify the full dataset to obtain the baseline credit scoring model. The results are presented in Table 1. The optimal hyperparameters of the classifiers are found using Grid Search, which is a full parameter enumeration method.
The best result is provided by Gradient Boosting Machine (GB). To demonstrate the impact of the methods proposed, we need to compare them to the baseline scoring model as follows. We train the three autoencoder models with different regularizers. For each model, the encoded features are received on the hidden layer of the autoencoder. These features are clustered with a certain method, and the outliers are identified. As a result, the outliers are filtered with the trained model, obtaining a new dataset. On the new datasets, the best classifier (GB) is trained, and the results of new scoring models are compared with the baseline. With this approach we test the verifiable consequence of Hypothesis 2.

Neighbor-Based Minimization Method Results
The neighbor-based minimization method has not clustered the IDs as expected. Firstly, as this method does not fix the center of the cluster over the training steps, the center may change, resulting in the cluster being "tightened" to different centers at every training step. This instability does not allow obtaining the tight IDs cluster, which affects the model training. Secondly, with the cluster center unfixed, the minimization according to Equation (4) causes the method to find small clusters and consider the distance ratio with respect to them, instead of forming a single cluster. The visualized process of clustering training is represented in Figure 4. The model is not able to detect the outliers. Hence, it is excluded from further comparison with the baseline.

Barycenter-Based Minimization Method Results
The barycenter-based minimization method obtains the well-clustered IDs point representation using the fixed cluster center. This fixing allowed to stabilize the optimization process using the regularizer from Equation (6). In our experiments, we test both method modifications: with the stabilizing multiplier and without it. Figure 5 demonstrates the clustering results during the model training. Only this method (original and stabilized) participates in the further comparison.

Results Comparison and Discussion
For the comparison of the results, we train GB on the data with the three different feature sets: (1) with the original 16 features, (2) with features obtained by the encoder, and (3) with the features obtained by the PCA algorithm. In addition, to ensure that the obtained result does not depend on the distribution of the dataset (the ratio of defaulting and dutiful borrowers), we conduct experiments with random filtering the same number of defaulters from the full dataset as in the dataset obtained by the proposed filtering method. Thus, we train GB on datasets of three different sizes: the full dataset, the dataset filtered randomly, and the dataset filtered using the proposed method.
To choose the dimensionality of the encoder, we conducted the experiments with the full Grid Search over the neurons number from 2 to 15 (for the results see Figure 6). The best result was achieved using 15 neurons in the encoder output layer. Hence, the PCA output dimensionality was also set to 15 components. The cross-validated results of GB-based credit scoring applied to eight datasets are presented in Tables 2-4. In this study, the PCA method was not applied to the filtered dataset obtained by the method proposed, because this would change the data distribution. The two ranges of the cluster size were tested for the barycenter-based minimization method: from 20% to 80%, and from 50% to 89% of initial points number. Here, the upper limit ensures that the cluster does not include all the points. The two configurations of the barycenter-based minimization method were tested: with the stabilizer and without the stabilizer.
Our experiments have demonstrated that the proposed data-driven regularization method is superior to the baseline model on all the datasets tested. The best result is achieved using the stabilized barycenter-based method, which proves the importance of taking into account the cluster size from the previous training step. The developed method improves the solution of the credit scoring problem and provides better performance than models without clients transaction profiling. As we can see, filtering out those who were considered to be outliers by the autoencoder has improved the predictive quality of the scoring model compared with the original dataset and with the dataset of the same size with randomly filtered objects. As the autoencoder regularization is based only on the defaulters, its representation is not overfitted towards discrimination of defaulters and non defaulters. We therefore conclude that the verifiable consequence is confirmed. This is a good evidence confirming Hypothesis 2, which in its turn, is evidence confirming Hypothesis 1.
However, we must notice that stronger evidence should be achieved to confirm Hypothesis 2 and, especially, Hypothesis 1. There may be other reasons why filtering out objects in this way improves the predictive accuracy. Training a representation, in which we leave only a tightly packed cluster of objects of one class may ease the process of separating these objects from objects of another class.

Discussion and Conclusions
In this paper, we proposed a competitive methodology for the problem of binary credit scoring based on borrowers' financial transactions. We formulated the hypothesis for the profiling of bank's clients, who took a loan, using their transactional data. It was assumed that a client can become a defaulter either intentionally or unintentionally, and identifying unintentional defaulters as outliers for further elimination can enhance the credit scoring model.
To find the optimal client representation, we attempted to reduce the feature space dimensionality using autoencoders. As for the profiling approach, we developed three methods of autoencoder regularization. The proposed regularizers are used to cluster clients in the output space of the encoder and are aimed at obtaining a tightly packed cluster of intentional defaulters with the outlied unintentional defaulters.
We proved that the proposed profiling methods efficiently filter clients, which allows the classification model for credit scoring to generalize better. The scoring model applied after the proposed filtering method outperformed the same model applied to the identically distributed random filtered dataset. Additionally, our experiments showed that the proposed encoder model is superior to the PCA method with the same number of components. The obtained results prove the hypotheses stated in this research and opens a new direction in approaches to the problem of credit scoring.
In our study, we used only one hidden layer of the autoencoder, leaving the deeper models for further research. We also used the reduced, or squeezed, client representation, disregarding the transactions as sequences. However, considering temporal connections may have a significant impact on the task solution. As a direction for future work, recurrent models, namely LSTMs Gers et al. (2002), as well as Transformers Vaswani et al. (2017) should be investigated.