Towards Repayment Prediction in Peer-to-Peer Social Lending Using Deep Learning

: Peer-to-Peer (P2P) lending transactions take place by the lenders choosing a borrower and lending money. It is important to predict whether a borrower can repay because the lenders must bear the credit risk when the borrower defaults, but it is difficult to design feature extractors with very complex information about borrowers and loan products. In this paper, we present an architecture of deep convolutional neural network (CNN) for predicting the repayment in P2P social lending to extract features automatically and improve the performance. CNN is a deep learning model for classifying complex data, which extracts discriminative features automatically by convolution operation on lending data. We classify the borrower’s loan status by capturing the robust features and learning the patterns. Experimental results with 5-fold cross-validation show that our method automatically extracts complex features and is effective in repayment prediction on Lending Club data. In comparison with other machine learning methods, the standard CNN has achieved the highest performance with 75.86%. Exploiting various CNN models such as Inception, ResNet, and Inception-ResNet results in the state-of-the-art performance of 77.78%. We also demonstrate that the features extracted by our model are better performed by projecting the samples into the feature space.


Introduction
Peer-to-Peer (P2P) lending belongs to FinTech services that directly match the lenders with borrowers through online platforms without the intermediation of financial institutions such as banks [1]. P2P lending has grown rapidly, attracting many users and generating huge transaction data. For example, the total loan issuance of the Lending Club reached about $31 billion in the second half of 2017.
When a borrower applies to the platform, many lenders select a borrower and lend money. It is the financial loss of the lender that the borrowers do not pay or only partially pay to them in the repayment period. The lenders may suffer due to the default of the borrowers [2]. To reduce the financial risk of the lenders, it is important to predict defaults and assess the creditworthiness of the borrowers [3].
Since P2P social lending data is processed online, large and various data is generated, and the P2P lending platform provides much information on borrowers' characteristics to solve problems, such as information asymmetry and transparency [4,5]. The availability and prevalence of transaction data on P2P lending have attracted many researchers' attention. Recent studies mainly address the issues such as assessing credit risk, portfolio optimization and predicting default.
They extract features from information on borrowers and loan products of transaction data and solve the problems using machine learning methods with extracted features [6]. Most studies design feature extractors based on statistical methods [7], and extract hand-crafted feature representations [8].
However, these studies are potentially faced with problems such as scale limitation and variety. The conventional machine learning is difficult to train and test large data [9] and tree-based classification methods with high performance require many features [10]. Also, statistical methods and hand-crafted methods are limited in extracting features by capturing the relationship between complex variables inherent in various data.
In the case of the Lending Club in the United States, it provides a total of one million data, consisting of 42,535 in 2007-2011, 188,181 in 2012-2013, 235,629 in 2014, 421,095 in 2015 and 434,407 in 2016 (March 2017, https://www.lendingclub.com). The amount of data in the P2P lending is increasing, and the data structure is very large and complex. Table 1 shows the statistics of the data from the Lending Club and Table 2 shows the description of some attributes.   Figure 1 shows some of the correlation plots for the loan status of the samples after normalizing the raw data. As can be seen in the figure, the "charged off" and the "fully paid" have very similar plot correlations. These loan status classes can be easily confused with each other. It is difficult to extract discriminative features for the loan status. Deep learning, which has become a huge tide in the field of big data and artificial intelligence, has made a significant breakthrough in machine learning and pattern recognition research. It provides a predictive model for large and complex data [11], which automatically extracts non-linear features by stacking layers deeply. Especially, deep convolutional neural network (CNN), one of the deep learning models, extracts hierarchically local features through weighted filters [12]. Several researchers have studied mainly to recognize patterns using images [13], video [14], speech [15], text [16], and other datasets [17]. It is also applied to other problems such as recognizing the emotions of people [18,19] and predicting power consumption [20].
In this paper, we exploit a deep CNN approach for repayment prediction in social lending. The CNN is well-known as powerful tool for image classification, but has not been explored for general data mining tasks. We aim to extend the edge of the applications of CNN to large-scale data classification. The social lending data contains a specific pattern for the borrowers and the loan product. The convolutional layer captures the various features of borrowers from the transaction data, and the pooling layer merges similar features into one. By stacking several convolutional layers and the pooling layers, the basic features are extracted in the lower layer, and the complex ones are derived in the higher layer. The deep CNN model can classify the loan status of borrowers by extracting discriminative features among them and learning patterns in lending data.
We confirm the potential of CNN in the problem of social lending by designing one-dimensional CNN, and analyzing the features extracted and the performance in Lending Club data, evaluating whether the feature representation is generalized for other lenders. We show how various convolutional architectures affect overall performance, and how the systems that do not require hand-crafted features outperform other machine learning algorithms. We provide empirical rules and principles for designing deep CNN architectures for repayment prediction in social lending. This paper is organized as follows. In Section 2, we discuss the related work on social lending. Section 3 explains the proposed deep CNN architecture. Section 4 presents the experimental results, and Section 5 concludes this paper.

Related Works
Milne et al. stress that P2P lending platforms are increasing in many countries around the world, and that the probability of increased defaults and potential problems are important [21]. As shown in Table 3, there are many studies on the default of borrowers and the credit risk in P2P social lending. Most researchers have mainly used a few data and attributes by extracting features using a statistical method or hand-crafted ones. They presented a default prediction model and a credit risk assessment model using a machine learning method. Lin et al. proposed a credit risk assessment model using Yooli data from a P2P lending platform in China [7]. They extracted the features affecting defaults by analyzing the demographic characteristics of borrowers using a nonparametric test. As a result, ten variables, including gender, age, marital status and loan amount were extracted, and a credit risk assessment model established using logistic regression. Malekipirbazari and Aksakalli assessed credit risk in social lending using random forest [8]. Data pre-processing and manipulation tasks were used to extract 15 features and evaluate the performance according to the number of features. As the number of features grew, it performed better. They achieved higher performance compared to other methods.
All of these studies have been hand-designed to derive unique features, which makes it difficult to compare them with other experimental grounds [29]. As the amount of data and the number of attributes increase, it is difficult to extract discriminative features of the borrower. However, because big data brings about new opportunities for discovering new values [30], it is important to use all the information of the borrower to predict the repayment of the borrower accurately.
On the other hand, there have been studies using a lot of data. Kim and Cho used semisupervised learning to improve the performance leveraging the unlabeled data of Lending Club data [23]. They predicted the default of borrowers using a decision tree with unlabeled data. Vinod Kumar et al. analyzed the credit risk by labeling new classes to all data as "Good" or "Bad", such as "current," "default," "late", including the data of the borrower who was "fully paid" and "charged off" [10]. However, these studies also require the process of extracting features. In this paper, we show that deep CNN can overcome the problem of default prediction by using all the data and attributes. Figure 2 shows the overall architecture for social lending repayment prediction using deep CNN. We train deep CNN using the formulations defined below. The key idea is to learn feature space that captures inherent properties such as the characteristics of the borrower or the loan product using the data of many borrowers. We continue to train classifiers to model the characteristics of each borrower using this feature space.

Deep Learning for Repayment Prediction
The learned network is used to project the social lending data into the representation space learned by CNN and predict the repayment of the borrowers through the softmax classifier. The network can easily predict the repayment of borrowers by extracting features and by capturing the characteristics of the borrower through the convolution layers and the pooling layers.

Convolutional Neural Network
Convolutional neural networks perform convolution operations instead of matrix multiplication [17]. In continuous case, convolution of two functions f and g is defined as follows: In discrete case, the integral is replaced by the summation: Discriminative features are extracted from information about borrowers and loan products in Lending Club data through local connection leveraging convolution operations. Suppose that = [ , , ⋯ , ] be preprocessed lending data. The output , is obtained from the input vector through the convolution layer as the following Equation (3). Several feature-maps are generated from the lending data using the trained convolution filter, and complex features of the lending data are captured by the following activation function.
where , is calculated by output vector of the previous layer and the convolution weight . is the index of the layer, is the filter size, is the index of the filter, and is the activation function. Here, we use ReLU for the activation function: (x) = max(0, x).
In the pooling layer, the semantically similar features extracted from the convolution layer are merged into one [31]. The pooling layers are used to extract representative values of features from Lending Club data. The maximum of local patches in one feature map is computed to reduce the dimension and distortion. Equation (4) represents the process of extracting the maximum from the th pooling layer. means a pooling size of a certain range, and means the stride to move pooling.
Several convolution and pooling layers are stacked up. These layers perform as a role of feature extraction hierarchically on lending data. They extracted informative and discriminative representations based on the data, and appeared as more complex features from the bottom up [9].
The feature maps generated by repeating several convolution and pooling layers from lending data are connected one-dimensionally through the fully connected layer, and the data is classified using the activation function as loan status. The combination of a fully-connected layer and a softmax classifier is used to predict the repayment of the borrower. The features extracted from the convolution and pooling layers are flattened to form the feature vector = [ , , ⋯ , ] where means the number of units in the last pooling layer. This is used as the input of the fully-connected layer. Equation (5) shows the process of calculating the hidden node in the fully-connected layer. is an activation function, is a weight connected between nodes, and is bias term.
The output of the last layer through the softmax classifier is loan status (charge off, fully paid). In Equation (6) L is the index of the last layer, and is the total number of classes.
Forward propagation is performed using Equations (3)- (6), and gives us the error of network. The deep CNN weights are updated using a backpropagation algorithm based on the RMSProp optimizer [32] that minimizes the categorical cross entropy in the mini-batches of the lending data. RMSprop is a method to maintain the relative difference between variables of recent change using exponential averaging. We set the learning rate as 0.001 and the number of samples per batch as 512.
= − + ∇ ( ) where is the gradient of the sum of squares, is the learning rate, γ is the momentum term of every parameter at every time step , and ( ) is the gradient of the objective function. When the criterion is satisfied, forward and back propagation is stopped.

Architecture and Hyperparameters
A deep CNN can be manifested in many structures by the combination of hyperparameters. Hyperparameters affect the process of extracting features, learning time and performance [33]. To determine the optimal architecture of deep CNN including hyperparameters, it is necessary to understand the domain. In our case, it means to classify the repayment in P2P social lending.
Lending Club data, unlike images, do not have a strong relationship between attributes, so a small window size should be used to minimize the loss of information on the convolution and pooling layers, and the stride of window should be small. An activation function such as rectified linear unit (ReLU) also should be used to extract nonlinear patterns [34]. We design the network as shown in Table 4. The input of the network is 1 × 72 size as 1D data. The Lending Club data go through the convolution and pooling layers followed by two fully-connected layers.

Dropout
Overfitting occurs as the layer is deeper or the network is complex [35]. Overfitting is overly fit to the data, resulting in low accuracy for new data. To prevent overfitting, we can use regularization, dropout or data augmentation, and in this paper we choose the dropout.
Dropout is a regularization technique to avoid overfitting, omitting a portion of the network [36]. It deletes hidden nodes, except input and output nodes, and uses only some of the weights contained in the layer, thereby allowing robust features to be learned without relying on other neurons [37]. The dropout is accompanied by a probability of inclusion, and is performed independently for each node and each training data on the lending data. The probability of dropout will affect the performance, and overfitting or underfitting can occur if it is too large or too small. We set the value as 0.25, and use it before the last fully-connected layer.

Lending Club Dataset
In this paper, we use the data from Lending Club, the biggest US P2P lending company. A total of 855,502 data were collected during the year 2015-2016, which consist of the predictor variables of 110 attributes such as loan amount, payment amount and loan period. 143,823 data with 63 attributes were used. The attributes that were ruled out include what cannot be used for prediction, such as borrower ID, URL and the description of loans, what the missing values are over 80% and what are filled after the borrower starts to repay [10].
As the input of CNN ranges in [0, 1], we preprocess the 63 attributes used for prediction. The categorical variables were created as dummy to represent binary variables, and the continuous variables were normalized by removing 1% of the outliers as follows.

Result and Analysis
Experimental results with the proposed method are described in this section. First, we show the experimental results for the validation set used to design the architecture for the proposed method: We evaluate the performance with various loss functions and hyperparameters, and compare with other methods. Afterward, we analyze the misclassification cases in the confusion matrix and the deep CNN models using t-distributed stochastic neighbor embedding (t-SNE) [38].
The hyperparameters are adjusted while maintaining the best configuration based on the hyperparameters mentioned in Section 3.2. The size of the input vector is 1 × 72, and the range of the parameter values is determined as shown in Table 5. The hyperparameters were tuned in 100 epochs for a validation set and saved a model that achieves the highest performance.
The parameters that most affected the performance are a stride of pooling layer and batch size, and parameters that are least affected are the number of filters in the convolution layer. The highest performance is obtained when the number of filters was 32, the size of the kernel is 3, the size of pooling windows is 2, the size of pooling stride is 1, dropout is 0.25, hidden size is 512, and batch size is 512. Appendix A presents the experimental results for hyperparameters and loss functions. The number of times to iterate in training 100 Comparison of performance with other methods. We present a comparison of accuracy with other methods on the test set. The best performing deep CNN model as described in Appendix A is used. The number of hidden nodes for multi-layer perceptron is set to 15, the kernel, C and gamma for SVM are set to rbf, 300 and 1.0, the k-nearest neighbor is set to k = 3, the depth of decision tree is set to 25, and the depth and number of random forests are set to 30 and 200. All the hyperparameters for the comparing methods are optimized with several experiments.
We obtained an accuracy of 75.86% and achieved higher performance than the conventional machine learning methods. Figure 3 shows the comparison of the proposed method with other methods. 5-fold cross-validation is performed to verify the usefulness of the proposed method. Deep CNN showed the highest performance compared to other machine learning methods, followed by random forest, decision tree and multi-layer perceptron. Figure 4 shows a comparison of accuracy on 5-fold cross-validation.
Preprocessing and feature extraction are important steps to develop a classification system. Table 6 presents a comparative study with the preprocessing, feature selection and extraction methods. The feature selection methods employ mutual information, information gain and chisquare statistics. The features are selected according to the order of the importance of features. Recently, the restricted Boltzmann machine (RBM) is exploited to extract effective features [39]. The basis classifier for this experiment is a softmax classifier with two fully-connected layers (#features × 512 × 2). Without preprocessing, the model tends to fail to learn, and has classified all the data into one class. Feature extraction methods have produced a higher performance than feature selection methods. RBM has achieved almost similar performance to the CNN.   We further compare the performance with a variety of CNN models by running additional experiment with Inception-v3 [40], ResNet [41], and Inception-ResNet v4 [42] models. Table 7 shows the performance of each model. The improved CNN model has achieved higher performance and demonstrates the potential of CNNs for predicting the repayment in social lending. Analysis of misclassification cases. Table 8 shows the confusion matrix of the deep CNN model. Our model tends to well classify the repaid borrowers and not to classify the non-repaid borrowers. This problem appears because the number of non-repaying borrowers is less than the number of repaid borrowers. R. Emekter et al. found significant variables on repayment of the borrower such as interest rate, home ownership, revolving line utilized, and totally funded in the delinquency prediction model using the Lending Club data [43]. We compared the well-classified data with the misclassified data based on these variables. Figure 5a shows the distribution of well-classified samples (A, B in Table 8), and Figure 5b shows the distribution of misclassified samples (C, D in Table 8). The misclassified data showed a tendency opposite to the well-classified data. The data show the opposite distribution on other variables such as the loan period, the verification status, and dti including the variables that they mentioned. In Figure 5, dti means a ratio calculated using the borrower's total monthly debt payments on the total debt obligations. Analysis of deep CNN model. We analyze the feature space of the learned model by projecting the samples in the validation set using t-SNE to verify that our model extracted discriminative features. This analysis helps visualize non-representational deep features as dimensional reduction techniques that can maintain the local structure of the data while revealing the important global structures [44]. The more samples of different types are separable in the map, the better this feature performed.
We use the saved model above from 10,000 samples to extract the features and project the samples in two-dimensional space at the layer before classification by performing forward propagation. Figure 6a shows the t-SNE results projected in two dimensions. The distribution of the repaid borrowers and the non-repaid borrowers is well-clustered. On the other hand, there are clusters that are often mixed like the marked parts. Three samples are selected at random from each cluster as shown in Figure 6b. It can be seen that those samples have similar patterns in features even though they belong to different classes. However, it has turned out that the trained model with the extracted features works out for repayment prediction very well.

Conclusions
We have presented an architecture of deep CNN for repayment prediction in P2P social lending. It is confirmed that the deep CNN model is very effective in repayment prediction compared to the feature engineering and machine learning algorithms. The presented model can help choice of the lenders. The visualization analysis reveals that the feature space is clustered well depending on the success of repayment, and verifies that the extracted features of the deep CNN are effective to the prediction.
In addition, we have analyzed the features extracted by the deep CNN model with the misclassification cases based on the confusion matrix, which shows the problem of skewed distribution of classes.
To solve this problem, we need data from borrowers that have not been repaid. In reality, however, it is difficult to collect data, because there are fewer borrowers who did not repay than the borrower who has been repaid. This problem can be worked out by giving more weight to the data on the less observable side (non-repaid borrowers), or more losses when the data is misclassified. In addition, an architecture of deep CNN can be deeply established by extracting dense information and sparse information at the same time using a various size of filters, and it can extract features of the borrower who did not repay. This problem remains for a future work that we must address. We also need more effort to find the various parameters of deep CNN automatically such as the number of layers and the order of layers in addition to the basic parameters such as the number of filters and the size of the kernel in order to determine the optimal architecture. For fairer comparison, we also need to adopt more sophisticated classifiers such as gradient boosting trees.  In Table A1, the deeper the red, the higher the performance; the darker the blue, the lower the performance. Empirically, the more layers are stacked, the lower the performance, and the performance was the highest when stacking four layers. Performance varied from 54% to 75% depending on the parameter settings. The performance was the highest when the number of layers is four, and the performance on validation data was decreased with increasing number of layers. In this paper, we propose to use four layers that show the best performance.

Conflicts of Interest
In the case of the dropout, the performance decreases as its probability increases, and the performance gets worse as more layers are stacked. The experiments show that 0.25 is an optimal choice in terms of performance and efficiency. When we removed the dropout layers, we observed a degraded performance of 2% on average. This is because the use of dropouts prevents overfitting. However, as the probability of dropout increases, underfitting occurs and performance tends to decline.
We experimented to set the optimal parameters of the pooling layer, which was the most significant performance difference depending on the set values. The smaller the size of pooling and the smaller the stride, the higher the performance. Especially, when the strides increase from 1 to 5, they decrease a performance of about 7%. In this paper, since the relationship between variables is low, the grower the stride, the higher the loss of information. To minimize it, the setting of the optimal pooling size and stride is essential. Table A2 shows the effect of various optimizers and activation functions for CNN training. All the loss functions of the optimizer were categorical cross-entropy, and the learning rates set 0.01 and 0.001. The optimizer shows a performance difference of about 7%. RMSprop is the highest, but SGD is lower, which did not learn well. The activation function shows a difference of about 7%. Both ReLU and LeakyReLU [45] extract nonlinear relations and show higher performance than other functions.