Peer-To-Peer Lending: Classiﬁcation in the Loan Application Process

: This paper studies the peer-to-peer lending and loan application processing of LendingClub. We tried to reproduce the existing loan application processing algorithm and ﬁnd features used in this process. Loan application processing is considered a binary classiﬁcation problem. We used the area under the ROC curve (AUC) for evaluation of algorithms. Features were transformed with splines for improving the performance of algorithms. We considered three classiﬁcation algorithms: logistic regression, buffered AUC (bAUC) maximization, and AUC maximization.With only three features, Debt-to-Income Ratio, Employment Length, and Risk Score, we obtained an AUC close to 1. We have done both in-sample and out-of-sample evaluations. The codes for cross-validation and solving problems in a Portfolio Safeguard (PSG) format are in the Appendix. The calculation results with the data and codes are posted on the website and are available for downloading.


Introduction
Peer-to-peer lending, also known as person-to-person lending, social lending, or P2P lending, commonly abbreviated as P2PL, is usually the online practice of individuals lending money to other individuals without going through a traditional financial intermediary.Basically, it involves people with extra money (investors), people who need money (borrowers), and a platform (website) that facilitates P2PL (see LendingClub (2006)).
After a borrower receives the money, monthly payments are made back to lenders through the platform.Some fraction of these payments is taken by the P2PL platform for providing the service.The most important task of a platform is to distinguish loan applicants who will pay money back from those who will default.This activity is called loan application processing.
This paper studies the loan application process of a P2PL company's platform and has two main objectives: Objective 1.The main objective of this paper is to help investors using P2PL companies to understand the loan approval/classification process.With a practical example, we study classification algorithms for selecting loans.The case study was done for the LendingClub platform, which is a leading global P2P lending company.LendingClub was selected because of publicly available historical loan data.The comparison of the loan selection process of LendingClub with other companies is beyond the scope of this paper.While investors use ratings set by P2P lending companies for investment decisions, those companies do not provide detailed information on how ratings are set, and investors do not have appropriate information.We want to reduce this gap and explain with a case study that loan selection decisions are based on only several simple features (factors).
Objective 2. We want to demonstrate new classification techniques using spline transformation of features in combination with logistic regression, maximization of area under the ROC curve, and maximization of Buffered AUC.
We considered the loan applications process as a binary classification problem and used the AUC to evaluate algorithms.For conducting a case study, we used the open data of LendingClub and, in particular, features that were available for both approved and declined loans.Features were transformed with splines to improve classification algorithm performance.We conducted logistic regression with original and transformed features.Then, we maximized bAUC and AUC with transformed features.We applied the Portfolio Safeguard (PSG) 1 package for spline fitting and optimization.With only three features, Debt-to-Income Ratio, Employment Length, and Risk Score, we got an AUC close to 1.
Below are several of the main findings of the paper.A lot of information is collected and available for the evaluation of loans.However, most of this information is not used, and decisions are frequently based only on few key features (in the considered case study, only three features: Debt-to-Income Ratio, Employment Length, and Risk Score).We also found that popular simple technologies, such as logistic regression, are used for classification decisions.Some additional improvements can be obtained by using advanced algorithms, such as spline transformation of features and direct maximization of AUC and bAUC.These innovations can bring, in some cases, additional improvements, compared to basic simple technologies.Potential benefit for companies from this research is that expensive departments selecting loans can be substituted by a relatively cheap commonly used technologies (e.g., logistic regression with some additional innovations such as spline transformation of features).For P2PL investors this is also an important finding because it provides information about classification decisions used in practice.One more insight is that a standard PC can handle quite large datasets (with hundreds of thousands of observations) and advanced numerical capabilities (such as parallel processing) are not needed for loan selection.
The remaining part of the paper is structured as follows.Section 2 gives background information about P2PL.Section 3 introduces the loan application process and performance metric.This section also provides mathematical problem statements.Section 4 presents a case study.Section 5 concludes the paper.

Peer-To-Peer Lending Companies
A P2PL company plays the same role in the P2PL market as the stock exchange in a stock market, and often has an online platform.ZOPA 2 , the first company that offered P2P loans in the world, was founded in Britain in 2005.The name ZOPA, stands for "zone of possible agreement," a negotiating term identifying the bounds within which agreement can be reached between two parties (see Lai and Turban (2008) Smith (1999) stated that the SEC issued its formal cease-and-desist letter, explaining that PROSPER is as a seller of securities and should be regulated by the SEC.
LendingClub was launched at first as a Facebook application.Within a couple of months, it emerged as a standalone website 4 .LendingClub was the first P2PL company who registered its offerings as securities with the SEC, and offered loan trading on a secondary market (run through a company called Foliofn 5 ).Currently, it is the world's largest P2PL platform.

How Does It Work?
When someone needs a loan, they submit an application to a P2PL platform and become a potential borrower.The application includes information about the loan and the borrower, such as the amount requested, employment status, and social security number.
The platform accesses the status of a potential borrower using the Fair Isaac Credit Organization (FICO) score, debt-to-income ratio (DTI), home ownership, employment status, and other information.The platform decides whether to approve or decline a loan and sets an interest rate based on this information.The decision process is called loan application processing.
Once a loan is approved, potential lenders have 14 days to review the loan information and make an investment decision.The loan is issued if it receives enough funding within this period.The borrower receives money and makes monthly payments until they off off the loan.According to Lending Academy (2010), the lenders collect these payments minus a fee to the platform.

Loan Application Processing
The loan application process is a crucial procedure for a platform.This paper considers the loan application processing for the LendingClub.The company provides public access to approved/declined loan data and statistics.Initially, the loan applications go through a credit screening procedure.The applications passing the initial screening are evaluated by LendingClub's proprietary scoring models.The scoring model provides each applicant with a score, which is combined with the FICO score and other features.LendingClub considers about 180 features to decide whether to approve or decline a loan.For more details, we refer to LendingClub (2006).

Related Works on Peer-To-Peer Lending
As a novel financial model, P2PL has been extensively studied in the past two decades.Hulme and Wright (2006) focused on online social lending and provided an in-depth exploration of social lending from multiple perspectives; while Wang et al. (2009) provided an overview of the concept of P2PL, and discussed different P2PL marketplace models.Berger and Gleisner (2009) analyzed the role of a P2PL platform and found that market participants act as financial intermediaries and significantly improve borrowers' credit conditions by reducing information asymmetries.Lin (2009) investigated the role of "hard credit information" and "soft credit information" in the P2PL market.He found that loan applications with lower credit scores are less likely to be funded and more likely to default.Further, Iyer et al. (2009) found that a third of the variation in creditworthiness captured by the borrower's credit score can be inferred from available information.Puro et al. (2010) introduced a borrower decision-aid system that helps to formalize the decision-making process.Collier and Hampshire (2010) found that both loan amount and debt to income ratio of a borrower have influence on the final interest rate of a loan.Wu and Xu (2011) proposed a decision-support system providing individual risk assessment, eligible lender search, lending combination, and loan recommendation.Lin et al. (2013) studied friendship networks and information asymmetry in online P2PL and concluded that friendships increase the probability of successful funding, decrease interest 4 LendingClub, https://www.lendingclub.com/.5 Foliofn, https://www.folioinvesting.com/folioinvesting/home/. rates, and are associated with a lower ex post default rates.Chen et al. (2014) empirically tested data from PPDai and showed that both trust in borrowers and in intermediaries are significant factors influencing lenders' lending intention.Tsai et al. (2014) employed four machine-learning algorithms to classify and optimize peer lending risk, and found out that logistic regression outperformed LibSVM, Naïve Bayes, and random forest.Emekter et al. (2015) stated that higher interest rates charged on the high-risk borrowers are not enough to compensate for higher probability of default and claimed that, in order to sustain the business, LendingClub must attract borrowers with a high FICO score and high-income.Ma et al. (2017) studied different pricing mechanisms in peer-to-peer lending market, under the consideration of lenders' risk appetite.They have included the borrower pricing mechanism (BPM), the auction pricing mechanism (APM), which is Prosper's pricing mechanism before 2010, and the platform pricing mechanism (PPM), which Prosper used after a regime change.They claimed that, as long as the loan is profitable, the BPM and PPM are incentive-compatible mechanisms, while APM in not.Taking into consideration soft factors, Mi et al. (2018) established a model called SoFa to help investors to estimate default risks.Jiang et al. (2018) studied loan defaults by combining soft information extracted from descriptive text in online P2PL.Ding et al. (2018) investigated the lending transactions on RRDAI6 , and found a reputation mechanism that borrowers with better historical performance have a higher probability to obtain loans and at lower cost.Yu et al. (2018) studied the underlying neutral basis of herding behavior in online P2P lending at decision-making stage and feedback stage.By introducing event-related potentials (ERPs), they stated herding decision in P2PL is an evaluation of potential risk and it is effective for P2P platforms to optimize disclosure interfaces.

Methods and Performance Metric
The loan application process is a binary classification procedure that classifies a given set of loan applications into two classes (approved and declined loans).
Let {(x 1 , y 1 ), ..., (x m , y m )} be a set of m labeled loans, where x i ∈ R k is the vector of features for a loan application i, and y i ∈ {0, 1} is the binary label of the loan application, where y equals 1 for approved and 0 for declined loans.Note that the dimension k of x i is the number of features (also called credit attributes) provided by an applicant and a credit bureau.The loan application process can then be modeled as an estimation of a function f : R k → {0, 1}, which is called a binary classifier, by using the existing labeled loan data.

Logistic Regression-A Benchmark
Logistic regression is a popular binary classifier suggested by Cox in 1958, see Freedman (2009).It assumes that given a feature vector x ∈ R k , the probability that label y = 1 is given by where S is a function on R k , called a score function.A simple example of S is a linear function, S(x) = w x + b, where w ∈ R k and b ∈ R are parameters.These parameters can be estimated by the maximum likelihood method (see, for instance, Habermann 1979 andHosmer et al. 2013).The case study in Section 4 uses logistic regression as a popular benchmark method.

Our Approach
In addition to a simple logistic regression we considered the following two-step procedure: 1.Each feature is transformed for finding the nonlinear dependence of the likelihood of loan approval using a feature-wise spline regression.
2. Nonlinear score functions are estimated by applying logistic regression or maximizing the Buffered AUC or AUC with transformed features.

Transforming Features via Cubic Spline Regression
We used a spline transformation to capture the nonlinearity of every feature x j .We suppose that, for every feature, x j , an interval [a, b] and a partition where P i is a nonlinear function on [t i−1 , t i ], i = 1, 2, ..., n.Points t i , i = 0, ..., n are called knots and are specified by splitting interval [a, b] in subintervals containing (approximately) an equal number of observations (see Ahlberg et al. 2016 for a typical definition of a spline).In this case study, we considered cubic splines, where P i are cubic polynomials.Cubic splines are commonly used in practice.The fitting procedure can be reduced to convex programming and easily implemented.For every feature, with a PSG package, we maximized the likelihood function for a cubic spline of data.PSG is a general-purpose nonlinear optimization package for solving optimization and statistical problems.Nonlinear transformations, such as ln(.), exp(.), or polynomial, are commonly used for transforming features.However, with the splines implemented in PSG, it is possible to perform an optimal transformation rather than using a trial-and-error approach.The mathematical description of one dimensional spline as an argument for the logistic regression likelihood function is described here7 .The mathematical programming problem for finding optimal parameters of the spline is a convex nonlinear programming problem that can be solved very efficiently with standard nonlinear optimization algorithms.The PSG code for finding splines is in Appendix A. Each feature vector x ∈ R k is transformed via k-separate logistic regression problems, where the j-th logistic regression problem only utilizes the j-th feature x j to perform a univariate prediction.After estimating splines, say s j , for every feature x j , j = 1, ..., k, we employ the transformed features s j (x j ) as the new features of the score function, S(x) = ∑ k j=1 w j s j (x j ) + b.This transformation does not affect the complexity of the following optimization problem, such as the maximization of Loglikelihood (1) with logistic regression.

AUC and Optimization
AUC is a popular criterion in classification.AUC performs quite well for datasets with unbalanced class sizes.It is easy to achieve 99% accuracy on a dataset where 99% of objects are in the same class.
When a score function of a classifier exceeds a threshold, it is considered that the label equals 1; otherwise, the label equals 0. The receiver operating characteristic (ROC) curve is useful for visualizing and evaluating classifiers.The ROC curve is a two-dimensional plot of classifier performance.It is obtained by plotting the true positive rate (TPR) vs. the false positive rate (FPR) for every possible classification threshold.
This section directly maximizes the AUC performance metric.AUC is considered as the objective function for finding a classifier.AUC, by definition, is the area under the ROC curve, see Hanley and McNeil (1982); Bradley (1997);and Fawcett (2006).AUC values range from 0 to 1, since it is a portion of the unit square.A reasonable classifier should have an AUC greater than 0.5, because the random guessing generates the diagonal line between (0, 0) and (1, 1) and gives AUC = 0.5.
AUC has an important statistical property that is equivalent to the Wilcoxon test of ranks, first proposed by Hanley and McNeil (1982).The AUC of a classifier is the probability that the rank for a randomly chosen positive instance is higher than the rank for a randomly chosen negative instance.
Let us denote the score function by S(x) = w s(x) + b, where s(x) := (s 1 (x 1 ), ..., s k (x k )) .Note that, when s j (x) = x for all j, it corresponds to the case where no spline transformation is applied.We suppose that samples with label = 1 have higher scores.By ordering scores S(x 1 ), ..., S(x m ) and setting a threshold, we get a binary classifier.Further, we provide a probabilistic definition of AUC, see (5).Let L be a loss function vector defined by: where where 1 C is the indicator function of C, We assume that each x i ∈ I 1 has probability 1 m 1 and each x j ∈ I 0 has probability 1 m 0 .With the cumulative distribution function (CDF) of a real-valued random variable X, F(x) := Pr{X ≤ x}, the probability of exceedance (POE) (see Hoblit 1988) is defined as AUC of the classifier can then be expressed as We want to maximize the AUC of classifier, i.e., solve the following problem: The PSG package has a precoded probability of exceedance function that can be used for optimization of large datasets (millions of observations) with a standard PC.Since AUC according to Equation (5) can be expressed as the probability of exceedance, we can directly minimize AUC using PSG without writing a special optimization program.The PSG subroutine for minimization of probability is similar to the subroutine for minimizing of quantiles (Value-at-Risk in finance) described in Larsen et al. (2002).Note that the probability of exceedance is an inverse function of the Value-a-Risk.
It can be verified that AUC(λw) = AUC(w) for any λ > 0, i.e., the function AUC(w) is positive homogeneous.Therefore, if w is an optimal solution vector of problem (6), then λw is also an optimal solution vector of this problem.We observed that leads to instability of the PSG optimization algorithms, when the optimal point tends to zero.
To make sure that the optimal point is not close to zero, Problem (6) can be equivalently reformulated as follows: max where w := √ w w denotes the Euclidean norm of a vector w.Constraint w = 1 in Problem ( 7) is nonconvex.
Further, we formulate a problem equivalent to Problem ( 6), but with a linear constraint (which is a convex constraint).Let us assume that it is known some vector w 0 such that (w 0 ) w > 0 for some optimal vector w of Problem (6).Such vector w 0 can be a solution of a proxy for Problem (6).For instance, the logistic regression can be considered as a good proxy problem.Let us consider the following AUC maximization problem with a linear constraint: The constraint in Problem ( 8) is imposed to make sure that an optimal solution vector is not equal to 0. Further, we formulate a theorem about the relation of Problems ( 6) and ( 8).
(2) In addition, if w is an optimal solution of Problem (8), then, w is an optimal solution of Problem (6).
Proof.Let us prove Statement (1) of the theorem.AUC is a positive homogeneous function; therefore, AUC(λω ) = AUC(ω ) for λ > 0. Consequently, AUC(λ w ) = AUC(w ) and λ w is an optimal point of Problem (6).Point λ w is a feasible point of the constraint in Problem (8).Indeed, Since point λ w is feasible for Problem (8) with constraint and optimal for the problem (6) without the constraint, it is also optimal for the problem (8) with constraint.The statement (1) is proved.Let us prove the statement (2) of the theorem.Since w and λ w are optimal solutions of the problem (8), then AUC(w ) = AUC(λ w ).Moreover, AUC(λ w ) = AUC(w ), therefore, w is also an optimal solution of problem (6).
Since AUC equals one minus probability, as shown in Problem (5), AUC maximization Problem (8) can be converted to the following probability minimization problem: where L is defined in Problem (2).AUC is discontinuous and nonconvex, which makes it difficult to solve Problem (9) to global optimality.However, PSG has a quite efficient algorithm for minimizing probability with convex constraints.The algorithm used in PSG is similar to the optimization of Value-at-Risk, as described in Larsen et al. (2002).PSG code for solving Problem (9) is included in Appendix A. The main motivation of including Theorem 1 in this section is to explain the reduction of Problem ( 7) to (9).

bAUC and Optimization
Several tractable approximation methods have been developed for AUC maximization (see Doucette and Heywood 2008;Miura et al. 2010;and Aiolli 2014).In these methods, AUC was approximated by some surrogate function and this function is maximized.Norton and Uryasev (2016) suggested a new approach to deal with this complicated problem.They defined Buffered AUC (bAUC) and reduced maximization of bAUC to convex and linear programming problems.The bAUC is the best quasi-concave lower bound of AUC.The bAUC characteristic can be maximized efficiently with convex programming.PSG has a precoded analytical function for bAUC.
The bAUC concept is based on the so-called Buffered Probability of Exceedance (bPOE), defined in Norton and Uryasev (2016), and also in Mafusalov and Uryasev (2018).For references to several papers using bPOE concept in various areas, see Davis and Uryasev (2016); Norton et al. (2017); and Shang et al. (2018).Further, based on one-dimensional minimization representation of the bPOE, Mafusalov et al. (2018) studied statistical properties of empirical estimates of the bPOE.To explain bPOE and bAUC, we give below formal definitions of Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR), and bPOE.
For τ ∈ (0, 1), the τth quantile q τ (X) of X is defined by In finance, the quantile is known as VaR (see, e.g., Artzner et al. 2002 andEinhorn andBrown 2008).Rockafellar and Uryasev (2000) considered Conditional ValR (CVaR) as an alternative measure of risk which has better mathematical properties, compared to VaR (see, e.g., Artzner et al. 2002 andRockafellar andRoyset 2018).call CVaR by τ-superquantile for general use outside financial context.CVaR is defined as follows: where [•] + = max{•, 0}.For a continuously distributed random variable X, the CVaR (τ-superquantile) equals a conditional expectation of the tail exceeding the quantile, namely, There are two slightly different variants of bPOE: Upper and Lower bPOEs.In this paper we use Upper bPOE which is defined as 10) is considered in Norton and Uryasev (2016) and Mafusalov and Uryasev (2018) as a property of bPOE, but it is convenient to use it as a definition.Further on, Upper bPOE will be called bPOE (without mentioning that it is Upper bPOE).It has been proved in Mafusalov and Uryasev (2018) that bPOE equals 1 − τ on the interval E[X] < z < sup X, where τ is an inverse function of CVaR, i.e., a unique solution of the equation where sup X is the essential supremum 8 of the random variable X.Therefore, bPOE equals the probability, 1 − τ, of the tail such that CVaR for this tail is equal to z.The Formula (10) looks quite unusual; the expression (10) does not immediately come across as a probability of some event.
The maximization of bAUC can be formulated as: Norton and Uryasev ( 2016) reduced the bAUC maximization (11) to a convex optimization problem: and gave a linear reformulation: It is contrastive that while the AUC maximization ( 6) through ( 9) is a nonconvex optimization, the bAUC maximization (12) (with the linear constraint of the form as in Problems ( 8) or ( 9)) can be reduced to a linear program.PSG code for minimization of bAUC using bPOE is included in Appendix A. PSG code is based on convex programming and directly utilizes representation (12).

Case Study
This section reports a case study, examining the fitting of a simple regression model-based framework to a P2PL application process by applying the methods described above. 9

Data Preparation
LendingClub provides open access to sets of loan data since 2007, when the company started operation.The company updates the status of the loans currently listed in downloadable data on a monthly basis and adds new loan data quarterly.The complete loan data are posted for all issued loans.Declined loan data contain the list and details of loan applications that did not meet LendingClub's credit underwriting policy.
The number of reported features changes over time.LendingClub cut open data by 50% in November 2014.At that time, the company removed a half of features of borrowers, as well as all the data of loans with "Policy Code" equal to 2 (new product and not publicly available).That part accounts for 25% of LendingClub's total issuances.
We considered data for three calendar years, 2012, 2013, and 2014.The loan data contain fifty-one features, while the sets of declined loan data have only nine features, as shown below.

•
Amount Requested: The total amount requested by the borrower.

•
Application Date: The date when the borrower applied for the loan.

•
Notes Offered by Prospectus/Loan Title 10 : The loan title or purpose description provided by the borrower. 8 The essential supremum of the random value X is the smallest number a such that probability of the set {X > a} equals zero. 9 The case study presented in this section (data, codes, and calculation results) is posted at this link http://www.ise.ufl.edu/uryasev/research/testproblems/financial_engineering/%20classification-in-loan-application-process%20/.
• Risk Score: For applications prior to November 5, 2013 the risk score is the borrower's FICO score.For applications after November 5, 2013, the risk score is the borrower's vantage score.

•
Debt-To-Income Ratio (DTI): A ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower's self-reported monthly income.

•
Zip Code: The first three digits of the zip code provided by the borrower in the loan application.

•
State: The state provided by the borrower in the loan application.

•
Employment Length: Employment length in years.Values are between 0 and 10 where 0 means "less than one year" and 10 means "ten or more years".

•
Policy Code: policy codes 1 and 2 correspond to publicly available and not available new products.
In order to calibrate classification methods, we needed values of features and approved/declined labels for a training set of loan applications.We used only nine features and ignored the descriptive information in "Notes offered by Prospectus" and in "Loan Title".We have not used the feature Policy Code since it equals 0 for every declined loan and 1 for every publicly available loan.Moreover, the information for loans with Policy Code = 2 is not available since November 14, 2014.To make process consistent, we have not considered the Application Date, State, and Zip Code.
Therefore, we considered only four features: Debt-To-Income Ratio, Amount Requested, Risk Score, and Employment Length.
There are a few loan applications with very high values of Debt-To-Income Ratio or Amount Requested, which we regarded as outliers.Following the definition of Tukey (1977), an outlier is a value outside interval where Q 1 and Q 3 are the first and the third quartiles, respectively.We projected outliers to the nearest boundary.
There are many loans with N/A (not available) values for some features.We considered only loans with complete data that do not contain N/A values.

Numerical Results
We used PSG for conducting numerical experiments.Classification algorithms were evaluated with the AUC criterion.

Spline Transformation of Features
For every feature, we have performed nonlinear spline transformation.Optimal parameters of splines are found by maximizing the logistic regression likelihood of observed labels.Maximization was done with the PSG package (see code in PSG Text format in Appendix A; the code is also available as a MATLAB or R subroutine, see link in Appendix A).We considered five cubic pieces with continuous first, second, and third derivatives at spline nodes.Every spline piece contains about the same number of observations (this is the rule for setting nodes of the spline).Every spline has only eight degrees of freedom ((five pieces) * (four parameters in a cubic polynomial) − (four nodes) * (three constraints) = 5 × 4 − 4 × 3 = 20 − 12 = 8).The dataset contains 380,465 observations in 2012.Therefore, there is no overfitting with the spline optimization procedure, since every spline has only eight degrees of freedom.
Figure 1 shows the spline-transformed DTI feature in 2012.The spline shows the likelihood of approval as function of the DTI feature.We see that loan applications with DTI in the 5-30 range have a relatively high approval likelihood.The graph provides a valuable information for the decision maker about the dependence of approval rate from the feature value.
Figure 2 shows the spline transformed Employment Length feature in 2012.The loan applications with Employment Length equal to 0 have a fairly low chance to be approved, as shown by the circled point in Figure 2. The applications with Employment Length greater than 0 have similar chances to be approved.The spline-transformed Risk Score feature is shown in Figure 3. Loan applications with a Risk Score higher around 770 are likely to be approved.It is interesting to observe that applications with the Risk Score higher than 770 have a decreasing chance to be approved.This is a counterintuitive fact.A possible explanation of this fact is that people with very high Risk Score may be overloaded with loans and they may have high DTI (which results in the higher chance of declining the loan).We ranked features with the MATLAB 8.1.0.604 (R2013a) subroutine "Rankfeatures" with ROC as the criterion.See MATLAB documentation for the description of the "Rankfeatures" subroutine11 .For all years (2012, 2013 and 2014), we get the same features ranking, shown in Table 1.

In-Sample Evaluations
This section contains in-sample classification results.First, we conducted the logistic regression with the original and spline transformed features.Data, codes, and calculation results for one instance of logistic regression are in PROBLEM 0 (with original features) and PROBLEM 2 (with spline-transformed features ) 12 .Then, we maximized bAUC with the spline-transformed features, as in (11), see PROBLEM 3 11 .Finally, we maximized AUC with the spline-transformed features, where the initial point is set to the solution of the bAUC maximization problem, see PROBLEM 4 11 .PSG codes for Logistic Regression, bAUC maximization, and AUC maximization are in Appendix A.
Model with One Feature: Debt-To-Income Ratio According to the features ranking, we first used only DTI to perform classification.The resulting AUC values are in Table 2.
The last column shows that with only one feature, we got AUC over 0.72.Further, we included the Employment Length, which is the second feature in the ranking Table 1.

Model with Two Features: Debt-To-Income Ratio and Employment Length
With two spline transformed features, DTI and Employment Length, we got a fairly high AUC exceeding 0.93 with all considered approaches (see columns 3, 4, and 5 in Table 3).Model with 3 Features: Debt-to-Income Ratio, Employment Length, and Risk Score We got even higher AUC exceeding 0.95 with three spline transformed features: DTI, Employment Length, and Risk Score (see columns 3, 4, and 5 in Table 4).This table shows that the standard logistic regression (without features transformation) provides quite a high AUC, exceeding 0.93 (see column 2).Further, this result was improved by the spline transformation of features.For 2014, the AUC of logistic regression was improved from 0.937744 to 0.982057.We conducted bAUC maximization and AUC maximization with one, two, and three spline-transformed features (see columns 4 and 5 in Tables 2-4).However, the improvement was insignificant compared to the logistic regression.
We want to mention inconsistency by doing feature transformation with logistic regression and using transformed features for maximizing bAUC and AUC.However, since we got a near-perfect classification (AUC close to 1), transformation of features using bAUC and AUC criteria does not bring additional benefits.Transformation of features using bAUC and AUC criteria are much more complicated problems than the transformation with the logistic regression.

Out-Of-Sample-Evaluations
We also performed out-of-sample validations.We considered the model with three features: DTI, Employment Length, and Risk Score.We have done fourfold cross-validation for 2012.The code for generation of fourfold cross-validation data and solving of four logistic regression problems in PSG text format is in Appendix A (see also "PROBLEM 4" 13 ).
The in-sample dataset contains over 380,000 observations.Because of quite the large size of the dataset, there was very little difference between the in-sample and out-of-sample results (in-sample and out-of-sample AUC coincided with four digit precision).
A more meaningful cross validation is to use an optimal solution from previous years for classification in forthcoming years.We took an optimal solution ω obtained by AUC maximization for 2012 and calculated the AUC for 2013.This AUC for 2013 equaled 0.959837, not far from the in-sample AUC obtained by AUC maximization for 2013, which is 0.960204 in column 5, Table 4.We repeated a similar procedure for 2014.We took optimal solution ω obtained by AUC maximization for 2013 and calculated the AUC for 2014.The out-of-sample AUC for 2014 was 0.981841, which is very close to the in-sample AUC obtained by AUC maximization, which is 0.982097 in column 5, Table 4.This out-of-sample verification shows that the same model was used for several years without significant modifications.

Conclusions
The objective of this paper was to obtain an understanding of loan classification problems solved by P2PL companies.We considered LendingClub because of the availability of historical data describing the classification decisions.We found that, with three spline-transformed features, Debt-to-Income Ratio, Employment Length, and Risk Score, we obtained an AUC exceeding 0.95, with three considered methods (see columns 3, 4, and 5 in Table 4).In this case, the commonly used logistic regression (without transformation of features) also performed fairly well and delivered AUC exceeding 0.93, see column 2 in Table 4.For some years, e.g., for 2014, the spline transformation of features significantly improved the result (for the logistic regression from 0.937744 to 0.982057).These results indicate that the lending process can be well-approximated with only a few features after adequate nonlinear transformation of those features.We think that the paper provides insight on the loan application process at LendingClub, and suggests an approach that can be used for a similar analysis of other companies.
Regarding methodological improvements, we found that the spline transformation of features may improve the performance of the classification algorithms.We compared three classification optimization procedures: logistic regression, maximization of AUC, and maximization of bAUC.For the considered examples, these methodologies provided quite similar results.Since AUC was used as a criterion, the maximization of AUC outperformed other approaches.However, this outperformance was not significant.
The main conclusion/outcome of this paper is that, although various factors are available for loan selection at LendingClub, these factors are not used.Lending decisions are based only on a few factors that can be processed with standard popular approaches, such as logistic regression.We recommended some improvements, such as the spline transformation of factors, but these improvements do not have significant impact on the overall quality of the classification process.Out-of-sample cross-validation confirmed our conclusions.The case study with data and codes was posted on the website and is available to other researchers.
denote the index sets with labels −1 and +1, respectively.Let us denote by m 0 := |I 0 | and m 1 := |I 1 | the cardinality of I 0 and I 1 .AUC is given by

Figure 2 .
Figure 2. Employment Length feature transformed with the cubic spline.The first observation with distinctively lower value is circled.

Figure 3 .
Figure 3. Risk Score feature transformed with the cubic spline.
).The first P2PL company in the United States, PROSPER Marketplace 3 , a San Francisco, California-based company, was founded in 2005.PROSPER began operations in February 2006 and was the only P2PL company in the U.S. until LendingClub was founded in May 2007 in San Francisco, California.PROSPER was temporarily shut down in 2008 because of Securities and Exchange Commission (SEC) scrutiny.The majority of PROSPER investors got negative returns, mainly because of a poor loan evaluation model.PROSPER issued loans to anyone who had an interest in getting a loan.

Table 2 .
Results for models with one feature-DTI.

Table 3 .
Results for models with two features-DTI and Employment Length.Column 2 shows AUC for logistic regression with the original features without transformation; all other columns show results with spline-transformed features.Column 2 shows that with two features, the logistic regression gives AUC over 0.9.For 2014, AUC maximization with spline transformed features gives AUC over 0.973 (see column 5).

Table 4 .
Results for models with three features, DTI, Employment Length, and Risk Score.