The Advantage of Case-Tailored Information Metrics for the Development of Predictive Models, Calculated Profit in Credit Scoring

This paper compares model development strategies based on different performance metrics. The study was conducted in the area of credit risk modeling with the usage of diverse metrics, including general-purpose Area Under the ROC curve (AUC), problem-dedicated Expected Maximum Profit (EMP) and the novel case-tailored Calculated Profit (CP). The metrics were used to optimize competitive credit risk scoring models based on two predictive algorithms that are widely used in the financial industry: Logistic Regression and extreme gradient boosting machine (XGBoost). A dataset provided by the American Fannie Mae agency was utilized to conduct the study. In addition to the baseline study, the paper also includes a stability analysis. In each case examined the proposed CP metric that allowed us to achieve the most profitable loan portfolio.


Introduction
Predictive model development is a multifaceted task involving steps, such as preparing a dataset, selecting an appropriate algorithm, optimizing hyperparameters and determining the optimal cut-off point. While the described stages usually receive considerable attention [1][2][3], an often overlooked issue is the selection of the information metric against which the model is optimized. This issue turns out to be no less important than the other steps, especially if we are dealing with an imbalanced dataset or a case in which the misclassification error costs are significantly different for a false positive and a false negative case [4]. The most commonly used metrics, such as Area Under the ROC curve and the Kolmogorov-Smirnov test [5], unfortunately do not account for this disparity and are not best suited for credit scoring [6].
The problem of misclassification costs is not completely ignored in credit scoring. Both in the literature and real business applications, there are proposals for metrics that take this phenomenon into account [6,7]. However, most of the model development methods incorporating this problem focus only at the final stage of selecting a cut-off point such that the proportion of errors is inversely proportion to their costs [8,9]. At the same time, the usage of a problem-suited metric for the whole model tunning process may allow obtaining a model better suited to the problem [10].
The overarching goal of building predictive models for credit scoring is to increase a credit institution's profitability by differentiating profitable (often referred to as good) and lossy (bad) customers. In this context, the primary purpose of this work is to demonstrate that synthetic metrics, commonly used for credit risk models' development, can lead to suboptimal solutions in business applications. The paper aims to show that using a problem-tailored metric, Calculated Profit, developed within this research, for the whole model development process, allows one to achieve a significant advantage in terms of overall credit portfolio profitability.
The study is organized as follows: Section 2 is devoted to the issue of classification models errors and methods for evaluating their performances. It also presents the CP metric proposed in the paper for the optimization of credit risk models. Section 3 introduces the methodology and results of the main study, including a description of the utilized dataset. Section 4 is dedicated to a sensitivity analysis of the compared metric. The conclusions and possible further development directions are presented in Section 5.

Credit Risk Modeling
The ability to distinguish between a potentially good and bad customer is of tremendous value to a financial institution, dramatically impacting corporate profits and the riskiness of a loan portfolio. Ideally, it would be possible to lend to all profitable customers while rejecting applications of the risky and potentially lossy ones. To some extent, such insight can be provided by developing a credit risk model. The task of such a model in the field of information theory is to reconstruct information about the riskiness of a loan using signals, which are random variables (features) describing the characteristics of that loan. These signals are usually significantly distorted, so the role of a good model is to extract only those pieces of the information that allow for a correct assignment of riskiness for a given set of loan features. The results of such a model are then used to decide whether or not to grant credit to a particular customer. Considerable attention has been given to this issue of risk modeling [11][12][13] and a broad spectrum of econometric and machine learning methods have been used for the described task [14,15].

Model Performance Metrics
Credit risk modeling is usually done by determining the probability of defaulting for a given customer in the range from 0 to 1. Then, based on a settled or determined value, usually called a threshold or cut-off point, customers with a risk below that value are granted credit, whereas those with a probability above it are rejected. As a whole, this process is referred to as binary classification, meaning labeling customers as good ("0") or bad ("1"). When dealing with this issue, four model results are possible in terms of the prediction correctness. They are customarily presented by means of a contingency table (Table 1): Two of the results: true positives and true negatives, stand for correct identifications of bad and good credits, respectively. The two following outcomes of false positives and false negatives, marked in gray in Table 1, represent incorrect classifications, which in the business application of the model, are usually associated with specific costs for the company. Moreover, those costs are usually significantly different for those two types of errors [16]. Relative to an ideal discriminator, the model payoff in a credit scoring application can be represented as (Table 2):

Bad
Denying credit to bad client cost = 0 Granting credit to bad client cost = loss on default Good Denying credit to good client cost = loss of profit on good credit Granting credit to good client cost = 0 In general, the goal of model building is to achieve the lowest possible prediction error. There are many widely used metrics for evaluating models [17], including, for example: Although the above measures are widely and readily used, they have numerous drawbacks that eliminate them from application in the complex development of advanced models. First of all, they are not suitable for imbalanced datasets [18]. Secondly, they depend on the choice of a cut-off point, which is used for transferring from the probability plane, returned by the predictor, to the binary classification. This fact is problematic, as this point is usually selected only at the last stage of model development. For those reasons, more universal metrics, such as Area Under The ROC Curve (AUC) (Figure 1) [19], which measures discriminative power of a model through the whole spectrum of possible cut-off points and is partially insensitive to the class imbalance problem, are most commonly used for the model development process [20]. Similar characteristics are offered by the Kolmogorov-Smirnov test (KS-statistic) [21] or Area Under Precision-Recall Curve (AUPRC) [22]. However, all these measures are still not ideal for evaluating models used in business applications because they do not consider misclassification costs, assuming that they are equal for FP and FN prediction. In the example of crediting institutions, if the bankruptcy cost exceeds the profit on good credit, out of two models with identical accuracy or AUC, the one that achieves a smaller percentage of false negative classifications is preferred.

Cost-Sensitive Methods
Many attempts were made to account for the cost aspect [6][7][8]23,24]. One of the simplest methods uses a graph of the Receiver Operating Characteristic (ROC) (Figure 2) [25]. In this approach, the optimal cut-off is determined as a point on the ROC curve, for which it is tangent to a line with a slope coefficient equal to the quotient of the false negative and false positive costs: This approach should theoretically ensure that the proportion of each type of false prediction is inversely proportional to their cost in a loan portfolio. However, the presented method allows for including misclassification costs only in the final step of selecting the cut-off point. The whole training process of the model in this case is performed solely using AUC or another similar metric. Thus, misclassification costs are not fully taken into account, and it has been proven that an additional advantage can be gained by using metrics considering misclassification costs throughout the whole optimization stage [26]. Some authors already addressed this issue in the area of credit scoring [27,28]. There were also proposals for new metrics incorporating cost aspects. Hand [6] presented an M 2 measure. Calabrese [8] introduced the MEL curve. One of the newest and most sophisticated ones, which we will focus on in this study, is the Estimated Maximum Profit (EMP) metric, described by Verbraken [7].
The main idea behind EMP is to estimate the costs of false positive and negative predictions as proportions of the Initial Principal Balance (IPB), namely, the size of the loan, calculated as follows: Cost(FP) = ROI = I IPB (7) where I = total interests, LGD = Loss Given Default as a proportion of the IPB and ROI = Return on Investment. The Loss Given Default (LGD) distribution is assumed to come from a point-continuous distribution, defined as below: In addition to considering the costs of false classification, EMP also provides an optimal cut-off point, which is determined as the argument that maximizes the internal expected profit function. However, the downside of this method is that both the value of Return on Investment (ROI) and LGD distribution are assumed to be constant and need to be arbitrarily defined. In reality, these values can change in time and vary depending on specific loan characteristics. Therefore, it is a common practice to create models predicting LGD for each credit separately [29,30]. Additionally, the assumption of a particular form of LGD distribution creates the possibility for a systematic estimation error when this condition is not met.

Calculated Profit
Effective use of the most common metrics comes with limitations, such as the need to work with a balanced set and equality of classification error costs, and to reconcile the representation of the model's effectiveness only for a particular cut-off point. Even using more advanced metrics tailored to a specific problem means coming to terms with some simplification, as shown for the EMP.
To eliminate problems associated with the use of synthetic metrics as described above, a novel, strictly business-oriented metric called Calculated Profit (CP) is proposed. Calculated Profit may be defined as: where Credit Profit/Loss i = empirical profit or loss on a single credit, score i = predictor's score for a specific credit and t = cut-off point.
Determining the CP metric requires computing the cut-off point while maximizing its value; therefore, the optimal value of this threshold is simultaneously supplied by the metric: CP is calculated as a sum of profit (or loss) for every credit admitted to the portfolio. A metric constructed this way has many advantages. To start with, it is insensitive to the class imbalance problem, which is connected with the use of basic metrics such as accuracy. Secondly, it incorporates possible differences in misclassification cost of FP and FN predictions. However, compared to the EMP, it has two significant advantages in this field. While calculating the CP, we make no assumptions about the ROI or shape of the LGD distribution; thus, it is resilient to a constant bias or incorrect determination of those values. It also allows the misclassification costs to be time-variant and correlated with features of the credit itself. Therefore, CP should be more efficient and robust, while simultaneously being easier to apply, as it does not require any predetermined parameters of the dataset. Finally, it represents the direct objective of the company. Thus, the model optimized with its utilization should be most closely aligned with the business problem. At a theoretical level, CP is both simpler and more flexible than presented synthetic metrics. One inconvenience of using CP is the need to possess information about the actual profit on a loan. However, this issue is not a problem in real-world applications. In other cases, it can be estimated, for example, based on the reference interest rate.

Metric Comparison
The purpose of the next stage of the paper is to compare the CP with other metrics used in credit risk modeling. The comparison will be made by developing separate scoring models, which are then compared in terms of total portfolio profitability on a validation set. Apart from the CP, two more previously described metrics, namely, AUC and EMP, differing in the degree of fit to the problem under study, were selected for the comparison.

Methodology
In order to compare model development strategies based on different metrics, two algorithms, widely used in actual business applications, were utilized. The first of them was the Logistic Regression (LR) with a Weight of Evidence (WoE) variable transformation [31]. Logistic Regression is a classical generalized linear method for binary classification. Despite recent advancements in the area of machine learning, it is still prevalent because of its stability on smaller samples, low likelihood of overfitting and explainability [32,33]. The latter is especially important, due to regulations imposed on the financial sector by international institutions [34] which constrain crediting companies to the explicit use of fully explainable models for credit scoring, making the Logistic Regression still a widely used tool. Additional methods, such as WoE transformation, diminish the limitations of linear models and create performance predictive tools comparable with those prepared using modern machine learning algorithms [35]. Weight of Evidence was specially chosen for this study, as it is a method that originated from and is still widely used in preparation of credit risk models [36].
For the second approach, an XGBoost [37] was chosen as a representant of a modern, state-of-the-art class of machine learning algorithms. Although being a relatively new solution, XGBoost has already been implemented in all kinds of applications, including the financial industry in the less restricted areas [38,39]. With the usage of those two algorithms, the models were optimized using three different metrics: • AUC is the most universal and basic metric of those presented. For the model optimized by this method, the cut-off point was selected by minimizing the KROC value ( Figure 3), which is the distance from the ROC curve to the point (0,1) of its graph [25]. KROC favorites intermediate solutions with a similar level of FP and FN errors when it comes to selecting the cut-off point. Thus, this approach equalizes both misclassification costs and was chosen for the study as a popular reference solution. • EMP, using an inbuilt optimal cut-off point recommendation, represents an advanced, problem-dedicated but synthetic solution. The scoring function parameters were estimated as follows: where T 0 = the moment of loan acceptance for a portfolio, and UPB = Credit Unpaid Principal Balance. p 0 and p 1 were calculated as proportions of credits with LGD values of below 10% or above 90%.
where T F = the moment of a loan foreclosure and F LGD = LGD empirical distributor. • CP is a novel, strictly goal-oriented metric, described in-depth in Section 2.4. All models went through an optimization process of feature selection, hyperparameter tunning (for XGBoost) and cut-off point calculation, using the subsequent metrics. The feature selection stage was conducted using the permutated feature importance method [40].
Features that improve the model up to the fifth significant digit (according to the relevant metric) were chosen for the final model. In the case of XGBoost, the optimal hyperparameters were determined using 100 iterations of random search. The entire optimization process was performed using 3-fold cross-validation. Considering two alternative default flags, which will be introduced in the next chapter, 12 models were compared at this study stage.

Fannie Mae Multifamily Loan Performance Data
Data used in the study comes from an American agency called Federal National Mortgage Association (FNMA), commonly known as Fannie Mae. It is a private, publicly-traded company working in cooperation with the USA government (Government Sponsored Enterprise). By special law, it is authorized to provide credits and credit guarantees. The corporation is not directly financed by the government, although it operates on preferential terms compared to other enterprises of this type. Fannie Mae's primary focus is to provide liquidity on the secondary mortgage market by securitizing mortgage loans [41].
The database [42] describes 54,160 credits from a "multifamily" category which includes loans granted to enterprises and secured by multi-apartment buildings. Reported monthly data consist of credits acquired between 1 January 2000 and 31 December 2020, representing about 85% of all multifamily credits acquired by Fannie Mae in this period. The file contains 3,607,268 rows and 57 columns. On average, one credit is present in the dataset for about 67 months. By the thematic criterion, we divided features into: identifying credit/transaction (4), loan terms at origination (11), customer repayment capabilities at origination (2), collateral property (9), loan parameters at the acquisition (8), current loan information (8), loan repayment history (1) and describing credit events (21).
The raw dataset was put through a standard process of data preparation. Firstly, credits were removed for which information was available for less than 12 months as not representative. Then, the average interbank lending rate [43] was added as a reference credit rate. On its basis and through a transformation of the original variables, 17 additional predictors were added. Auxiliary features irrelevant to predictions (i.e., credit id) were removed. A full feature list after this step can be found in Table A1 in the Appendix A. Data gaps were marked as a separate category when their source was known or when they represented a significant proportion of data (over 10%). For the rest of the instances, a K-Nearest Neighbors (KNN) imputing with parameter k = 5 was applied. With a view of using a Logistic Regression algorithm, categorical variables were binarized. In the end, highly collinear features were removed based on Variance Inflation Factor (VIF) statistic of over 100.

Profit Calculation
For the purpose of using the CP metric, it is essential to obtain information about the profitability of each finalized credit. It is usually not a problem in a business application. However, publicly available databases usually do not share this kind of insight. In the case of the dataset used in the study, the information about a profit was estimated by subtracting capital costs (interbank reference rate) from installments, calculated based on a current unpaid balance of the credit and a credit rate. Both credit rate and the default cost were available for in the unprocessed database as separate variables. All cash flows were discounted based on annual inflation in the USA [44] at the date of observation. Furthermore, administrative costs were considered in the form of 0.1% of Unpaid Principal Balance (UPB) as the credit portfolio admission cost, and 5% of UPB when the loan entered default. Final credit profit was calculated as follows: • When credit was matured: • When credit was foreclosed: where UPB(t) = credit Unpaid Principal Balance at the moment of t, T M = credit maturity moment, r C (t) = credit rate at the moment of t, r R (t) = reference rate at the moment of t and π(k) = inflation at the moment of k.

Dependent Variable-Default Flag
In the next step, the default flag was prepared as the dependent variable. In order to test the different approaches, the entire study was conducted using two scenarios: • 60M flag: Dependent variable with a value of "1" when the credit will fall into a delinquency of over 90 days within the next 60 months. Such a flag follows a common definition of the default flag in the financial industry, proposed by the Basel II regulations [34]. Although the standard observation period is 12 months, a longer 60 months window was chosen in this study because of the long-term character of modeled instruments and to increase the observed default ratio. • Lifetime flag: dependent variable with a value of "1" when information about the credit foreclosure is available in the whole credit history.
Based on a default flag, the datasets were then separated. For both target variables, observations with unknown target variable were removed, which meant that either the credit repayment had not yet ended or its lifespan was shorter than 60 months in the case of the 60M flag. Next, for the lifetime flag, only the first record of the credit was chosen for the final dataset because of its lifecycle approach. In contrast, in the case of the 60M flag, one record for every 60 months was selected for the modeling dataset. Such treatment allowed for achieving larger training and testing samples while simultaneously avoiding an overlapping observation window. Model overfitting was not expected to be a problem, as the average number of repetitions of a unique credit in the studied set is around 2. Finally, credits delinquent at the time of observation were excluded, and both datasets were divided into training and testing subsamples in 70%/30% proportions. To avoid the time interference and unequal case distribution, samples were drawn in a shuffled and stratified way with respect to the dependent variable. The main characteristics of modeling-ready datasets are presented in Table 3. An interesting note from Table 3 is that for the 60M flag, the average loss on a bad client is relatively small, almost zero. The source of this phenomenon is explained partially by the credits profit histograms (Figure 4). As one can see, for the 60M flag, many loans designated as "defaults" will still make a positive profit. The source of this situation may be due to the not entirely commercial nature of Fannie Mae. Loans entering default, for example, may get favorable debt refinancing terms, and thus a high cure rate is achieved for the entire base. In the case of the lifetime flag, the average loss on a bad customer is still lower than the average gain on a good customer, but this time, the disproportion is much smaller. In addition, there is a sharper separation between profitable and lossy customers for the lifetime flag.

Results
The following section is devoted to discussing the results of models trained according to the methodology presented above. All approaches were tested on the same samples. The parameters of the models were chosen in 3-fold cross-validation, and the results refer to models trained on the entire training subset and validated on the test subset. The main calculations were conducted using Python with sklearn, statsmodels and xgboost packages. A dedicated package [45] in R was used to examine the EMP value.
Analyzing the results of the individual models ( Figure 5), we can observe, not surprisingly, that both methods incorporating different misclassification costs for FP and FN predictions produced a significantly better results in terms of profit than the general approach using AUC. In each study, the CP metric achieved the highest profit. The largest difference was approximately 4%, compared to the EMP model, in the case of the lifetime default flag and Logistic Regression. This number may seem small, but a 4% gain achieved by using the more tailored model development metric would add significant value to a company's overall financial account in an actual business application. As can be seen in Table 4, the most prominent feature of the models developed by CP is the significantly higher cut-off point selection compared to other metrics, subsequently allowing the largest number of loans to be included into the portfolio in each of the study variants examined. An in-depth analysis of the level of the cut-off point and the average profit per loan indicates that the advantage of the CP metric lies not only in the more appropriate choice of the cut-off point. In the case of the lifetime flag and the LR model, we observed that despite including more loans into the portfolio, the average profit per single loan was higher than for the EMP. This example illustrates the point of using goal-oriented metric for the entire model development process, as in this case, the CP-based model was able to catch the differentiating characteristics of the profitable loans. Additionally, the CP metric, in all cases, chose a significantly smaller number of features, which is usually a beneficial factor in terms of the stability and computational complexity of a model. Moreover, the maximum difference in performance between two algorithms was 6.56% in favor of Logistic Regression in the case of the 60M flag and AUC metric, and that was far smaller than the maximum difference between metrics, which amounted to 21.3% in favor of CP over AUC for XGB and also the 60M flag. That example illustrates that the correct selection of model development metric might be much more beneficial than the choice of predictive algorithm. It is also worth noting that Logistic Regression models boasted equal or better total profit results, compared to XGBoost, for the best models for a given default flag.
The obtained results in terms of the comparison between EMP and AUC are in line with those presented in previous studies [7]. Additionally, in that case, EMP was characterized by a higher number of accepted credits and a higher total profit at the price of a lower average profit per individual customer, in relation to AUC.
As observed above, the difference in performance between the CP and EMP metrics was notable for the lifetime flag but only slight in the case of the 60M flag. The source of this phenomenon was that the EMP, as described in the previous chapter, makes specific assumptions about the LGD distribution when estimating the cost of a false positive classification.
The 60M flag is generally an unusual case for which a great proportion of bad clients are still profitable. The EMP does not incorporate this in the estimated loss on the defaulted client. As the LGD is assumed to take values from 0 to 1, EMP has to utilize its transformed distribution, as in Figure 6 on the right. However, in the case under examination, thanks to a specific appearance of the transformed LGD, EMP more accurately approximates the cost of a bad customer as close to zero. Nevertheless, it is still overestimated.   In order to ensure that the results obtained do not depend on the specific characteristics of the dataset under study, an analysis of the stability of the results is presented in the next section.

Stability Analysis
The following section is devoted to analyzing the sensitivity of solutions as a function of test conditions and dataset characteristics. The study addresses two issues. First, the effects of the level of additional administrative costs on the performances of individual metrics are tested, as these values were taken arbitrarily and were not directly derived from empirical data. Next, the impacts of the LGD distribution in the study sample on the EMP and CP metrics performance are examined.

Different Levels of Service Costs
The values of administrative costs were assumed in the baseline study to be 0.1% and 5% of current UPB, respectively, for every granted loan and for serving a delinquent credit, though these can vary significantly depending on the nature of the lending institution's business. These values may be lower for banks working with large, multi-year loans. Simultaneously, they will be much higher in for small, short-term cash loans. That is the reason why this chapter is dedicated to testing the sensitivity of the results as a function of the level of additional costs.

Low Costs
First, a check was conducted for the lower level of costs, where the expense of taking on the portfolio was zero, and the cost of servicing the loan in delinquency amounted to 1% of UPB.
The reduced cost level did not significantly affect the profit on good customers for both default flags (Table 5). However, there were interesting changes on the data side for customers marked as bad, as for the 60M flag, on average, a profit of $118,050 was made on them, and for the lifetime flag, the average loss dropped by about 10%.
In spite of that, the low-cost variant did not noticeably change the visual profit distributions for "good" and "bad" clients, in either the 60M or the lifetime flag case (Figure 8). Reducing the level of additional costs did not dramatically affect the results of individual metrics, but the changes are noticeable (Figure 9). First, there were slight decreases in the advantages of the CP and EMP over AUC for the lifetime flag. For the 60M case, profit gains on the XGBoost model dropped significantly, by about 10%, for both the EMP and CP metrics. An interesting result of this change is that for both flag variants, the most significant gains from using the problem-tailored metric were achieved by models based on LR. Despite the changes, the models developed with CP again resulted in the most profitable portfolios for each study variant.  A more profound analysis (Table 6) of the details provided similar conclusions as in the case of the baseline study. Again, the CP metric selected the smallest number of variables on average and allowed the largest number of credits per portfolio. Again, the maximum profit difference between two algorithms for the same metric and data subset (6.23% for Lifetime flag, AUC) was much smaller than the difference between metrics with regard to the same algorithm and subset (15.95% for LR, 60M Flag).

High Costs
Next, the models were compared when the cost of admission was 10%, and the expense of servicing the delay amounted to 20% of UPB.
This time the cost alteration had a bigger impact on both subsamples ( Table 7). The average profits on good clients decreased by 29.9% and 36.3%, respectively, for the 60M and lifetime flags. Additionally, the loss on bad clients increased drastically for the 60M flag by 1699% to 1.45M USD, almost equalizing the profit. For the lifetime flag, that increase amounted to 72%. In Figure 10 we observe that the profit histograms in both cases shifted noticeably to the left, and the loss distributions on defaults were stretched. For the lifetime flag, the profit on a portion of "good" customers took a negative turn.
The advantage of the EMP and CP metrics over the AUC also declined in this study variant ( Figure 11). This is particularly evident for the models based on the XGBoost algorithm and in the case of the lifetime subbase for the Logistic Regression model developed with EMP. Again, models based on the LR algorithm achieved greater gains for both flags. As in previous studies, CP-based models produced the largest gains for each scenario tested.  On close inspection (Table 8), a pattern familiar from previous studies is repeated. The CP metric selects the lowest number of variables and grants the most credits. However, it is interesting to note that in the case of the XGBoost algorithm, CP selected a lower absolute cut-off point level than EMP. This may indicate a better fit of the model, developed through CP, to the problem under study. A disturbing sign may be that for each case, CP recorded the lowest level of average profit per credit, which was a result of granting the largest number of credits while allowing those with mildly worse parameters also. However, the loss to AUC, which on average boosted the portfolio with the highest average return per loan, did not exceed 3.5% in any case. Once again, the maximum profit difference between two algorithms for the same metric and data subset (4.89% for lifetime flag, AUC) was significantly smaller than the difference between metrics with regard to the same algorithm and subset (10.92% for LR, 60M flag). Even if we compare CP with EMP, then the difference for lifetime flag and LR is still bigger (5.69%) than the maximum difference between the two algorithms.

Cost Comparison Summary
Regardless of the level of costs, which influenced the distribution of the profit on the credits, a similar pattern of results was observed: the CP metric outperformed the EMP and AUC in terms of the total profit, and the difference between metrics was bigger than the gap between algorithms. However, compared to the baseline study, both problem-suited methods provided lower gains over the AUC in terms of total profit than the XGBoostbased models. On average, CP chose the smallest number of features and granted credits to the highest number of borrowers, accepting a slightly lower average profit per customer. It is worth mentioning that those results were very robust and repeated over all variants of the cost levels. Finally, there was no clear dependence between the level of costs and the relative performance of the EMP and CP metrics.

LGD Distributions
As shown above, a major drawback of the EMP metric is the assumption of a specific, point-continuous LGD distribution, which limits the applicability of this metric to the specific cases when this assumption is met. Additionally, EMP assumes a constant ROI and expected LGD value for all the loans in sample. In reality, both ROI and LGD can be dependent on the characteristics of the loan itself, which, if caught by the model, can yield additional gains over the portfolio. Here, the advantage of the CP metric over the EMP metric is that it does not make any assumptions about the distributions of gains and losses, which gives the model additional degrees of freedom to adjust for characteristics of profitable or lossy loans. This section explores how much deviation from the assumptions of specific LGD distribution affects the EMP performance.

Methodology
In order to demonstrate CP's advantage in the described field, a study on a bootstrapped sample of credits with a pre-defined LGD distribution was conducted. Bootstrapping was used for two purposes. First, the base set is specific, having a very low default ratio and a relatively small cost of granting credit to a bad borrower. This fact directly favors models that admit more loans into the portfolio. To present a more general case, the modeling sample was randomized to achieve a default ratio of 20%. Secondly, bootstrapping was applied to obtain specific distributions of LGD. Four configurations were tested in the study: • Uniform distribution between 0 and 1. • EMP-assumed, discrete-continuous probability distribution with discrete weights at the ends of the interval (p 0 = 0.5 and p 1 = 0.25); uniform distribution on the interval (0, 1). • Right-skewed beta distribution with parameters alpha and beta equal to 3 and 8. • Left-skewed beta distribution with parameters alpha and beta equal to 8 and 3.
Example drawn samples are presented in Figure 12. The study results were averaged from 100 bootstrapped samples. To reduce subsample instability, 10% of loans with the highest level of UPB at the time of acquisition were excluded from the draw. Without this step, single sample profits were largely defined by whether or not the model accepted single, dominating credits with a large UPB, requiring many more draw repetitions to achieve reproducible results for the whole study. A single sample consisted of approximately 200 defaults and 800 non-defaulted credits. Analogously to the baseline study, EMP and CP were considered for variable selection, hyperparameter optimization and cut-off point determination.

Bootstrapping Results
As can be seen in Figure 13, the results of the EMP metric are comparable to those of CP for the uniform and EMP-assumed distribution. Such an outcome was quite expected because, for these examples, the actual expected value of the LGD matches that estimated by the EMP. The CP's advantage is revealed for both beta distributions. In the case of a right-skewed distribution, EMP overestimates the cost of the default and allows fewer loans in the portfolio than is optimal. For a left-skewed distribution, the empirical LGD is significantly higher than the one determined by the EMP, thereby granting too many loans to lossy customers. As in previous studies, LR-based models repeatedly outperformed XGBoost. Percentage cut-off marked on the right. Source: own preparation. Table 9 shows that EMP distribution parameters (p 0 , p 1 ) were correctly estimated for all variants of the test. However, their values did not allow is to properly approximate the properties of the samples for both beta distributions. Once again, the CP metric selected the fewest predictors for the model in all the cases considered.

Conclusions
We aimed to present a new empirical, business-oriented metric for evaluating predictive credit risk models, Calculated Profit, and to compare it with other metrics used for this purpose (AUC, EMP). The study was based on the "Multifamily" loan database acquired from the Fannie Mae. In the research, Calculated Profit outperformed the problem-tailored metric EMP, by up to 4% in terms of total portfolio profit on the test subset, and the outperformance was even greater compared to the more generic metric AUC. The sensitivity analysis demonstrated that this advantage is stable for different dataset characteristics. The results indicate that applying CP to the entire development process gives the model additional degrees of freedom to better adapt to the nature of the problem, allowing for additional advantages. Furthermore, in the studied case, the differences between the compared metrics proved to be significantly bigger than the differences between the used predictive algorithms (Logistic Regression and XGBoost), highlighting the importance of choosing the right information metric for the model development process. A conclusion of the study is that in real applications, the usage of artificial, even problem-dedicated assessment methods for model development can lead to suboptimal results. This study demonstrated that top-down assumptions negatively affect the performance and overall applicability of model evaluation metrics. The general definition of the CP metric makes it both more efficient and more flexible, making it also suitable for applications other than credit risk.
It is also worth mentioning that credit scoring is a multidimensional process that is evaluated not only through the perspective of direct profits. Financial institutions need to manage the portfolio risk at least equally strongly. In this respect, CP may lead to a riskier result, as in many cases, it allows more credit into the portfolio. However, this issue does not undermine the concept of CP as a profit-driven metric. It would be intriguing to develop this method into a multidimensional approach. In this way, other factors, such as the described risk perspective, could be incorporated into the model development process. In addition, it would also be valuable to expand the study to include a comparison of CP's performance with those of other metrics, based on other datasets and using other predictive algorithms.  Data Availability Statement: Dataset used in the study was derived from: https://capitalmarkets. fanniemae.com/credit-risk-transfer/multifamily-credit-risk-transfer/multifamily-loan-performance-data (accessed on 18 March 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: