Credit Decision Support Based on Real Set of Cash Loans Using Integrated Machine Learning Algorithms

: One of the important research problems in the context of ﬁnancial institutions is the assessment of credit risk and the decision to whether grant or refuse a loan. Recently, machine learning based methods are increasingly employed to solve such problems. However, the selection of appropriate feature selection technique, sampling mechanism, and/or classiﬁers for credit decision support is very challenging, and can affect the quality of the loan recommendations. To address this challenging task, this article examines the effectiveness of various data science techniques in issue of credit decision support. In particular, processing pipeline was designed, which consists of methods for data resampling, feature discretization, feature selection, and binary classiﬁcation. We suggest building appropriate decision models leveraging pertinent methods for binary classiﬁcation, feature selection, as well as data resampling and feature discretization. The selected models’ feasibility analysis was performed through rigorous experiments on real data describing the client’s ability for loan repayment. During experiments, we analyzed the impact of feature selection on the results of binary classiﬁcation, and the impact of data resampling with feature discretization on the results of feature selection and binary classiﬁcation. After experimental evaluation, we found that correlationbased feature selection technique and random forest classiﬁer yield the superior performance in solving underlying problem.


Introduction
Nowadays, banks and financial institutions carefully analyze the credit risk of their clients [1]. The current world situation, i.e., COVID-19 pandemic, affects not only people's lives, but also has a negative impact on economic factor, especially related to paying liabilities by potential borrowers [2]. According to that issue, credit scoring systems [1] are needed by such organizations in order to select the most promising clients to work with and offer well-tailored services for them. These models are particularly suited for financial institutions, due to the ability of assessing the numerical score of individual customers, which determines their loan repayment probability [3]. Under the hood the final decision is made-whether loan granting is justified or not. Most often, credit risk is assessed on the basis of historical data, using mainly statistical or machine learning methods [4], among them, e.g., rough sets [5], usually combined with: probability theory [6], fuzzy • creation of decision models using different binary classifiers, feature selection methods, as well as data resampling and feature discretization methods; • evaluation of models on dataset containing real data of cash loans.
It is important to note that the presented research is a significant extension of the earlier works in which we examined only selected classifiers and feature selection methods [22] as well as rough set approach [23].
Section 2 discusses the problem of credit risk assessment and reviews the literature on the subject. Section 3 presents a review of useful methods for classification task, feature selection, data resampling, and feature discretization incorporated in the study, as well as proven measures for assessment of classification models. Section 4 contains a description and explanation of the adopted test procedure. The general results of the research carried out are included in Section 5, while the more detailed results are included in the Appendices A-G. The paper is summarized with conclusions and proposals for further research presented in Section 6.

Literature Review
The subject of interest of authors dealing with financial issues is often credit risk, generally defined as the risk of a business partner who does not fully meet its obligations on time and avoids such activities altogether [24]. Credit risk can also be understood as the risk of changes in the value of the company's equity as a result of changes in the creditworthiness of its debtors. It is noted that in recent years a lot of attention has been paid to the methods and algorithms for assessing financial credit risk. This was due, among others, to the occurrence of global financial crises, but also to the need for a thorough assessment of such threats and forecasting business failures. It should be added that the above-mentioned factors have an impact on the functioning of the economy and financial decisions made by societies [25].
Due to the fact that financial credit risk indicates a risk related to financing, its assessment is aimed at solving the following two categories of problems: credit rating or scoring and predicting bankruptcy of forecasting a financial crisis of enterprises. Historically, research on financial credit risk assessment was initiated in the 1930s [26] and continued over the years with considerable success in the 1960s [27]. Nowadays, apart from taking into account the achievements obtained with the use of traditional statistical methods, the research focuses primarily on the use of advanced machine learning methods. This approach, without the need to follow strict assumptions, results in an improvement in the accuracy of the results obtained in a conventional manner. At the same time, it is impossible to indicate the only effective method that is superior to others. On the other hand, the most recently used intelligence techniques include: artificial neural networks (ANNs), fuzzy set theory (FST), decision trees (DTRs), case-based reasoning (CBR), support vector machines (SVMs), rough set theory (RST), genetic programming (GP), hybrid learning, and ensemble computing [25].
The traditional approach to credit risk assessment focuses on obtaining the optimal linear combination of the input explanatory variables. It is expected that thanks to these variables it will be possible to: model, analyze and predict the risk of corporate insolvency. Their use is determined by popularity, but attention is paid, for example, to the fact that they do not take into account complex relationships between variables. To assess credit risk using statistical models, among others, linear discrimination analysis (LDA), logistic regression (LR), multivariate discriminant analysis (MDA), quadratic discriminant analysis (QDA), factor analysis (FA), risk index models, and conditional probability model are used [25]. Among the works pointing to the domination of statistical methods over other approaches, there are [28,29].
The group of methods that combine the traditional and intelligent approaches are semiparametric method, which are characterized by greater flexibility of the model structure, clearly interpret the modelled process and show greater accuracy. More information on this can be found in [30,31]. In the literature on the subject, there are many interesting combinations of parametric, non-parametric and semi-parametric models, for example, the Klein and Spady model [32], Logit model and the CART model [33]. Another proposal is the integration of a parametric binary logistic regression model (BLRM) and non-parametric models (e.g., SVM, DTR) [34].
Many publications report good results obtained with the use of artificial neural networks [35][36][37]. The feature of networks that makes them useful for the assessment of credit risk is the ability to process non-linear data and approximate most of the functions. In this way, internal patterns can be found from complex financial data [38]. There are also some limitations to their use, such as difficulty in explaining the black box algorithm, time-consuming learning, not providing optimal solutions, and too much adjustment to the training data.
Another proposal for credit risk assessment are SVMs, which transform non-linear input vectors into a multidimensional feature space. It is possible with the use of kernel functions, which means that the data can be separated by linear models. The interest in SVMs is due to their good performance, the possibility of generalizing a small set of high-value data [39]. Their effectiveness is noticeable when the input data are non-linear and non-stationary, which results in obtaining models supporting credit decisions [40].
The classical classification approach is represented by decision trees. In the case of credit risk, their usefulness results from: easy interpretation of the obtained results, non-linear estimation, non-parametric form, accuracy, possibility of application in the case of continuous and categorical variables, as well as the indication of significant variables. In the discussed field, for example, ID3, C4.5, CART, CHAID, MARS, ADTree [33] can be used.
In the literature on the subject [25], it is possible to note the use of CBR in the subject of credit risk. This approach makes it possible to propose problem-solving by recalling similar experiences. All activities are based on the principle of k-nearest neighbors (kNN), which in the case of classification includes the identified object in the class to which most of its k-nearest neighbors belong. It is suggested to use CBR in the case of small data sets, although it is less precise in relation to other methods used in this type of problem and its improvement is proposed [41].
There have been many interesting publications on credit risk assessment recently. In their work, Wang et al. (2020) [42] presented the results of a study on the assessment of credit risk in the supply chain of commercial banks online. The authors used the literature induction method, the non-linear LS-SVM model and compared the obtained results with the results of the logistic regression model. They found that the LS-SVM evaluation model had a higher classification accuracy than the logistic regression model. In addition, they found that it has a strong generalization capacity and can comprehensively identify credit risk and provide sound, scientific analysis, and is an effective tool supporting the credit risk assessment of small and medium-sized enterprises.
The article by Arora and Kaur (2020) [43], which confirmed the usefulness of modern data mining and machine learning techniques, is also worth mentioning. According to the authors, these methods show precision in predicting credit risk and support taking appropriate decisions. Bolasso (Bootstrap-Lasso) was used in the research. In order to test the predictive accuracy, the functions obtained by Bolasso were applied to the following classification algorithms: Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB), and kNN. The authors concluded that the Random Forest algorithm (BS-RF) with Bolasso enabled provides the best credit risk assessment results.
Other conclusions were reached by Froelich and Hajek (2019) [44], who proposed in their previous studies to automate credit risk assessment by using systems based on machine learning methods. The authors concluded that the obtained results are difficult to interpret and do not fully take into account the expert knowledge. In the next step, they applied multi-criteria group decision making methods (MCGDM) to simulate the assessment process performed by a team of credit risk experts. According to the authors, standard MCGDM methods do not take into account high uncertainty and are not effective in the case of a significant impact of the assessed credit risk criteria. Therefore, they proposed an MCGDM model that combines fuzzy sets and fuzzy cognitive maps with the traditional TOPSIS approach. In turn, Heidary Dahooie et al. (2021) [45] proposed a combination of Data Envelopment Analysis (DEA) with the dynamic multi-attribute decision-making method (DMADM), considering it an innovative dynamic decision-making method for assessing loan applications. The credit performance criteria were distinguished on the basis of a literature review and expert opinion. In contrast, the criteria weights were calculated using the dynamic approach to the common set of DEA weights. Then, candidates were prioritized using five Gray MADM methods (including SAW-G, VIKOR-G, TOPSIS-G, ARAS-G and COPRAS-G). In the final study, a new method called the correlation coefficient and standard deviation (CCSD) was used to determine the aggregate rank.
In the summary of the review of credit risk assessment methods, it should be added that in recent years, in line with the observations of Bellacos (2018) [46], efforts to improve the traditional approach to credit scoring have not always been successful. Compared to traditional credit models, the data used in the new credit models is much more precise, comprehensive and holistic. These data, combined with modern machine learning (ML) algorithms and artificial intelligence (AI), provide much better calibrated risk assessment models. On the other hand, when comparing ML and AI methods with expert credit risk assessment, it should be noted that modern methods take into account many more decision-making factors than a human can do. The expert has knowledge based on his previous experience, but classification models have much more knowledge. The knowledge of classifiers is also based on previous experiences, in this case written as a set of training cases, but their ability to process information is much greater than that of an expert who has limited perception. Moreover, ML methods, unlike humans, do not get tired, do not get sick, etc. Additionally, in the literature, the advantage of machine learning and data mining methods over expert assessment in complex problems requiring the processing of many data is noticed [47]. On the other hand, there are still areas where the expert outweighs ML and AI methods [48].
The banking sector already has some characteristics such as: advanced computerization (available computing power, modern analytical tools), large amounts of transaction data, financial history of customers, which make it the preferred field for implementing credit risk assessment models based on machine learning and artificial intelligence. The content of the Digital Banking report (2021) [49] presenting current trends and priorities in retail banking shows that most banking institutions know what is needed, and many of them even know how to face the current challenges. The problem, however, is that current banking standards keep organizations from doing this. In the area of credit decisions, this applies to solutions with a very complicated, difficult or even impossible explanation mechanism. An example is neural networks seen as black boxes. What is happening inside such a network cannot be fully explained. Banks in Poland refuse to use such tools, as it is difficult to justify a specific credit decision made on their basis before the Polish Financial Supervision Authority (PFSA). PFSA is sympathetic to traditional scoring and other methods whose results are intuitive, easily interpreted, and easy to argue and explain.

Classification Methods
Machine learning can be used for various tasks, among others, in classification problems, consisting in predicting the belonging of an object to a certain class on the basis of well-defined characteristics of this object. Usually, discrimination of selected object is based on the earlier training of the classifier, during which the classification algorithm attempts to "learn", what are the real classes of training objects and what features determine whether the objects belong to specific classes [47,50]. Methods for classification task are, e.g., C4.5 decision tree (C4.5), random forest (RF), decision table (DT), naive Bayes (NB) classifier, logistic regression (LR), or k-nearest neighbors (kNN) algorithm. The characteristics of selected classification methods are presented in Table 1. • Assigning one value to dependent variable. • Significant change of predicted value when value of one of the features changes slightly. [51,52] RF RF is a complex classifier, consisting of multiple instances of decision trees, which is trained without supervision. One tree can be grown by obtaining a randomly drawn subset of data with replacement from the training dataset. Then the decision tree is created for the selected subset. Training finishes when the number of trees has reached its maximum or error in testing set has stopped decreasing.
• Possibility of enabling parallel computation for each tree, due to independence of trees. • This approach has more stability than simple decision tree model, providing improved classification accuracy. • Some of frequent issues are addressed by random forest: incomplete data, irrelevant and redundant explanatory variables, sophisticated and large dependency structure of features.

•
The main disadvantage can be loss of interpretability for trained classifier model. • High computational complexity.
[ [53][54][55][56] DT DT is an accurate method for numeric prediction from decision trees and it is an ordered set of If-Then rules that have the potential to be more compact and therefore more understandable than the decision trees. The entire problem of learning DT consists of selecting the right attributes to be included. Usually this is done by measuring the tables cross validation performance for different subsets of attributes and choosing the best performing subset.
• DT is one of the simplest hypothesis spaces possible and usually they are easy to understand. • It is a simpler, less compute intensive algorithm than the decision-tree-based approach.

•
Leave-one-out cross-validation is very cheap for this kind of classifier.

•
The TD algorithm very rarely achieves above-average classification accuracy.

•
There are always the same number of evaluation conditions and actions to be performed in the decision table. • DT does not depict the flow of logic for the solution to a given problem. [57,58]

NB
It is a family of algorithms based on a common principle, that the value of a given feature is independent of the value of any other feature, taking into account the class variable. The purpose of NB algorithm is to assess conditional probability of occurring events. • NB assumes that all features are independent, what rarely happening in real (it limits the applicability of this algorithm).

•
There is a problem of 'zero frequency' in the NB, where it assigns zero probability to a categorical variable whose category in the test data set wasn't available in the training dataset.
[ [59][60][61] LR LR is one of the classification methods used when each sample is assigned to one of two classes (binary classification). This model assesses the probability of an event that dependent variable is equal to 1.
• LR takes into account all significant variables and excludes all irrelevant features from model. • The resulting model is easy to interpret, because each feature has one weight assigned.

•
The LR model does not explain interactions between independent variables and data cannot be collinear.

•
In case of outliers LR model efficiency deteriorates much, so that they should be removed before starting the analysis. [62,63] kNN kNN is a nonparametric method. The algorithm assumes that similar objects are in the same class and the prediction of belonging to the class of a new object is based on a comparison with a set of prototype objects.
• kNN can be used both for regression and classification tasks. • kNN treats all the attributes of the feature space equally important, which increases risk of domination irrelevant or redundant features over significant ones, leading to inferior classification. To avoid such situation, an appropriate set of features should be selected [39]. [64,65]

Feature Selection Methods
One of the basic issues in classification task is the multidimensionality of the object to be assigned to a specific class. This is a serious obstacle decreasing accuracy of classification algorithms, known as the "dimensional curse" [66]. Dimensionality reduction of feature space allows lowering the computational and data collection costs, which eventually improves predictions [67]. Tools, which can be used for that task are called feature selection methods.
The feature selection process focuses on identifying relevant features in dataset as significant and rejecting redundant features [68]. For this purpose, various algorithms are used to assess the importance of particular features in the classification task. The feature selection methods are divided into three categories: filters, wrappers, and embedded methods [69]. Filters and wrappers are usually composed of four elements (steps), such as: generation of feature subset, evaluation of the subset, stopping criterion, result validation [70]. By describing individual elements of the feature selection methods, it is possible to point out significant differences between these groups of methods.
Filters are based on independent evaluation of features using general data characteristics. For example, Pearson correlation coefficients between each input and selected output can be used. Feature subset is determined by defining threshold for minimum value of correlation or particular number of features to be selected before training the machine learning algorithm [71].
Wrappers evaluate individual feature subsets using machine learning algorithms, which algorithms will eventually be used in the classification or regression task. In this case, training algorithm is included in the feature selection procedure, therefore, crossvalidation based on set of training cases is usually used to estimate the accuracy of the classifier using a specific feature subset [72].
Embedded methods are similar to wrappers in that they use classification to perform the task of feature selection. The main difference between wrappers and embedded methods is "embedding" of selection procedure into the selected classifier. In other words, the dimensions of training objects subject to classification are reduced while building classifier model [73]. For instance, in decision trees unnecessary features are eliminated by trimming and defining the minimum number of objects in the node.
Wrappers differ only in the applied machine learning algorithms, so, as in the case of embedded methods, the results obtained using them depend solely on the quality of the machine learning algorithm and the algorithm fit to a specific classification task. Wrappers and embedded methods analyze the features of the objects contained in the training set only in terms of obtaining the maximum number of correct classifications, omitting other characteristics of the features. Meanwhile, the general characteristics of the features seem so important that they should affect the selection of individual features that determine the training and test cases. Therefore, filtration procedures that determine the significance of individual attributes using measures other than classifier's accuracy seem to be more interesting. Filter methods are using various measures to assess relevance of each feature, e.g., distance function and different correlation measures.
Popular filter technique that uses the distance function is ReliefF [74]. On the other hand, the most numerous groups of filters are correlation procedures, among them the most promising are: Symmetrical Uncertainty (SU) [75], Correlation-based Feature Selection (CFS) [76], Fast Correlation-Based Filter (FCBF) [77], and Significance Attribute (SA) [78]. The basis characteristics of each method are presented in Table 2.

Resampling Methods
In binary classification, when number of classes in training set is unbalanced, i.e., class distribution is strongly skewed, conventional classifiers maximizing their accuracy usually build models that tend to classify all objects as belonging to the majority class. This results in low accuracy for the minority class, whose objects are underrepresented in training set, whereas such class is often of uttermost importance [84]. To overcome this issue, resampling methods are commonly used for training set. The two most popular in machine learning, yet very simple, are techniques of random undersampling and random oversampling [20]. In addition to the resampling methods already aforementioned, another interesting approach is Synthetic Minority Over-sampling Technique (SMOTE) [85]. Table 3 lists the main advantages and disadvantages of each of these approaches.

Discretization Methods
Some classification algorithms improve their performance by using feature discretization. Moreover, certain classifiers cannot work without data discretization. Such methods bin continuous features, dividing them into ranges or intervals, resulting in conversion of numerical data to nominal data. Here, main issue with feature discretization is appropriate choice of cutpoints, because continuous data can be discretized in an infinite number of ways. Perfect discretization method should find a relatively small number of cutpoints, dividing data into relevant bins. Among discretization techniques, there are supervised and unsupervised methods. First group results are superior to second group, because it uses class distribution to which each object belongs as additional information. Great number of methods perform discretization based on class entropy, which is a measure of uncertainty in finite range of classes. Entropy is calculated for different splits and compared to entropy of dataset without splits. It is run recursively until the search stop criterion is meet [86]. For instance, heuristic method of Minimal Description Length Principle (MDLP) can be used, here. This technique determines whether or not to accept current cut-off point candidate, thus, stopping recursion if specified condition is not met [87]. The entropy-based discretization with MDLP stop criterion is considered to be one of the best supervised discretization methods [71]. It measures information gain score of possible cutpoint by comparing entropy value. For each considered cutpoint, entropy of input interval is compared to the weighted sum of entropies for two output intervals. There are several different criteria for MDLP stopping condition, including Fayyad criterion [88] and Kononenko criterion [89].

Classification Evaluation Metrics
The quality of the classification can be evaluated by, e.g., Receiver Operating Characteristic curve (ROC), Area Under Receiver Operating Characteristic curve (AUROC) and Gini coefficient (GC). Another interesting measure is Precision-Recall Curve (PRC).
ROC is the graphic representation of the predictive model effectiveness made by sketching the quantitative characteristics of binary classifiers derived from such model using variety of cut-off points. This shows the relationship between True Positive Rate (TPR) and False Positive Rate (FPR). TPR can be calculated as follows by Equation (1) [85]: where TP indicates number of true positives, i.e., model predicts positive class correctly and FN indicates number of false negatives, i.e., model predicts negative class incorrectly. In turn, FPR is defined as Equation (2) [85]: where FP indicates number of false positives, i.e., model predicts positive class incorrectly and TN indicates number of true negatives, i.e., model predicts negative class correctly. AUROC measures the classifier's accuracy. It is calculated as probability thresholds for following event-considered object belongs to negative or positive class. Geometrically, this is area below ROC. The higher value of AUROC, the better classification results of model are, where AUROC < 0.5 means invalid classifier, i.e., worse than random, AUROC = 0.5 means random classifier, and AUROC = 1 means ideal classifier [85].
GC is a measure of model's quality, interpreted as degree of ideality for classifier. GC is calculated based on the following Equation (3): The higher value of GC, the better classifier is, where GC = 0 means random classifier, and GC = 1 means ideal classifier [90].

PRC shows dependence between precision (Positive Predictive Value-PPV) and recall (TPR) for the classifier, where former is calculated as follows Equation (4) [91]:
Big area under PRC (AUPRC) represents both high precision and high recall, where high precision corresponds to low false positive frequency and high recall corresponds to low false negative frequency. High scores for precision and recall indicate that classifier predicts accurate results and also most of them are positive [91]. PRCs are often zigzag curves with oscillations. Due to that fact, they tend to cross over much more than ROCs, therefore, leaving researcher difficult comparison. It is recommended to use PRCs in addition to ROCs for obtaining complete overview while evaluation and comparison of classifier models [92].

Research Procedure
The dataset on which the experiment was conducted describes anonymized data about loan repayment and borrowers. This set consists of 91,759 records described by 272 conditional attributes (features) and the decision attribute. It was divided in proportion 70/30% into training set (64,230 records) and testing set (27,529 records) [93].
Final research was preceded by a series of preliminary tests, during which following were selected: • the most promising and various filter methods for feature selection; • different classifiers, bearing in mind their core algorithm, way of knowledge representation and ability to explain classification of cases.
During preliminary tests, it was noticed that one of the models with outstanding classification results can be random forest, therefore, its more detailed examination allowed to select optimal parameters, i.e., number of iterations = 239 and maximum tree depth = 13 [22].
In this research study it was assumed that various combinations will be tested, consisting in filter methods (SU, FCBF, CFS, SA, ReliefF), classifiers models (C4.5, DT, kNN, LR, NB, RF, optimized random forest (ORF)), resampling methods (without resampling, random undersampling, SMOTE) and feature discretization (without discretization, Fayyad criterion, Kononenko criterion). Taking into account the number of methodological approaches considered in each group, this gives 315 different scenarios and the same number of classification models supporting credit decisions. In practice, this number was smaller due to the fact that the number of conducted scenarios was limited, because of omitting selected resampling and discretization algorithms. Here, following heuristics was used, according to which, if specific preprocessing method, i.e., resampling or discretization, does not give satisfactory results, then there is no reason for its inclusion in subsequent scenario. Moreover, due to the high computational complexity, some scenarios did not use ReliefF. It should be noted that in case of large training dataset, this method performed in general time-consuming calculations, not yielding acceptable results. Therefore, all scenarios included at least 4 filter methods (SU, FCBF, CFS, SA) and all seven classifiers. Additionally, it should be clarified that for case of random undersampling, each scenario was repeated three times, building three different classification models and averaging results, eventually. The above approach was followed in order to minimize the impact of training cases random selection on classification results. The research study was divided into four general scenarios in which following combinations of methods were applied: 1.
Furthermore, at the beginning, classification was performed without using filter methods, i.e., scenario 0. Results of this study were reference to subsequent scenarios in which filter methods were used. According to such approach all research scenarios allowed to define: • the effect of feature selection on classification; • the effect of data resampling on classification with feature selection; • the effect of feature discretization on classification with feature selection; • the effect of data resampling with feature discretization on classification with feature selection. Figure 1 depicts the research study, which was carried out. Figure 1 shows that processing techniques including feature discretization and feature selection were applied to training set and results were used in testing set. This was necessary step to allow full consistency between training set and testing set. For instance, binning of training data was achieved and then the same bins were adopted to testing data. Likewise, selection of relevant features was done based on training set and redundant features were removed from testing set. Only one processing method used on training cases without testing cases was data resampling.

Results and Discussion
Full results of conducted research study are presented in Appendixes A-G, while this section shows only the best results from each considered scenario. Table 4 depicts the four top classification results from each scenario. From Table 4 it can be stated that the best classification results are obtained by RF model with possible optimization and feature selection method allowing top classification results is mainly CFS. It should be also noted that overall outstanding result was achieved by RF on full dataset of 272 features. Obviously, dimensionality reduction of such data is necessary due to the lack of ability to explain classification or need to collect great amount of information in order to classify new case. Assuming feature selection is made without resampling or discretization the best

Results and Discussion
Full results of conducted research study are presented in Appendices A-G, while this section shows only the best results from each considered scenario. Table 4 depicts the four top classification results from each scenario. From Table 4 it can be stated that the best classification results are obtained by RF model with possible optimization and feature selection method allowing top classification results is mainly CFS. It should be also noted that overall outstanding result was achieved by RF on full dataset of 272 features. Obviously, dimensionality reduction of such data is necessary due to the lack of ability to explain classification or need to collect great amount of information in order to classify new case. Assuming feature selection is made without resampling or discretization the best classification results were obtained by ORF. However, if both feature selection and classification accuracy are important, then RF model should be supported by data resampling, which allows to balance class distribution. Moreover, in case of RF, as well as LR and DT, undersampling provides better classification results than discretization (cf. Appendices C, E and F). On the contrary, it is opposite for NB, kNN and C4.5. Furthermore, RF and LR, both with undersampling, yield superior results than with combination of undersampling and discretization. On the other hand, above combination improves quality of classification for NB. Additionally, in order to obtain acceptable results using LR or NB, it is necessary to employ methods previously mentioned while for RF model they can be entirely omitted. Moreover, the randomness in applied undersampling algorithm also plays vital role. It has serious impact on obtained feature sets, thus, on results of classification. Nevertheless, conclusions drawn here are true for each research case performed during the study. It should be noted that in order to maximize accuracy of classification, it is recommended to carry out several draws and select set of training cases that allows to obtain the best results for the classification of testing cases. On the other hand, if the selection of possibly smallest feature set is of great importance, then FCBF should be used. Table 5 depicts four top classification results from each scenario where feature sets were obtained by above method. From Table 5 it can be stated that feature sets consisting in five or six features do not provide acceptable classification results. Bearing in mind that the minimum number of features and the maximum accuracy are essential, results of RF in scenario 2 and NB in scenario 4 are worth noting. DT achieves also relatively good classification results compared to other models. Main reason behind that is due to the built-in feature selection, i.e., DT automatically reduces feature space. Whether input feature set is relatively large enough, this can cause deterioration of classification compared to other models, but with low number of features additional reduction is not performed, so that there is no negative impact on final results.

Conclusions
The article deals with the problem of credit decisions based on machine learning methods. In particular, the effects of the application were verified together with classifiers of other machine learning methods in the processing of the credit data set. Summarizing results of conducted research study, it is possible to indicate premises related to use of individual methods, i.e., feature selection, binary classification, data resampling, feature discretization: • if classification result is important, then RF will return good results over a full set of data; • if both feature selection and classification accuracy are important, then acceptable results will be obtained by undersampling with CFS and RF; • if both minimum number of features and classification accuracy are important, then fair results will be achieved by following approaches: (1) CFS with RF, (2) undersampling with FCBF and RF, (3) discretization with CFS and LR or NB, (4) undersampling with discretization, FCBF and NB.
Of course, above heuristics do not fulfill topic in an exhaustive way of choosing appropriate approach to credit scoring problem. In some business cases, apart from classification result and size of feature set, the ability to explain classification may be also important, which gives certain advantage. Moreover, constraining oneself only to classification accuracy, it is not possible to clearly determine whether it is better to use AUROC, AUPRC or GC. Basically, the selection of classification model will consist in seeking trade-off between inherent features of classifiers. Therefore, further research is targeted on the selection of a specific approach using a classifier for credit decisions in support of stakeholders (e.g., banks) depending on their personal needs (i.e., actual requirements and preferences). Assessment of various approaches is, here, a multi-criteria decision problem, thus, a multi-criteria decision analysis [94] will be involved.