To carry out the review of literature that researches the application of machine learning in bank risk management, two sets of key words were used in the search for related papers. The search for papers was done using the scholar.google.com, SSRN and ProQuest databases. The search was largely focused on papers after 2007 to capture developments since the global financial crisis; however papers prior to that period were also included if they were referenced in other recent papers.
The search and review was limited to conference papers, journal articles and selected theses (post graduate or doctoral). The review has not considered articles, white papers, vendor papers or web articles that have just made reference to machine learning without providing details on how, or that made references to the application of any specific algorithm, though many such articles did come up in the search. In particular, there are a large number of articles, web and magazines, and publications that include machine learning as a solution or as a generic and general recommendation without providing further details on how a given specific problem can be addressed.
The review has looked at only papers that have analysed the topic with a level of depth, namely, by making references to specific algorithms or providing a design or model for how ML can be implemented. Articles or papers or conference proceedings that have made only a cursory or a general reference to the application of ML in the risk management space have not been considered for this research. It is noted that there are many references available where the authors or speakers have proposed that ML or AI can be applied in the management of risk; however, many of them stop short of providing clarity on which algorithms, or fail to provide examples of how ML/AI has been applied in a test or industry setup.
The methodological framework for this research was determined by analysing the various problem areas related to machine learning and risk management in banks. The articles were classified to understand: (i) the risk area they focused on; (ii) the risk management tool or risk management framework component they targeted; or (iii) the algorithms that were applied/studied/proposed. The survey was also seeking to review papers that focused more on risk assessment and measurement.
Risk areas such as cybersecurity and fraud risk have been dealt with widely; however, the focus in this review has been only on cases where they specifically relate to banking risk management use cases. Papers that focus the research on operational matters, such as credit risk management solutions that address the operational process of credit review and approval, or tools that are focused on supporting traders and trading risk managers in the order and trade management process, have not been considered. Additionally, operational risk management solutions that fit within the operational process to mitigate operational events/incidents (e.g., robotics process automation, STP, anomaly detection) have not been researched.
3.1. Credit Risk
The assessment of credit risk remains an important and challenging research topic in the field of finance, with initial efforts dating back to the last century. On the back of the global financial crisis events and the consequent increased regulatory focus, the credit risk assessment process has seen an increased interest within the academic and business community. The general approach to credit risk assessment has been to apply a classification technique on past customer data, including on delinquent customers, to analyse and evaluate the relation between the characteristics of a customer and their potential failure. This could be used to determine classifiers that can be applied in the categorisation of new applicants or existing customers as good or bad (
Wang et al. 2005).
Credit risk evaluation occupies an important place within risk management. Techniques such as Logistic regression and discriminant analysis are traditionally used in credit scoring to determine likelihood of default. Support Vector machines are successful in classifying credit card customers who default. They were also found to be competitive in discovering features that are most significant in determining risk of default when tested and compared against the traditional techniques (
Bellotti and Crook 2009). Credit risk modelling for the calculation of credit loss exposure involves the estimation of the Probability of Default (PD), the Exposure at Default (EAD) and the Loss Given Default (LGD). This is emphasised by the Basel II accord. Predominant methods to develop models for PD are classification and survival analysis, with the latter involving the estimation of whether the customer would default and when the default could occur. Classifier algorithms were found to perform significantly more accurately than standard logistic regression in credit scoring. Also, advanced methods were found to perform extremely well on credit scoring data sets such as artificial neural networks, performing better than extreme learning machine (
Lessmann et al. 2015).
Through the Basel accord requirements, the need to allocate capital in an efficient and profitable manner has lead FIs to build credit scoring models to assess the default risk of their customers. Again, SVM has been shown to yield significantly better results in credit scoring (
Van Gestel et al. 2003). An accurate prediction of estimated probability of default delivers more value to risk management in comparison to a binary classification of clients as either credible or not-credible. A number of techniques are used in credit scoring, such as discriminant analysis, logistic regression, Bayes classifier, nearest neighbour, artificial neural networks and classification trees. Artificial neural networks have been shown to perform classifications more accurately than the other five methods (
Yeh and Lien 2009)
Methods and models are being constantly developed to address a significant issue at banks, namely, the correct classification of customers and the estimation of credit risk. The various approaches applied in these methods seek to increase the accuracy of creditworthiness predictions that could lead to a bigger and profitable loan portfolio. Neural networks have proven to be of significant value in the credit risk decision process, and their application in company distress predictions was reported to be beneficial in credit risk evaluation (
Wójcicka 2017).
While credit risk is the most researched and evaluated risk area for the application of machine learning, this is not a new phenomenon. Dating as far back as 1994, Altman and colleagues conducted an analysis comparing traditional statistical methods of distress and bankruptcy prediction with alternative neural network algorithm, and concluded that a combined approach of the two improved accuracy significantly (
Aziz and Dowling 2018).
Hand and Henley (
1997) argued that “credit scoring is the term used to describe formal statistical methods which are used for classifying applicants for credit into ‘‘good” and ‘‘bad” risk classes”. Credit scoring models are multivariate statistical models applied to economic and financial indicators to predict the default risk of individuals or companies. These indicators are assigned a weight relative of importance in predictions, and are fed as input to arrive at an index of creditworthiness. This numerical score serves as a measure of the borrower’s probability of default. The support vector machine technique was concluded as being the most widely applied in credit risk evaluations. Hybrid SVM models have been proposed to improve the performance by adding methods for the reduction in the feature subset. These, however, only classify, and don’t provide an estimation of the probability of default (
Keramati and Yousefi 2011).
Zhou and Wang (
2012) propose allocating weights to decision trees for better prediction. They put forward an improved random forest algorithm for predictions. The algorithm, during aggregation, allocates weights which are calculated based on out-of-bag errors in training to the decision trees in the forest. They attempt to address the binary classification problem, and their experiment shows that the proposed algorithm beats the original random forest and other popular classification algorithms (SVM, KMM, C4.5) in terms of balanced and overall accuracy metrics.
Some papers focus on a comparison against traditional statistical methods to highlight the efficiency in applying machine learning algorithms.
Galindo and Tamayo (
2000) research, through a comparative analysis of statistical and machine learning classification techniques, the credit portfolios of institutions to find accurate predictions of individual risk. They built more than 9000 models as part of the study and ranked the performance of the various algorithms. They show that the CART decision tree models provided the best estimates for default, with neural networks coming second.
Hamori et al. (
2018) studied and compared the prediction accuracy and classification ability of bagging, random forest, boosting with neural network methods in analysing default payment data. They found boosting to be superior among the studied machine learning methods.
A number of researchers have also evaluated the application of hybrid techniques and ensemble methods to study credit scoring (
Bastos 2014;
Hamori et al. 2018;
Raei et al. 2016). In a hybrid system, one technique is employed for the final prediction after the use of several heterogeneous techniques in the analysis (
Chen et al. 2016). In dealing with credit scoring problems, ensemble learning, using regularised logistic regression, can be applied. A method of applying clustering and bagging algorithms to balance and diversify the data, followed by lasso-logistic regression ensemble to evaluate credit risks, was found to outperform many popular credit-scoring models (
Wang et al 2015).
Khandani et al. (
2010), to improve classification rates of credit card holder delinquencies and defaults, constructed a nonlinear, non-parametric forecast model. The consumer credit risk model was able to identify subtle non-linear relationships in massive datasets. These relationships were typically reportedly difficult to detect when using standard consumer credit-default models such as logit, discriminant analysis or credit scores. This allows for credit line risk management, the forecasting of aggregate consumer credit delinquencies and the forecasting of consumer credit cycle.
Yu et al. (
2016) propose a novel multistage deep belief network based extreme machine learning as promising tool for credit risk assessment. The framework of multistage ensemble learning paradigms, working at three stages, is shown to outperform typical single classification techniques and similar multistage ensemble learning paradigms with high prediction accuracy.
“Support Vector Machine” (SVM) is a supervised machine-learning algorithm, and while it is widely used in classification problems, it is relatively new to credit scoring. In this algorithm, each data item is plot as a point in n-dimensional space, the value of each feature is the value of a particular coordinate (n—is number of features). Classification is performed by finding the hyper-plane that is the frontier that segregates two classes (
Ray 2015). The SVM has been applied as is or in some varied form to design a credit risk evaluation and credit scoring models (
Bellotti and Crook 2009;
Cao et al. 2013;
Van Gestel et al. 2003;
Huang et al. 2007;
Lai et al. 2006).
Har Harris (
2013) compares SVM-based credit scoring models using broad (<90 days past due) and narrow (>90 days past due), the latter being the more traditional approach. It was found that models built using a broader definition were more accurate, allowing for improvements in prediction accuracy.
Wang et al. (
2005) propose a new “fuzzy support vector machine”. The algorithm seeks to discriminate good creditors from bad ones through more generalisation while preserving the ability of the fuzzy SVM to be insensitive to outliers. They present a bilateral weighted fuzzy SVM with results that show promising application in credit analysis.
Huang et al. (
2007) constructed a credit scoring model to evaluate an applicant’s credit score from input features based on a hybrid SVM constructed using three strategies.
Yeh and Lien (
2009), in their paper, have acknowledged that forecasting the probability of default (PD) is a challenge facing practitioners and researchers, and it needs more study. A few papers have the objective of going beyond just classification, so as to predict the probability of default (PD) or recovery rates (RR) (
Bastos 2014;
Raei et al. 2016). ThelLeast squares vector machine technique, when incorporated into a two-state model, followed by a regression step, for predicting recovery rates, was also reported to show improvement compared to traditional statistical regression models (
Yao et al. 2017). Support Vector regression techniques could also be applied to increase the predictive ability of loss given default for corporate bonds outperforming statistical models (
Yao et al. 2015). These papers stop short of being able to provide a quantitative value, as they seem to approach more from a classification perspective.
Raei et al. (
2016) research a new hybrid model for estimating the probability of default of corporate customers in a commercial bank. They present the hybrid mode as one that can address the ‘black box’—model obtained not understandable in terms of parameters—criticism of neural networks. The research combines a two-stage approach, i.e., combining comprehensibility of logit models with the predictive power of non-linear techniques like neural networks. The overall accuracy of this hybrid model was shown outperform both the base models. Low default portfolios (LDP) are those that are considered as very low risk. LDPs have a class imbalance problem as, in a class of defaulters, they are contained in smaller numbers than in a class of good payers. Gradient boosting and random forest classifiers were found to perform well in dealing with samples that exhibited a class imbalance problem (
Brown and Mues 2012).
Banks seek to develop efficient models that can assess the likelihood of counterparty defaults.
Barboza et al. (
2017) test machine learning models to predict bankruptcy one year prior to the event comparing the performance with results from traditional methods. They report significant predictive accuracy being achieved, and also suggest that ML techniques can easily be applied for substantial classification accuracy in comparison to traditional mechanisms. Despite the concerns around the explanatory ability of the model, given the complexity of bankruptcy models, machine learning could prove to be an important aid.
Yang et al. (
2011) also explore a novel method to predict bankruptcy, proposing a combined method of partial least squares (PLS) -based feature selection with SVM for information fusion. A bank could benefit from the model’s ability to select the financial indicators which are most relevant in the prediction process, and also from its high level of prediction accuracy.
There are also a few papers that research the area of stress testing in credit risk management (
Islam et al. 2013). Stress testing requires the modelling of the link between macro-economic developments and banking variables to determine the impact of extreme scenarios on a bank. More frequently, bottom-up approaches are used where predictions about future profits/losses are made on mostly disaggregated portfolio levels, making it data intensive and difficult to identify the exact drives of loses. Predictions on an aggregated portfolio using a top down method can complement this process. A supervised learning algorithm that does not need a pre-specified model is the Least Absolute Shrinkage and Selection Operator (Lasso) method. A more involved version of the Lasso is Adaptive Lasso, which possesses attractive convergence properties. Adaptive Lasso can be used in the absence of theoretical models, as in the case of top-down stress testing, to discover a parsimonious top-down model from a set of thousand possible specifications. It was shown to give sparse, approximately unbiased solutions, by searching for variables that describe the behaviour of credit loss rates best resulting in a parsimonious description of the relation between macro-economy and credit loss rates. A key issue is the need for substantial amounts of data to train a model (
Blom 2015).
Model selection and forecasting have become a challenge as stress scenarios become more comprehensive, encompassing an increasing number of primary variables. Machine learning techniques for identifying patters and relationships between data can facilitate model selection and forecasting. These techniques don’t seem to be widely applied in stress testing. When there are a large number of potential covariates and the number of observations is small, lasso regressions are found to be suited to building forecasting models. They are likely to outperform traditional statistical models in forecasting the performance indicators required in applied stress testing. An advantage of the Lasso-type estimators is that they can handle complications arising from the high dimensional nature of stress tests (
Chan-Lau 2017). The Multivariate Adaptive Regression Splines (MARS), a machine learning technique, can be viewed as a generalisation of stepwise linear regression of the classification and regression tree (CART) method. Stress testing statistical regression models such as Vector Autoregression (VAR) is a common modelling approach which is known to be unable to explain the phenomenon of fat-tailed distributions. An empirical test of these models found that the MARS model exhibited greater accuracy in model testing and superior out-of-sample performance, with MARS producing more reasonable forecasts (
Jacobs 2018). Probabilistic graphs may be used for modelling and assessments of credit concentration risk with a tree-augmented Bayesian network providing a better understanding of the risk. This was also found to be suitable for stress testing analyses, with the ability to provide estimates of risk of losses consequent to changes in a borrowers financial condition (
Pavlenko and Chernyak 2009). EMA workbench is a software toolbox, developed by a team at Technische Universiteit Delft, TBM Faculty (Policy Analysis Section). Particular machine learning algorithms and advanced visualisation tools are used to perform multiple experiments and analyse the results, providing the ability to explore possible uncertainties and identify causes based on the inputs.
Neural networks, Support Vector Models and Random Forest appear to be the most researched algorithms in the credit risk management area.
3.4. Operational Risk
Machine learning is also applied in operational areas that enable the mitigation of risk, i.e., detection and/or prevention of risks. In the area of operational risk, aside from cyber security cases, machine learning is predominantly focused on problems related to fraud detection and suspicious transactions detection.
Khrestina et al. (
2017), in their paper, propose a prototype for the generation of a report that allows for the detection of suspicious transactions. The prototype uses a logistical regression algorithm. It is noteworthy that they have also included a survey of six software solutions that are currently implemented at various banks for the automation of suspicious transaction detection and monitoring processes. While the authors make a reference to algorithms, it is unclear whether these products apply machine learning techniques, and if so, with which algorithms. No further research was done on these products as this was not in the scope of the paper.
One such area where an intelligent system based on machine learning is known to add value is in the defence against spammers where the attackers’ techniques evolve. Losses from spam potentially include lost productivity, disrupted communications, malware attacks and theft of data, including financial loss. Proofpoint’s MLX technology uses advanced machine learning techniques to provide comprehensive spam detection that guards against the threat of spam. Millions of messages can be analysed by the technology, which also automatically refines the detection algorithm to identify and detect newer threats (
Proofpoint 2010). While it was not in scope for this research, being more of an operational control to manage risk, it has been highlighted as a case of how machine learning is used in managing cybersecurity risks.
In money laundering, criminals route money through various transactions, layering them with legitimate transactions to conceal the true source of the funds. The funds typically originate from criminal or illegal activities, and can be further used in other illegal activities including the financing of terrorist activities. There has been extensive research on detecting financial crimes using traditional statistical methods, and more recently, using machine-learning techniques. Clustering algorithms identify customers with similar behavioural patterns and can help to find groups of people working together to commit money laundering (
Sudjianto et al. 2010). A major challenge for banks, given the large volume of transactions per day and the non-uniform nature of many, is to be able to sort through all the transactions and identify those that are of suspicious nature. Financial institutions utilise anti-money laundering systems to filter and classify transactions based on degrees of suspiciousness. Structured processes and intelligent systems are required to enable the detection of these money laundering transactions (
Kannan and Somasundaram 2017).
Money laundering is another area that poses a significant challenge to financial institutions, given the high volumes and complexity of transactions coupled with the dynamic and fast evolving nature of financial crimes, and the need to do so on real-time data sets. In the area of financial crime detection, there has been a significant amount of research in the application of statistical learning and data mining for developing classification models to flag suspicious transactions. A C5.0 algorithm was applied to predict risk levels based on the different customer potential risk factors to create the set of rules for cluster allocation. The key factors were used to characterise transaction profiles. The model reportedly provided a 99.6% correct classification rate on the test data. The number of alerted cases was reported to have reduced from the close to 30% of transactions to less than 1% (
Villalobos and Silva 2017).
Credit card fraud is significantly increasing annually, costing consumers and the industry billions of dollars. Fraudsters are constantly finding newer techniques to perpetrate the crime. In order to manage the increasing fraud risk and minimise losses, banks have fraud detection systems in place. The systems are oriented towards increasing the detection rate while minimising the false positive rate. Models are estimated based on samples of fraudulent and legitimate transactions in supervised detection methods while in unsupervised detection methods outliers or unusual transactions are identified as potential cases of fraud. Both seek to predict the probability of fraud in a given transaction. Some reported challenges in credit card fraud detection are the non-availability of real data sets, unbalanced data sets, size of the data sets and the dynamic behaviour of fraudster. Bayesian algorithms, K-Nearest neighbor, Support Vector machines (SVM) and bagging ensemble classifier based on decision tree have been varyingly used in fraud detection systems. A comparative evaluation showed that the bagging ensemble classifier based on decision tree algorithms works well, as it is independent of attribute values, and is also able to handle class imbalance (
Zareapoor and Shamsolmoali 2015). False alarms, namely transactions labelled as fraudulent that are in fact legitimate, are significant, causing concerns for customers and delaying the detection of actual fraud transactions. Large Canadian banks rely heavily on NN scores, ranging from 1 to 999, with 1 being the lowest chance of a fraudulent transaction, determined by neural network algorithms. Reportedly, 20% of transactions with a NN score greater than or equal to 990 are fraudulent, causing fraud analysts to inefficiently spend time investigating legitimate transactions. A meta-classifier (a multiple algorithm learning technique) applied to a post-neural network was shown to provide quantifiable savings improvements with a larger percentage of fraudulent transactions being caught (
Pun and Lawryshyn 2012).
Also, in the areas of operational risk, there are a few papers on fraud risk detection in credit cards and online banking. They concern credit card fraud detection in domains not specifically related to bank risk management or the banking industry. One would note that the algorithms they refer to were SVM, KNN, Naïve Bayes Classifier, Bagging ensemble classifier based on decision tree (
Dal Pozzolo 2015;
Pun and Lawryshyn 2012;
Vaidya and Mohod 2014).