Machine Learning Applied to Banking Supervision a Literature Review

Machine learning (ML) has revolutionised data analysis over the past decade. Like innumerous other industries heavily reliant on accurate information, banking supervision stands to benefit greatly from this technological advance. The objective of this review is to provide a comprehensive walk-through of how the most common ML techniques have been applied to risk assessment in banking, focusing on a supervisory perspective. We searched Google Scholar, Springer Link, and ScienceDirect databases for articles including the search terms “machine learning” and (“bank” or “banking” or “supervision”). No language, date, or Journal filter was applied. Papers were then screened and selected according to their relevance. The final article base consisted of 41 papers and 2 book chapters, 53% of which were published in the top quartile journals in their field. Results are presented in a timeline according to the publication date and categorised by time slots. Credit risk assessment and stress testing are highlighted topics as well as other risk perspectives, with some references to ML application surveys. The most relevant ML techniques encompass k-nearest neighbours (KNN), support vector machines (SVM), tree-based models, ensembles, boosting techniques, and artificial neural networks (ANN). Recent trends include developing early warning systems (EWS) for bankruptcy and refining stress testing. One limitation of this study is the paucity of contributions using supervisory data, which justifies the need for additional investigation in this field. However, there is increasing evidence that ML techniques can enhance data analysis and decision making in the banking industry.


Introduction
Decision support systems had their genesis in the 1960s (Burstein et al. 2008). Perhaps because of the exposure risk and magnitude of revenues generated, the financial sector has been a particularly avid driver for developing these technologies.
Predicting how financial institutions will perform and whether they will create value is key for every contender in this field-financial institutions, central banks, consultancy companies, and academia. Consequently, the use of new technology and methods to support risk assessment tasks (fin-tech) is a rising trend in this sector (Milian et al. 2019). In recent years, machine learning (ML) methods and, to some extent, deep learning (DL), have been used for the assessment of credit risk, and more broadly, predicting bank failures. Currently, traditional statistical methods are still commonly used for this purpose. Nevertheless, machine learning techniques are overcoming traditional approaches by allowing practitioners to module past decisions, exploit them for other scenarios, and predict future chaotic phenomena.
This review intends to provide a comprehensive picture of how machine learning techniques have been used so far in risk assessment from a central bank's perspective.
Thus, the scope of this work encompasses credit institutions and investment firms since those are the ones the European Banking Authority (EBA) regulation focuses on (European Banking Authority 2013). Henceforth, the term institutions will be used to refer to both.
The above-mentioned regulation establishes the standardisation of reporting requirements under the Single Supervisory Mechanism (SSM) (European Commission 2015). As a consequence, this study focuses on the European banking sector. Although we are aware of the importance of insurance, pension funds, securities, and markets in the financial sector, these are subject to different regulations and would benefit from a dedicated study. This work intends to contribute to several stakeholders in the supervisory landscape: 1.
Institutions can have a comprehensive perspective on which risk assessment approaches are available and how they can evaluate their own exposures. 2.
Central banks can acquire an integrated view of several validated methodologies for risk assessment. These can be the pillars of their next decision support systems by laying down the technologies supporting risk assessment processes. Furthermore, this work can also incite surveys and case studies on the use and adoption of ML at central banks.

3.
Consultancy companies will benefit from a compendium of ML techniques and risk measures, to better support their clients.

4.
Academia receives an important contribution that gathers an extensive number of papers on risk assessment and collates the identified methodologies from a supervisory perspective. This will hopefully serve as a stepping stone for future developments in this area, and provide a baseline for testing new methodologies.
This paper is organised as follows: it starts by justifying the methodology and describing how the references were selected. The results section gathers similarities among published scientific knowledge and presents the most relevant works that influence this field. The last section provides a space for discussing lessons learned and future work.

Methodology
This research was conducted through a series of exploratory steps on the topics of machine learning, banking, risk assessment, and banking supervision. The initial objective was to evaluate how machine learning techniques were being used at central banks. Additionally, we intended to analyse how these methods were informing the analytical capabilities of supervisors. We then refined a search query broad enough to return a set of articles we could work on. The following subsections describe a step-by-step guide for the reference search and selection.

Engines
This literature review relies on three search engines: Springer Link, ScienceDirect, and Google Scholar, queried until June 2021. The first and second search engines are extensively renowned for their trustworthiness and for selecting top journals for their results. The last one provides an extensive overview of all articles published in English (Gusenbauer 2019).

Query
Through extensive addition and diversification of search terms, we refined the search query to the following: "machine learning" and ("bank" or "banking" or "supervision").
The underlying reasoning is that machine learning techniques are the focal point of this review article. The added value comes from analysing their potential applications to the banking sector, specifically banking supervision. No limitation concerning the year of publication was applied. Overlapping results are addressed in our secondary analysis. Furthermore, no filter regarding type or place of publication was applied, since the included papers' journals of publication were evaluated and classified after screening. Additionally, to keep up with new publications, we defined an alert in Google Scholar with this query. Finally, we pay close attention to Mendeley's alerts for articles related to the set gathered in this review.

Steps
The following subsections detail every step of the selection process summarise in the following PRISMA diagram Figure 1.

Identification
The research query identified 85 articles and two books, from the three search engines. All the papers were published in English, in several different journals, and spanned from 2000 to 2021. This first step involved title and abstract analysis, and excluded 14 articles for lack of relevance.

Screening
In this phase, the main topics of each article were analysed, resulting in the exclusion of 21 papers, based on the following criteria: • Dataset: when the analysed paper used data other than the banking sector, it was discarded. We are aware that applications of ML to the stock market are a trendy topic in the literature, and that the insurance and pension funds sector is of great importance in the Eurozone. Nevertheless, the regulation is substantially different, and they would merit from a different study and approach; • Methodology: risk assessment exercises are historically based on quantitative data, combined with expert judgment. Furthermore, it is the quantitative data that holds the largest amount of information regarding risk exposure practices. Therefore, we focus our analysis on quantitative methods, for which a risk assessment classification has already been assigned (leveraging on previous knowledge through supervised learning). We thus excluded works concerning unsupervised learning methods, or sentiment analysis (qualitative); • Region: this criterion is closely related to the first, since regulation changes according to geography. We chose to focus mainly on works based upon institutions operating in the Eurozone. Nonetheless, relevant works by other central banks were considered eligible.

Eligibility
The next step required a thorough analysis of each paper, to verify its sources and classify the journal it was published in (quartile of impact). Papers were analysed from 2021 backward to identify any overlapping results or new or improved methodologies, resulting in the exclusion of ten more articles: nine being personal loans related and one duplicate result.
The scope of this review is the application of ML techniques to risk assessment from a supervisory perspective, which includes at best how institutions are addressing their risk assessment exercises. The data and predictors used to evaluate an individual credit application (personal loan) differ substantially from the data used by banks from a corporate perspective, and even more from the data collected in the regulatory context. As such, works regarding credit risk for individual applicants were also excluded.

Considered Papers
The final article base consists of 41 papers and two books, published from 2000 until 2021, selected through the steps mentioned. In the next section, we will describe the similarities among the papers, as well as the methods applied and respective banking areas. Table 1 lists the selected papers, providing a single-sentence summary of their content.

Distribution
Based on the reviewed works from the previous section, the following paragraphs describe how machine learning techniques have been used in the banking sector. Our research intends to provide a future reference on how these technologies address and support the risk assessment process, in particular from a central bank's perspective. These results solely reflect the analysis of the papers selected for this review. They represent neither the total of publications throughout these years nor the distribution of topics for all publications. Table 2 summarises the selected articles, referenced by author, year of publication, affiliation and number of citations. Additionally, Table 3 lists the journals from the selected articles. Table 1. Short summary of each analysed paper, referenced by authors and year.

Authors
Year Summary Sentence Galindo et al. 2000 CART decision-trees out-perform statistics for credit risk assessment, using a commercial bank loans dataset Hillegeist et al. 2004 Black-Scholes-Merton option-pricing model is a better indicator of bankruptcy probability than Z-Score and O-Score. Min et al. 2005 Motivated by the increasing use of machine learning techniques, this paper aims to outperform classical statistics in bankruptcy prediction. An optimised SVM model performs better than MDA, logit and BPN for bankruptcy prediction. Angelini et al. 2008 Regulation-imposed capital requirements increase the need for precise credit risk assessment systems. This paper shows ANNs' very good results predicting the default tendency of a borrower. Boyacioglu et al. 2009 Multi-layer perceptrons and learning vector quantization are the most successful models predicting bank failure as a classification problem, in a Turkish case. Chaudhuri et al. 2011 Fuzzy-SVM satisfies Basel II demands for detecting bankruptcy probability, outperforming other approaches. This algorithm also proved to have more clustering capabilities than PNN.

Authors
Year Summary Sentence Hammer et al. 2012 The logical analysis of data (LAD) is able to reverse-engineer Fitch risk ratings of bank, showing better results than support-vector machines and logistic regression when evaluating the creditworthiness of banks. Ribeiro et al. 2012 This study establishes the limitations of using exclusively quantitative financial data when developing default risk models. The authors propose a new approach that includes contextual knowledge in an SVM model, showing better predictability performance t Lopez Iturriaga et al. 2015 Profiling distressed banks using self-organising maps and modelling failure detection with multi-layer perceptron outperforms traditional models of bankruptcy prediction. The resulting model detects 96% of failures, up to 3 years before the bankruptcy ev Ala'raj et al. 2016 The proposed hybrid ensemble model improves predicting capability compared to base classifiers, using 7 real-world datasets. It uses a classifier consensus system to compare this new approach with the traditional combination methods.

Abellan et al. 2017
Selection of the best base classifier in ensemble methods for credit scoring problems. The individual performance of classifiers is not the only criteria for ensemble schemes. Chakraborty et al. 2017 An overview of the applications of machine learning to financial problems, the most popular modelling approaches, and three case studies of relevant works for central banks. This study also establishes that machine learning models usually outperform tradi Pompella et al. 2017 An EWS is proposed to detect likely-to-fail banks. This method is compared with risk agencies' rating and detects possibly wrongly rated banks. The authors suggest the adoption of this EWS by regulators. Xia et al. 2017 The credit scoring problem is addressed using a XGBoost model with Bayesian hyper-parameter optimisation, not only obtaining better accuracy than baseline models, but also providing feature importance and a decision chart for interpretability. Alessi et al. 2018 The use of random forest to predict banking crises secondary to excessive credit growth, using credit and real estate predictors.

Broeders et al. 2018
A survey on the use of innovative technologies in financial supervision, the challenges faced by supervisory agencies and the need for a clear suptech strategy. Additionally, the experience of early adopters is described.

Chang et al. 2018
The development of a credit risk model using XGBoost classifier to address the heterogeneous nature of financial data. An under-sampling method is applied to deal with the imbalanced data. Gogas et al. 2018 Outperforming the Ohlson's score with stress-testing tool based on a support-vector machine model to forecast bank failures. The adopted methodology defines a clear boundary between solvent and insolvent banks. Jagtiani et al. 2018 The impact of machine learning in banking supervision in terms of new possible analytical solutions and risks involved in those new approaches. Kupiec et al. 2018 Addressing the need for validation of bank stress test models, by emphasising model forecast accuracy. A Lasso model shows the best forecasting capabilities for determining capital requirements in stressful conditions.

Authors
Year Summary Sentence Le et al. 2018 Artificial neural networks and k-nearest neighbour methods are more accurate for predicting bank failure than traditional statistics. Petropoulos et al. 2018 Predicting the probability of default of Greek banks using data mining techniques to reduce dimensionality, with XGBoost emerging as the best model. The authors aim to fully capture the information within these large datasets to better support the overall Tavana et al. 2018 Addressing liquidity risk assessment through a model that uses neural networks and Bayesian networks. The models were capable of distinguishing the most critical factors in liquidity risk measurement. Climent et al. 2019 Using XGBoost to identify the best predictors of bank failure and develop a classification model to label failed and non-failed banks in the Eurozone. The data used in this study is composed of 25 annual financial ratios for commercial banks in the Eurozo Dwivedi et al. 2019 Expert contributors identify and compile a series of opportunities, impacts and research topics raised by the rapid adoption of AI. The financial sector shows enormous potential in robot advisory and automation, and bankruptcy prediction.

Hohl et al. 2019
A survey of activities within the scope of suptech, classifying the degree of technological development, and the strategies in place to implement them, highlighting the experimental nature of these initiatives and the need for international coordination.

Kolari et al. 2019
Successfully undergoing European bank stress-tests depends largely on the risks a bank is exposed to, as opposed to being prepared for specific adverse scenarios. Using Bankscope data, the developed model accurately predicts 90% of the failing banks.

Kou et al. 2019
A survey depicting the most common methodologies to assess systemic risk in the financial system, using machine learning, big data analysis, network analysis and sentiment analysis. The paper showcases current researches on the use of machine learning in Leo et al. 2019 A literature review evidencing machine learning use for risk management purposes in the banking industry, while also noting the experimental nature of most approaches.

Milian et al. 2019
A literature review aiming to find consensus on a fintech definition, showing how banks and supervisory agencies are using these innovative technologies and dealing with the risks involved.

Soui et al. 2019
Using evolutionary algorithms to address credit risk assessment by considering it as an optimisation (rule-based) search problem: minimise complexity, maximise accuracy and weight (rules importance). Alonso et al. 2020 Comparing machine learning models from credit default prediction. Necessity for a structured strategy for assessing ML models to increase transparency in the use of these technologies, and promote innovation in the financial industry.

Dastile et al. 2020
A systematic literature review on how statistic and machine learning techniques have been used to address the credit scoring problem. Although machine learning is often incapable of explaining predictions, these models consistently outperform the classic Filippopoulou et al. 2020 Developing an EWS to detect systemic banking crisis based on the ECB Macroprudential database. Most of the risk indicators used in the dataset are key to forecast a systemic risk crisis 1 to 4 years before the event.

Authors Year Summary Sentence
Giudice et al. 2020 Developing an automatic classification system for the sector of economic activity for Italian companies, using a multi-step classifier with gradient boosting and support-vector machine models. The developed model is already being used in a production envi Lee et al. 2020 A study on types of machine learning applications, exploring the accuracy-interpretability trade-off, and three use cases in financial industry. Alonso et al. 2021 Predicting credit default probability with machine learning surpasses traditional statistic methods, potentially leading to savings of up to 17% in regulatory capital requirements.

Antunes 2021
Establishing the need for supervisory on-site inspection by comparing the results of two machine learning models, one based on the banks' own risk assessment and the other based on the findings from previous on-site inspections.

Doerr et al. 2021
Policy brief showing central banks are relying on big data for daily tasks, and identifying a clear need for specialised knowledge on how to adequately use machine learning, and extract greater value from that data. Huang et al. 2021 This study is developed under the assumption that the intricate nature of financial data cannot be properly explored through traditional methods. An advanced deep learning model to address the complex and hierarchical features of financial data, that outperforms traditional methods and other advanced approaches.

Wang et al. 2021
Random forest based EWS outperforms the classic logit approach as the predictive tool to prevent systemic banking crises. This paper shows an expert voting approach to model the multivariate nature of systemic risk assessment data.  The most common topic on these papers is credit risk related (nearly 34% of references), as shown in Figure 2.
The second major category relates to "ML application" (surveys, fin-tech and suptech, as per the division suggested by Broeders and Prenio (2018), the use of innovative technologies by supervisory agencies to support their processes) along with "stress tests". The remainder of the results focuses either on "bank risk" more broadly, or on specific topics for supervision such as liquidity risk and other banking risk perspectives. Another relevant aspect is the publication date of these articles, ranging from 2000 to 2021 and distributed as shown in Figure 3.   Importantly, although ML applied to the financial sector has been present since 2000, by 2015 the intersection of these knowledge areas gained a huge interest. This translated to increasing numbers of publications in this field, with the majority of relevant articles in this study being published from 2017 onward. Table 4 lists the machine learning methods applied by each author as well as the datasets that supported each research. Table 4. Machine learning methods applied in each paper and the respective dataset, referenced by authors.

ML Methods Dataset
Abellan et al. ada-boosting, bagging, random subspace, DECORATE, rotation forest public: Australian, German, and Japanese datasets obtained from UCI repository of machine learning; Iranian dataset from "A comparison between statistical and data mining methods for credit scoring in case of limited available data. (2007)"; Polish datase

Evolution
The selected papers were organised by date of publication. Publication intervals were defined based on relevant events in the banking sector, technological evolution, and the number of papers per interval. The first slot ranging from 2000 to 2011 encompasses the effects of the financial crisis of 1999 and 2008. The second range (from 2012 to 2016) still reflects several studies based on the 2008 crisis, but with a more mature insight. In this period there is also a trending increase of ANN models. The third slot encompasses the years of 2017-2018, which show a significant increase in publications intersecting ML and the banking sector.
The final interval (2019 to the current date) depicts important ML applications to the financial market in general. Studies in this period reveal an increased ponderation of the uses and impacts of machine learning in banking supervision, with several publications from banking authorities.

2000-2011
Six papers were identified from this period. They mostly focus on stress tests although three of them engage on the topic of credit risk and default risk.
Early in this period, Galindo and Tamayo (2000) identified the risk assessment task as crucial for an efficient use of resources. They used an error curve methodology to compare model precision and concluded that tree-based models outperform ANNs, KNN and probit. This sets forward the finding that tree-based models are more appropriate to structured data, as opposed to ANNs. Hillegeist et al. (2004) proposed a new method for assessing bankruptcy probability. Based on the Black-Scholes-Merton option-pricing model, this method was compared to the well-known Z-score (Altman 1968) and O-score (Ohlson 1980), obtaining superior results. These authors stressed the need for a standardised risk assessment measure mainly for comparability purposes. Min and Lee (2005) presented a paper that compares statistical and artificial intelligence methods, with the latter outperforming the former in the classification of bankruptcy. Although this study focuses on credit risk assessment for heavy industry firms in Korea, we included it in our sample for a compelling reason. It is a clear example of machine learning methods outperforming conventional statistics and it uses a set of predictors (financial ratios) easily mapped to regulatory financial reporting since they are based on balance sheet entries. Angelini et al. (2008) based their work on the Basel II capital requirements and the need for a system to assess credit risk. The main objective of this work is to evaluate the possibility of using neural networks to estimate the probability of default of a borrower (Italian small companies). In spite of some ANNs being used, the comparison of classic machine learning models to conventional statistical methods was the more recurrent approach. Furthermore, the risk definition used to evaluate the data sets was based on the probability of default. This is explained by the fact that the datasets are mostly from loan applications, either from small and medium enterprises or personal loans (housing included). These findings contradict Galindo and Tamayo (2000) as well as more recent developments in this area. ANNs have been proved to excel in time-series, image, and voice recognition, as opposed to their performance using structured data.
Additionally, some articles used financial ratios and CAMELS rating model (an international rating system used by regulatory banking authorities to rate financial institutions) to assess an institution's performance (stress testing and bankruptcy prediction). Assessing the health of a bank is crucial to prevent its failure and contain the systemic risk its failure or losses represent. The work of Boyacioglu et al. (2009) identifies this assessment as an original classification problem. The authors use the CAMELS method to select the most relevant predictors. Using this method, neural networks were shown to outperform multivariate statistical methods for a Turkish banking sector use case. Chaudhuri and De (2011) considers Basel II definition of risk to select features for the models. In this case, ANNs are not as frequently used as other conventional ML techniques, such as support vector machines and k-nearest neighbours. As a consequence, the authors focus on the optimisation of those models to the problem at hand (i.e. nature of the dataset).

2012-2016
In this period, articles mostly reflect the first insights gained from the 2008 financial crisis.
Having identified the lack of a comprehensive method to incorporate circumstantial aspects into the banking default risk predictive models, Ribeiro et al. (2012) reported that SVM+ outperformed other methods that did not include non-financial information. Hammer et al. (2012) showed that Logical Analysis of Data (LAD) is an accurate method by reverse-engineering Fitch risk ratings. The authors stated that LAD can be used as an internal rating system that is Basel compliant.
López Iturriaga and Sanz (2015) took a different approach to this matter. First, they used self-organising maps (SOM) to profile distressed banks. This unsupervised learning method is competitive so it thrives to reach the right pattern, the representation of bankruptcy for a bank. Afterward, the authors applied multi-layer perceptrons to assess a bank's risk in several time frames, obtaining very promising results predicting bankruptcy for commercial banks. This two-step approach is the first in this selection of papers to recognise the benefits of a pre-processing phase to map the bankruptcy layout of a bank. Although previous research has shown better results using conventional ML, the success shown by this perceptron model suggests it is adequate to model the time evolution of quantitative data.
A new approach to credit scoring using an ensemble model was proposed by Ala'raj and Abbod (2016). These authors combine several data filtering and feature selection methods before evaluating model performance, and compare the most traditional classifiers with their method. The results are validated on several public datasets and their accuracy assessed under several measures: average accuracy, area under the curve (AUC), Hmeasure, and Brier Score. This is the first paper in our sample showing that ensembles outperform single models for classification problems.

2017-2018
These two years showed a more than 60% increase in publications in the intersection of ML and banking sector. As highlighted by Strydom and Buckley (2019), the technological evolution allowed for the development of deep learning (DL) models, as well as new ensemble methods like extreme gradient boosting (XGBoost). Although the DL's first reappearance happened in 2012 (Zhang et al. 2020), its application to financial risk only came to light in 2016-2017.
Traditional ML and classical statistical approaches are still the cornerstones of most of these articles. However, an increasing trend is noticeable in the use of ANN-based models mainly due to bigger datasets and enhanced computing power. Abellán and Castellano (2017) build on their previous work showing how ensembles achieve better results in credit risk assessment than single models, validating the findings of Ala'raj and Abbod (2016). The authors stress the importance of individual model performance as a criterion for ensemble selection. Although the authors emphasize their own tree-based model (Credal Decision Tree, CDT), the main finding of their work is the corroboration of the hypothesis that ensembles outperform single classifiers.
Prompted by the 2008 Global Financial Crisis and the need to foresee signals of financial instability, Italian authors Pompella and Dicanio (2017) developed an Early Warning System (EWS) to help uncover distress signs for banks. This credit risk model allows users to discriminate stable from likely-to-fail banks and might be useful in adjusting rating assignments by Rating Agencies. The authors suggest its implementation in regulators to support the supervisory process. Xia et al. (2017) present an extreme gradient boosting model (XGBoost by Chen and Guestrin (2016)) that consistently outperforms baseline models. The authors stress the importance of model-based feature selection as well as the use of Bayesian hyper-parameter optimisation to achieve better predictive results. Although personal credit risk is not the main topic of interest in this review, this study shows the advantages of boosting techniques and the importance of an interpretable model for decision making. This type of models have won several Kaggle competitions and are consistently showing excellent results with structured data. Chakraborty and Joseph (2017) from the Bank of England introduce a central bank perspective on machine learning and its applications. The authors provide an overview of machine learning models and model validation to support the presentation of three case studies. As a final note, this work acknowledges the amount of available data as an important vector in decision support systems based on machine learning at central banks and other offices. As previously stated, agency papers as this one are paramount in understanding the use of machine learning in these contexts, providing use cases and areas of interest for future work. Alessi and Detken (2018) contribute with another EWS to detect excessive credit growth. This phenomenon is usually at the root of systemic risk to financial stability and its early detection can help avoid cases of bankruptcy. The authors use Random Forest classifier model with credit and real estate predictors. Their work pioneers in the domain of risk assessment from the perspective of central banks, thus setting peer practitioners in their future path. Moreover, the work reinforces that ensembles consistently outperform single models. Other authors successfully use extreme gradient boosting to develop a credit risk model for financial institutions (Chang et al. 2018). Those tools promise significant support (i.e. low error rate) for risk assessment in loans.
The Central Bank of Greece also provides a thorough analysis based on post-2008 crisis loan data from Greek banks, by Petropoulos et al. (2018). This study sets a milestone for the use of advanced ML techniques from a supervisory perspective. Furthermore, it leverages the resulting model to create an EWS that will support subsequent decisions in loan approval. Similar to what López Iturriaga and Sanz (2015) have shown, modeling a timeline evolution is where neural networks (in this case deep neural networks, DNN's) excel. Another important result is that DNNs can perform just as well as XGBoost, showcasing how precisely deep learning models adapt to structured data. Tavana et al. (2018) present a study that directly addresses liquidity risk, which is the most rapidly devastating risk a bank is exposed to. In this paper, the authors present an artificial neural network model combined with a Bayesian network (BN) to assess liquidity risk using solvency as a proxy. This combined approach models the liquidity risk indicator through the ANN and the probability of occurrence through the BN. The results show this approach distinguishes the most critical factors for liquidity in this dataset. Broeders and Prenio (2018) conduct a study that compiles the experience of early users of innovative technology in financial supervision (sup-tech). The authors structure a definition of sup-tech and show how it is used for data collection and analytics. These two applications have different initiators in supervisory agencies. Data collection tends to be initiated by management decisions and projects whereas analytics usually start out as research questions or analysis queries from supervision units. A conductive thread of all use cases is the sharing of the experience of some early adopters and the impact those technologies are having on the organisation. Similar studies, such as the one conducted by Chakraborty and Joseph (2017) are essential for compiling, sharing, contrasting the several approaches throughout central banks and other agencies.
The Federal Reserve provides a broader perspective, analysing how the use of machine learning and big data will impact compliance aspects (Jagtiani et al. 2018). The authors also stress the need to identify the risks that these technologies carry when applied to the financial market. Gogas et al. (2018) propose a methodology that separates solvent and failed banks, using machine learning models. The authors present an alternative tool for stress-testing that outperforms the O-score. Their approach is based on a support vector machine model that helps to define a boundary between solvent and insolvent banks, converting this issue into a classification problem. Kupiec (2018) presents a related study that stresses the need for new methodologies to validate conventional bank stress tests.
As a final reference for this period, Le and Viviani (2018) also tackle the problem of bank failure prediction using machine learning and classical financial ratios. One important aspect of this work is that the authors use ratios from 5 different risk perspectives: Loan quality, Capital quality, Operations efficiency, Profitability, and Liquidity. This work validates yet again that machine learning methods outperform traditional statistics. However, these authors do not explore the possibility of using ensembles, which have already been proven to be top performers in classification problems.

2019-2021
Credit and banking risks are essential for a balanced economy; trying to prevent systemic repercussions stemming from them is considered of the utmost importance. Similarly to earlier periods, these risks maintain a privileged spot in research. Still, it was on ML application we saw the most significant increase in publications. This suggests the demand for coordination and a global perspective on the developments conquered so far in this area. Leo et al. (2019) produce a thorough review on how machine learning has been used at banks for risk assessment. This paper offsets the industrial and academic claim for ML application versus real-life practices, highlighting a series of perspectives where risk management has been poorly applied. Climent et al. (2019) develop an insightful study that aims to identify a set of financial predictors that best model a bank's financial distress. To this end, the authors apply an XGBoost based model to a set of indicators that might predict a bank failure in the Eurozone. The set of selected indicators (Total assets, Loan loss provisions/net interest revenue, Equity/net loans and Interbank ratio) are shown to best help regulators monitor financial distress for those banks. From a technical perspective, this work reinforces the choice of XGBoost for classification problems using structured data. A recent study by Wang et al. (2021) deconstructs the use of logit as the base classifier for EWS developed to predict banking crisis. In fact, the authors use random forest classifier to simulate expert decision, obtaining a generalisation capability above 80% area under the curve (AUC). Kou et al. (2019) compare several ongoing researches concerning the applications of machine learning methods to the detection of systemic risk events, that is, financial distress phenomena that affect several markets or geographic regions. They also propose the use of big-data analysis to assess systemic risk. Soui et al. (2019) address the issue of comprehensibility of machine learning models for credit risk assessment. Interestingly, in this study, interpretability was mentioned as one of the barriers for adopting ML models in day-to-day decision making. In an attempt to circumvent this problem, the authors proceeded to develop an evolutionary algorithm to approach credit risk assessment as an optimisation problem: minimising complexity while maximising accuracy.
A recent review by Dastile et al. (2020) comparing statistical and ML learning models for credit scoring showed that ensembles outperform single classifiers, confirming the results of previously mentioned works. The authors identify model explainability and the ability to deal with imbalanced datasets, as the main issues to deal with when modelling credit risk. Deep learning models also show promising results, although they have not been extensively explored for credit risk assessment. The authors identify the lack of interpretability as the main barrier for adopting deep learning for credit risk assessment.
Banco de España (Alonso and Carbo 2021) published a comparison of several wellknown machine learning algorithms for credit default prediction, showing significant improvements over logit. The authors estimate that implementing XGBoost-mediated assessment could lead to savings of up to 17% of capital requirements under current ECB regulation. Antunes (2021) from the Central Bank of Brazil presents a solid argument to maintain supervisory on-site inspections. The author compares two machine learning models, one trained with portfolio ratings assessed by the banks themselves, and the other based on past ratings obtained through on-site inspections. The results show that the overall performance is consistently higher when using data retrieved through inspections. This is the period with the most ML applications papers identified (with a total of 9 out of 13). They span from insights on how AI will continue to revolutionise industries and change social behaviour (Dwivedi et al. 2021), to more practical approaches on how to incorporate ML in financial services (Lee and Shin 2020). Milian et al. (2019) also provide a list comparing fin-tech definitions, how it is supported by digital transformation, and the financial risks associated with the use of ML.
A comprehensive study from 2019 by di Castri et al. (2019) focuses on the definition of sup-tech and highlights the need for a more precise notion of what to include as "innovative technology" at the service of a financial authority. It presents several use cases and classifies the technologies onto maturity levels (named in the paper as "generations"), concluding that the identified initiatives (applications of innovative technologies to support the activities carried out by financial regulators and authorities) are mostly experimental. The authors suggest an international coordination effort and alignment to create synergies that leverage sup-tech development.
The Bank of Italy presented a use case for a classification problem (deducing the institutional sector code of a company based on its characteristics) (Massaro et al. 2020). Although this work is not related to risk assessment, it provides an excellent example of a production-ready application of ML to supervisory tasks. Alonso and Carbo (2020) from Banco de España stress the need for a joint strategy to assess ML models to increase transparency and promote adherence to this technology. The authors conclude ML models increase the predictive capability of a credit default classifier by 20%. The study also identifies factors in credit risk management that might increase supervisory costs.
Driven by the recent progress in financial technology, Huang et al. (2021) acknowledge the complex and hierarchical nature of financial data and the technological barriers found when using statistics and classic ML. The authors then proceed to apply advanced deep learning methods and make use of several graphic processors to improve computation.
As a final remark regarding ML applications, Doerr et al. (2021), from the Bank of International Settlements, presented a policy briefing on the European Money and Finance Forum, evaluating to what extent central banks are making use of ML and big data. The authors conclude that although central banks are acquainted with big data, there exists a persistent need for specialised knowledge on how to use ML throughout these organisations.
Stress tests are also referenced in these years. In a 2019 study, Kolari et al. (2019) hypothesise that stress tests themselves are more of an assessment of a bank's ability to deal with the risks it is exposed to. This statement challenges the common conception of stress tests as a marker of a bank's resilience to adverse alternative macroeconomic scenarios. For this purpose, the authors develop an early warning system to assess how European banks will perform on stress tests. These authors suggest surviving stress tests depends largely on the underlying risk dimensions of individual banks. Moreover, this paper reaffirms boosting techniques as winning solutions, not only for this sort of classification problems but also when applied to structured data. As a future work, the authors recommend a similar approach using regulatory data.
In the same line of investigation, an EWS was developed by Filippopoulou et al. (2020) to predict bank systemic risks in the Eurozone. This study starts by analysing the importance of the indicators that are usually applied and presents a model that detects a systemic crisis one to four years beforehand. In spite of using a classic multivariate binary logistic regression model, the methodology adopted for this EWS shows promising results and can be a reference for future developments in this area.

Datasets
Most central banks and supervisory agencies do not make their datasets available for confidentiality reasons. This is true for several types of data, such as credit responsibilities and supervisory data (European Banking Authority 2013).
As depicted in Figure 4, regardless of the research topic, most datasets used in these papers are public. The main reason for this is that most researchers cannot gain access to validated supervisory data. Another relevant aspect is that central banks and supervisory agencies have just begun to engage in programs where ML development strategies were in place. These developments are starting to appear, as can be seen by the growing number of titles under the "ML applications" topic. Table 4 lists the datasets used in each paper. Some rating agencies, central banks and other institutions provide datasets to support research projects. A good example is Banco de Portugal BPLIM (de Portugal 2021), a micro-data research laboratory that provides up-to-date anonymised datasets available for national and international researchers. Another example is Moody's DataHub (Moody's 2021), that provides a cloud-based platform containing eligible data alongside affiliated third-party participants.

Related Work
In this research, we have found few papers strictly addressing the use of machine learning techniques for supervisory risk assessment. As a consequence, we have broadened our research question to include banking risk assessment and machine learning in the financial sector. This reasoning is thoroughly presented in Section 2.2.
Nonetheless, we found some works that support the purpose of this review. di Castri et al. (2019) is a survey that summarises the activities that can be considered as an application of innovative technology to supervisory purposes. The authors also present a series of use cases, mostly experimental and originated by supervisory agencies. Kou et al. (2019) list the most common methodologies-ML, big data analysis and sentiment analysis-to address systemic risk in the banking sector. Last, and closest to this research, Leo et al. (2019) contribute with a literature review that brings to light how machine learning is currently being used in the banking sector. The authors stress that contrarily to what might be expected due to the magnitude of financial consequences involved, the real-life use of these sophisticated technologies is in fact under-used and poorly developed.
The authors' specific knowledge of banking context, namely projects within Banco de Portugal and European Central Bank, allowed them to propose a reliable proxy for the scarcity of published works on this topic. To establish the ideal perspective, we evaluated how risk assessment is carried out in the banking industry, and central banks in the SSM. On the other hand, we investigated how ML is being used for risk assessment in banks. Additionally, we referenced various surveys from central banks to depict and support our statements regarding the use of innovative technologies for supervisory purposes.
In this sense, although this review is sustained by a proxy and there is a paucity of related works from a central bank perspective, the authors propose this review as a starting point for researchers and industry stakeholders. We aggregate relevant contributions to support and ignite the use of ML in risk assessment exercises, from a central bank or supervisory agency perspective.

Global Analysis
The set of papers identified in this review includes diverse approaches to risk assessment. We have selected some works that use a specific bankruptcy indicator (such as the Altman score or the O-score). However, most of the authors set forth from a set of financial ratios and, knowing the final result, try to model that knowledge through supervised learning. Most of these approaches convert the problem at hand to a classification task, for example, "failure" or "no failure" of a bank.
Another interesting aspect is how the datasets are designed. Most of these works use public datasets to validate a certain approach, even though some of these datasets are specifically collected to depict financial crises. The set of features available in these datasets often reflect a certain industry perspective of risk assessment. For instance, many datasets focus on credit and profitability ratios, since both are two crucial vectors for the industry: how a bank performs and how it is exposed to its main business model.
As a final remark, although most of the selected works come from the academia, we would like to mention the five papers published from 2017 until now by central banks. Alessi and Detken (2018), from the European Central Bank (ECB) and European Commission, have a significant number of citations (135 by the end of 2020) and present an important EWS that can support everyday processes. Also, Chakraborty and Joseph (2017) from the Bank of England give a great contribution with a broad view of what is being done with ML in this context. By presenting some use cases, they also turn the spotlight on the successes of these approaches. The Bank of Greece presents an insightful use case by Petropoulos et al. (2018) for credit risk analysis.
From a more strategic point of view, Jagtiani et al. (2018) from Federal Reserve Banks depict the impacts, roles and possible risks of using ML at central banks.
Although not related to risk assessment, a recent study Massaro et al. (2020) from Bank of Italy presents a production-ready solution of the application of ML techniques to everyday central bank tasks. This is one of the most recently works, showing how ML can make a difference in day to day tasks.

Conclusions
This review provides a comprehensive picture of how machine learning techniques have been used so far in risk assessment from a central bank's perspective. It is organised by timeline and topic. All of the presented topics relate to some extent to the supervisory activity and to dimensions of analysis that are part of the day-to-day processes. As a consequence of the SSM legislation and the EBA reporting requirements, this work focused on the European banking sector.
The majority of the selected papers reflect upon the credit scoring problem. This stems largely from the fact that granting loans is the core business of most of the commercial banking sector. Stress testing in the form of bankruptcy prediction is also in the spotlight since it is strongly connected with regulators' compliance. There are several other risks a bank is exposed to that require their own studies, such as liquidity or operational risk. However, focusing on those risks is more of a compliance issue, rather than a business model perspective.
Some studies benefited from more structure and clarity, which is useful for comparability purposes. The more structured studies answer the questions of which problem they are addressing (a measure of risk and its perspective, a stock index, portfolio pricing, etc.), ML techniques that were applied, and variables considered. They also offer insight into the datasets they were based upon, and clarify the methods used to assess the models' precision and prediction capability. The lack of this organised approach evidenced in some articles made it more difficult to review and condense the information published across the broad spectrum of expertise found. As a consequence, interpreting data originating in different geographies and diverse banks' business models proves to be a challenging task. International consensus must be established regarding terminology, analysis methods and result reporting, as pertaining to this field. The authors advocate for a universal risk assessment methodology, classifying bank risk according to preset parameters and based on the same data, regardless of their location or business model. To this end and taking advantage of the central bank's perspective, the authors suggest the use of the Supervisory Review and Evaluation Process (SREP), namely, one of its pillars, the Risk Assessment System (RAS). This methodology is used by the ECB and applied, to some extent, to every institution in the SSM. Through the application of such a broad methodology, results of analysis and ML application are more comparable to an already established practice.
Another relevant aspect is the paucity of data published from a supervisory perspective. The reviewed papers mainly focus on credit risk and stress tests using public data. Despite being useful in assessing the financial health of a credit institution, they seldom use data collected through supervisory directives. Scenario testing, sometimes used as a synonym for stress testing, is another decision support system that greatly increases the analytical capabilities of supervisors. The authors emphasise the importance of landmark publications such as the EWS proposed by Filippopoulou et al. (2020), using data gathered in the aftermath of the 2008 economic collapse (European Central Bank Macroprudential Database). These systems are especially relevant since they function as a daily tool for analysts, and strongly benefit from supervisory data. The EWS developed by Alessi and Detken (2018) has also had an enormous impact in the literature by presenting a solution for anticipating banking crisis, using random forests.
As a final remark, we point out that many of these studies rely on public datasets. This often implies they are not as recent as desired since the data might not include the more recent events. For instance, a dataset from 2005 to 2011 captures the market behaviour before the crisis, the crisis itself, and a fraction of the decline of the market. It would be useful to model the behaviour of the institutions with the new regulation as well as the economic recovery seen later until 2019.

Limitations and Future Work
This study proposed to select and review the literature regarding the applications of machine learning to banking supervision. However, since this is a rather specific topic and the regulation has suffered a thorough revision after the 2008 financial crisis, our review falls short on papers that address solely this issue. There is some literature published by central banks and other agencies, but these works are mostly surveys, assessments of adoption, or definition of new concepts. As a consequence, the research query was broadened to include works from other perspectives: • Assessment of credit defaults (the topic most explored in the reviewed literature); • New stress test methodologies; • Systemic risk detection; • Other surveys regarding fin-tech and sup-tech.
All these topics are pillars of financial analysis and as such, they relate in a direct and crucial manner to proper supervision. Nevertheless, they are all collateral aspects and do not correspond to the core of the supervisory process itself.
Another aspect worth mentioning is the fact that our work is not a detailed review of the literature cited within it. Due to the heterogeneous structure of the included literature, we opted for a broader approach when comparing them. Each topic would merit an individual in-depth analysis and review, which was not warranted in the scope of this article. The authors believe this review will provide a stepping stone for supervisors, analysts, consultants, or academics that desire to further explore machine learning as a tool for banking risk assessment.