Modeling Credit Risk: A Category Theory Perspective

: This paper proposes a conceptual modeling framework based on category theory that serves as a tool to study common structures underlying diverse approaches to modeling credit default that at ﬁrst sight may appear to have nothing in common. The framework forms the basis for an entropy-based stacking model to address issues of inconsistency and bias in classiﬁcation performance. Based on the Lending Club’s peer-to-peer loans dataset and Taiwanese credit card clients dataset, relative to individual base models, the proposed entropy-based stacking model provides more consistent performance across multiple data environments and less biased performance in terms of default classiﬁcation. The process itself is agnostic to the base models selected and its performance superior, regardless of the models selected.


Introduction
Credit risk assessment is a critical component of a lender's loan approval, monitoring and pricing process. It is achieved through the application of statistical models that provide estimates of the probability of default (PD) of the borrower, usually over a one-year period. Default risk is typically treated as a dichotomous classification problem, distinguishing potential defaulters (payers) from non-defaulters (non-payers) with information about default status contained within a set of features of the parties involved in the transaction. Altman (1968) provided the first formal approach towards corporate default modeling, reconciling accounting-based ratios often used by practitioners with rigorous statistical techniques championed by researchers. He applies a statistical technique called Multivariate Discriminant Analysis (MDA) to construct discriminant functions (axes) from linear combinations of the selected covariates. A major drawback of MDA is the large number of unrealistic assumptions imposed, which frequently results in biased significance tests and error rates (Joy and Tollefson 1978;Mcleay and Omar 2000). This has led many researchers to propose logistic models as the next best alternative, requiring fewer restrictive assumptions and allowing for more general usage without loss in performance (Altman and Sabato 2007;Lawrence et al. 1992;Martin 1977;Ohlson 1980). Whilst there have been several attempts to put the field of credit risk modeling on a more concrete theoretical foundation (Asquith et al. 1989;Jonkhart 1979;Santomero and Vinso 1977;Vassalou and Xing 2004), supported by advances in computing power, the literature has more recently moved to techniques employed in the field of machine learning (ML). Essentially, it consists of statistical models that require less restrictive assumptions regarding the data, providing more flexibility in model construction and usage. It has made this approach the fastest growing research area in credit risk modeling. Among supervised machine learning methods, Artificial Neural Network (ANN) has received most attention, offering improved prediction accuracy, adaptive capability and robustness (Dastile et al. 2020;Tam 1991).
Since its inception, the number of studies using ML techniques has increased nearly exponentially, focusing primarily on benchmarking state of the art individual classifiers. Lessmann et al. (2015) is the first study to benchmark a wide range of supervised ML classifiers not investigated previously. It has become a key reference for other researchers on model comparison. Also notable is the study of Teply and Polena (2020), applying ML to peer-to-peer loan dataset provided by the Lending Club. Of particular interest has been the construction of ensembles of credit risk models (Abellán and Mantas 2014;Ala'raj and Abbod 2016b;Finlay 2011;Hsieh and Hung 2010) and meta-classifiers trained on combined outputs of groups of base models (Doumpos and Zopounidis 2007;Lessmann et al. 2015;Wang et al. 2018;Wolpert 1992;Xia et al. 2018).
Despite the increasing sophistication in how individual base models are put together in an ensemble (stacking) or how the various outputs are combined to achieve final prediction, all face a critical issue. Essentially, there is a lack of a sound conceptual framework to guide the ensemble or stacking process. Each study specifies their own method of selecting base models for combination and generating combined outputs. As a result, the recommendations made have been highly sensitive to the data environments examined, making it difficult to perform sound comparative performance analysis. This explains why each study tends to conclude that their combination method is the best performer among competing models.
Motivated by a lack of consistency in model selection, this paper outlines a conceptual framework concerned with the design of structures in credit risk modeling within a classification context. Based on the framework, various computational approaches are proposed that solves the above noted problem of inconsistency in results. First, category theory is introduced to help design common structures underlying seemingly unrelated credit risk models. These structures reveal deep connection between seemingly unrelated models, thus providing a powerful tool to study their relationships without being distracted by details of their implementation. Second, a stacking model is constructed to address issues of inconsistent and biased performance in model benchmarking. Typically, a model's predictive value exhibits inconsistent performance when there are changes in data scope within an environment or changes in the environment itself, with the underlying model essentially remaining unchanged. Complicating this issue is a tendency for models to be biased in their prediction due to the subjective selection of performance criteria. It is not unusual to observe a model delivering an impressive overall performance, while failing to detect any credit default at all. In order to address this issue, two new structures-Shannon's information entropy and enriched categories-are introduced. The focus of attention is on demonstrating the benefit of having a sound conceptual framework to enable optimal construction of models that minimise performance inconsistency and bias.
The proposed modeling framework is applied to the Lending Club's peer-to-peer loans dataset from 2007-2013 as well as to Taiwanese credit card clients dataset for 2005. The empirical results show that the proposed entropy-based stacking approach results in more consistent performance across multiple data environments as well as less biased performance. The process itself is agnostic as to which base model is selected. The conceptual framework developed provides an explanation as to why various ensemble and stacking models proposed in the literature arrive at different conclusions regarding classification performance-they are caught in an equivalence trap. Ensemble models, despite their seemingly sophisticated assembling process, fuse the outputs of base models either by majority voting or by some type of linear weighted combination. In doing so, no new instance of data structure is created; all that has been achieved is an extension of the operation to cover the output combination process. As a result, the categorical structure of the modeling approach is the same as that of any other credit risk model with equivalent performance. This paper is organized as follows. Section 2 describes in detail the modeling framework proposed, including key elements of a category and how the representation of current approaches to modeling credit risk can be built within the context of frames. Section 3 presents the data, whilst Section 4 presents the empirical results. A discussion of the empirical findings is presented in Section 5. Finally, Section 6 concludes the paper.

Categorial Equivalence
Whilst at a first glance, the many statistical approaches to credit risk modeling may seem radically different from one another, with each model constructing its own relation between the various covariates, common features exist which can be integrated into a conceptual framework that captures the essence of the credit modeling process. This framework can be built on the concept of category theory, which is the abstract study of process first proposed by Eilenberg and MacLane (1945). Category theory concerns itself with how different modeling approaches relate to one another and the manner in which they relate to one another is related to the functions between them. Instead of focusing on a particular credit risk modeling approach A and asking what its elements are, category theory asks what all the morphisms from A to other modeling approaches. Arguably, this mindset could be extremely useful as it suppresses unimportant details, allowing the modeler to focus on the important structural components of credit risk assessment.
The structure of the credit risk modeling process underlying current approaches is represented in Figure 1 (for the key definitions in category theory see Appendix A). J. Risk Financial Manag. 2021, 14, x FOR PEER REVIEW 3 of 21 This paper is organized as follows. Section 2 describes in detail the modeling framework proposed, including key elements of a category and how the representation of current approaches to modeling credit risk can be built within the context of frames. Section 3 presents the data, whilst Section 4 presents the empirical results. A discussion of the empirical findings is presented in Section 5. Finally, Section 6 concludes the paper.

Categorial Equivalence
Whilst at a first glance, the many statistical approaches to credit risk modeling may seem radically different from one another, with each model constructing its own relation between the various covariates, common features exist which can be integrated into a conceptual framework that captures the essence of the credit modeling process. This framework can be built on the concept of category theory, which is the abstract study of process first proposed by Eilenberg and Maclane (1945). Category theory concerns itself with how different modeling approaches relate to one another and the manner in which they relate to one another is related to the functions between them. Instead of focusing on a particular credit risk modeling approach A and asking what its elements are, category theory asks what all the morphisms from A to other modeling approaches. Arguably, this mindset could be extremely useful as it suppresses unimportant details, allowing the modeler to focus on the important structural components of credit risk assessment.
The structure of the credit risk modeling process underlying current approaches is represented in Figure 1 (for the key definitions in category theory see Appendix A). Object D represents a data structure that forms the basis of which specific data are collected, processed, analyzed and used in both the testing and training process. Object M represents model choice with the morphism m between D and M defined by a computational process that optimally maps the specific training dataset to a unique model (Aster et al. 2018). Object C represents modeling outcomes with the morphism c between M and C defined by a two-stage process: (i) the testing dataset is applied to the model to obtain predictions of default; and (ii) these predictions are compared to the actual outcomes observed in the data and the results are mapped into a compressed structure such as a confusion matrix or vectors of PDs from which various performance metrics are constructed (Dastile et al. 2020). Object P represents performance criteria, i.e., agreement between prediction and observation. This measurement process defines the morphism p in the structure above. The morphisms m, c and p are well-defined computational processes in the sense that they are finite and generate unique results. Consequently, m, c and p are injective morphisms.
At this stage, four more morphisms, denoted with dD, idM, idC and idP, are introduced into the structure, as shown in Figure 2 below. They essentially send each object to itself, thus representing the objects' identity morphisms. For example, the replacement operator which replaces one instance of an object with another instance can be used as an identity morphism. The resulting category , represents the process underlying current approaches to credit risk modeling.  Object D represents a data structure that forms the basis of which specific data are collected, processed, analyzed and used in both the testing and training process. Object M represents model choice with the morphism m between D and M defined by a computational process that optimally maps the specific training dataset to a unique model (Aster et al. 2018). Object C represents modeling outcomes with the morphism c between M and C defined by a two-stage process: (i) the testing dataset is applied to the model to obtain predictions of default; and (ii) these predictions are compared to the actual outcomes observed in the data and the results are mapped into a compressed structure such as a confusion matrix or vectors of PDs from which various performance metrics are constructed (Dastile et al. 2020). Object P represents performance criteria, i.e., agreement between prediction and observation. This measurement process defines the morphism p in the structure above. The morphisms m, c and p are well-defined computational processes in the sense that they are finite and generate unique results. Consequently, m, c and p are injective morphisms.
At this stage, four more morphisms, denoted with d D , id M , id C and id P , are introduced into the structure, as shown in Figure 2 below. They essentially send each object to itself, thus representing the objects' identity morphisms. For example, the replacement operator which replaces one instance of an object with another instance can be used as an identity morphism. The resulting category R, represents the process underlying current approaches to credit risk modeling. J. Risk Financial Manag. 2021, 14, x FOR PEER REVIEW 3 of 21 This paper is organized as follows. Section 2 describes in detail the modeling framework proposed, including key elements of a category and how the representation of current approaches to modeling credit risk can be built within the context of frames. Section 3 presents the data, whilst Section 4 presents the empirical results. A discussion of the empirical findings is presented in Section 5. Finally, Section 6 concludes the paper.

Categorial Equivalence
Whilst at a first glance, the many statistical approaches to credit risk modeling may seem radically different from one another, with each model constructing its own relation between the various covariates, common features exist which can be integrated into a conceptual framework that captures the essence of the credit modeling process. This framework can be built on the concept of category theory, which is the abstract study of process first proposed by Eilenberg and Maclane (1945). Category theory concerns itself with how different modeling approaches relate to one another and the manner in which they relate to one another is related to the functions between them. Instead of focusing on a particular credit risk modeling approach A and asking what its elements are, category theory asks what all the morphisms from A to other modeling approaches. Arguably, this mindset could be extremely useful as it suppresses unimportant details, allowing the modeler to focus on the important structural components of credit risk assessment.
The structure of the credit risk modeling process underlying current approaches is represented in Figure 1 (for the key definitions in category theory see Appendix A). Object D represents a data structure that forms the basis of which specific data are collected, processed, analyzed and used in both the testing and training process. Object M represents model choice with the morphism m between D and M defined by a computational process that optimally maps the specific training dataset to a unique model (Aster et al. 2018). Object C represents modeling outcomes with the morphism c between M and C defined by a two-stage process: (i) the testing dataset is applied to the model to obtain predictions of default; and (ii) these predictions are compared to the actual outcomes observed in the data and the results are mapped into a compressed structure such as a confusion matrix or vectors of PDs from which various performance metrics are constructed (Dastile et al. 2020). Object P represents performance criteria, i.e., agreement between prediction and observation. This measurement process defines the morphism p in the structure above. The morphisms m, c and p are well-defined computational processes in the sense that they are finite and generate unique results. Consequently, m, c and p are injective morphisms.
At this stage, four more morphisms, denoted with dD, idM, idC and idP, are introduced into the structure, as shown in Figure 2 below. They essentially send each object to itself, thus representing the objects' identity morphisms. For example, the replacement operator which replaces one instance of an object with another instance can be used as an identity morphism. The resulting category , represents the process underlying current approaches to credit risk modeling.  From this structure, a specific approach to credit risk modeling is just a C-Instance of the category R (R-instance I 1 ), represented by four elements and seven morphisms as shown in Figure 3. J. Risk Financial Manag. 2021, 14, x FOR PEER REVIEW 4 of 21 From this structure, a specific approach to credit risk modeling is just a C-Instance of the category (R-instance ), represented by four elements and seven morphisms as shown in Figure 3. where is a set of PDs obtained during the testing process, , , , are elements of the confusion matrix (see Table A1) that are generated by the testing process , where is a true positive, FP is a false positive, FN is a false negative and TN is a true negative.
consists of performance metrics ( , … , ) generated by the morphism , which is a specific implementation of . Since the morphisms always generate unique results, they serve as the functional mapping between the set , , and . Figure 4 summarises the setvalued functor , performing the mapping process of the first R-instance.   are structured together. C 1 consists of the modeling results with structure P I 1 , TP I 1 , FP I 1 , FN I 1 , TN I 1 where P is a set of PDs obtained during the testing process, TP, FP, FN, TN are elements of the confusion matrix (see Table A1) that are generated by the testing process c 1 , where TN is a true positive, FP is a false positive, FN is a false negative and TN is a true negative. P 1 consists of performance metrics m I 1 , . . . , m I l generated by the morphism p 1 , which is a specific implementation of p. Since the morphisms always generate unique results, they serve as the functional mapping between the set D 1 , M 1 , C 1 and P 1 . Figure 4 summarises the set-valued functor I 1 , performing the mapping process of the first R-instance. J. Risk Financial Manag. 2021, 14, x FOR PEER REVIEW 4 of 21 From this structure, a specific approach to credit risk modeling is just a C-Instance of the category (R-instance ), represented by four elements and seven morphisms as shown in Figure 3. where is a set of PDs obtained during the testing process, , , , are elements of the confusion matrix (see Table A1) that are generated by the testing process , where is a true positive, FP is a false positive, FN is a false negative and TN is a true negative.
consists of performance metrics ( , … , ) generated by the morphism , which is a specific implementation of . Since the morphisms always generate unique results, they serve as the functional mapping between the set , , and . Figure 4 summarises the setvalued functor , performing the mapping process of the first R-instance.   Now suppose there is a second approach to credit risk modeling that can be represented as another R-instance I 2 , represented by four elements and seven morphisms ( Figure 5). J. Risk Financial Manag. 2021, 14, x FOR PEER REVIEW 4 of 21 From this structure, a specific approach to credit risk modeling is just a C-Instance of the category (R-instance ), represented by four elements and seven morphisms as shown in Figure 3. where is a set of PDs obtained during the testing process, , , , are elements of the confusion matrix (see Table A1) that are generated by the testing process , where is a true positive, FP is a false positive, FN is a false negative and TN is a true negative.
consists of performance metrics ( , … , ) generated by the morphism , which is a specific implementation of . Since the morphisms always generate unique results, they serve as the functional mapping between the set , , and . Figure 4 summarises the setvalued functor , performing the mapping process of the first R-instance.   Assume the set-valued functor I2 performs the mapping as set out of Figure 6. Since both R-instances have unique objects and morphisms that share the same exact structure, it follows that there is a natural transformation between them. Essentially, this natural transformation can be constructed as a term rewriting operation that replaces specific elements of one object in with a corresponding object in that satisfies the following (1) The existence of is warranted by the fact that any modeling approach would result in the same structures of their corresponding category and with the uniqueness of , and , while the operation ensures that the naturality condition holds for both and . More specifically, there is a natural isomorphism between the two instances and (Figure 7). The beauty of category theory thus comes from its design-as-proof feature. That is, given a proposition regarding relations between objects, as soon as a structure is properly constructed, the structure itself becomes a proof. The power lays in its capability to construct simple representations that captures the essence of credit risk modeling in a single concrete formalization (category), which may yield powerful insights into credit risk modeling that are difficult to identify using traditional comparative analysis of individual models usually seen in the literature. That is, different models and their underlying processes are just instances of the same modeling structures represented by a category. As a Figure 6. The mapping process of the second R-instance.
Since both R-instances have unique objects and morphisms that share the same exact structure, it follows that there is a natural transformation between them. Essentially, this natural transformation can be constructed as a term rewriting operation T that replaces specific elements of one object in I 1 with a corresponding object in I 2 that satisfies the following (1) The existence of T is warranted by the fact that any modeling approach would result in the same structures of their corresponding category and with the uniqueness of m 1 , c 1 and p 1 , while the operation T ensures that the naturality condition holds for both I 1 and I 2 . More specifically, there is a natural isomorphism between the two instances I 1 and I 2 (Figure 7).
Assume the set-valued functor I2 performs the mapping as set out of Figure 6. Since both R-instances have unique objects and morphisms that share the same exact structure, it follows that there is a natural transformation between them. Essentially, this natural transformation can be constructed as a term rewriting operation that replaces specific elements of one object in with a corresponding object in that satisfies the following (1) The existence of is warranted by the fact that any modeling approach would result in the same structures of their corresponding category and with the uniqueness of , and , while the operation ensures that the naturality condition holds for both and . More specifically, there is a natural isomorphism between the two instances and (Figure 7). The beauty of category theory thus comes from its design-as-proof feature. That is, given a proposition regarding relations between objects, as soon as a structure is properly constructed, the structure itself becomes a proof. The power lays in its capability to construct simple representations that captures the essence of credit risk modeling in a single concrete formalization (category), which may yield powerful insights into credit risk modeling that are difficult to identify using traditional comparative analysis of individual models usually seen in the literature. That is, different models and their underlying processes are just instances of the same modeling structures represented by a category. As a The beauty of category theory thus comes from its design-as-proof feature. That is, given a proposition regarding relations between objects, as soon as a structure is properly constructed, the structure itself becomes a proof. The power lays in its capability to construct simple representations that captures the essence of credit risk modeling in a single concrete formalization (category), which may yield powerful insights into credit risk modeling that are difficult to identify using traditional comparative analysis of individual models usually seen in the literature. That is, different models and their underlying processes are just instances of the same modeling structures represented by a category. As a result, there is an equivalence between the various modeling processes that creates a performance boundary: Generalization power has meaning only within the categorical frame representing the modeling process. Consequently, two different credit risk models having the same categorical structure will on average deliver the same result if tested over all possible instances of the category. In practice, this process could go on indefinitely as new datasets would create new instances. Thus, representing credit risk modeling as a category yields a compact method to arrive at the equivalence concept without the burden of going through all possible empirical verifications.

Model Combination
A natural consequence of categorial equivalence is that combining different types of models can result in better and more consistent forecasting performance. Empirically, this has been observed in the literature (Dastile et al. 2020). Conceptually, for model combination to be effective, two conditions must be satisfied. First, since an instance of D determines C, the combination process must generate a new data instance having a structure different from the data initially used in the combination process. Second, the classification method adopted in the combination process must have a categorical structure different from the modeling process without combination. In this category, M is decoupled from C. Instead, it is mapped to D twice with the first morphism m describing the usual process of individual model construction, as shown in Figure 8. result, there is an equivalence between the various modeling processes that creates a performance boundary: Generalization power has meaning only within the categorical frame representing the modeling process. Consequently, two different credit risk models having the same categorical structure will on average deliver the same result if tested over all possible instances of the category. In practice, this process could go on indefinitely as new datasets would create new instances. Thus, representing credit risk modeling as a category yields a compact method to arrive at the equivalence concept without the burden of going through all possible empirical verifications.

Model Combination
A natural consequence of categorial equivalence is that combining different types of models can result in better and more consistent forecasting performance. Empirically, this has been observed in the literature (Dastile et al. 2020). Conceptually, for model combination to be effective, two conditions must be satisfied. First, since an instance of D determines C, the combination process must generate a new data instance having a structure different from the data initially used in the combination process. Second, the classification method adopted in the combination process must have a categorical structure different from the modeling process without combination. In this category, is decoupled from C. Instead, it is mapped to twice with the first morphism describing the usual process of individual model construction, as shown in Figure 8. The second morphism, , represents the process of generating a new data structure by using model combination. Performance is measured by applying a new morphism c, which is essentially a computational process, that maps the new data structure in to without going through any specific model. The morphism d does not necessarily generate a unique instance of since its construction depends on how the output of the individual models are combined, thus reducing the likelihood of categorical equivalence.
From a practical point of view, the main purpose of combining models based on the categorical framework is to address inconsistency and bias in classification performance. Inconsistency arises when models are sensitive to changes in the data structure, with their performance being valid only within specific contexts shaped by the structure and scope of the data. Bias is a result of the credit risk models used being sensitive to imbalance in default classes in the data. More specifically, models tend to be biased towards non-default prediction, generating performance that at first glance seems to be satisfactory overall but are poor in terms of capturing actual default outcomes. Bias is also a result of the tendency of modelers to focus on good overall prediction outcomes, with more attention paid to non-default outcomes and less attention to stability in performance (Abdou and Pointon 2011;Dastile et al. 2020;Lessmann et al. 2015). Unfortunately, it is common to find models showing high accuracy while failing to capture actual default outcomes.
The conceptual framework based on category theory provides an explanation as to why various ensemble (stacking) models proposed in the literature arrive at different conclusions regarding classification performance. Essentially, these models are caught in an equivalence trap. Ensemble models, despite their seemingly sophisticated assembling process, fuse the outputs of the base models either by majority voting or some type of The second morphism, d, represents the process of generating a new data structure by using model combination. Performance is measured by applying a new morphism c, which is essentially a computational process, that maps the new data structure in D to C without going through any specific model. The morphism d does not necessarily generate a unique instance of D since its construction depends on how the output of the individual models are combined, thus reducing the likelihood of categorical equivalence.
From a practical point of view, the main purpose of combining models based on the categorical framework is to address inconsistency and bias in classification performance. Inconsistency arises when models are sensitive to changes in the data structure, with their performance being valid only within specific contexts shaped by the structure and scope of the data. Bias is a result of the credit risk models used being sensitive to imbalance in default classes in the data. More specifically, models tend to be biased towards non-default prediction, generating performance that at first glance seems to be satisfactory overall but are poor in terms of capturing actual default outcomes. Bias is also a result of the tendency of modelers to focus on good overall prediction outcomes, with more attention paid to non-default outcomes and less attention to stability in performance (Abdou and Pointon 2011;Dastile et al. 2020;Lessmann et al. 2015). Unfortunately, it is common to find models showing high accuracy while failing to capture actual default outcomes.
The conceptual framework based on category theory provides an explanation as to why various ensemble (stacking) models proposed in the literature arrive at different conclusions regarding classification performance. Essentially, these models are caught in an equivalence trap. Ensemble models, despite their seemingly sophisticated assembling process, fuse the outputs of the base models either by majority voting or some type of linear weighted combination. In doing so, no new instance of the data structure D is created; all that has been achieved is an extension of the operation of the morphism c to cover the output combination process. As a result, the categorical structure remains the same as that of any other credit risk model with equivalent performance. In contrast, the stacking model proposed in this paper creates a new data structure D and at the same time a new instance of model choice M as a meta-classifier. It is the creation of M that effectively provides stacking models with a categorical structure that is identical to that of the typical credit risk model. However, the concept of an equivalence trap also applies in the situation, as shown in Figure 9.
linear weighted combination. In doing so, no new instance of the data structure D is created; all that has been achieved is an extension of the operation of the morphism c to cover the output combination process. As a result, the categorical structure remains the same as that of any other credit risk model with equivalent performance. In contrast, the stacking model proposed in this paper creates a new data structure D and at the same time a new instance of model choice M as a meta-classifier. It is the creation of M that effectively provides stacking models with a categorical structure that is identical to that of the typical credit risk model. However, the concept of an equivalence trap also applies in the situation, as shown in Figure 9. It is a category representing the equivalence trap often observed in typical stacking models. Essentially, the new data instance created by d can be used to train a new metaclassifier , which in turn brings the combination process back to the original structure of the modeling process.
The combination process proposed addresses this issue by considering two key issues. First, combining models, as the theoretical framework suggests, should first transform the initial feature space into a new data instance D with a structure different from the initial dataset, whilst still capturing information representing outcomes in the initial modeling phase. Second, the new data instance D should be transformed into PDs in a coherent and transparent manner without creating any new classifiers that puts the process into an equivalence trap. These considerations are supported by two conceptual constructs: Shannon's information entropy and enriched categories, which are discussed next. Shannon (1948) proposed a concept called entropy to measure the amount of information created by an ergodic source and transmitted over a noisy communication channel. Noises here reflect uncertainty in how signals arrive at the destination and, for finite discrete signals, they are represented by a set of probabilities , , … , . Entropy H is defined as follows.

Shannon's Information Entropy
( , , … , ) = − ∑ log Judged by its construction, Shannon's information entropy captures uncertainty in the communication as it deals with noise. Shannon (1948) considered this uncertainty to be the amount of information contained in the signals, thus conceptually establishing a link between uncertainty and information. Essentially, the entropy value tells us how much uncertainty must be removed by some process to obtain information regarding which signals arrive at the destination. Thus, it can said that the amount of information received from progressing through the process, results from the removal of the uncertainty that existed before the modeling process begun. The notion of the communication channel can be generalized to a finite event space that consists of n mutually exclusive and exhaustive events with their probabilities. The connection between entropy and information enables the creation of structures that effectively capture the information contained in the modeling process, a feature that will be exploited in the stacking model as a new data structure D used to enhance prediction. Other studies that similarly exploit the concept of entropy in risk assessment are Gradojevic and Caric (2016); Lupu et al. (2020) and Pichler and Schlotter (2020). It is a category representing the equivalence trap often observed in typical stacking models. Essentially, the new data instance created by d can be used to train a new metaclassifier M S , which in turn brings the combination process back to the original structure of the modeling process.
The combination process proposed addresses this issue by considering two key issues. First, combining models, as the theoretical framework suggests, should first transform the initial feature space into a new data instance D with a structure different from the initial dataset, whilst still capturing information representing outcomes in the initial modeling phase. Second, the new data instance D should be transformed into PDs in a coherent and transparent manner without creating any new classifiers that puts the process into an equivalence trap. These considerations are supported by two conceptual constructs: Shannon's information entropy and enriched categories, which are discussed next. Shannon (1948) proposed a concept called entropy to measure the amount of information created by an ergodic source and transmitted over a noisy communication channel. Noises here reflect uncertainty in how signals arrive at the destination and, for finite discrete signals, they are represented by a set of probabilities p 1 , p 2 , . . . , p n . Entropy H is defined as follows.

Shannon's Information Entropy
Judged by its construction, Shannon's information entropy captures uncertainty in the communication as it deals with noise. Shannon (1948) considered this uncertainty to be the amount of information contained in the signals, thus conceptually establishing a link between uncertainty and information. Essentially, the entropy value tells us how much uncertainty must be removed by some process to obtain information regarding which signals arrive at the destination. Thus, it can said that the amount of information received from progressing through the process, results from the removal of the uncertainty that existed before the modeling process begun. The notion of the communication channel can be generalized to a finite event space that consists of n mutually exclusive and exhaustive events with their probabilities. The connection between entropy and information enables the creation of structures that effectively capture the information contained in the modeling process, a feature that will be exploited in the stacking model as a new data structure D used to enhance prediction. Other studies that similarly exploit the concept of entropy in risk assessment are Gradojevic and Caric (2016); Lupu et al. (2020) and Pichler and Schlotter (2020).

Enriched Categories
Another important construct used in the post-stacking classification process is the concept of enriched categories (Kelly [1982(Kelly [ ] 2005. Enriched categories replace the category of sets and mappings, which play a crucial role in ordinary category theory, by a more general symmetric monoidal closed category, allowing the results of category theory to be translated into a more general setting. Enriched categories are potentially an important analytical tool for classifying default outcomes. Essentially, the paired data of entropy value H and prediction output for the training data can be separated into two groups according to the classification class associated with each output. Each group can be viewed as a set of objects that is enriched in (B, ≤, true, ), with their hom-object values, defined by whether they belong in the same group or not. A computational process is then constructed to obtain a borrower's PD employing the following formula: where L d is the likelihood that the applicant belongs to the default group and L nd is the likelihood that the applicant belongs to the non-default group. Both L d and L nd are computed using the Hamming and Manhattan distance. Thus, the combination model not only provides a new classification process but also a new method of estimating PD.

The Stacking Process
With the concepts of information entropy and enriched categories defined above, the stacking model is constructed as follows (see Figure A1 for a flow chart). First, several of the nine classifiers, consisting of a logistic regression and eight of the most popular supervised ML methods, are selected as the base models. Second, during the training and testing phases, the estimated PDs are used to compute the classifiers' Shannon information entropy (H). The entropy value and the default classifications generated will then be paired to form a new (restructured) training dataset (D 2 ). Next, employing the concept of enriched categories, final predictions are formed by assigning new testing samples into either the default group or the non-default group just constructed. Finally, the performance results of the stacking model are subjected to location tests to check for consistency and biasedness.
Several considerations differentiate the entropy-based stacking model proposed in this paper from the stacking models proposed by others (Doumpos and Zopounidis 2007;Wang et al. 2018). First, instead of selecting and processing the datasets carefully before training and testing a model only once on the dataset, as is usually done by others, the performance of the proposed entropy-based stacking model is assessed repeatedly on small randomly chosen non-overlapping subsets of the original dataset. Inherent class imbalance is utilized to make model comparison more realistic (Lessmann et al. 2015), enabling the construction of different data environments, and thus tests of performance inconsistency and bias. The data process ensures that each subsample will have a different structure regarding class ratio (default/non-default) and feature availability, especially categorical features. Further, performing many simulations allows for significance testing, which is preferred over making ad hoc judgements about average performance outcomes over limited rounds of tests. Thus, significance tests are a necessary complement to the usual average performance results reported by others. Statistical analyses of model performance are also proposed in Lessmann et al. (2015), but their non-parametric tests are performed on a sample of just 10.
The second consideration concerns model selection. Typically, the combination models proposed in the literature carefully select base models according to their performance on some testing data. Some combination of these models will then be benchmarked against all other models. The fact that their selection greatly determines the combination model's overall performance suggests that the base model selection process is more critical than the combination process itself. In contrast, the entropy-based stacking model proposed in this paper seeks to prove that the combination process likely offers more consistent and less biased performance results, regardless of which base models are selected. In order to achieve this goal, the simulation process is carried out over 100 different data environments, with a different number of base models used in each simulation. Moreover, in each scenario, each sample is trained and tested on a different set of base models. Thus, the only element that remains invariant in each simulation is the reasoning process underlying the stacking model. A final consideration is on demonstrating how a sound conceptual framework may enable quality model combination that both improves consistency and reduces bias in performance. It follows that the method can be applied to various situations without having to worry about the selection of the base models employed. Essentially, the approach avoids making any a priori judgments as to which combination of base models performs best. Comparison of this type often has little meaning since each study has its own unique data and optimization process (hyperparameters), both of which are difficult to replicate across data environments.

Base Models
Nine classifiers are used as the base models in the stacking process. Whilst not exhaustive, the models chosen are currently the most popular ones in the literature, covering most aspects of statistical and ML approaches, either as a standalone classifier or as part of a combination framework (Dastile et al. 2020;Lessmann et al. 2015;Teply and Polena 2020).
Their key structures are discussed next.

i. Artificial Neural Networks
An Artificial Neural Network (ANN) is essentially a nested construct with each layer being represented by the same or a different function. In a mathematical form, a typical ANN model can be defined as follows (Barboza et al. 2017): where n is the number of layers that transform the input feature x into a final set of output features from which classification results are obtained by using the operation of f o . Typically, the inner nested function possesses the following form: where i is the layer index spanning from 1 to n. The g i function is called an activation function, which usually has a non-linear form. Gradient descent techniques are used to obtain the parameter matrix W i and the vector b i through optimization processes constrained by some cost function (such as Mean Square Errors). The function f o is usually a scalar or a vector function that transforms previous layers' output into the final classification results.
ii. Support Vector Machine Support Vector Machine (SVM) is a parametric method that essentially puts the input features into a multi-dimension space and separates them into classes by a hyperplane wx − b, where w is the parameter vector and x is the feature vector. The classification decision has the following construct (Cortes and Vapnik 1995): under the constraint of maximizing the distance or margin between the closest examples of two classes. In order to achieve this, the Euclidean norm of w, which is ∑ n 1 w i 2 , must be minimized, where n is the number of features.

iii. Logistic Regression
In a logistic regression model, the PD is computed as (Altman and Sabato 2007): where W i = θ 0 + ∑ n j=1 θ j x ij , with x ij representing a feature in the feature vectors and θ is the set of the model's parameters obtained by the maximum likelihood estimation on the training dataset.

iv. Decision Trees
A decision tree is a kind of acyclic graph in which splitting decisions are made at each branching node where a specific feature of the feature vector is examined. The left branch of the tree will be followed if the value of the feature is below a specific threshold; otherwise, the right branch will be followed. At each split, the process calculates two entropy value (Safavian and Landgrebe 1991) described as follows: where S + and S − are two sets of split labels and f D is the decision tree with the initial value defined as f S D = 1 S ∑ (x,y)∈S y. For each case, the process will go through all pairs of features and thresholds and it will choose the ones that minimize the split entropy: which is the weighted average entropy at a leaf node. The classification will then be made using the average value of the chosen labels along the selected nodes.

v. Random Forest
This is essentially an ensemble of decision trees, with each tree built on bootstrapped samples of the same size (Breiman 2001). Each tree works on a set of features chosen randomly and classes are the generated for these features. The overall classification is obtained through majority voting of the trees' decisions. This approach reduces the likelihood of correlation of the trees since each tree works on a different set of features. Correlation will thus make majority voting more effective. By using multiple samples of the original dataset, variance of the final model is reduced. As a result, overfitting is also reduced.

vi. Gradient Boosted Tree
This method uses an adaptive strategy that starts with a simple and weak model and then the method learns about its shortcomings before addressing them in the next model, which is often more sophisticated (Chen and Guestrin 2016). Examples incorrectly classified by the previous classifier would be assigned larger weights in the next classifier. The classifiers' outputs will then be ensembled in the following construct to yield the final classification result: where n is the total number of classifiers and α i , which is learned during the training process, is the weight of the classifier φ i .

vii. Naïve Bayes
In a default classification problem, Naïve Bayes (NB) is essentially a decision process based on the following construct (Rish 2001): where 1 represents non-default status and −1 default status. The conditional probability is computed according to the Bayesian rule with p(x|y = 1 ) and p(x|y = −1 ), which is assumed to follow a normal distribution with mean and covariance matrices computed on the default and non-default sample groups constructed from the training dataset. The model assumes that the features are mutually independent.

viii. Markov model
In a Markov model, each feature vector x is treated as a member in a sequence and the probability distribution for the feature vectors given a credit classification class could be estimated from the training data as described as follows: where x i is the feature vector that requires probability estimation, x i is the set of the feature vectors preceding x i and c is a credit classification class. The cardinality of x i determines how far the model would look back to obtain information for the next prediction. In this context, a cardinality of n would result in a so called n-gram Markov model (Brown et al. 1992). If n = 0, a Naïve Bayes model is generated, which will be discussed shortly. At test time, the probability for each class given a feature vector is computed according to Bayes' theorem P c x j ∝ P x j c P(c), where P x j c is computed from the Markov model that is just derived in the training process and P(c) is a class defined prior to the start of the modeling process.

ix. k-Nearest Neighbor
This is a non-parametric method in the sense that no functional form needs to be constructed for the classification purpose (Henley and Hand 1996). The process learns how to assign a new sample point to a group of known examples and then to generate classification based upon a majority voting of the classes observed in the group. The modeling process is represented by the following constructs: where {x i } n i=1 − x denotes the distance between elements in the group and the new example. Typically, the Euclidean or Mahalanobis distance is used in the model.

Method of Comparison
Before discussing the relative performance of the proposed stacking model, it is desirable to consider an appropriate method of gauging agreement between prediction and observation. The first performance metric employed is the Matthew Coefficient Correlation (MCC). It is the preferred benchmarking criteria for binary confusion matrix evaluation as it avoids issues related to asymmetry, loss of information and bias in prediction (Matthews 1975). MCC computed as follows: A key advantage of MCC is that it immediately provides an indication as to how much better a given prediction is than a random one: MCC = 1 indicates perfect agreement, MCC = 0 indicates a prediction no better than random, whilst MCC = −1 indicates total disagreement between prediction and observation.
In addition to MCC, Accuracy is employed as an overall classification performance metric that captures consistency of the model in terms of overall predictive capability. It is computed as follows: This metric avoids the class asymmetry issue by looking at overall prediction performance, but often suffers from prediction bias caused by the imbalance problem with non-default predictions likely to account for most of the results. A very high TN with low TP results in high Accuracy without accurately capturing poor prediction outcomes for the default class.
The final performance metric is Extreme Bias, which captures the situation in which a model fails to generate a correct classification of a credit class. It is described as follows: where C i = 1, MCC = 0 in the i th simulation, 0, otherwise.
Essentially, the Extreme Bias of a model is the number of times the model generates an MCC = 0 (no better than random). This measure reveals situations in which mean Accuracy is high, but the prediction is extremely biased.

Data
Credit risk analysis is performed on two major datasets (see Table 1). The first is the peer-to-peer loans dataset of the Lending Club (Lending Club 2020). The scale of the platform's dataset and the maturity of loan portfolios (212,280 loans from 2007 to 2013) makes it an ideal sample for testing various types of credit risk models (Chang et al. 2015;Malekipirbazari and Aksakalli 2015;Teply and Polena 2020;Tsai et al. 2009). Although much smaller in size (~30,000 loans for 2005), the second is the credit card clients dataset from Taiwan (Yeh 2006) used by Yeh and Lien (2009) to benchmark the predictive power of various credit classification models.

Empirical Results
Tables 2 and 3 summarize the relative performance of the proposed stacking model for the Lending Club's peer-to-peer loans dataset and the Taiwanese credit card clients dataset, respectively. Reported are the mean values of MCC and Accuracy as well as Extreme Bias count (zero value MCC count) over 100 simulations. Also reported is the standard deviation of the MCC values, giving an indication of performance consistency, and the significance test of differences (p < 10%) in mean MCC values (equal to or greater than) between the stacked model and the base models selected. The prediction statistics reported for the stacking models are for two to nine base models, where both the subsets of the original dataset and the base models are chosen at random (non-overlapping).  Distinctly, the proposed stacking model delivers better performance in default prediction, relative to the individual base models, and for both data sets. The mean MCC is always higher for the stacking model that for the individual base models, with significance tests strongly supporting this conclusion. Most notably, the stacking model achieves consistently better performance across the various data environments as indicated by the low standard deviation of MCC. In contrast, the performance of the individual base models is highly inconsistent, as indicated by the high standard deviation of MCC. Amongst the nine individual base models, Naïve Bayes provides the best average prediction performance (MCC = 0.14) for the Lending Club peer-to-peer loans dataset, whilst Random Forest provides the best average performance (MCC = 0.33) for the Taiwanese credit card clients dataset.
Compared to the individual base models, the stacking model provides the best overall performance, with the mean MCC value exceeding that of any of the individual base models selected, with an overall agreement between prediction and observation twice as high for the Taiwanese credit card clients dataset compared to the Lending Club's peer-to-peer loans dataset. While in a few cases the performance of the stacking model appears similar to the base model selected (as indicated by the mean MCC value), the individual base models always experience high Extreme Bias. For example, for the Taiwanese credit card clients dataset, while the mean MCC (about 0.32) for the Random Forest model is similar to that of the proposed stacking model, the Random Forest model experiences high Extreme Bias (4-8%), with the prediction of the base model no better than random.
Again, in terms of Accuracy, the stacking model delivers highly and consistent performance across all data environments. Mean Accuracy of the stacking model tends to fluctuate close to 0.79 across all data environments. In contrast, for the individual base models, mean Accuracy fluctuates significantly between 0.66 to 0.84. None of the individual base models show consistency in performance across the data environments.
Whilst the stacking model does not provide the highest mean Accuracy in all cases, in all cases it experiences the lowest Extreme Bias. This renders the Accuracy measure somewhat inapt in terms of judging prediction performance. At best, Accuracy should be used as a complement to MCC, with its usefulness viewed in terms of satisfactory consistency. That is, a good model should deliver relatively stable Accuracy.

Discussion
The computational effort in this paper has been in running a large number of simulations to capture different data environments. The results of the simulations presented in the previous section support the proposed stacking model in terms of providing more consistent performance across data environments and less biased performance in terms of default classification. Unlike previous studies, which have been unable to settle which base model exhibits superior default classification performance across multiple data environments (Ala'raj and Abbod 2016a; Lessmann et al. 2015;Li et al. 2018;Xia et al. 2018), this paper shows that careful selection of base models is not necessary. The performance of the proposed stacking model remains high and consistent despite changes in the number and type of base model used or the data used to train the model on. In other words, the reasoning process itself is somewhat agnostic as to which base model is selected, thus enabling replication of the stacking method in a wide range of situations, allowing meaningful comparative analysis across multiple data environments.
In essence, the power of the conceptual construct based on category theory lies in its capability to construct simple representations that captures the essence of credit risk modeling in a single concrete formalization (a category). It yields powerful insights into credit risk modeling that are difficult to identify using traditional comparative analysis of individual base models frequently adopted in the literature. That is, different models and their underlying processes are just instances of the same modeling structures represented by a category. As a result, there is an equivalence (trap) between the various modeling processes, creating a performance boundary. That is, generalization power has meaning only within the categorical frame representing the modeling process. Consequently, two seemingly different credit risk models that have the same categorical structure will on average produce identical results if tested over all possible instances of the category. In practice, this process could continue indefinitely as new datasets create new instances. This has been clearly demonstrated by the empirical results, showing poor performance persistence of the base models selected across different data environments. It follows that representing credit risk modeling as a category yields a compact method to arrive at the equivalence concept without the burden of having to go through all possible empirical verifications, as revealed by the literature.

Conclusions
Two motivations underly the use of category theory to credit risk modeling. First, it serves as a powerful tool to construct an inward view of our own reasoning processes in credit risk modeling. By using this view, invariant structures emerge and form a basis on which construction of the relationship between seemingly unrelated models can be created. Furthermore, category theory enables these structures to form relationships with new conceptual constructs in fields unrelated to credit risk modeling. This unique capability enlarges the space of potential modeling solutions, resulting in improved default prediction performance. Second, categorical constructs result in new perspective on the meaning of risk beyond PDs. From this perspective, credit risk is not just a quantification of specific features but also a property emerging out of a network of relationships between various modeling processes represented by enriched categories. Thus, credit risk assessment is no longer an endeavor carried out with an isolated model; it has become as a network phenomenon. Creating the theoretical framework is, therefore, a novel contribution to the current body of literature.
By focusing on credit risk through these structures, the equivalence implication was better understood and a stacking model was introduced with two new structures, enriched categories and information entropy. The empirical results showed that the stacking framework's performance remained robust despite changes in data environments and selection of the base models, thus enabling more objective replication. The conceptual structures, seemingly disconnected, turned out to be perfect companions in the stacking model.
That said, there are some limitations to the paper. The first issue relates to substantial computational overhead associated with implementing the proposed stacking model. Whilst there is no doubt that keeping the per-unit processing cost low is an important concern to credit providers, advances in supercomputing are likely to push computational costs down considerably soon. The second issue relates to the performance of the stacking model which could be tested more extensively by application to more datasets and by comparing with a larger number of base models, including deep learning and unsupervised learning. This could not only create a more dynamic testing environment but also provide more transparency for replication purposes. A unified stacking and dynamic model selection framework would enable more extensive statistical tests of performance, an objective that has so far been absent from the literature but could be a fruitful avenue for further research. A final issue of concern is that the focus on constructing classification models has value only at the time of application. The focus of risk managers is undoubtedly on the development of credit risk models that provide lenders with on-going predictive diagnosis of clients' credit risk status. However, this would require a richer dataset.
While the approach embraced in this paper is essentially exploratory in its nature, it is likely to raise more questions than provide answers on sound credit risk modeling.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Key Definitions in Category Theory
Definition A1. A category C has the following elements: A collection of objects denoted as Ob(C); For every two objects c and d, there is a set C(c, d) that consists of morhphims from c to d or f : c → d ; For every object c ∈ Ob(C), there is a morphism Id c ∈ C(c, c), called the identity morphism on c. For convenience, c ∈ C is used instead of c ∈ Ob(C); For every three objects c, d, e ∈ Ob(C) and morphisms f ∈ C(c, d) and g ∈ C(d, e), there is a morphism f • g ∈ C(c, e), called the composite of f and g. These elements are required to satisfy the following conditions: For any morphism f : c → d , with id c • f = f and f • id d = f , which is called the unitality condition; For any three morphisms f : c 0 → c 1 , g : c 1 → c 2 and h : c 2 → c 3 , the following are equal: . This is called the associativity condition.
Definition A2. The category Set is defined as follows: Ob(Set) is the collection of all sets; If S and T are sets, then Set(X,Y) = { f : X → Y}, where f is a function; For each set S, the identity function id x : X → Y is given by id x (s) := x for each x ∈ X; Given f : X → Y and g : Y → Z , their composite function is ( f • g)(x) • g( f (x)). Since these elements satisfy the unitality and associativity conditions, Set is indeed a category.
Definition A3. A functor between two categories C and D, denoted F : C → D , is defined as follows: For every object c ∈ Ob(C), there is an object F(c) ∈ Ob(D); For every morphism f : c 0 → c 1 in C, there is a morphism F( f ) : F(c 0 ) → F(c 1 ) in D. These elements are required to satisfy the following conditions: For every object c ∈ Ob(C), F(id c ) = id F(c) ; For any three objects c 0 , c 1 and c 2 ∈ C and two morphisms, f : c 0 → c 1 , and g : c 1 → c 2 , the equation F( f • g) = F( f ) • F(g) holds in D.
Definition A4. A C-instance of the category C is functor I : C → Set . Definition A5. Let C and D be categories and F, G : C → D be functors. A natural transformation α : F → G is defined as follows: For each object c ∈ Ob(C), there is a morphism α c : F(c) → G(c) in D, called the c-component of α, that satisfies the following naturality condition; For every morphism f : c → d in C, the following equation holds.
A natural transformation α : F → G is called a natural isomorphism if each component α c is an isomorphism in D. The naturality condition can be represented as follows.
For every morphism : → in , the following equation holds.

( )° = °( ).
A natural transformation : → is called a natural isomorphism if each component is an isomorphism in . The naturality condition can be represented as follows.
The concept of natural transformation plays an important role in understanding relations between two categories. It describes how the two functors and can be used to as two representations of category inside with the natural transformation connecting these two representations using the morphisms in .
In order to arrive at enriched categories, the following definitions apply. It is trivial to show that this structure forms a symmetric monoidal structure.
Definition A10. Let V = (V, ≤, I, ⊗) be a symmetric monoidal preorder. A V-category X has the following two elements: A set Ob(X ), elements of which are called objects; For every two objects x, y ∈ Ob(X ), there is an element X (x, y) ∈ V, called the hom-object. These elements must satisfy the following two properties: For every object x ∈ Ob(X ), I ≤ X (x, x); For every three objects x, y, z ∈ Ob(X ), all X (x, y) ⊗ X (y, z) ≤ X (x, z). Hence, it can be said that X is enriched in V. Notes: "Positive (P)" is the term used to describe a prediction of default and "Negative (N)" for a prediction of non-default outcome. "True (T)" means the actual data agrees with the prediction, whilst "False (F)" means the data does not agree with the prediction.