Risk Assessment of Polish Joint Stock Companies: Prediction of Penalties or Compensation Payments

Corporate misconduct is a huge and widespread problem in the economy. Many companies make mistakes that result in them having to pay penalties or compensation to other businesses. Some of these cases are so serious that they take a toll on a company’s financial condition. The purpose of this paper was to create and evaluate an algorithm which can predict whether a company will have to pay a penalty and to discover what financial indicators may signal it. The author addresses these questions by applying several supervised machine learning methods. This algorithm may help financial institutions such as banks decide whether to lend money to companies which are not in good financial standing. The research is based on information contained in the financial statements of companies listed on the Warsaw Stock Exchange and NewConnect. Finally, different methods are compared, and methods which are based on gradient boosting are shown to have a higher accuracy than others. The conclusion is that the values of financial ratios can signal which companies are likely to pay a penalty next year.


Introduction
The assessment of activities in an enterprise is especially important, not only from the point of view of management but also counterparties or investors ready to invest their capital as well as other interdependent companies. Before deciding to grant a loan, many financial institutions, such as banks, are obliged to confirm the credibility of both individual and corporate customers. To support the analysis of company activities, advanced credit scoring algorithms are created. Their main objective is to streamline the evaluation process and minimize potential losses for business entities resulting from erroneous and costly decisions.
The evaluation of companies has been an important and widely studied topic for decades. A vast number of algorithms have been proposed (Beaver 1966;Altman 1968;Ohlson 1980;Zmijewski 1984;Betz et al. 2014;Mselmi et al. 2017;Pisula 2017;Shrivastav and Ramudu 2020). Nowadays, given the recent growth of big data, the most popular method is the implementation of machine learning techniques (Barboza et al. 2017;Le and Viviani 2018;Petropoulos et al. 2020;Jabeur et al. 2021;Pham and Ho 2021). In creating evaluating algorithms, it is also important to discover the reasons why one company is riskier than another and which variables have an impact on the final prediction results. Using advanced machine learning methods, it is difficult to obtain such information directly. Model explainability is a significant aspect of modelling. Many researchers focus only on the effectiveness of models, not their explainability. This approach makes the models more effective, but not easily interpretable.
Companies are evaluated in many ways, but most often this is through the analysis of the probability of events such as bankruptcy or losses (Jabeur et al. 2021;Pham and Ho 2021;Pisula 2017), which has been proven in the literature. The main purpose of this paper was to present an original solution for the evaluation of companies in terms of predicting a negative event-the payment of penalties or compensation. This study is based on a different approach by focusing on a different, but also significant, dependent variable. An in-depth analysis of the methods used to assess companies led to the development of the following hypothesis: the values of financial indicators signal, one year in advance, the occurrence of a negative event in the form of penalties or compensation, which is reflected in the financial situation of the business entity. Financial problems may translate into delays in the delivery of products and services, which in turn leads to sanctions. For example, in 2018 and 2019, the Polish company Elektrobudowa SA had to pay large penalties and compensation to another Polish company because of delays in the fulfilment of a contract. Consequently, this was one of the factors that contributed to the drop in the company's financial indicators and its financial collapse, which eventually led to its bankruptcy 1 . To verify this hypothesis, machine learning methods were used. The problem under investigation is related to classification and thus supervised learning methods were implemented. Another goal of this paper was to propose a solution to point out financial indicators which signal the occurrence of an analyzed negative event. The final results could support firms in decision-making processes.
Section 2 of the article reviews the literature and presents the latest scientific achievements related to the use of classic statistical or machine learning methods in the assessment of the activities of entities, companies or individual bank customers and the likelihood of negative events occurring in their activities. This section also presents evaluations of individual credit applicants in order to demonstrate that similar algorithms are used and thus the same methods can be implemented to assess both corporate and individual clients. Among other things, it shows which methods were used in the early years of machine learning and which have recently gained in importance. In the review, the focus was also put on the data used in the studies in question, which showed that this is a global issue rather than a regional one. This is followed by a comparison of the use of variables. Based on the literature review, it can be concluded that there is a trend involving the creation of models based on a limited set of popular dependent variables. The same publicly available datasets are used online on platforms such as Kaggle or the UCI Machine Learning Repository (Marqués et al. 2012;Tsai and Wu 2008). This work attempts to include a dependent variable that is not commonly used and to demonstrate that the information on penalties and compensations paid is a valid dependent variable in scoring modelling.
The third section presents the research methodology. It describes the individual stages of the research procedure, i.e., the selection of variables for the model and the correlation analysis between them, the sampling for modelling and its balancing as well as the presentation of chosen supervised learning methods and techniques for evaluating the developed models. Section 4 presents the results of the conducted research, with a description of the chosen dataset, which contains indicators calculated mostly based on financial information from financial statements of companies. These indicators are also used, among others, in fundamental analysis. Since it was readily available, we used information on business entities listed on the Warsaw Stock Exchange and NewConnect and carried out an analysis of their descriptive statistics and distributions. The effects of the study are presented in both tabular and graphic forms. To select a set of the most important variables that help to predict the studied phenomenon, the SHAP approach was used, which is based on Shapley values originating from game theory. All experimental analyses were implemented using Python.
The final sections include a summary of the study and proposals for further research. The topic covered in this article has not been exhausted. Each year, more and more data are collected that ought to encourage the use of new and more advanced analytical techniques. The assessment of entities also attracts the interest of other researchers.

Literature Review
Methods for evaluating the probability of entities experiencing a negative occurrence have already appeared in the literature in the twentieth century. One of the first models, which is still used, was created by Altman, and is known as the Z-score (Altman 1968). It was created using discriminant analysis, and nowadays it is used to predict corporate bankruptcy. Out of the initial twenty-two variables, five were selected for its construction (Altman 1968). The model is constantly in development and is used to predict the insolvency of companies. Almamy et al. (2016) used the Altman model to estimate the probability of such an event among British companies during the financial crisis. Despite the passage of years, it has been proven to still be precise. The Altman model encouraged other researchers to raise the issue of evaluating selected aspects of business activities. Over time, more and more advanced analytical methods were developed. This in turn, attracted more interest in the use of algorithms to assess entities.
In the twenty-first century, machine learning has played an increasingly important role in the construction of such algorithms. Colloquially, machine learning is defined as the ability of machines to learn without being programmed directly. This definition was coined by Arthur Samuel in 1959 (Awad and Khanna 2015). In the case of evaluation algorithms, they are based on a branch of machine learning known as supervised learning, which is similar to learning with the help of a teacher. This method consists of the computer model learning how to assign labels to input data that contain labels previously classified by humans (Chollet 2018). However, this does not exclude the use of other methods in scoring algorithms, such as unsupervised learning. The main purpose of unsupervised learning is to discover dependencies and patterns in data (Chollet 2018). Such methods are the basis for the segmentation and construction of recommendation systems. An example of the use of unsupervised learning methods in the assessment of taxpayers is a publication by Colombian researchers (de Roux et al. 2018). They analyzed the declarations of the Urban Delineation tax in Bogota to detect under-reporting taxpayers and based their calculations on a sample of 1367 tax declarations. They divided them into smaller groups using a spectral clustering technique. Then, they marked the declarations that stood out from other declarations in each of the created clusters. Finally, they submitted the selected declarations for expert verification (in-depth analysis).
Among supervised learning methods, a very popular trend in entity evaluation models is the use of the logistic regression (Barboza et al. 2017;Mselmi et al. 2017;Le and Viviani 2018;Zhou 2013;Zhao et al. 2009;Zizi et al. 2020) and support vector machine (Barboza et al. 2017;Geng et al. 2015;Harris 2015;Mselmi et al. 2017;Xia et al. 2018;Zhou 2013;Shrivastav and Ramudu 2020) models. Additionally, neural networks are also used. Tsai and Wu (Tsai and Wu 2008) followed this path in their research. They used neural networks to predict bankruptcy and evaluate creditworthiness using credit data from three countries: Germany, Japan and Australia. Neural networks also appeared in (Zhou 2013), except that the focus was on American (years 1981-2009) and Japanese (years 1989-2009) non-financial companies. In recent years, however, there has been an increase in the use of ensemble classifiers, such as random forests (Ala'raj and Abbod 2016; Barboza et al. 2017), as well as boosting-based methods, such as gradient boosting (Tian et al. 2020;Pham and Ho 2021), adaptive boosting-AdaBoost (Sun et al. 2020;Marqués et al. 2012;Pham and Ho 2021), extreme gradient boosting-XGBoost (Chang et al. 2018;Xia et al. 2018), or the increasingly popular categorical boosting-CatBoost (Jabeur et al. 2021). To construct the abovementioned methods, several weaker classifiers are combined. As a result, the most powerful classifier is created, which, by definition, increases accuracy (Bequé and Lessmann 2017). Similar methods have been used to assess both types of entities-companies and individuals. In both cases, the effectiveness of the models was satisfactory. Regarding the increasing attention on the assessment of companies, advanced machine learning methods have recently grown in importance.
In the literature, the use of datasets from different parts of the world can be observed, which shows that this topic is a global one. Pisula (2017) used and compared different ensemble classifiers to assess the phenomenon of production companies going bankrupt in a Polish region, based on a sample of 144 records. In his work (Harris 2015), Harris decided to compare the results of machine learning methods using two historical credit scoring datasets. In both cases, the information concerned credit applicants with and without creditworthiness. The author used a sample of 1000 observations from Germany with 20 variables and a credit union dataset from Barbados with 21,620 observations and 20 variables. In their empirical studies, Spanish researchers (Marqués et al. 2012) used six datasets. In a similar way to the aforementioned Tsai and Wu, they used credit datasets from Germany, Japan and Australia and also supplemented their calculations with information from the United States, Iran and Poland.
The common denominator of the analyzed publications is the use of financial indicators to predict the occurrence of negative events in companies (Sahin et al. 2013;Pham and Ho 2021;Patel and Prajapati 2018;Park et al. 2021;Monedero et al. 2012;Harris 2015;Zizi et al. 2021). Individual customer scoring was instead built based on information about a person's life, such as their gender, marital status and location, etc.
The literature review has shown that individuals were mainly assessed with regard to their creditworthiness. For companies, the dependent variable was based on whether the business entity declared bankruptcy or made a profit or loss. The results presented in the literature indicate that a very important aspect has been neglected. The dependent variable could be powered by other information, for example, on whether the business has had to pay a penalty or compensation to another firm. The purpose of this article was to verify whether this aspect is significant in the assessment of a company's performance. Information regarding the payment of significant penalties or compensation may indicate the relevant intentional or accidental actions of the enterprise, affect its credibility and signal whether it can successfully navigate the current reality and regulations. This is particularly important for companies listed on the stock exchange, where such information can have an impact on investor relations and the transparency of businesses. Moreover, it may shake their financial liquidity, especially if penalties are imposed by supervisory institutions such as The Polish Financial Supervision Authority or The Office of Competition and Consumer Protection in Poland.

Analysis of Variables
The first step was the analysis of variables. Basic statistics were calculated, and variables were visualized using histograms and quantile-quantile plots. The normality of distributions was tested with the Shapiro-Wilk test, in which the null hypothesis refers to the normality of the studied distribution, while the alternative hypothesis refers to a distribution that lacks such normality. Regarding the analysis of the distribution of variables, it was also essential to compare the values of the calculated skewness coefficients. Variables with extremely high or low values have to be properly transformed. If values were negative, logarithmic transformation, which has been applied in some empirical studies (Feng et al. 2014), was rejected and another method had to be found. Finally, if a significant right-hand or left-hand skewness was detected, the transformation defined by Formula (1) was used, in which x denotes the variable undergoing transformation.
The next step was the analysis of the correlations between variables. The Spearman rank correlation, the Phi coefficient (φ) and the Fisher test were used. The Spearman rank correlation was chosen to explore the relationship between continuous independent variables, the Phi coefficient (φ) between continuous independent variables and the dependent variable and the Fisher test between the independent binary variable and any other independent (continuous) variables. Since the Phi coefficient (φ) and the Fisher test are used to study the correlation between two dichotomous variables, continuous variables had to be transformed into dichotomous ones and the observations had to be classified into two new categories. After this transformation, the above methods were applied.

Sample Selection for the Modelling Process
Due to the fact that the data related to the financial activity of companies span three years and include businesses that have varying numbers of sales, it was decided that instead of simple random sampling, it would be more appropriate to employ the stratification of the sample in relation to other variables or additional information. Stratification allowed for the avoidance of a situation where only large companies were included in the training set and small companies in the test set. This could cause the model to learn sufficiently well on the training set but be overfitted or underfitted on the test set. The collected data was divided into several smaller subsamples, and an adequate percentage of "good" and "bad" examples was selected from each subsample. These percentages were assigned to a training set or a test set. The process was divided into three parts, which are described in detail in Section 4.4. This division resulted in an unbalanced sample with one class containing more records than the other. If used, such samples could lead to incorrect results and render the models useless. One of the most popular methods that is often cited in the literature (Wang et al. 2020;Sun et al. 2020;Ng et al. 2021;Maldonado et al. 2019;Zizi et al. 2021) is SMOTE (Synthetic Minority Over-sampling Technique). Zhou (2013) showed the superior effectiveness of models based on artificially balanced sets using SMOTE in comparison to other balancing set methods such as under-sampling. The purpose of this technique is to generate an appropriate, fixed number of "synthetic" minority class records. In the space of variables, any point from the class which is meant to be enlarged is selected. Then its k-nearest neighbors, also from the same class, are determined, and an additional point in a random location on the line between the chosen point and the one of the nearest neighbor is generated (Chawla et al. 2002).

Supervised Learning
In this paper, five supervised learning algorithms were used to make a prediction and assign labels to observations using a model trained on input data containing labels. The approach uses classic methods such as logistic regression and a decision tree as well as boosting-based methods such as gradient boosting, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM) and categorical boosting (CatBoost). These methods are briefly described below. All of them use the following libraries implemented in Python: Scikit-learn, xgboost, catboost and lightgbm 2 . The choice of methods is not accidental. In many publications, boosting-based methods have been used to assess the probability of negative financial events in companies (Jabeur et al. 2021;Pham and Ho 2021). It was decided to verify their effectiveness in predicting the payments of penalties or compensation by Polish joint stock companies.
The first method that was used was logistic regression. It can be applied to predict both dichotomous and multiclass variables. It is not as rigorous in terms of principles as linear regression, but it is important to avoid using independent variables that are strongly correlated among themselves and collinear (Jabeur et al. 2021). In this method, coefficients are not interpreted directly. The most important aspect of the model are the odds ratios, which describe the ratio of the probability of success to the probability of failure. In addition to finance, logistic regression is also widely used in medical research, for example to assess the likelihood of re-infection of a specific disease or recovery from an illness (Fawcett and Provost 2014). The second method used in this paper was a decision tree, which is applied to solve both classification and regression problems. It is often used in the development of decision support tools (Al-Hashedi and Magalingam 2021). Decision trees are also applied in ensemble classifiers as a base classifier, on the basis of which more powerful classifiers are built. A decision tree is a construction consisting of a main node, branches and leaf nodes. Each node shows the feature, the branch, i.e., the decision, and the leaf, i.e., the result in a categorical or continuous form (Patel and Prajapati 2018). Along with logistic regression, it is one of the most easily interpretable algorithms used in fraud detection (Monedero et al. 2012;Sahin et al. 2013). This paper uses the CART technique with the Gini impurity measure. In machine learning, the approach using ensemble classifiers plays a significant role. It relies on combining the results of single methods, the so-called base classifiers (Marqués et al. 2012). While choosing individual techniques is important, we should also know how to combine them (Sesmero et al. 2021). Multiple techniques are used to improve accuracy. It has been established that the combination of methods produces more accurate prediction results than single base classifiers (Dietterich 1997). In short, the process can be described as combining weak classifiers in order to obtain a more powerful one. It should be mentioned that this is associated with a decrease in their interpretability-ensemble classifiers are not as easy to interpret as the implementation of logistic regression or a single decision tree. There are three main methods for using ensemble classifiers: bagging, boosting and stacking. This research used algorithms based on boosting: gradient boosting, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM) and categorical boosting (CatBoost). All of them include a decision tree as the base classifier. The last three from the above list are recognized as very effective in predicting events in disciplines other than finance. This is confirmed by the fact that many winning models in competitions on the Kaggle website use these algorithms (Sagi and Rokach 2021). The first algorithm, based on the boosting method, is gradient boosting, also known as gradient boosting machine (GBM for short). This algorithm can be used to solve both classification and regression problems. It reduces prediction errors made by previous classifiers. It was first proposed by Friedman (Friedman 2002). Many algorithms, including XGBoost, LightGBM and CatBoost, are based on the gradient boosting method to improve scalability (Jabeur et al. 2021). XGBoost was developed by Chen and Guestrin (Chen and Guestrin 2016). This technique adds the regularization component to the loss function. As a result, the final complexity takes into account the predictability at each splitting. In addition, XGBoost can handle excessive overfitting by tuning multiple hyperparameters (Sagi and Rokach 2021). LightGBM, unlike many methods based on decision trees, relies on the leaf-wise tree growth algorithm, not a depth-wise one 3 . Not only are the resulting complex trees more accurate, but this method has proven to be faster than gradient boosting (Ke et al. 2017). One of the disadvantages of utilizing the leaf approach in the generation of a model is the possibility of overfitting in situations where the datasets are smaller. The CatBoost ensemble classifier is the youngest of all the algorithms mentioned in this paper. It is a modification of the standard gradient boosting algorithm. It was proposed by Yandex employees, who are continuously developing it (Prokhorenkova et al. 2018). It copes well with both numerical and categorical data and can also be used for small datasets (Jabeur et al. 2021).

Model Evaluation and Interpretation (SHAP Approach)
To evaluate the created models, measures based on the confusion matrix were used, such as AUC (Area Under the Curve) and Cohen's kappa, which is indirectly based on this matrix. These two popular evaluation techniques are used in classification models and to compare models. The higher their value, the more accurate the model is. AUC values range from zero (very bad model) to one (ideal model). A value of 0.5 indicates that the prediction of the model is random (Rachakonda and Bhatnagar 2021). The higher the value of this metric, the more accurately the created model assesses a random element of the positive class compared to a random element of the second class (negative). Cohen's kappa uses values between −1 and 1. Values closer to zero determine the divergence of judgements, while those below zero indicate a worse classification than a random assignment to classes.
As the volume of data increases so does the usage of advanced and more complex predictive algorithms. This is linked to the difficulty of interpreting models due to their "black-box" character. The SHAP (SHapley Additive exPlanations) approach is becoming increasingly popular and has appeared in many publications, (Lundberg and Lee 2017;Mangalathu et al. 2020;Futagami et al. 2021;Dumitrescu et al. 2020;Bakouregui et al. (Severino and Peng 2021). It is based on Shapley's value, which is derived from game theory. Its purpose is to distribute profit or pay-out to players depending on their contribution to the final result of a given game (Bakouregui et al. 2021). This approach was used in the interpretability of machine learning algorithms by Lundberg and Lee (Lundberg and Lee 2017). In this case, its purpose was to assess the contribution of the analyzed variable to the final result of the prediction (Futagami et al. 2021).

Description of Data
All of the necessary information came from the financial statements of businesses. In addition, the dependent variable was constructed on the basis of lists of businesses that were given penalties, which are published on the official website of The Polish Financial Supervision Authority 4 . Based on the acquired data, dependent and independent variables were built. Since penalties or compensation are a consequence of actions, it was decided that the independent variables was to precede the dependent variable by one year. Independent variables were constructed based on data from the years 2016-2018 while the dependent variable was based on data from the years 2017-2019. The total number of collected observations was 928. Table 1 contains information about the total number of companies in each year relative to the dependent variable. The dependent variable determines whether the company has paid a significant penalty or compensation to another firm ("bad") or not ("good"). It is based on the amount of penalty or compensation compared to the revenue from the same period. In some financial statements, the penalties are classified as "other operating expenses", which, apart from the amount paid, also include other information, such as the cost of court proceedings or administrative costs. However, since these amounts were insignificant, it was decided to build the dependent variable on the basis of "penalties/compensation paid" without considering "other operating costs".
The first stage involved calculating the ratio of penalties paid to revenues. If the sales revenues were zero, the ratio was also equal to zero. There were no cases where a business entity paid penalties and had sales revenues equal to zero in the same year. Due to the high value of the skewness coefficient (greater than 14), it was decided to transform this variable according to Formula (1), where x is the quotient of the amount of penalty/compensation divided by the sales revenues generated in the year when the penalty/compensation was paid.
Based on the transformation according to Formula (1), it was decided that the final shape of the dependent variable could be expressed using Formula (2). In addition, the amount of penalties was taken into consideration. All observations where the amount of penalty or compensation was lower or equal to 10,000 PLN were ignored and classified as "good". In other words, penalties below 10,000 PLN were arbitrarily classified as irrelevant. Formula (2) has the following form: where z is the transformed quotient of the penalty paid divided by the sales revenues according to Formula (1), while Amount refers to the amount of the penalty or compensation paid to another company. A total of 284 records labelled as "bad" were obtained. Table 2 contains information about the number and percentage of records classified as "bad" by year. On the basis of collected financial information, a total of nineteen independent variables were built (see Table 3). Many of them are financial indicators used, for example, in fundamental analysis and are mainly continuous variables. Only one of them, a characteristic indicating whether a company has made a profit or a loss, was classified as dichotomous. These variables were chosen because of their important role in company financial standing assessments. They describe significant aspects such as liquidity, profitability, investments, sales or debt. They have often been used to predict negative events such as bankruptcy or financial distress (Barboza et al. 2017;Mselmi et al. 2017), so it was decided to check their utility in predicting penalties or compensation payments.

Analysis of Independent Variables
In the dataset, one of the variables was presented on a dichotomous scale. This is the variable that determines whether a firm has made a profit (indicated as "1") or a loss or a result equal to zero (marked as "0") from operations in a given year. Table 4 shows the number of occurrences of each category. Less than 29% of observations resulted in a loss in a specific year, while over 71% resulted in a profit. Most of the companies were analyzed three times because their data was collected for three subsequent years. Some companies recorded both a profit and a loss depending on the year of occurrence of this event. For continuous variables, descriptive statistics were calculated (see Table 5). The first stage of the analysis focused mainly on two metrics: the coefficient of variation and the skewness coefficient. The former provided information on variable diversification and the latter on how a variable's distribution was formed.  Table 5 shows that each of the variables was characterized by a high variability exceeding 100% in absolute value. This is a desirable phenomenon-all variables were taken into account in the next stage.
However, what is not desirable is the significant right-and left-sided skewness. This is quite a common situation for economic (financial) data. Each of the variables had an absolute value of the skewness coefficient greater than 2. Most of the negative values that appeared had to be transformed using a different method than the popular logarithmic conversion. It was decided that the same transformation as in the case of the dependent variable, according to Formula (1), was to be used. Table 6 contains the skewness coefficient values for continuous variables after their transformation, while Figure 1 visualizes their distributions (after change).  After transformation, it was noticed that the absolute value of the skewness coefficient increased for one variable instead of decreasing. This variable, labelled X10, represents the debt to equity ratio. It was concluded that the transformed version of this variable should be included in subsequent stages of the analysis despite its increased skewness. The distributions of this variable before and after conversion were visualized ( Figure 2). The increase in the skewness coefficient, despite the transformation, can be explained by the fact that the distribution of the transformed version of the variable labelled as X10 is a mixture of two distributions. The subsequent stages of this research took into account the modifications of variables according to Formula (1). After transformation, it was noticed that the absolute value of the skewness coefficient increased for one variable instead of decreasing. This variable, labelled X10, represents the debt to equity ratio. It was concluded that the transformed version of this variable should be included in subsequent stages of the analysis despite its increased skewness. The distributions of this variable before and after conversion were visualized ( Figure 2). The increase in the skewness coefficient, despite the transformation, can be explained by the fact that the distribution of the transformed version of the variable labelled as X10 is a mixture of two distributions. The subsequent stages of this research took into account the modifications of variables according to Formula (1).
To investigate the normality of distributions of each variable after transformation, quantile-quantile graphs were used (Figure 3). Based on these graphs, it was concluded that the distributions of all variables were not normal. This fact was also confirmed by the Shapiro-Wilk test results, where the null hypothesis had to be rejected in favor of the alternative hypothesis for each variable due to the very low p-values (p < 0.01). It could be observed that despite transformations, the variable distributions were not close to normal. The lack of normality is not a huge problem for all supervised learning methods mentioned in this paper because they can cope with it. It was decided that variables after transformation were to be used in the modelling process, because using variables with very high skewness coefficient values could worsen the model's performance. To investigate the normality of distributions of each variable after transformation, quantile-quantile graphs were used (Figure 3). Based on these graphs, it was concluded that the distributions of all variables were not normal. This fact was also confirmed by the Shapiro-Wilk test results, where the null hypothesis had to be rejected in favor of the alternative hypothesis for each variable due to the very low p-values (p < 0.01). It could be observed that despite transformations, the variable distributions were not close to normal. The lack of normality is not a huge problem for all supervised learning methods mentioned in this paper because they can cope with it. It was decided that variables after transformation were to be used in the modelling process, because using variables with very high skewness coefficient values could worsen the model's performance.

Correlation Analysis
The first stage of correlation analysis between variables consisted of calculating the values of Spearman rank correlation coefficients for all pairs of continuous independent variables. The results are presented in graphical form (Figure 4). To study the dependencies between independent variables and the dependent variable, the Phi coefficient (ϕ) was used after the continuous variable had been divided into two ranges.

Correlation Analysis
The first stage of correlation analysis between variables consisted of calculating the values of Spearman rank correlation coefficients for all pairs of continuous independent variables. The results are presented in graphical form (Figure 4). To study the dependencies between independent variables and the dependent variable, the Phi coefficient (φ) was used after the continuous variable had been divided into two ranges. Because of their low correlation with the dependent variable, the X3 (return on equity) and X4 (return on assets) variables were removed. Not only were their relationships exceptionally weak compared to other variables but these variables were found to have an insignificant impact on the dependent variable. In subsequent iterations, variables were firstly eliminated based on the adopted threshold values of the Spearman rank correlation coefficient. Strongly correlated variables were those with Spearman rank values lower than −0.7 and greater than 0.7. The following independent variables were eliminated in subsequent steps:  Step 1: Similar to the case of X12, the X2 variable correlated in accordance with the approved threshold with three other variables. There was also a strong correlation between X12 and X2, but X2 had a lower impact on the dependent variable than X12 based on the value of the Phi coefficient;  Step 2: the X7 and X12 variables correlated in accordance with the approved threshold with two other variables;  Step 3: X9 correlated with one variable (labelled X10), but its correlation with the dependent variable based on the value of the Phi coefficient was weaker than for X10.
To investigate the correlation of all independent variables with one binary independent feature (X1), the Fisher test was conducted. Continuous variables had already been transformed into the dichotomous type. The results are presented in Table 7. The X1 variable was strongly correlated with most of the variables, as evidenced by very low p-values. At a significance level of 0.05, the null hypothesis of the independence of the examined variables was rejected. The X1 variable was eliminated. Finally, a set of variables was selected for further processing. These were: X5, X6, X8, X10, X11, X13, X14, X15, X16, X17, X18 and X19.  Because of their low correlation with the dependent variable, the X3 (return on equity) and X4 (return on assets) variables were removed. Not only were their relationships exceptionally weak compared to other variables but these variables were found to have an insignificant impact on the dependent variable. In subsequent iterations, variables were firstly eliminated based on the adopted threshold values of the Spearman rank correlation coefficient. Strongly correlated variables were those with Spearman rank values lower than −0.7 and greater than 0.7. The following independent variables were eliminated in subsequent steps:

•
Step 1: Similar to the case of X12, the X2 variable correlated in accordance with the approved threshold with three other variables. There was also a strong correlation between X12 and X2, but X2 had a lower impact on the dependent variable than X12 based on the value of the Phi coefficient; • Step 2: the X7 and X12 variables correlated in accordance with the approved threshold with two other variables; • Step 3: X9 correlated with one variable (labelled X10), but its correlation with the dependent variable based on the value of the Phi coefficient was weaker than for X10.
To investigate the correlation of all independent variables with one binary independent feature (X1), the Fisher test was conducted. Continuous variables had already been transformed into the dichotomous type. The results are presented in Table 7. The X1 variable was strongly correlated with most of the variables, as evidenced by very low p-values. At a significance level of 0.05, the null hypothesis of the independence of the examined variables was rejected. The X1 variable was eliminated. Finally, a set of variables was selected for further processing. These were: X5, X6, X8, X10, X11, X13, X14, X15, X16, X17, X18 and X19.

Division of Data into a Training Set and a Test Set
The sample was stratified by entering additional information, such as the year of payment of the penalty or compensation, the revenues in that year and the dependent variable. The whole process was divided into three stages and the final objective was to obtain a training set and a test set.
Stage 1: Division of the sample according to the year of payment of the penalty/compensation The sample was divided into three subsamples according to the year of payment of the penalty/compensation. Table 8 shows number of records of each new subsample. Stage 2: Division of data according to the sales revenues in the year when the penalty was paid Each of the subsamples created in step 1 were divided according to the amount of revenue obtained in the year when the penalty/compensation was paid and further divided into the following quartile-based groups:  Tables 9-11 show the number of observations grouped by category after two stages in the sample division process  Stage 3: Division of data based on the dependent variable The final stage of the division process was based on the dependent variable. A total of 75% of records from each category ("good", "bad") within a specific subgroup created in stage 2 were assigned to the training set and the other 25% to the test set. Table 12 contains information about the number of records in each category in a specified set. Based on the information contained in Table 12, it is clear that the created sets were unbalanced, as the number of observations in one category exceeds the number of records in the other category. It was decided one of the methods for set balancing described in Section 3.2, in particular the SMOTE method, was to be used. In this analysis, as in other publications (Maldonado et al. 2019), the value of parameter k was 5. According to the adopted modelling principles, this process was carried out only on the training set, with the test set remaining unchanged. Table 13 shows the number of observations from each set by category after applying the SMOTE method.

Supervised Learning
The whole procedure, consisting of drawing observations for the training and test sets as well as launching a particular method and calculating the evaluation indicators, was performed ten times. This process is a modification of cross-validation, in which the entire dataset is divided into a training set and a test set and then the training set is divided into n subsamples. An evaluation set consisting of one subsample and a training set containing n-1 subsamples were obtained. Each of the n groups in the whole process became, in a sense, a test. In each iteration of the applied approach, a training set and a test set were created from scratch and a proper algorithm was run. Both of these procedures were performed k times. The aforementioned process can be defined as sampling with replacement. One record could have been drawn for the test set several times. The k parameter was assigned the value of 10 in the calculation, because 10-fold cross-validation is often used in the literature.
Each result in each iteration was saved and their average score was treated as the final result (Table 14). In terms of the value of AUC, the best results were obtained using the CatBoost method, whereas the worst outcome was produced using the built decision tree. Interestingly, all the boosting-based algorithms proved to be more efficient than classic, easily interpretable methods such as logistic regression or the decision tree. The same was true of Cohen's kappa indicator. Again, CatBoost was the best, and the decision tree the worst. Once again, the superiority of boosting over other methods was confirmed. Table 14. Mean value of AUC and Cohen's kappa coefficients for the applied supervised learning methods after a 10-fold division of the sample into a training set and a test set and running the specified method. In order to analyze the stability of individual methods, standard deviations were calculated for the results obtained in ten iterations. They are presented in Table 15. This time, the most stable method with respect to AUC was gradient boosting, whereas the least stable outcomes were produced by the decision tree. The results were similar in the case of Cohen's kappa index values. Once again, a boosting-based algorithm proved to be the best, but in this instance, it was XGBoost. The worst results were again produced by the decision tree. However, it is worth noting that in terms of the Cohen's kappa indicator, the second least stable method was LightGBM. This might stem from this technique's poor performance when there are small datasets, as was the case in this research. The stability of this method was worse than that of logistic regression. In the case of the AUC values, LightGBM also produced the least stable results in comparison with the other three boosting-based algorithms and was the only one to exceed the value of 0.02. Table 15. Standard deviation value of AUC and Cohen's kappa coefficients for the applied supervised learning methods after a 10-fold splitting of the sample into a training set and a test set and running the specified method. Furthermore, the SHAP values, based on the Shapley value, of each iteration were averaged for each of the variables and a special ranking was created ( Figure 5). Detailed charts of the SHAP approach for each method and for each iteration were analyzed. Figure 6 shows an example of this approach for one method (based on gradient boosting) for one of the iterations. These graphs can help to determine which variables affect the dependent variable and in what manner.  The results of the logistic regression are essentially different from other methods. Figure 7 shows the Spearman rank correlation coefficient values for the scores of each algorithm. The effects of logistic regression are the most correlated with the decision tree results. The value is 0.66. In the case of other techniques, they are correlated with each other at a level of at least 0.75, which is a strong, significant dependence. This should not be surprising, because the gradient boosting methods are based on decision trees.  The results of the logistic regression are essentially different from other methods. Figure 7 shows the Spearman rank correlation coefficient values for the scores of each algorithm. The effects of logistic regression are the most correlated with the decision tree results. The value is 0.66. In the case of other techniques, they are correlated with each other at a level of at least 0.75, which is a strong, significant dependence. This should not be surprising, because the gradient boosting methods are based on decision trees. The results of the logistic regression are essentially different from other methods. Figure 7 shows the Spearman rank correlation coefficient values for the scores of each algorithm. The effects of logistic regression are the most correlated with the decision tree results. The value is 0.66. In the case of other techniques, they are correlated with each other at a level of at least 0.75, which is a strong, significant dependence. This should not be surprising, because the gradient boosting methods are based on decision trees. Bearing in mind the logistic regression effects, the most important characteristic which has an influence on the prediction of the examined undesirable phenomenon is the current ratio (X6). The lower the values of this feature, the more the model leans towards the "bad" class. In the case of the other methods, the results were relatively similar. An importance ranking was created which omitted the logistic regression effect. It is presented in Table 16. The most important variables that signaled a payment of penalties or compensation by a company in the following year were: the long-term debt to equity ratio (X11), the receivables to payables coverage ratio (X17) and the basic earning power ratio (X14).  Bearing in mind the logistic regression effects, the most important characteristic which has an influence on the prediction of the examined undesirable phenomenon is the current ratio (X6). The lower the values of this feature, the more the model leans towards the "bad" class. In the case of the other methods, the results were relatively similar. An importance ranking was created which omitted the logistic regression effect. It is presented in Table 16. The most important variables that signaled a payment of penalties or compensation by a company in the following year were: the long-term debt to equity ratio (X11), the receivables to payables coverage ratio (X17) and the basic earning power ratio (X14). A deeper analysis of these independent variables for the applied methods showed that the classification of a business as "bad" is supported by high long-term debt to equity ratio (X11) values and high receivables to payables coverage ratio (X17) values and also by lower basic earning power ratio (X14) values. This was illustrated in Figure 6. The higher the long-term debt to equity ratio or receivables to payables coverage ratio values, the greater the SHAP value in the positive direction, so the larger the probability that the company will pay a significant penalty or compensation the following year. Lower basic earning power ratio values mean that a company is more likely to pay a significant penalty or compensation the following year.

Discussion
The financial statements of companies provide a great deal of information about their financial activities. In this paper, a dependent variable was created based on information about the payment of penalties or compensation to other companies. This is an original approach in assessing companies. Previous studies have not used the payment of penalties or compensation as a dependent variable. I decided to compare the modelling effects presented in other publications, focusing mainly on models that assess the probability of negative financial events, in particular the prediction of bankruptcy or insolvency of companies. It is worth highlighting that the analyzed research methodologies exhibited differences in the size and kind of the population, the set of variables or the period under investigation. Only those methods were compared which were used in this study. In the case of (Jabeur et al. 2021), the evaluated data was not older than three years before company failure. The closer a company was to bankruptcy, the higher the AUC was for all the created models. The best algorithm in each period was CatBoost, for which the AUC ranged from 0.764 to 0.994. For XGBoost, the AUC value was between 0.715 and 0.931, while in the case of gradient boosting, its values ranged from 0.718 to 0.951. In the case of logistic regression, its AUC values ranged from 0.744 to 0.919, and the results were better than for XGBoost or gradient boosting when determining the prediction of bankruptcy three years prior. In the study by (Pham and Ho 2021), the boosting-based algorithms were compared. The AUC value for XGBoost and gradient boosting was 1, which is the ideal state. However, we do not know if model overfitting occurred in this case. Pisula (2017) compared the results of several methods for an unbalanced and balanced sample. For example, the decision tree was used as a stand-alone classifier and a base classifier for the ensemble model. For each of these iterations, the AUC value was greater than 0.9, while Cohen's Kappa was above 0.8. This demonstrates the high level of the predictive power of the decision tree.
These results confirmed what was stated before in the literature related to algorithm accuracy. Ensemble classifiers are more accurate than logistic regression or an individual decision tree. It would be preferable for future studies to concentrate on these algorithms. For example, it might be worth trying to use AdaBoost, which is an algorithm which is also based on boosting. Future research should concentrate on selecting hyperparameters whose optimization may help to increase the values of model evaluation metrics.
The results of this research indicate which financial measurements could signal the future occurrence of negative events that exacerbate the financial situation of a business entity. This is a very important subject in business management, as it allows managers to focus on those measures which reduce the enterprise's rating. Such information is valuable in the context of a company analysis or financial statement evaluation; it makes the management of a company more effective and helps to minimize the risk of mistakes resulting in financial losses. According to the results of this study, the most important variables in the analysis of a business are the long-term debt to equity ratio, the receivables to payables coverage ratio and the basic earning power ratio, as determined by ensemble classifiers, as well as the current ratio, as shown by logistic regression effects. This information could also be helpful for investors interested in buying company shares. First of all, they could concentrate on these specific indicators instead of conducting a comprehensive fundamental analysis, saving time, which is crucial in decision-making processes on the stock market. However, they should not limit their analysis to the indicated ratios, as this could obscure the overall picture of a company's financial condition. This model could also help investors to reduce risk in their investment portfolio. The model's predictions pertain to companies listed on the stock exchange, which is why it could be helpful in building a portfolio of assets.
High long-term debt to equity ratio or the receivables to payables coverage ratio values, as well as lower basic earning power ratio or the current ratio values make a company riskier. Reduced financial liquidity, as measured by the current ratio, means that a given company cannot cope with the repayment of its current liabilities. In addition, growing long-term debt causes a company to become over-indebted, which means it may fail to fulfill signed contracts. This situation can lead to the imposition of penalties or compensation. In the end, it may even lead to a company's bankruptcy. At first sight, high receivables to payables coverage ratio values may appear to be a good thing in the context of a given company's financial standing, but in the long term, it may raise doubts. Companies with many debtors may, after a certain period of time, have problems obtaining these receivables due to the financial problems of their debtors. Receivables which the customers have not paid are known as bad debts. This can cause problems for companies in paying their liabilities due to the lack of financial resources they were expected to receive from their counterparties. Secondly, not obtaining receivables causes less profit for the company.
It is important to evaluate many aspects of a business. The developed model can be used as a part of one comprehensive scoring algorithm. So far, no publications have been found that include a dependent variable based on the information about significant penalties or compensation paid to other companies and therefore it is difficult to directly compare this model with similarly devised ones. In general, this type of algorithm can be classified as one that assesses the activities of businesses. This category of model mainly focuses on the prediction of bankruptcy. In such algorithms, the dependent variable is based on information about companies that have failed. A value of one is assigned to those businesses that went bankrupt during the analyzed period and a value of zero to other businesses. Such models have a higher accuracy than those proposed in this publication, i.e., those with a dependent variable based on information about the payment of a significant penalty or compensation to another company.
This study can serve as a prelude to future research. However, consideration should be given to increasing the sample size and expanding the criteria for selecting data to include capital companies and business entities with a different legal form. Moreover, other machine learning methods could be incorporated or the information about penalties or compensation may be combined with another variable and included as a new dependent variable. Such information could be regarded as complementary when calculating the score of a company. A model based on a function which determines the payment of a penalty or compensation should be one of the components of a scoring algorithm. It is also essential to take into account other independent variables which are based on other financial indicators.

Conclusions
The assessment of the activities of a business is extremely important nowadays. With the huge growth in available data, new opportunities are emerging that make the construction of advanced algorithms possible. In this study, the research hypothesis formulated in the introduction was confirmed. The values of financial indicators signal, one year in advance, the occurrence of a negative event-penalties or compensation payments, reflected in the financial situation of the business entity. An example of such a negative effect is the Polish company Elektrobudowa SA described in Section 1. In this case, such penalties and compensation exacerbated the company's problems with other counterparties and shareholders 5 . This information is valuable, for example, for stock exchange investors who could make a decision to buy or sell a company's shares based on the values of such measures as the long-term debt to equity ratio, the receivables to payables coverage ratio, the basic earning power ratio or the current ratio. The same applies to people who manage business entities. It is important that such individuals examine a large number of indicators more broadly, but it would be reasonable for them to focus on information about those metrics that could reveal the likelihood of negative developments. In terms of businesses, such information may accelerate decision-making processes in the company. This paper shows that ensemble classifiers based on decision trees produced better results in terms of accuracy and stability than the use of a single decision tree. The combination of weaker classifiers had a greater effect than one weak classifier. Also, compared to logistic regression, boosting-based methods produced better final scores. The logistic regression differed from the other methods in terms of the importance of variables. In this case, the current ratio was the most important feature that signaled the occurrence of a penalty or compensation paid to another company in the following year. In the case of the other methods, these variables were the long-term debt to equity ratio, the receivables to payables coverage ratio and the basic earning power ratio. It is recommended to researchers continue to use machine learning methods and compare them, while also taking into account other independent variables. Managers or investors who would like to implement obtained results should firstly analyze companies' financial standings using indicators highlighted as significant in the context of predicting penalties or compensation payments, and only after that should they focus on other indicators.

Conflicts of Interest:
The author declares no conflict of interest.