1. Introduction
The global COVID-19 pandemic has caused a health and economic crisis in individuals and in the various economic and banking sectors (
Corredera-Catalán et al. 2021;
Hidayat et al. 2021; Ya
Liu et al. 2021;
Luo 2021;
Norden et al. 2021), the latter due to the impact on credit risk on the part of companies (
Yin et al. 2022). This is why many countries worldwide have intervened and worked to combat the economic crisis caused by the coronavirus pandemic, especially with small- and medium-sized enterprises (SMEs) being the most vulnerable and important in the business ecosystem (
Corredera-Catalán et al. 2021), which is why a great demand for credit has been generated for this sector (
Yang et al. 2021) with the support of governments, who have influenced the allocation of credit by banks because the survival of the economy depends on them (
Massoc 2021).
Recently, several studies have been carried out on credits and their risk in the face of the global problem caused by COVID-19. In this sense, risk mitigation can be achieved through letters of credit, and financing instruments that provide guarantees for commercial partner services (
Crozet et al. 2022). However, the risk of default is latent in enterprises and credit risk is mitigated after the epidemic has been controlled (
Yan et al. 2022), while it has been shown that state-owned banks in the periods of the pandemic outbreak managed to reduce their borrowing capacity of SMEs (
Yun Liu et al. 2022). It is in this scenario that monetary interventions are associated with lower levels of trade credit, while fiscal interventions increase due to the use of trade credit (
Al-Hadi and Al-Abri 2022). This implies that, COVID-19, produced by SARS CoV-2, significantly affects credit risk and is related to bank capital, total loans, and bank profitability (
Riani 2021).
The differential effects of different types of creditor claims on the probability of default and loss of default can show significant intertemporal variation (
Heitz and Narayanamoorthy 2021). Meanwhile, companies with higher operational risk tend to adjust trade credit around the target more quickly than those with lower risk exposure (
Luo 2021). In the shareholder scenario, the COVID-19 shock was able to increase the credit default swap (CDS) spread, thereby reducing shareholder value for those companies with higher debt rollover risk, however, it is stronger in non-financial, financially constrained, and highly volatile companies (Ya
Liu et al. 2021). However, it is the degree of missing data matching, the number of contract defaults, the enforcement rate, the level of business concentration and the amount of administrative penalties that influence SME credit risk, in addition to transactional credit and reputational monitoring (
Yang et al. 2021). Therefore, the capacity of governments will depend on the capacity of banks to grant credit to companies (
Massoc 2021) and the latter to meet their credit obligations.
Peruvian companies were economically affected due to the social isolation measures established by the executive in the second week of March 2020, in order to face the health care emergency generated by COVID-19, being that companies were subjected to an enormous risk in the continuity of their payment chain (
Sampén et al. 2021), so the government established business programs to alleviate the economic havoc wrought by COVID-19 in companies at the national level. One of these programs that had the greatest acceptance and disagreement was the Reactiva Perú Program enacted by
Decreto Legislativo No. 1455 (
2020) and extended by
Decreto Supremo No. 335-2020-EF (
2020), which was aimed at companies affected by the COVID-19 health emergency, with the intention of promoting financing to companies facing payments and obligations with their collaborators and suppliers, with the promise of safeguarding the continuity of the payment chain in the country (
Decreto Legislativo No. 1455 2020).
So, why should we determine the credit risk in the framework of the Reactiva Peru Program? This question arises due to the existing problem of credit risk for the companies benefiting from the program, given that the financing fund amounted to PEN 60 billion T(8% of GDP), specifically destined to guarantee loans from the Financial System Entities (ESF), administered by the Development Finance Corporation (COFIDE). However, these guarantees may be at risk, since they more reflect the identity of the borrower who determines the value of the loan, but it is the risk of the lender (companies benefited by Reactiva Peru) that may be limited to the borrower’s willingness to pay or their inability to meet their obligations, which may be reflected in the short-term in the borrowers (
Yan et al. 2022). In this sense, the beneficiary companies are the ones that take the credits and they are the ones that must adequately manage these working capital funds, although their scope of financing is limited to the acquisition of assets, the purchase of shares, bonds, monetary assets, the payment of overdue obligations and not to use it as capital contribution; on the other hand, they have the responsibility and obligation to reactivate the Peruvian economy (
COFIDE 2020;
Martinez and Pérez 2020).
In the Peruvian case, studies indicate that the companies that have benefited from the Reactiva Perú Program have shown a positive improvement in liquidity to continue with their activities and meet their short-term obligations (
Martinez and Pérez 2020;
Riani 2021). Likewise, it has been stated that the Reactiva Peru Program has a positive impact on working capital, allowing them to continue with their daily commercial operations of buying and selling (
Sudario 2021); in addition, it has been identified in the gray literature that interest rates have been reduced by up to 4.3% and the supply of credit has had an increase of 38% for a certain sector (
Quispe 2020), assuming that the program has benefited companies so that they do not go bankrupt and stop generating employment (
Monzón et al. 2021). However, the knowledge of and access to the Reactiva Peru Program has been revealed, where it was found that 20% had insufficient knowledge, followed by 75% with an average level of knowledge and access, 33.8% had access to a loan, and 3.8% benefited from two loans financed by Reactiva Peru, while 36.3% were not able to apply for financing and therefore their economy was affected and disrupted by the impact of COVID-19 (
Bocanegra et al. 2021).
A published report stated that 501,298 companies in total (first credit and second credit) were benefited by the Reactiva Peru Program as of 30 October 2020, with a loan amount totaling PEN 57,863,747,358.00 and a covered amount of PEN 52,158,699,017.00 distributed among companies located in 25 departments of Peru (
COFIDE 2020;
MEF 2020). This is a calculated risk of the Peruvian government, which is currently revealing the effects of the Reactiva Peru Program (
Cuadros 2022) and being that many companies are not paying due to the way in which the credits of the Reactiva Peru Program were given, that is, of those companies that were mostly benefited (
La República 2022). Therefore, it is worth determining the credit risk through a predictive model of machine learning by means of the Lasso and Ridge multiple regression models, which can be verified with algebraic mathematics by the least squares from the list of beneficiary companies published by the Ministry of Economy and Finance, which allows public decision makers of the credit risk granted to generate strategies and minimize the risks on the part of the beneficiary companies.
2. Methodology
This section presents the quantitative analysis of the dataset, using multivariate regression analysis with machine learning techniques under the Lasso and Ridge regression models (
Dalgaard 2008;
Tan et al. 2019) and verification with algebraic least squares mathematics to compare and validate the models. For this purpose, analysis software such as SPSS, Jamovi, R Studio, and MATLAB were considered, under the CRISP-DM method, especially for data mining projects, which determined the ten-phase approach to determine the best model that predicts the credit risk of the Reactiva Peru Program (
Figure 1).
2.1. Download Database
The problem was identified, and the information was verified on the website of the Ministry of Economy and Finance (MEF), where it was possible to access the statistical reports issued and the list of companies benefiting from the Reactiva Peru Program, updated to 30 October 2020. This gave way to downloading the publicly available Excel database. The data can be found at:
https://bit.ly/ListadeempresasRP-2020 (accessed on 5 February 2022).
2.2. Data Quality Control
A total of 501,298 companies benefited from the Reactiva Peru Program, grouped into microenterprises, small enterprises, medium-sized enterprises, and large enterprises, benefiting 2,561,236 employees (
MEF 2020) (See
Table 1).
After downloading the list of beneficiary companies from the MEF’s web portal, a copy of the data was made in Microsoft Excel for efficient quality control. In this process, eight variables were identified: the name of the company, RUC/DNI (Single Register of Taxpayers/National Identity Document), the economic sector, name of the entity granting the loan, name of the second entity granting the loan (companies that received a second loan), loan amount (s/), amount covered (s/), and departments. Of the eight variables identified, four were eliminated: the name of the firm, RUC/DNI, name of the second lending institution, and amount of the loan (s/), because they do not contribute to the main objective of the study. The name of the company and the RUC/DNI were equivalent, and were eliminated because there was no variability (few companies took out two loans), and the name of the second lending institution was eliminated because the study only focused on the level of risk of the companies that took out the first loan granted. The variable “amount of the loan” could have been considered in the present study, but was not taken into account since the amount covered is the most important data for predicting credit risk, so these variables did not contribute to the main objective of the study. It should be noted that prior to the elimination, an attempt was made to analyze these variables so they went through a normalization process, but they lacked this assumption, and since most of them could not be transformed, they were not considered. The following tables show the percentage behavior of the most relevant variables in relation to the beneficiary companies.
Table 2 shows the total number of companies that benefited according to the economic sector. The commerce sector had the highest loan coverage (47.48%), followed by the transportation, storage, and communications sector with 11.90%, equivalent to 59,661 beneficiary companies. Of the 14 economic sectors, the electricity, gas, and water sectors benefited the least from the Reactiva Peru Program, reaching a coverage of 0.14%, equivalent to 717 companies.
Figure 2 shows the percentages of the beneficiary companies by department, where companies in Lima benefited the most, covering 30.57% of companies, followed Puno with 7.46%, and the fewest companies covered was in Huancavelica, where only 1728 companies had access, representing 0.34% of companies.
Table 3 shows the percentage results of the companies that accessed credit according to the list of lending institutions. The financial institution that provided the most loans to companies was Mibanco, with a total of 255,671, equivalent to 51.002%, followed by Banco de Crédito BCP with 12.933% of companies that benefited, and the bank with the fewest companies that benefited was Santander Perú S.A., with only nine companies benefiting from the Reactiva Perú Program loan.
2.3. Identification of Regressor and Predictor Variables
After quality control of the data and descriptive exploration of the variables, four regressor variables were identified: economic sector, the name of the entity granting the loan, the amount covered (PEN), and department (See
Table 4). In this step, a correlative numerical value was assigned to the qualitative variables according to the alphabetical order of their categories, which can be seen in parentheses in
Table 2 and
Table 3, and in
Figure 2.
A logical transformation of the quantity covered was used to generate an ordinal interval variable (considering the levels according to SBS) and create a dummy variable (
Pérez 2004;
Tsuchiya et al. 2021). The minimum (229.32), maximum (8,500,000), quartile 1 (4890.2), quartile 2 (11,760), and quartile 3 (30,079.7) were considered, which allowed for the categorization of the predictor variable, equivalent to 1 = With potential problems, 2 = Deficient, 3 = Doubtful, and 4 = Lost, according to the levels pre-established by the Superintendency of Banking, Insurance, and AFP (
SBS 2019).
An analysis of the level of risk by category in relation to the unstandardized amount covered was carried out by using the multinomial regression technique, resulting in a risk level of 26.12% with potential problems and 25.000% with a loss, followed by 24.951% with a doubtful level, and finally, 23.931% with a deficient level (See
Figure 3).
2.4. Normalization of Variables
The present research used data mining with unscaled variables. In this regard,
Shanker et al. (
1996) suggests that when using data mining and with the application of automatic learning techniques, it is necessary to normalize the characteristics of the variables, since they produce better results in general. In addition, the requirement of the algorithms require the normalization of the data (
Atlas et al. 1990), in this case, of the regressors and predictors identified. For this, we proceeded to normalize the data using the min–max normalization technique in order to ensure homogeneity in the variables concentrated in a continuous interval [0; 1] (
M-Dawam and Ku-Mahamud 2019) by considering Equation (1):
2.5. Training Dataset Creation (70%) and Testing (30%)
The normalization of the variables allowed for the process of creating the training dataset equivalent to 70% (called train) and 30% of the test dataset (called test1) to be used for risk level prediction. These two databases were imported into the R Studio software for their respective analysis, initially verifying the descriptive training data were equal to 350,909 companies, and test1 was equal to 150,389, identifying five study variables in both files.
2.6. Selection and Formulation of Lasso and Ridge Predictive Models
2.6.1. Lasso Model Prediction Measures
At the end of the last decade of the last century, the Lasso (Least Absolute Shrinkage and Selection Operator) model was proposed as a method to estimate linear models, with the purpose of minimizing the residual sum of squares, conditional on the sum of the absolute value of the coefficients being less than a constant (
Tibshirani 1996). Therefore, small coefficients can be reduced to zero, thus eliminating them from the model, or a small subset can be larger and non-zero (
Friedman et al. 2010). This works when the number of variables tends to be large, or in cases when the number of variables is larger than the sample (
Hair et al. 2018). Lasso regression and recursive estimations were also performed and the penalty coefficient “
λ” was selected at each recursive step on the basis of cross-validation, focusing on the mean square error (
Friedman et al. 2010). However, we defined lambda (
λ) as the weight or regularization parameter assigned to the Lasso and Ridge models (
Hastie et al. 2016).
The Lasso regression model represented mathematically (
Hastie et al. 2016) has the equation as follows, in addition to the complementary results in
Appendix A:
2.6.2. Ridge Model Prediction Measures
Ridge regression is a particular adaptation of least squares and allows one to address the estimation problem by producing a biased estimator but with small variances (
Crocker and Seber 1980). In addition, it allows for data analysis to be performed when multicollinearity exists and helps to avoid over-fitting (i.e., during the procedure, it removes part of the variance in exchange for a small bias, producing more useful coefficient estimates when such multicollinearity is present) (
Frost 2019).
From another point of view, unlike Lasso regression, Ridge regression reduces the coefficients of the correlated predictors, which allows them to borrow the strength of the others. From a Bayesian perspective, the penalty of the Ridge model is appropriate in cases where there are several predictors and they all have non-zero coefficients (i.e., they are drawn from a Gaussian distribution) (
Friedman et al. 2010). Furthermore, it is a priority to consider the properties of the Ridge regression mean square error such as the variance and bias of the estimator, the theorem on the mean square function, and the comments made on the mean square error function in the analysis (
Crocker and Seber 1980;
Hoerl and Kennard 1970). Therefore, the Ridge regression model can be mathematically represented in the following equation:
4. Discussion
This study determined the level of credit risk of the Reactiva Peru Program through a Lasso regression model with an optimal penalty coefficient
λ60 equal to 0.00038 with a precision of 0.36. In this regard,
Yang et al. (
2021), under the application of the Lasso-logistic model with a precision equal to 0.96, showed that the factors that influence the credit risk of small and medium enterprises (SMEs) are the degree of coincidence of missing data, the ratio of contract compliance and the number of defaults of these as well as the degree of business concentration and the number of administrative sanctions. In contrast,
Luo (
2021) noted that firms with higher operational risk tended to adjust trade credit around the target more quickly than those with lower risk exposure. Therefore, the amount covered by firms benefiting from the Reactiva Peru Program has a higher risk, especially those that obtained a larger amount. In fact, financial institutions should focus on these factors when granting and assessing the level of credit risk in order to make better decisions.
It should be noted that the credit portfolios of banks are often large and complex to visualize; in this sense,
Neuberg and Glasserman (
2019) mentioned that proper regularization of the portfolio contributes to significantly improve performance, moreover, the application of these methods to credit default swaps allows for margin requirements of the clearing portfolio to be set, and the Lasso method is suitable for estimating the market structure.
Liu et al. (
2021) noted that the advent of COVID-19 and the shock generated by it have led to an increase in credit default swaps (CDSs), with a significant effect on shareholders, especially in non-financial firms, financially constrained firms, and highly volatile firms. In contrast, in a recent study,
Jiang (
2022) showed that equity risk has risen to be an important determinant of credit risk.
The comparison and validation of the Lasso regression and Ridge regression models under validation with algebraic mathematics allowed for the validation of the best risk level prediction model, with Lasso being the best prediction model. Contrasting results were found by
Wang et al. (
2015) when assessing credit risks with the Lasso logistic regression and showed that the proposed algorithm outperformed the most popular credit scoring models such as decision tree, Lasso logistic regression, and random forests. Similarly,
Dai et al. (
2021) used several models including using Lasso and recursive feature elimination to predict the bank’s credit rating, finding that the SVM model obtained the best accuracy of 86% on the validated dataset and was able to identify that zero and negative revenue days can affect the firm’s credit rating. Similarly,
Yan et al. (
2020) were able to compare machine learning models and found different results to ours, where they mentioned that models incorporating indicator data in multiple time windows conveyed more information in terms of predicting the financial distress compared to existing single-time window models.
In comparison to the aforementioned opposing results,
Zhou et al. (
2021) agreed with our results, in the sense that they confirmed that the Lasso feature selection method demonstrated a remarkable improvement and outperformed other classifiers. Therefore, they pointed out that the credit score modeling strategy could be used to develop policies, progressive ideas, and operational guidelines for effective credit risk management of loans and other financial institutions. In addition,
Ahelegbey et al. (
2019) mentioned that the Lasso logistic model for credit scoring led to better identification of the meaningful set of relevant financial characteristics variables, thus producing a more interpretable model, primarily when combined with population segmentation through the factor network approach. However, they emphasized that while the results are promising, they are certainly not definitive.
A similar study on credit risk by
Brownlees et al. (
2021) proposed a credit risk model, where the interdependence of the default intensity is induced by the exposure to common factors. In contrast,
Rao et al. (
2020) identified 21 characteristics as a function of rating accuracy, which constitutes the credit risk assessment index system for borrowers in “three rural areas”, therefore, considering the owners of these three areas for their rating reduces the volatility of the characteristics and the probability of selection preference and effectively identifies the characteristics that affect the risk rating. In this regard, in the Peruvian case, a large part of the beneficiaries of the Reactiva Peru Program are from rural areas; therefore, it is important to effectively identify the characteristics that may be affecting the non-payment of the loan granted.
However, studies revealed the effectiveness of the Reactiva Peru Program on liquidity to continue its activities and meet their short-term obligations (
Martinez and Pérez 2020;
Riani 2021). In addition, it has had a positive impact on working capital, allowing them to continue with their daily commercial operations of buying and selling (
Sudario 2021) by significantly reducing interest rates by up to 4.3% and increasing the supply of credit by up to 38% in certain sectors (
Quispe 2020), with the aim of preventing companies from going bankrupt and ceasing to generate employment (
Monzón et al. 2021). However, there are adverse factors such as political, social, and economic factors that may affect the continuity of many companies from benefiting from this program due to the various local and global events that are directly and indirectly related. Therefore, having a model such as the one proposed in this study provides a warning about the level of credit risk, especially for Peruvian companies benefiting from the Reactiva Peru Program.
Based on the above, new lines of research and gaps that arise should be answered with future studies. First, it is necessary to carry out a study with the companies that benefited from a second loan and determine whether they may have a higher level of risk than those that accessed a single loan, under a benchmarking methodology, by considering the data from the first loan in relation to the second block of companies that accessed the second loan. Second, an investigation could be carried out by comparing the credits granted to the Reactiva Peru Program with other programs developed by the Peruvian government such as the Business Support Fund Program for SMEs in the Tourism Sector (FAE-Tourism), the Business Support Program for Micro and Small Enterprises (PAE-Mype), and the National Government Guarantee Program for the Financing of Agricultural Enterprises (FAE-Agro), whose program has had an extension of the credit granting period, under a machine learning methodology with the K-nearest-neighbor (KNN) algorithm, the elastic net model, or consider the computationally efficient lava prediction model, whose method structure is based on penalization (
Chernozhukov et al. 2017).
5. Conclusions
The research found that the commerce sector had a loan coverage of 47.48%, followed by the transport, storage, and communications sector with 11.90% and the electricity, gas, and water sector benefited the least with a coverage of 0.14%. In terms of departments, Lima benefited the most with 30.57% of the companies covered, followed by the department of Puno with 7.46%, and the least benefited was Huancavelica with 0.34%. On the other hand, among the financial institutions that granted loans, Mibanco stood out with 51.002% of companies covered, followed by Banco de Crédito BCP with 12.933%, and the bank with the fewest companies was Santander Perú S.A., with only nine companies benefiting from the Reactiva Perú program. The results are conclusive in the sense that companies with lower amounts presented potential problems with a risk level of 26.119%, and companies with amounts higher than 11,761 presented a risk level of 25% and 24.95% with potential and doubtful risks, respectively.
According to the comparison and validation of the model estimators, it can be concluded that all of the models presented in the Results Section have their merits in predicting the level of risk, but the Lasso model was the optimal model for predicting the level of credit risk of the Peruvian companies benefited by the Reactiva Peru Program as its prediction error was better (RMSE = 0.3573685; R2 = 0.07975), being lower than the Ridge model (RMSE = 0.3573812; R2 = 0.07973) with a difference of 0.0000127 and a precision error of 0.00036%. Therefore, the partial regression coefficient of the amount covered was 1.29671437 (i.e., holding all other variables constant will reflect an increase in the amount covered by a company, which is accompanied by an increase in the average risk level of about PEN 129.67144 per loan granted).
Public policies and strategies should be established, considering the Lasso prediction model to control and minimize the risks of the credits granted, in order to minimize the risk of non-payment by the beneficiary companies of the Reactiva Peru Program. The importance of the study lies in presenting the best machine learning predictive model, which is the Lasso model, and encouraging the use and applicability of these models by financial institutions and government agencies to enable them to make better decisions in the economic and business field.