Domain Knowledge Features versus LASSO Features in Predicting Risk of Corporate Bankruptcy—DEA Approach

: Predicting the risk of corporate bankruptcy is one of the most important challenges for researchers dealing with the issue of ﬁnancial health evaluation. The risk of corporate bankruptcy is most often assessed with the use of early warning models. The results of these models are signiﬁcantly inﬂuenced by the ﬁnancial features entering them. The aim of this paper was to select the most suitable ﬁnancial features for bankruptcy prediction. The research sample consisted of enterprises conducting a business within the Slovak construction industry. The features were selected using the domain knowledge (DK) approach and Least Absolute Shrinkage and Selection Operator (LASSO). The performance of VRS DEA (Variable Returns to Scale Data Envelopment Analysis) models was assessed with the use of accuracy, ROC (Receiver Operating Characteristics) curve, AUC (Area Under the Curve) and Somers’ D. The results show that the DK+DEA model achieved slightly better AUC and Somers’ D compared to the LASSO+DEA model. On the other hand, the LASSO+DEA model shows a smaller deviation in the number of identiﬁed businesses on the ﬁnancial distress frontier. The added value of this research is the ﬁnding that the application of DK features achieves signiﬁcant results in predicting businesses’ bankruptcy. The added value for practice is the selection of predictors of bankruptcy for the analyzed sample of enterprises.


Introduction
Research shows that no company can be sure of its future even in times of peace and prosperity.The problem of companies' risk of bankruptcy is highly relevant today and is being addressed by many researchers.The acceleration in interest in its solution was caused by the events of the last few years (COVID-19, war in Ukraine), especially in Europe.It is necessary to catch earlier signals of bankruptcy, to which business managers should pay increased attention in order to prevent bankruptcy.For this purpose, various methods of selecting bankruptcy prediction features, as well as various bankruptcy prediction models, are suitable.It is proven that domain knowledge plays a significant role in the given process and, when combined with a suitable prediction method, can provide significant results.This is confirmed by the studies of several authors.It is possible to mention the studies of Veganzones and Severin (2021), who selected features based on their popularity in the prior literature, the study of Min and Lee (2008), who used expert opinion, or the study of Zhou et al. (2015), who applied domain knowledge approach.Often used features in bankruptcy prediction are Altman's (1968) features.They were used in the study of Hu (2009) and that of De Andrés et al. (2011).Barboza et al. (2017) combined the features of Altman (1968) with the features of Carton and Hofer (2006), which have a greater impact on financial performance models in the short term.Similarly, Du Jardin (2015) applied financial ratios traditionally used in the literature since Altman (1968).These ratios were chosen based on the main financial dimensions which govern bankruptcy.Tseng and Hu (2010) used features inspired by the research of Lin (1999) and Lin and Piesse (2004).
Several studies (Kirkos 2015;Zvarikova et al. 2017;Kovacova et al. 2019) were published in which the authors examined the occurrence of individual features in bankruptcy prediction models.We followed up the results of the study of Kovacova et al. (2019), who made a review of the most often used bankruptcy prediction features in Visegradgroup countries.
Based on the above mentioned, the research question was as follows: Which way of selecting financial features for DEA model ensures higher performance of the model: the domain knowledge approach or one of data mining techniques-LASSO regression?
This paper follows previous research aimed at finding the most appropriate method of selecting features for DEA models.In previous studies, we can rarely see the comparison of domain knowledge and data mining techniques when selecting features.The mentioned approaches are mostly considered individually.This study is focused on filling this gap in the research.The LASSO+DEA approach is applied, and its results are compared with the selection of features based on expert opinion and their use in DEA (DK+DEA approach).The performances of the LASSO+DEA approach and the DK+DEA approach are compared.
In line with the above mentioned, the aim of this paper was to select the most suitable financial features for bankruptcy prediction based on the comparison of the performance of DEA prediction models.
The remainder of the paper is structured as follows: The Literature Review Section presents different approaches to defining bankruptcy risk and lists studies dealing with methods and features applied in bankruptcy prediction.The Materials and Methods Section describes the research sample and methods used for feature selection and bankruptcy prediction.The Results Section offers the results of feature selection with the use of the domain knowledge and LASSO methods and uses them to create VRS DEA model.The Discussion Section compares the results of the DK+DEA and LASSO+DEA models and discusses them from the point of view of their performance and applied features.The Conclusion Section presents the contributions, added value, limitations and future direction of this research.

Literature Review
Determining corporate bankruptcy risk is one of the main challenges of economic and financial research as well as one the most important issues for investors and decision-makers (Korol 2019).Predicting, measuring and assessing the risk of bankruptcy of a company is of particular interest to investors before investing their capital, as the optimization of risk is a prerequisite for the maximum capital profit of the investment, which will ensure payment of dividends.However, value maximization can only occur if capital providers selectively choose a profitable and sustainable business from which they can obtain the maximum share of business income (Agustia et al. 2020).The risk of bankruptcy is an important topic in many scientific articles, which is primarily reflected in the implications for the stakeholders' decisions (Lukason and Camacho-Miñano 2019).Bankruptcy risk (insolvency) can be understood as "the company's inability to meet maturing obligations resulting either from current operations, whose achievement conditions the continuation of activity, or from compulsory levies" (Bordeianu et al. 2011, p. 250).According to Achim et al. (2012), the risk of business bankruptcy is closely related to economic and financial risk.While financial risk is determined by the level of indebtedness, economic risk is dependent on the ratio of fixed and variable costs.It can be said that, in general, knowledge of these risks makes it possible to quantify the risk of bankruptcy of the company.Bankruptcy risk is the risk of a company no longer being able to meet its debt obligations.This risk is also referred to as the risk of failure or insolvency (Campbell 2011).
Bankruptcy risk represents a constant threat to businesses, which determines how long they will survive (Khan et al. 2020).If a business goes bankrupt, in fact, the probability of bankruptcy in connected businesses increases (Battiston et al. 2007), which can have a negative effect on the entire economy.Therefore, predicting the risk of bankruptcy is the subject of many research studies dealing with the search for the most suitable bankruptcy prediction model as well as the features describing bankruptcy the best.
Research on bankruptcy prediction dates back to Fitzpatrick (1932), who was the first to examine the financial conditions of bankrupt and non-bankrupt firms by comparing the values of their financial ratios.He found that there are significant differences between bankrupt and non-bankrupt companies, especially between liquidity, debt and turnover indicators (Fejér-Király 2015).In the early days of the development of bankruptcy prediction models, discriminant analysis (DA) was very popular.Beaver (1966) applied univariate discriminant analysis to investigate the predictive ability of 30 financial ratios.The best discriminating factor was identified as the working capital/debt ratio.The second one was the net income/total assets ratio (Gameel and El-Geziry 2016).Despite the criticism, this method was a starting point for the development of other models.The most famous bankruptcy-risk-scoring model, known as Z-score, was published by Altman in 1968 (Voda et al. 2021).This model was developed with the use of multiple discriminant analysis.Since the introduction of Altman's model, many other authors (Deakin 1972;Altman et al. 1977;Norton and Smith 1979;Taffler 1983) developed their models based on multiple discriminant analysis.In the 1980s, logistic regression analysis was developed, followed by probit analysis.The first logistic regression model intended to predict the financial situation of businesses was developed by Ohlson (1980).In the next period, many authors (Kim and Gu 2006;Mihalovic 2016;Barreda et al. 2017;Khan 2018;Affes and Hentati-Kaffel 2019) compared the accuracy of the multiple discriminant analysis model and the logistic regression model.These two models were the most used parametric models in bankruptcy prediction (Fejér-Király 2015).Probit analysis has not been as widely used as logistic regression.The first probit model was developed by Zmijewski (1984), followed by Zavgren (1985).Since the 1990s, the development of computer science has enabled the use of more computationally demanding methods in bankruptcy prediction.These methods are mainly non-parametric.Within them, Mousavi et al. (2023) identifies two main groups: machine learning and artificial intelligence, and operation research.Most used methods within the machine learning and artificial intelligence group include artificial neural networks, such as those used by Messier and Hansen (1988), Odom and Sharda (1990), Atiya (2001) and Abid and Zouari (2002), decision trees (Frydman et al. 1985;Chen et al. 2011;Stankova and Hampel 2018), the Bayesian models (Sarkar and Sriram 2001;Aghaie and Saeedi 2009;Cao et al. 2022), genetic algorithms (Kingdom and Feldman 1995;Alfaro-Cid et al. 2007;Bateni and Asghari 2020), modeling based on rough sets (Ahn et al. 2000;Wang and Wu 2017) and support vector machines (Huang et al. 2004;Olson et al. 2012).
The main method within operation research is Data Envelopment Analysis.This method by Simak (1997) was firstly used when predicting corporate failure.In his master thesis, he compared the results of DEA with the results of Altman's Z-score.In recent years, numerous models based on Data Envelopment Analysis have been developed to predict bankruptcy and their results were compared with the results achieved based on other techniques.Cielen et al. (2004) found that DEA outperformed a discriminant analysis model and a rule induction (C5.0) model in terms of their classification accuracy.Ouenniche and Tone (2017) proposed the out-of-sample evaluation of decision-making units by applying DEA.Out-of-sample framework was based on an instance of casebased reasoning methodology.They found that "DEA as a classifier is a real contender to Discriminant Analysis, which is one of the most commonly used classifiers by practitioners" (Ouenniche and Tone 2017, p. 249).Premachandra et al. (2009) compared the results of an additive DEA model with the results of a Logit model.They found that DEA outperformed the Logit model in evaluating bankruptcy out of sample.Condello et al. (2017Condello et al. ( , p. 2186) ) found that DEA has "a greater capacity for bankruptcy prediction, while Logit Regression and Discriminant Analysis perform better in non-bankruptcy and overall prediction in the short term".Janova et al. (2012) achieved similar results.They found that the additive DEA model seems to perform well in correctly identifying bankrupt agricultural businesses.On the other hand, it is less powerful when identifying non-bankrupt agricultural businesses.The performance of DEA models is assessed mainly with the use of sensitivity, specificity, or overall accuracy.In this regard, Premachandra et al. (2011) pointed out that the cut-off point of 0.5 traditionally used to classify bankrupt and non-bankrupt businesses may not be appropriate for the DEA model.According to these authors "depending on the precision with which predictions for bankrupt and non-bankrupt businesses need to be done, the decision maker has to determine an appropriate cut-off point", Premachandra et al. (2011, p. 623).Stefko et al. (2020) determined the optimal cut-off of the additive DEA model at a point in which the sum of sensitivity and specificity is the highest.Stankova and Hampel (2023) selected an optimal threshold by applying the Youden index and distance from the corner.They found that "selecting a suitable threshold improves specificity visibly with only a small reduction in the total accuracy" (Stankova and Hampel 2023, p. 129).
In the development of the above-mentioned models, the variables included in the model are as important as the method applied (Nurcan and Köksal 2021).In order to select appropriate variables from high-dimensional datasets, various dimensionality reduction methods can be applied.Depending on whether the original features are transformed into new features or not, feature extraction methods and feature selection methods are differentiated (Wang et al. 2016;Li et al. 2020).Feature extraction methods transform existing features into a lower-dimensional space (new set of features) while preserving the original relative distance between the features (Subasi and Gursoy 2010;Li et al. 2020).Well-known feature extraction methods often used in current research include Principal Component Analysis (Adisa et al. 2019;Karas and Reznakova 2020), Multidimensional Scaling (Tang et al. 2020) and Isometric Mapping (Gao et al. 2020).Since the new set of features is different from the original ones, it may be difficult to interpret them (Wang et al. 2016).When using feature selection methods, the original features are sorted according to specific criteria and features with the highest ranking are selected to form a subset (Li et al. 2020).Among the feature selection methods, we can differentiate between filter, wrapper, embedded and combined methods (Liu et al. 2018).Filter methods examine each feature independently while ignoring the individual performance of the feature in the relation to the group.Within filter methods, researchers frequently use t-test (Chandra et al. 2009;Xiao et al. 2012), correlation analysis (Zhou et al. 2012) and stepwise methods (Lin et al. 2010).Wrapper methods use machine learning algorithms to evaluate the performance of selected feature subsets.Within them, decision trees (Ratanamahatana and Gunopulos 2003), Naive Bayes (Chen et al. 2009), artificial neural networks (Ledesma et al. 2008) and genetic algorithms (Amini and Hu 2021) are often used.The results of wrapper methods are often superior to the results of filter methods; however, the computational cost of wrapper methods is high.Embedded methods integrate feature selection and learning procedures.Important embedded techniques are regularization approaches which have recently become more and more interesting, for example, LASSO (Fonti and Belitser 2017;Cao et al. 2022;Paraschiv et al. 2021), and Elastic net (Jones et al. 2016;Amini and Hu 2021).Combined methods include different types of feature selection measures, such as filter and wrapper.
Various methodologies have been applied to select features for DEA models.Cielen et al. (2004) used variables according to their efficiency to predict bankruptcy in prior research.Similarly, Psillaki et al. (2010) focused on financial ratios which appeared to be most successful in previous studies.Premachandra et al. (2009) approached this issue in the same way.When creating DEA models, they used ratios which were applied in past bankruptcy literature, and some of them were the same as the ratios used by Altman (1968) and Cielen et al. (2004).The ratios selected by Premachandra et al. (2009) were later applied in the study of Condello et al. (2017) and other studies.Min and Lee (2008) combined expert opinion and factor analysis when selecting features for DEA models.The resulting set of indicators contained the most relevant financial classification dimensions, while taking into account the mathematical relationships among ratios as well.Sueyoshi and Goto (2009) applied Principal Component Analysis to reduce the number of financial factors in order to reduce the computational burden of the DEA-DA model.Stefko et al. (2021) used Principal Component Analysis and Multidimensional Scaling when selecting inputs and outputs for DEA models.Huang et al. (2015) selected variables for DEA models based on gray relational analysis.They proved this method to be an effective technique for obtaining variables for DEA models.Gray relational analysis was later used in this way by Nurcan and Köksal (2021) as well.Lee and Cai (2020) were dealing with the curse of dimensionality in DEA.They proposed the LASSO variable selection technique and combined it in a sign-constrained convex nonparametric least squares (SCNLS) to support estimating the production function using DEA for small datasets.They also proved that this approach provides useful guidelines for DEA with small datasets.Chen et al. (2021) were inspired by their approach and proposed a simplified two-step LASSO+DEA approach to handle the dimensionality of data entering the DEA models via LASSO.They used standard cross-validation LASSO to select an optimal number of regressors.These regressors were used in the DEA model.As an important advantage of this approach against the study of Lee and Cai (2020), Chen et al. (2021) state that tuning parameter λ was not chosen manually, but it was determined based on optimizing the classical cross-validation criterion to optimally select the relevant variables.

Materials and Methods
In this paper, two approaches to feature selection were compared.As the first one, domain knowledge was applied.As the second technique, feature selection based on LASSO was used.Based on these two approaches, two sets of variables were chosen.With the use of these data, VRS DEA models were formed, the results of which were assessed and compared with the use of accuracy, ROC curve, AUC and Somers' D.

Description of the Research Sample
The input database for the prediction of financial distress of companies operating under SK NACE 41-Construction of buildings consisted of data from the financial statements of 2660 companies.The database of the financial statements of these companies was provided by CRIF-Slovak Credit Bureau, s.r.o.(CRIF 2023).
In order to prepare the research sample for analysis, businesses with zero sales and incomplete records were removed.Since the DEA method is sensitive to outliers, it was necessary to identify and remove them from the analyzed sample.For this purpose, kernel density estimates (Scott 1992) were created for all analyzed indicators using the Epanechnikov kernel function, which was applied in studies by Produit et al. (2010), Gyamerali et al. (2019), Moraes et al. (2021).After excluding the outliers, we continued to work with a sample of 1349 businesses.In order to use the DEA method, the assumption of bankruptcy was established.The analyzed businesses were divided into prosperous and non-prosperous ones based on the criteria reflecting valid Slovak legislation and practice.Non-prosperous businesses included businesses which fulfilled the following criteria: negative EAT (earnings after taxes), equity to liabilities ratio lower than 0.08, and current ratio lower than 1 (Valaskova et al. 2017).The analyzed sample contained 1282 prosperous and 67 non-prosperous businesses.
The construction industry was chosen because it is one of the few industries that can have a stabilizing effect on the economy; this segment is an indicator of economic development and affects the development of other industries and segments of the economy (MTSR 2019; PS Stavby 2021).Therefore, it is necessary to pay attention to the prediction of financial difficulties of companies operating in this industry and to identify possible risks these companies have to face.
The selection of input parameters for predicting bankruptcy was carried out using the domain knowledge approach.Financial features were selected based on the research of Kovacova et al. (2019), as follows: the three most frequently used features were selected from each group of indicators mentioned in this research (see Table 2).To avoid the occurrence of highly correlated features within the selected set, the correlation matrix was applied.From pairs of highly correlated indicators with a correlation coefficient higher than 0.9 (Delina and Packova 2013), the indicator with a higher frequency of usage was selected.

Activity ratios Liquidity ratios
Total revenues to total assets Current ratio Total asset turnover ratio Quick ratio Cash flow to total assets Working capital to total assets

Profitability ratios Debt ratios
Return on assets with EAT Liabilities to total assets Return on equity Equity to total assets Return on assets with EBIT Cash flow to liabilities Source: Kovacova et al. (2019).
The second set of financial features was selected with the use of LASSO penalized logistic regression.A logistic regression model is determined by the probability of success of the dependent variable, while this category is coded as 1 and another category is coded as 0. For k independent variables, the probability that the dependent variable is equal to 1 is expressed as follows (1) (Wu et al. 2009;Rabaca et al. 2023): where y i is the response for observation i, x ji is the j-th predictor for the observation i, β j is the regression coefficient for the j-th predictor, and β 0 is the intercept.The logit is expressed by the logarithm of the odds as follows (2) (Rabaca et al. 2023): LASSO is a particular case of penalized least squares regression with a penalty function L1 (Muthukrishnan and Rohini 2016).LASSO penalized logarithmic likelihood function that needs to be maximized can be written as follows (3) (Rabaca et al. 2023;In Hastie et al. 2009): where λ is the penalty parameter, and n is the number of observations.Penalty parameter λ ≥ 0 controls the amount of regularization applied to the estimate (Zhao and Yu 2006).The optimal value of λ ( λ min ) is usually determined with the use of a 10-fold cross-validation method (Liu et al. 2021).The advantage of LASSO is that it improves the prediction accuracy and interpretability of the model by combining the good properties of ridge regression and subset selection.If there is a high correlation in the group of predictors, LASSO selects only one of them and shrinks the rest to zero (Muthukrishnan and Rohini 2016).

Method Used for Bankruptcy Prediction
To identify businesses that are threatened with bankruptcy, the DEA method was applied.The DEA model was built in the DEAFrontier software (Zhu 2023).Since this software cannot work with negative values, in accordance with the approach of the software creator, a positive constant has been added to the values of indicators (Seiford and Zhu 2002).According to Silva Portela et al. (2004), if the research sample contains negative values, it is necessary to use a model with the application of variable returns to scale.In accordance with the mentioned approach, the VRS DEA model was applied in this paper.This model assumes variable return to scale, and it was developed by Banker et al. (1984).The dual input-oriented VRS DEA model can be written as follows (4): where θ q is the value of objective function, ε is the non-Archimedean infinitesimal value, x ij and y kj are the inputs and outputs of the DMU j , x iq and y kq are the inputs and outputs of the DMU q , m and r are the number of inputs and outputs, respectively, n is the number of DMUs, λ j is convex coefficient, and s − i and s + k are the input and output slack variables.With the use of the VRS DEA model, businesses which possibly have financial difficulties were identified.In line with the approach of Premachandra et al. (2009), two sets of features were divided into inputs and outputs, as follows: The smaller (inferior) values in the financial ratios, which could possibly cause financial failure, were considered to be input variables.In contrast, the larger (superior) values in those ratios, which could cause financial failure, were classified as output variables.In this approach, businesses that possibly have financial difficulties form the financial distress frontier.Score of these businesses are equal to 1. Financially healthy businesses are then expected to lie inside the financial distress possibility set, which is shaped by the financial distress frontier.Premachandra et al. (2011) pointed to the need to find a suitable cut-off value of the DEA model at which the classification accuracy of the given model is optimal.In this paper, the Youden index was used to determine the optimal cut-off, which is calculated as follows: Youden index = Sensitivity + Speci f icity − 1 (Hajian-Tilaki 2018).The optimal cut-off point is determined at the point where the maximum of Sensitivity + Speci f icity is achieved (Youden 1950;In Yin and Tian 2014).

Results
From the results presented in Table 3, it is clear that the analyzed sample of companies achieved the required liquidity values, which means that most of the companies are able to pay their liabilities.Since liquidity is one of the representatives of the financial risk of companies and its low values can put companies in a state of financial distress, these results can be evaluated positively.Equally good results are indicated by the median of the indicator net working capital to current assets, which represents 21%.This value is not optimal, but it can be considered acceptable from the point of view of financial risk.The results of the profitability indicators can also be evaluated positively, as the median of them is positive and ranges from 9% to 2%.The costs ratio (0.98) also corresponds to it.The results of this indicator gives companies room for profit creation.
The total asset turnover ratio reaches a value of (1.46 or 1.43), which can be considered an adequate turnover rate considering the subject of business activity.
Less good results are indicated by indebtedness values.The share of liabilities in total assets is up to 68%, 56% of which are short-term liabilities.Liabilities are 1.69 times higher than the company's equity and, thus, the indicator liabilities to equity ratio does not reach the required optimal value.It is precisely the indebtedness of businesses that can be considered a weak point of the analyzed sample, which represents a risk of financial distress for them.Based on the research of Kovacova et al. (2019) and the procedure described in Section 3.2.Selection of financial features, selected DK features were as follows: Total revenues to total assets, Current ratio, Net working capital to total assets, Return on assets with EAT, Return on equity, Netto cash flow to liabilities and Liabilities to total assets.These features were selected with the use of the correlation matrix.
The most relevant predictors according to LASSO penalized logistic regression were identified by optimizing the value of λ min using 10-fold cross validation.At the optimal lambda value 0.0071, 7 financial ratios out of 26 exhibit non-zero coefficients (see Table 4).These indicators are as follows: Liabilities to total assets, Return on costs, Return on equity, Short-term liabilities to total assets, Net working capital to total assets, Netto cash flow to total assets, and Total asset turnover ratio.The coefficients of the rest of indicators were shrunk to zero.A similar approach was used in the study of Chen et al. (2021), who used LASSO, while tuning parameter λ was selected based on optimizing cross-validation criterion.In this way, the authors selected the relevant variables optimally before deploying DEA on these variables.A simplified LASSO+DEA approach was also used by Lee and Cai (2020).However, these authors chose to manually tune parameter λ.Features selected with the use of the DK approach and LASSO penalized logistic regression were used as inputs and outputs for the VRS DEA models.Two VRS DEA models were formulated-the model with the application of DK features (DK+DEA) and the model with the application of LASSO features (LASSO+DEA).Their results are compared in Table 5.In the case of the DK+DEA model, there were 41 businesses which lie on the financial distress frontier.LASSO+DEA model identified 13 less businesses lying on the financial distress frontier.In the case of DK+DEA, the most numerous group of enterprises is located in the efficiency interval 0.9; 0.8 ; on the contrary, in the case of LASSO+DEA, the largest number of identified enterprises is in the interval 0.5; 0.4 .For better comparability of the results, the optimal cut-offs for both models were determined with the use of the Youden index.
The optimal cut-off of LASSO+DEA model was determined at the level of 0.59.The classification accuracy for bankrupt businesses at this cut-off was 79.10% (see Table 6).The classification accuracy for non-bankrupt businesses achieved a higher value, 86.66%.The overall classification accuracy of LASSO+DEA model at a cut-off of 0.59 was 86.29%.In the case of the DK+DEA model, the optimal cut-off was determined at the level of 0.89.At this cut-off, the DK+DEA model achieved high classification accuracy for bankrupt businesses, 97.01%, and lower classification accuracy for non-bankrupt businesses, 78.72 (see Table 7).The overall classification accuracy of DK+DEA model was 79.63%.A slightly higher overall classification accuracy of the DEA model with the application of DK features (85.1%) was achieved by Cielen et al. (2004).DEA models using DK features developed by Premachandra et al. (2009) achieved an overall classification accuracy of 74-86%.Similar to our results, these models achieved higher classification accuracy for bankrupt businesses.Based on the results presented in Tables 6 and 7, we can conclude that the DK+DEA model performs better when identifying bankrupt businesses.It means that features selected via the DK approach are more suitable for bankruptcy prediction.The selection of DK features, with the application of which the DEA model with a higher classification accuracy was created, represents the fulfillment of the aim set in this paper.
The confirmation of this result can also be seen on the ROC curve (see Figure 1).The results show that both DEA models achieved excellent classification accuracy; however, the classification accuracy of the DK-DEA model was slightly higher.model performs better when identifying bankrupt businesses.It means that features selected via the DK approach are more suitable for bankruptcy prediction.The selection of DK features, with the application of which the DEA model with a higher classification accuracy was created, represents the fulfillment of the aim set in this paper.
The confirmation of this result can also be seen on the ROC curve (see Figure 1).The results show that both DEA models achieved excellent classification accuracy; however, the classification accuracy of the DK-DEA model was slightly higher.

Discussion
The summary of the research results shows interesting findings.By applying different features, the models achieved different classification accuracies.Table 8 and Figure 2 show the comparison of the bankruptcy prediction results achieved using the DEA model when applying features selected via DK and LASSO.

Discussion
The summary of the research results shows interesting findings.By applying different features, the models achieved different classification accuracies.Table 8 and Figure 2 show the comparison of the bankruptcy prediction results achieved using the DEA model when applying features selected via DK and LASSO.The analysis shows that when applying DK features, the VRS DEA model confirmed the assumption of bankruptcy in 44 businesses, which is 13 more businesses than when applying LASSO features.It is 23 fewer businesses than the assumption of bankruptcy.However, when applying LASSO features, it is 36 fewer businesses compared to the as- The analysis shows that when applying DK features, the VRS DEA model confirmed the assumption of bankruptcy in 44 businesses, which is 13 more businesses than when applying LASSO features.It is 23 fewer businesses than the assumption of bankruptcy.However, when applying LASSO features, it is 36 fewer businesses compared to the assumption of bankruptcy.
On the other hand, in the case of LASSO, only 3 of 31 businesses were incorrectly identified.These results indicate that the DK+DEA model has a better classification accuracy in relation to the assumption of bankruptcy.However, LASSO+DEA shows a smaller deviation in the number of identified businesses on the financial distress frontier.
Based on the above results, the application of feature selection using the LASSO method appears to be more appropriate.However, it is necessary to continue the analysis and apply other procedures and methods.
To analyze the results in more detail, it is necessary to specify the features used in both cases.They are presented in Table 9. Agreement in the selection of indicators occurred in the case of three indicators, highlighted in italics in Table 9.However, the selection based on the experience of experts seems to be more relevant, as it also includes Current ratio and Insolvency ratio (Netto cash flow to liabilities).Many authors consider these indicators to be important predictors of bankruptcy.This can be confirmed by the definitions of financial health of several authors.Szilagyi (2004) defined a financially healthy business, and, within his definition, he pointed out that such a business is not expected to become insolvent and does not show any sign of a threat to its existence, and it is even able to adequately cover the risks related to indebtedness.The importance of ability to pay was also pointed out by Koh et al. (2015), who defined financial distress as a situation when a business cannot pay the amount owed on the due date.Platt et al. (1995) argue that financial distress occurs when the total value of a company's assets is lower than the total value of creditors' claims.In the long term, this situation can lead to forced liquidation or bankruptcy.For this reason, financial distress is often referred to as a harbinger of bankruptcy and is related to the availability of liquid funds and credit (Hendel 1996).Gestel et al. (2006) characterize financial distress and financial failure as the result of chronic losses that cause a disproportionate increase in liabilities accompanied by a loss of assets' value.It is possible to mention other authors who talk about the ability to repay obligations as an important predictor of bankruptcy (Campbell 2011;Achim et al. 2012).
This means that a financially healthy company is able to pay its obligations and has fulfilled the purpose of its existence-to be profitable.Therefore, the indicators Current liquidity, Netto cash flow to liabilities and Return on assets in DK have their justification.This selection seems to be much more relevant than the LASSO selection.
On the other hand, it should be pointed out that the Current ratio was used as one of the criteria when establishing the assumption of bankruptcy.This indicator was selected as one of the DK features as well.This fact could affect the results of the DK+DEA model.
Table 10 shows the overall performance of the constructed DEA models.We can see that the DK+DEA model achieved slightly better AUC and Somers' D compared to the LASSO+DEA model.Based on it, we can conclude that the selection of DK features is more appropriate than the selection of LASSO features when predicting the bankruptcy of businesses.If we compare the achieved results with the literature, the results of previous studies are slightly different.Zhou et al. (2015) found that there is no significant difference between the classification performance of models with feature selection guided by data mining techniques and that of those guided by domain knowledge.The findings of Lin et al. (2014), who revealed that a model with LASSO-based feature selection achieved a slightly higher performance in terms of accuracy as well as AUC compared to DK, are also slightly different.However, the comparability of these studies depends on several factors, e.g., research sample, used model, etc.

Conclusions
In this research, the features for DEA models were selected with the use of the domain knowledge approach and the LASSO approach.According to DK, the following bankruptcy prediction indicators were chosen: Total revenues to total assets, Current ratio, Net working capital to total assets, Return on assets with EAT, Return on equity, Netto cash flow to liabilities, and Liabilities to total assets.LASSO identified the following predictors of bankruptcy: Liabilities to total assets, Return on costs, Return on equity, Short-term liabilities to total assets, Net working capital to total assets, Netto cash flow to total assets, and Total asset turnover ratio.Subsequently, the performance of the DK+DEA and LASSO+DEA models was compared.Performance was different for both selections at different cut-offs.For the selection of features according to LASSO, the optimal cut-off was 0.59, which means that from this value, businesses were identified as bankrupt.In the case of selecting features based on DK, the optimal cut-off value was at the level of 0.89.Based on this fact, it can be concluded that in the case of DK feature selection, more indicators were identified as predictors of businesses' bankruptcy.Important predictors of bankruptcy found with the DK application include Current ratio, Insolvency ratio (Netto cash flow to liabilities) and Return on assets, which are missing in features selected via LASSO.These features are significant predictors of bankruptcy that are applied in many bankruptcy prediction studies (Reznakova and Karas 2014;Lin et al. 2014;Pavlicko and Mazanec 2022).
The contribution of the paper is the application of DK and LASSO features and VRS DEA model in the evaluation of the financial failure of businesses.The results revealed that the DK+DEA model achieved higher classification and prediction accuracy compared to the LASSO+DEA model.On the other hand, there is a smaller deviation in the number of identified businesses on the financial distress frontier in the LASSO+DEA model.
The added value of this research lies in pointing out the importance of the indicators Current ratio and Return on assets, which were the criteria used to establish the assumption of bankruptcy.Since these indicators entered the DK+DEA model as well, this model achieved higher classification accuracy compared to LASSO+DEA.Therefore, it is necessary to pay more attention to the selection of criteria for determining the assumption of bankruptcy and subsequently to the selection of features based on DK.The managerial implications of this research enable companies and managers from the construction industry to focus on those features that are decisive for the area of evaluating the financial health of companies.
A limitation of the given research was missing and insufficient data.Another limitation was the occurrence of a relatively large number of outliers.Future research will be focused on confirming the significance of selected indicators for predicting the financial failure of companies, and especially on the Current ratio and its use in identifying prosperous and non-prosperous businesses.

Figure 1 .
Figure 1.Comparison of ROC curves of DK+DEA and LASSO+DEA models.
in financial distress possibility set 1305 1.318 Match in the number of businesses on financial distress frontier 28 28 Difference in the number of businesses on financial distress frontier 16 3 Difference in the number of businesses on financial distress frontier compared to the assumption of bankruptcy 23 36 Source: authors.

Figure 1 .
Figure 1.Comparison of ROC curves of DK+DEA and LASSO+DEA models.

Figure 2 .
Figure 2. Comparison of classification accuracy of DK+DEA and LASSO+DEA models.Source: authors.

Figure 2 .
Figure 2. Comparison of classification accuracy of DK+DEA and LASSO+DEA models.Source: authors.
equity + long − term liabilities f ixed assetsReturn on costs (ROC)short − term liabilities total assetsReturn on assets with EAT (ROA EAT )

Table 2 .
Most frequently used bankruptcy prediction features in V4 countries.

Table 3 .
Descriptive statistics of financial features.

Table 4 .
Coefficients of the indicators for λ min.

Table 5 .
Efficiency results of DEA models.

Table 6 .
Results of LASSO+DEA model at optimal cut-off 0.59.

Table 7 .
Results of DK+DEA model at optimal cut-off 0.89.

Table 8 .
Summary of DK+DEA and LASSO+DEA results.

Table 8 .
Summary of DK+DEA and LASSO+DEA results.

Table 9 .
DEA inputs and outputs in two sets of financial features.

Table 10 .
Performance of DEA models.