Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows

Chang Chuan Goh; Yue Yang; Anthony Bellotti; Xiuping Hua

doi:10.3390/info16050397

,

and

¹

School of Computer Science, University of Nottingham Ningbo China, Ningbo 315100, China

²

Department of Finance, Accounting and Economics, Nottingham University Business School China, University of Nottingham Ningbo China, Ningbo 315100, China

³

UNNC-NFTZ Blockchain Laboratory, University of Nottingham Ningbo China, Ningbo 315100, China

^*

Author to whom correspondence should be addressed.

Information2025, 16(5), 397;https://doi.org/10.3390/info16050397

This article belongs to the Section Artificial Intelligence

Version Notes

Order Reprints

Abstract

We propose a comprehensive and practical framework for Chinese corporate fraud prediction which incorporates classifiers, class imbalance, population drift, segmented models, and model evaluation using machine learning algorithms. Based on a three-stage experiment, we first find that the random forest classifier has the best performance in predicting corporate fraud among 17 machine learning models. We then implement the sliding time window approach to handle population drift, and the optimal training window found demonstrates the existence of population drift in fraud detection and the need to address it for improved model performance. Using the best machine learning model and optimal training window, we build general model and segmented models to compare fraud types and industries based on their respective predictive performance via four evaluation metrics and top features using SHAP. The results indicate that segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size. The dissimilarities between the top features set of the general and segmented models suggest that segmented models are useful in providing a better understanding of fraud occurrence.

Keywords:

machine learning; corporate fraud; population drift; segmented model; fraud type; industry

1. Introduction

Corporate fraud is defined as the use of one’s occupation for personal enrichment through the deliberate misuse or misapplication of the employing organization’s resources or assets [1]. According to 2023 Global Fraud Survey, corporate fraud cases from countries all over the world have incurred a total loss of more than 3.1 billion USD, which is estimated to be equivalent to 5% of the company’s revenue [2]. Due to its huge financial impact, corporate fraud has attracted interest from researchers and regulators since the mid-20th century. The history of corporate fraud detection dates back to the early 1950s when Cressey [3] proposed the famous fraud triangle framework. This framework remains a classic model to date, and it is often combined with quantitative methods to carry out fraud prediction. With the emergence of artificial intelligence in the past two decades, research in the domain of corporate fraud detection has shifted toward the application of data analytics and machine learning models.

In China, corporate fraud is regulated by the China Securities Regulatory Commission (CSRC). Between 2000 and 2020, the CSRC has revealed more than 5000 violation cases, with the number of violations showing an upward trend in recent years. This is due to the shift in the CSRC regulatory focus from market reform to strengthened supervision following the Chinese stock market crash in 2015 [4]. The rising fraud rate has drawn the attention of researchers to investigate the occurrence of corporate fraud in China. Despite the increasing number of studies on this topic over the past five years, there are still some important aspects that are not properly addressed in the literature. In particular, we find that none of the extant research considers the population drift problem, which is one of the vital issues in credit scoring [5]. In addition, although the data contain firms from 19 different industries, fraud analysis based on industry sector is rarely seen in the literature, with a few research works focusing on specific industries which are large [6,7]. Furthermore, most of the existing studies utilizing Chinese data focus on financial statement frauds [8,9]. Even though there are more than ten different fraud types, only a few studies consider all of them [10,11]. We believe that information about industry and fraud type can be used to extract useful information.

Complementing the above literature, we propose a comprehensive framework for Chinese corporate fraud prediction which incorporates the unaddressed or rarely addressed aspects in the literature. We first collected observations for all fraud types from 2000 to 2020 and extract yearly corporate governance information and financial indices from firm’s annual reports as features. These data were combined as firm-year observations to form the full dataset. To accommodate for segmented models, the full dataset was broken down into data by fraud type and data by industry. The experiments were divided into three stages, each serving a different purpose. In the first stage, we carried out the selection of best machine learning model with the full data. Five machine learning classifiers were selected and combined with three resampling techniques that handle the class imbalance problem to form 15 classifier–resampling combinations. We further employed two ensemble-based resampling approaches to make up a total of 17 candidate models. We found that the random forest classifier without resampling performed best among the 17 models according to the average ranking of four performance measures.

The second stage involved dealing with population drift, where selection of optimal time window for training data was carried out for the general model based on the full data and the fraud segmented models based on the data by fraud type. Our results indicate that the predictive performance increased gradually as older data were discarded from the training sample until at some time period when the number of training examples became too small and performance began to decline. In particular, when predicting fraud occurrences in 2016 to 2017, the general model reported the best performance when training data prior to 2012 were discarded. For the same test window, all the optimal time windows found in the fraud segmented models excluded training data before 2010. This suggests that population drift exists in our data, and it may be caused by several reasons such as changing corporate behavior, socio-economic conditions, and regulatory policy. This manifests the need to balance the tradeoff between the most recent data and sufficient data in order to obtain the best model performance.

Thirdly, using the best model and optimal time window, we built a general model and segmented models for fraud types and industries. The forecast performance results of the general model, evaluated by four metrics, are comparable to the model performance reported in the previous literature. In terms of feature importance, the risk level and solvency features emerged as top features in the general model. Next, we compared the general model with each segmented model according to predictive performance and top features associated with fraud occurrence. Our findings revealed that five out of nine segmented models for fraud type achieved a better performance than their respective predictions using the general model. Among these fraud types, four of them have low fraud rates (<2%), and two of them show a significant discrepancy in terms of top features. On the other hand, only 2 out of 12 segmented models for industry performed better than their respective predictions in the general model. Our further investigation indicates that this is due to the insufficient training sample for most of the industries. In particular, we found that when we controlled the training size, the segmented models were as good as, or even better than the general model in terms of the AUROC for 11 of the 12 industries. These results suggest that segmented models are more informative than the general model for fraud types with low fraud rates and most industries given sufficient training data and that they should be employed to investigate fraud occurrence for important fraud types and industries.

This paper contributes to the literature on Chinese corporate fraud detection in the following ways. Firstly, to the best of our knowledge, our study is the first paper to address the population drift problem in corporate fraud prediction. This is a common yet important subject in credit scoring, and methods have been proposed to handle this in multiple credit scoring scenarios [12,13,14]. However, none of the previous research considered population drift when building models to predict corporate fraud. We fill this gap by incorporating the selection of an optimal time window for training data before the main model is built. Our results shed new light on the impact of handling population drift in corporate fraud detection.

Secondly, our research provides new insights on the use of segmented models in corporate fraud prediction by introducing new dimensions for segmentation according to fraud type and industry. This is rarely seen in the literature, as previous studies focus on model for specific fraud types and large industries. Fraud predictions for less prevalent fraud types and small industries are understudied. We built segmented models for nine fraud types and 12 industries with sufficiently large sample sizes. The segmented models allow us to (1) compare the predictive performances between general model and segmented model; (2) understand which fraud types/industries are more difficult to predict; and (3) investigate the main risk features associated with each fraud/industry. This offers a new perspective on corporate fraud prediction to researchers, practitioners, and regulators.

Thirdly, our paper complements the growing literature on Chinese corporate fraud detection by providing a comprehensive framework to predict corporate fraud. In particular, we include all observations and fraud types in our model. The features employed in our model can be easily extracted from readily available annual reports and directly used without sophisticated feature engineering approaches. We have also proposed a clear three-stage experimental design that takes into account data splitting, machine learning algorithms, evaluation metrics, and methods to handle the class imbalance and population drift problems which are common in fraud prediction problem. Furthermore, we introduce segmented models for fraud type and industry to carry out comparisons between general and segmented models based on predictive performance and top features. We argue that we have taken into account most if not all aspects of corporate fraud prediction in our framework. Our work may serve as a practical reference for future research.

The remainder of the paper is organized as follows: We start with a review of the existing research on corporate fraud detection in Section 2. In Section 3, we discuss the data used in our study and how the data are split to carry out the experiments, as well as the methodology implemented in our experiments, which includes methods to handle the population drift problem, segmented models, and experimental design, which is divided into three stages. The results for each stage are reported in Section 4, with discussion and analysis to draw meaningful insights. Concluding remarks are provided in Section 5.

3. Materials and Methods

3.1. Data

The data employed in our experiments were taken from the China Stock Market and Accounting Research (CSMAR) database [37]. The following information was retrieved from the CSMAR website:

Company profile such as company name, industry, and establishment date.
Annual corporate governance information for each company.
Annual or quarterly financial indices for each company.
Information on detection of frauds based on China Securities Regulatory Commission (CSRC) enforcement actions.

The initial sample consisted of data spanning from 2000 to 2020. CSMAR records the fraud revelation year and fraud occurrence year for 14 different fraud types and firms from 19 different industries. Following Lu et al. [11], we dropped two fraud types which were deemed as less severe, namely, delayed disclosure and false information disclosure. We also dropped firms from the finance industry. This is a common practice in the literature due to the significant differences in the definition of financial statements and accounting methods for financial firms [30]. The target variable

F r a u d

for year t is constructed for each firm based on fraud occurrence year, which takes the value of 1 if a firm commits any of the 12 frauds in year t and 0 otherwise. Multiple frauds committed by a firm in the same year are treated as a single observation so that there is only one observation for each firm-year.

To carry out fraud prediction, we matched frauds from year t with corporate governance and financial information from year

t - 1

. For quarterly financial report, the average value was taken to represent the financial indices for a particular year. With this, we were left with data from 2000 to 2019, where data in year 2000 represent corporate governance and financial information from 2000, fraud occurrence in 2001, and so forth. After all the calculations and merging, we obtained a total of 37,125 firm-year observations.

The observations were split based on the train–validation–test approach. Since our data consist of firm-year observations across a number of firms spanning over 20 years, they are considered as panel data. We utilized the method proposed by Granger and Huang [38] to split the data. In particular, the data were split into the in-sample, out-sample, and post-sample based on their respective firm and year. This arrangement simulates the real deployment of a fraud detection system. The in-sample and out-sample selections were the training and test set for model development, and the post-sample selection simulated the operational use of model post-development.

The in-sample set contains 75% of the firm-year observations from 2000 to 2017. This served as the training sample to carry out model building. The remaining 25% of the firm-year observations went into the out-sample set. We split the data in a way that the firms in in-sample and out-sample did not intersect, i.e., a firm’s observations were either entirely in in-sample or entirely in out-sample. The out-sample served as the validation sample, which was used for model selection. However, we only kept data from 2016 to 2017 in the out-sample set, and we discarded the remaining data. By doing so, we obtained a validation sample which had (1) the closest possible time period and trend to the post-sample; (2) a comparable violation rate as the post-sample; and (3) the same time duration (2 years) as the post-sample. These characteristics would allow us to select an optimal model for forecast prediction. Finally, the post-sample contained all firm-year observations from 2018 to 2019 and served as the test sample for our experiment.

Table 2 presents the number of observations, fraud count, and fraud rate for each sample. The in-sample contains 22,865 firm-year observations, and 2808 (12.28%) of the observations involve fraud. The out-sample consists of 1374 observations with 223 (16.23%) fraud occurrences. The post-sample contains 6700 observations, where 1166 (17.40%) of them are fraudulent observations. We observe that the fraud rates in recent years are higher, and this necessitates a validation set which contains recent data to ensure the model selected is more precise. The breakdown of number of observations and fraud rate by year are presented in Table A1 in Appendix A.

Table 2. Number of observations and fraud rate for full sample.

Data preprocessing was carried out after data split. To minimize information leakage, the preprocessing tasks were carried out based on in-sample, that is, we applied the model built based on in-sample on all three sample sets. We first applied winsorization at top and bottom 1% and standardization on numerical features to remove outliers. Categorical features were one-hot encoded into binary features. We adopted a standard approach to handle missing values, where the mean (or mode) was imputed for numerical (or nominal) features [39]. We further included indicator variables to represent whether a feature was missing. The processed data produced 241 features. Feature selection and feature engineering can be usefully applied for corporate fraud detection [10,32]. In our study, we followed Chen and Zhai [30] by using all original features in the model build. The reason for this is that machine learning models tend to internalize feature selection, transformation, and interaction effects through built-in algorithms like LASSO penalty in logistic regression, feature number in tree-based models, and non-linear transformation in kernel-based models and neural networks. Hence, all available features are retained to preserve the original effects of the features. These features were categorized into four groups according to the file it was drawn from in the CSMAR database, namely, corporate governance, profitability, risk level, and solvency. The features which belonged to neither of these categories were grouped into the fifth category as others. Table 3 provides a brief description for each category and presents the number of features in each category.

Table 3. Number of features and description for each feature category.

3.2. Dealing with Population Drift

A vital issue in credit scoring is to model the likelihood of the population to evolve and change with time [5]. This scenario is commonly known as population drift and often occurs as a result of economic pressures and changing environment. Population drift can be classified into abrupt and gradual drifts based on its rate of change, where the former represents a sudden change in distribution at a certain time point, while the latter represents a steady change over time [40]. As mentioned in Section 3.1, the violation rate is higher in recent years than older years, and this suggests population drift in our data. This may indicate an abrupt shift due to the change in regulation policy [4], a gradual shift as a result of economic recession [26], or even a combination of both. A common approach to deal with the problem is to carry out re-estimation of the classifier’s parameters at different intervals based on a recent subset of data [35]. However, this is difficult and time-consuming, since the occurrence of population drift is unpredictable. Nevertheless, the population drift problem needs to be addressed, as it may cause a deterioration in classification performance. In our experiment, we dealt with population drift using the selection of optimal time window for training data.

Selection of Optimal Time Window for Training Data

To address the population drift problem, we have to minimize the impact of irrelevant data in our model. Trends change as time evolves, and this causes older data to become less relevant. Therefore, they should be excluded in model building to ensure that the model is able to learn the correct trend. This brings us to the next task, which is the identification of an optimal time point such that all data before the time point are removed. According to Qian et al. [41], this can be done by selecting the optimal time window for the training set. The optimal time window is defined as the time period which contains the subset of training data that gives the best model performance in the validation set among different time periods. Once the optimal time window is determined, the subset of training data within the time window is kept for subsequent models, whereas all the remaining training data outside the time period are discarded.

3.3. Segmented Models

Segmentation in credit scoring is the process of identifying homogeneous populations with respect to their predictive relationships [42]. Our data consist of firms from different industries committing different types of fraud over a span of 20 years. It is hence interesting to study whether the fraud patterns are different for each fraud type and industry and to what extent do they differ from the general trend. To achieve this, we considered segmented models for some of the fraud types and industries which have sufficiently large samples. This allowed us to (1) compare the predictive performances between general model and segmented model; (2) understand which fraud type and industry are more difficult to predict; and (3) investigate the main risk features associated with each fraud type and industry.

3.3.1. Segmented Model for Fraud Type

There are 12 fraud types in our sample. For each of the fraud type i, we define the target variable

F r a u d_{i, t}

which assigns the value of 1 if a firm commits fraud i in year t and 0 otherwise. For example, if a firm commits fraud P2501 but not fraud P2502 in year 2010, we will have

F r a u d_{P 2501, 2010} = 1

, whereas

F r a u d_{P 2502, 2010} = 0

for the firm. Since fraud is committed, the target variable

F r a u d

for the particular firm-year observation takes the value of one in the general model.

The number of fraud count and fraud rate for each of the fraud type in all three samples are presented in Table A2 in Appendix A. We used the same data split for the segmented models so that the number of observations in each sample would be consistent across the general model and segmented models. However, the fraud count for each fraud type did not sum to the fraud count in Table 2, as some firms committed multiple frauds in the same year.

We observed that false record and material omission were the two dominant fraud types over the years, and some fraud types had extremely low fraud rates. In particular, there were three fraud types which recorded no fraud observation in both the out-sample and post-sample, namely, fraudulent listing, insider trading, and stock price manipulation. Since we would not be able to carry out validation and testing, segmented models were not built for these fraud types. Finally, we were left with nine segmented models for fraud type.

3.3.2. Segmented Model for Industry

According to the 2012 CSRC industry code, firms in China are classified into 19 different industries and assigned letters A to S to represent their respective industry. The breakdown of observations and fraud rate by industry in in-sample and post-sample for each industry are presented in Table A3 in Appendix A. Firms from the finance industry (Industry J) were dropped, whereas firms from the service industry (Industry O) are not present in the data, making a total of 17 industries in our data. The breakdown for out-sample is not reported because it was not employed in segmented models for industry. This is further discussed in Section 3.4.

We observe that more than half of the observations are from the manufacturing industry, whereas the agriculture industry and the healthcare industry have the highest fraud rates among all industries. Since some industries have few observations, segmented models were only built for industries with more than 300 observations in in-sample. As a result, we only built segmented models for 12 industries, dropping accommodation and catering (Industry H), research and technology (Industry M), education (Industry P), healthcare (Industry Q), and others (Industry S).

3.4. Experimental Setup

After data splitting and preprocessing, the setup of the experiment was divided into three stages. The first stage involved the selection of machine learning model; the second stage focused on the selection of optimal time window for training data; and the third stage involved forecast prediction and identification of top features based on the selections in the first two stages. Figure 1 presents a flow chart as a brief overview of the experiments and data involved at each stage.

Figure 1. Experimental design flow chart. *** In Stage 2, the selection of optimal time window is not carried out for segmented models for industry due to insufficient sample size. Apart from the manufacturing industry (Industry C), the full sample from 2000 to 2017 for a particular industry is used to train segmented model.

3.4.1. Stage 1: Selection of Best Machine Learning Model

The primary target of the first stage is to select the best machine learning model for the subsequent stages. Based on the literature, we selected five machine learning classifiers as candidates. This included logistic regression, which is widely regarded as the typical model for credit scoring and support vector machine and has been proven to be effective in the literature [15,16]. We also employed two ensemble classifiers, namely, random forest and gradient boosting, which have been found to outperform traditional models in several studies [10,22,30]. In particular, for gradient boosting, we employed the XGBoost variant, since several recent studies documented that it performed best in corporate fraud detection [11,32,33]. Finally, we employed an artificial neural network model with two hidden layers, since it has been shown in the literature that, generally, deep learning models do not perform well for credit scoring task [39]. As fraud prediction involves unbalanced data, apart from the original data without resampling, we employed Synthetic Minority Oversampling Technique (SMOTE) [43] and cost-sensitive learning to deal with the class imbalance problem. In particular, cost-sensitive learning is achieved by including an extra hyperparameter

c l a s s_w e i g h t

in model building in Python 3.11. We followed the combinatorial approach presented in Yang et al. [44] by combining the machine learning classifiers and resampling techniques to obtain 15 (5 × 3) classifier–resampling combinations. Since the use of ensemble-based resampling approaches is also popular [8,9,34], we further employed two ensemble approaches, namely, the RUSBoost classifier and the balanced random forest classifier, in our experiments. Since resampling is integrated in these classifiers, they are considered as individual models and were not combined with resampling techniques. These classifiers were combined with the 15 classifier–resampling combinations to make up a total of 17 machine learning candidate models.

For each of the 17 candidate model, we carried out hyperparameter tuning via grid search to select the best hyperparameter set based on the average AUROC across five-fold cross-validation on in-sample data. With the best hyperparameter set, the models were trained using the entire in-sample data and tested on out-sample data. This was repeated three times with different random states to ensure the stability and robustness of the models. Most extant research in the literature has utilized a few performance measures to assess the model performance. Lessmann et al. [45] also mention that it is beneficial to consider multiple metrics to obtain a comprehensive performance evaluation. We employed four metrics, namely, accuracy, area under receiving operating characteristics curve (AUROC), F1-score, and area under precision–recall curve (AUPRC), to evaluate the model performance. As a method to handle class imbalance, we implemented threshold optimization via Youden index (J) [46] to determine the optimal threshold for accuracy and F1 based on the in-sample. The Youden index is a measure that determines the effectiveness of the ROC curve, which can be computed by taking the difference between the true positive rate (sensitivity) and false positive rate (1—specificity). The optimal threshold of a model can be determined by selecting the cutoff point with the highest Youden index. To determine the best machine learning model, we adopted the average rank (

A v g R

) approach proposed by Lessmann et al. [45]. The model performance was ranked across all models for each evaluation metric, and the average value of the four rankings was computed. The model with the highest average ranking was selected as the best model and employed in subsequent stages.

3.4.2. Stage 2: Selection of Optimal Time Window for Training Data

The main goal of the second stage is to select the optimal time window for training data. For this task, we started with the whole in-sample data which contained firm-year observations from 2000 to 2017. At each run, we carried out stratified sampling and randomly selected an equal number of observations from each year to make up a training sample of 5000 observations. This was done to ensure that data from all years were equally represented in the training sample. A model (which was the best machine learning model obtained in Stage 1) was trained on the selected data and tested on the out-sample data. After that, data from the oldest year were discarded and replaced by an equal number of observations from each of the remaining years such that the size of the training sample remained unchanged at 5000 observations. The substitute data were drawn via stratified sampling from the unused training data pool. A new model was then trained on the updated sample and tested on the same out-sample data. The ‘data update’ step was repeated to obtain training samples for all 17 time windows, for which the training sample of the last time window contained 2500 observations each from 2016 and 2017.

To compensate for the randomness arising from data sampling, the process was repeated for 100 iterations. In other words, there were 100 training samples, trained models, and out-sample outcomes for each of the 17 time windows. Similar to the previous stage, the outcomes were assessed based on four evaluation metrics, and the average values of the metrics over 100 runs were calculated. These values were separately ranked across all time windows, and the time window with the highest average ranking across four metrics was selected as the optimal time window for training data.

The selection of optimal time window was also implemented for segmented models for fraud types. The same process which included data sampling and data update for 17 time windows was repeated for 100 iterations to select the optimal time window for each of the nine fraud segmented models, with the only difference being the data employed. Instead of the full in-sample data, we used in-sample data by fraud type, as mentioned in Section 3.3, to carry out data sampling. Similarly, the trained models were tested and evaluated on out-sample data by fraud type. On the other hand, we did not implement selection of optimal time window in segmented models for industry type due to limited sample size, with the exception of the manufacturing industry, where the same process mentioned above was conducted to select the optimal time window using observations whose firms were from the manufacturing industry.

3.4.3. Stage 3: Forecast Prediction and Top Features

In the third stage, we carried out forecast prediction and determine the top features for the general model and each segmented model. Based on the experiments in the first two stages, we trained the general model and each segmented model for fraud using all available observations within the optimal time window determined in Stage 2. This was also carried out in segmented model for the manufacturing industry. For the other industry segmented models, all the observations in in-sample which belonged to the industry were used for model training. The trained models were then used to carry out forecast prediction on the post-sample data. In particular, the post-sample data were separated into three categories, i.e., full data, data by fraud type, and data by industry. We used the general model to predict the occurrence of fraud on all three categories, whereas the segmented models were applied to their respective groups. For the general model and each segmented model, we extracted the top ten features associated with model predictions and grouped them into categories based on Table 3. Based on the extant literature, we implemented SHAP (Shapley additive explanations) to determine the top features for each model. This allows us to conduct global interpretations on feature importance and their interactions with the target variable. Comparisons were carried out between the general model and segmented models based on two aspects—prediction outcome evaluated using four metrics and top features associated with each fraud and industry.

4. Results and Discussion

4.1. Best Machine Learning Model

In this section, we report the best machine learning model based on the model performances on the out-sample data evaluated using four metrics. All 17 candidate models were trained on the full in-sample data based on the best hyperparameter set determined using grid search. The best hyperparameter set with its respective candidate values for each model is presented in Table A4 in Appendix A. Figure 2 provides a visual representation of the performances of each machine learning model, as reported in Table A5 in Appendix A. In particular, Figure 2a illustrates the accuracy, AUROC, F1, and AUPRC of the models when tested on out-sample data. As mentioned in Section 3.4, all the models were trained for three runs to ensure their stability and robustness. The values for each run are presented in bars, while the lines represent their means across three runs. These means were separately ranked across all models for each metric, and the average rankings across four metrics are presented in Figure 2b.

Figure 2. (a) Out-sample accuracy (top left), AUROC (top right), F1 (bottom left), AUPRC (bottom right), and (b) the average ranking for each machine learning model.

Based on the results, we found that random forest outperformed other classifiers without oversampling, whereas gradient boosting performed best when cost-sensitive learning and SMOTE were applied, but the latter is negligible, since it is the worst of the three resampling methods. Logistic regression had a moderate performance, as it ranked middle among the classifiers for all three resampling methods. Support vector machine, on the other hand, performed well with cost-sensitive learning but not quite well with no oversampling and SMOTE. The artificial neural networks, despite showing decent performance in some runs, were not as stable as the other models, as portrayed by the fluctuations across multiple runs. In terms of resampling methods, all classifiers performed best with cost-sensitive learning, with logistic regression and random forest being the exceptions. On the other hand, SMOTE did not show good performances, as it ranked in the bottom among the resampling methods for all machine learning classifiers. For ensemble-based resampling approaches, RUSBoost and balanced random forest both produced decent performances, as their average rankings are ranked among the top five across 17 candidate models.

We observed that the random forest–non-oversampling combination had the best performance across four evaluation metrics on out-sample data with an average ranking of 2.8. This agrees with the findings of Xu et al. [10], whose results indicate that random forest outperformed five machine learning models in Chinese corporate fraud detection. It is also similar to the results reported in Chen and Zhai [30], where they found that the bagging model outperformed several boosting models in a few evaluation metrics. Henceforth, the combination of random forest and non-oversampling was employed in subsequent stages unless stated otherwise.

4.2. Optimal Time Window for Training Data

Selection of the optimal time window for training data was carried out for the general model, for all segmented models for fraud types, and for the segmented model for the manufacturing industry. Figure 3 presents the out-sample performances for the models trained using data from different time windows, as reported in Table A6 in Appendix A. Figure 3a depicts the mean accuracy, AUROC, F1, and AUPRC, with the shaded region representing the standard deviation over 100 runs for each time window, whereas Figure 3b presents the average ranking across four metrics over 100 runs for each time window where the selection of the optimal time window was based on.

Figure 3. (a) Mean accuracy (top left), AUROC (top right), F1 (bottom left), AUPRC (bottom right) with their respective standard deviation over 100 runs on out-sample data using the general model for each time window, and (b) their average rankings over 100 runs.

According to the results, 2012 to 2017 was selected as the optimal time window for the general model, as it achieved the best average ranking of 2.2 across four metrics. We employed all data within the time window to train the model in the subsequent stage. We observed that there was a gradual improvement in model performances when older data were excluded, until reaching the maximum point at 2012, after which further removal of data led to a decline in performance. This may be due to the lack of volume of data. While this indicates that dealing with the population drift problem results in an improvement in model performance, it also suggests that there is a tradeoff between most recent data and sufficient data in order to obtain the optimal performance. Therefore, searching for the optimal time point to discard data is indeed an important step when dealing with population drift.

The optimal time window for each segmented model for fraud type and the manufacturing industry are tabulated in Table 4. Based on the results, we found that five segmented models for fraud type achieved the best performance when the optimal time window of training data was either two or three years, while the remaining four fraud types performed best when prior data up to six to eight years were included in training. An interesting point to observe is that only two of the optimal time windows for the fraud segmented models matched the optimal time window for the general model. Although this is slightly counterintuitive, it is sensible, since the general model is not a direct combination of segmented models for fraud, as multiple fraud occurrences within the same year are taken as a single occurrence in the former.

Table 4. Time window with best average ranking across four metrics over 100 runs for each model.

On the other hand, the optimal time window for the manufacturing industry model agrees with the general model. Since the manufacturing industry accounts for more than half of the observations in the general model, it is natural that the fraud patterns in the industry dominates the general model. These results indicate that the optimal time point to discard data may vary for different tasks and may even differ when predicting over a different time period. This suggests that corporate fraud models should be updated from time to time, and segmented models can be employed for the investigation of specific cases. Despite being different, none of the optimal time windows includes data prior to 2010. This manifests the need to handle the population drift problem and suggests that when the sample size is large enough, we should attempt to carry out drift analysis to ensure that irrelevant observations are excluded in model building for segmented models.

4.3. Forecast Prediction Results

In this section, we report the post-sample prediction results for general model and segmented models. We employed the random forest classifier to train the models based on all in-sample data in the optimal time window, as reported in Table 4. These models were trained without oversampling, with the exception of the fraud segmented models for fictitious asset (P2502), unauthorized change in capital usage (P2509), and illegal stock trading (P2512), in which cost-sensitive learning was applied with the ‘

c l a s s_w e i g h t

’ parameter set as ‘

b a l a n c e d

’, since the class distribution is extremely unbalanced (less than

1 %

positive events). For industry segmented models other than the manufacturing industry, all available data in in-sample were utilized to build the models. In terms of training time, all models recorded a training time of less than a minute, with the general model recording a training time of 38 s. Apart from training size, we observed that the fraud rate also affected the training time of the model.

The general model reported an accuracy of 0.8127, an AUROC of 0.7452, an F1-score of 0.4433, and an AUPRC of 0.3864 on the post-sample data, which are comparable to the performances reported in Xu et al. [10] and Lu et al. [11] on a different but overlapped test period. Figure 4 illustrates the post-sample prediction results for each fraud type of the general model and its respective segmented model, which were evaluated using four metrics. The corresponding plots for the industry segmented models are presented in Figure 5. The values for these plots are given in Table A7 in Appendix A. We have also included the prediction results of the general model on the entire post-sample data in the plots for industry segmented models. This can be seen as the average across the performance of the general model on the post-sample data for each industry, since the post-sample data by industry were obtained via breaking up the entire post-sample data according to its industry, and the fraud rate of the entire post-sample data was calculated as the average of the fraud rates of each post-sample data by industry. On the other hand, we could not directly compare the predictive performance of the general model on post-sample data by fraud type with its predictive performance on the entire post-sample data. This is because the post-sample data by fraud type are not the direct segmentation of the entire post-sample data, and the significant differences in fraud rates between the data have resulted in deviations in the F1-scores and AUPRC scores. Therefore, comparisons can only be made on a within-fraud-segment basis between the predictive performance of general model (trained using entire in-sample data) and each segmented model (trained using in-sample data by fraud type) on the post-sample data of the particular fraud type. The models with better performance among the four metrics between the general model and segmented models for each fraud type and each industry are tabulated in Table 5 and Table 6, respectively.

Figure 4. Post-sample prediction for each fraud type of the general model and its respective segmented model evaluated using accuracy (top left), AUROC (top right), F1 (bottom left), and AUPRC (bottom right).

Figure 5. Post-sample prediction for each industry of the general model and its respective segmented model evaluated using accuracy (top left), AUROC (top right), F1 (bottom left), and AUPRC (bottom right). The leftmost bar for the general model in each metric represents the prediction results of the general model on the entire post-sample data.

Table 5. Best model between general model (GM) and segmented model (SM) for each fraud type and its respective fraud rate on post-sample data. The model with better performance among the four evaluation metrics is considered as the best model.

Table 6. Best model between general model (GM) and segmented model (SM) for each industry and its respective fraud rate on post-sample data. The model with better performance among the four evaluation metrics is considered as the best model.

We found that five out of nine segmented models for fraud type achieved a better performance than their respective prediction results using the general model. They are fictitious profit, fictitious asset, unauthorized change in capital usage, occupancy of company’s asset, and illegal stock trading. We further observed that these fraud types all had low fraud rates, where four of them had the lowest post-sample fraud rate among the nine fraud types. The low F1-scores and AUPRC scores for these fraud types are likely due to the extremely unbalanced class distribution, since F1 and AUPRC both integrate precision and recall, and these metrics can be greatly affected by class imbalance. However, this is not an unusual observation, as Xu et al. [10] also reported a low F1-score in a segmented model for more serious frauds, which included two of the aforementioned fraud types (fictitious profit and fictitious asset). Nevertheless, the segmented models demonstrated an improvement compared to the general model. This is sensible, because when all frauds are included in a model (as in the case of the general model), it will lean toward the more prevalent fraud type. The results suggest that segmented models should be utilized to carry out predictions for frauds which are important but less common. On the other hand, there were four fraud types whose segmented models’ predictive performances did not outperform that of the general model. This indicates that their fraud patterns are better captured by the general model and suggests that the information from other frauds may be transferable to these frauds.

In terms of the segmented models for industry, only 2 out of 12 segmented models (agriculture and real estate) for industry achieved a better performance than their respective prediction results using the general model. The results indicate that the performance outcomes of the segmented models are not as good as the general model for most industries. We speculate that this is because the training size is small in each segment and hence insufficient to build an informative segmented model. To verify our hypothesis, we plotted a learning curve of model performance (AUROC) against training set size for each of the ten industries whose segmented models were outperformed by the general model. The results show that when we control the training set size, the segmented models have a comparable if not better predictive performance, i.e., showing higher learning curves, than their respective performance outcomes of the general model for nine of the ten industries, with the wholesale and retail industry being the only exceptional case. This provides justification to our conjecture and suggests that the superior predictive performance of the general model compared to the segmented models for industry is due to the information gained as a result of transfer learning from the additional training samples in other industries. Nevertheless, when the sample size is large enough, we should attempt to build segmented models for industries, as they are more informative than the general model in most cases. As illustration, we present two of the learning curve plots in Figure 6, namely, the manufacturing industry and power, gas, and water industry. The former has the largest sample size, whereas the latter demonstrates the largest difference in the AUROC between the general and segmented models when the training size is controlled. The post-sample AUROC for the general model and segmented model for industry when the training size is fixed to be the sample size of each respective industry is presented in Table A8 in the Appendix A.

Figure 6. Plot of post-sample AUROC against training size for manufacturing industry (left) and power, gas, and water industry (right).

Since the general model can be considered as an average across all industries, we further compared the performance of the general model on each industry. If the data from an industry have a better performance than most of the segmented models for other industries and the overall model, this indicates that fraud occurrences in the particular industry can be more easily predicted than other industries. Hence, we observed that fraud occurrences in the power, gas, and water; construction; leasing and commercial; and culture and sports industries were easier to predict, whereas fraud occurrences in the agriculture, information technology, and real estate industries were harder to predict compared to other industries. An interesting point to observe is that the industries which are harder to predict are exactly the industries for which the general model’s performance did not outperform their respective segmented models’ performances. The observation suggests that the information from other industries is not transferrable to these industries, and hence, the use of segmented model is indispensable.

4.4. Top Features for Each Model

In this section, we report the top ten features for the general model and each segmented model obtained using SHAP 0.46.0. Table 7 presents the category distribution of these features based on categories described in Table 3. We further computed the sum of the absolute difference between the number of features in the general model and each segmented model over all categories. Although the features may be different despite being in the same category, we argue that features of the same category exhibit similar aspects of a company, and hence, the absolute difference can be seen as a straightforward dissimilarity measure for the top features. We observed that for fraud segmented models, the segmented model for fictitious asset (P2502) had the greatest dissimilarities when compared to the general model, in which its top features were dominated by corporate governance information. The top features for unauthorized change in captial usage (P2509), on the other hand, were mainly solvency features. The segmented models for fictitious profit (P2501), material omission (P2505), and illegal guarantee (P2514) had the same category distribution with the general model, whereas the top features for the remaining fraud types were distributed across all four categories. For industry segmented models, the segmented model for power, gas, and water industry (Industry D) had the greatest dissimilarities with the general model among the industry segmented models, whereas the top features for agriculture industry (Industry A) were mainly solvency features. An interesting point to observe is that none of the segmented models for industry had the same category distribution as the general model, with construction industry (Industry E) and culture and sports industry (Industry R) being the only industry pairs which shared the same category distribution. This indicates that each industry has its unique top features and further manifests the need to build segmented models for a better comprehension of these industries.

Table 7. Category distribution of top ten features via SHAP for each model. The description for each feature category is presented in Table 3.

We took a step further to investigate the top features for the general model and segmented models for fraud types and industries which had better predictive performances than the general model. Figure 7 presents the SHAP summary plots of the top ten features for the general model and seven segmented models, as reported in Table A9 in Appendix A. To further enhance model interpretability, we plotted the partial dependence plots for the features in general model which also appeared as top features in segmented models, as shown in Figure 8. We observed that risk level features played an important role in Chinese corporate fraud detection with three of the top four features in the general model from this category. In particular, the retained earnings to total assets (RE/TA) ratio emerged as the top feature, showing a negative association with the occurrence of fraud. We observed that its risk effect flattened at extreme values of the RE/TA ratio, and this indicates that outliers do not have a distorting effect in the model. This was followed by financial leverage in the second and total leverage in the fourth, both showing a positive association with fraud, with risk effects flattening at extreme high values. This finding agrees with Xu et al. [10], where the leverage ratio had the highest feature importance after the exposure variables, which were not included in our model. A similar finding was also reported in Duan et al. [9], where the leverage ratio ranked fourth based on feature importance.

Figure 7. Top 10 features in SHAP for general model and selected segmented models. (a) General model. (b) Model for fraud P2501. (c) Model for fraud P2502. (d) Model for fraud P2509. (e) Model for fraud P2510. (f) Model for fraud P2512. (g) Model for Industry A (agriculture). (h) Model for Industry K (real estate).

Figure 8. Partial dependence plots for top features in general model which appear as top features in segmented models.

Solvency features also demonstrated significant impacts on corporate fraud prediction, with five of the top ten features coming from this category. This includes earnings before interest; the taxes, depreciation, and amortization (EBITDA) to total liabilities ratio, operating cash flow ratio (OCFR); net cash flow from operating activities (NCFA); the interest coverage ratio (ICR); and the tangible assets ratio, all of which yielded a negative association with the target variable fraud. In particular, both OCFR and NCFA produced a similar relationship with fraud as the RE/TA ratio. This agrees with findings in the existing literature, as several studies have reported net cash flow from operating activities as one of the top ten features with greatest importance [11,30,34]. The ICR, on the other hand, showed a drastic change in risk effect on low values, and the risk effect flattened out as its value increased. In terms of corporate governance information, we found that executive’s salary and management’s salary produced a negative association with the likelihood of fraud, where the former showed an exponentially decreasing relationship with fraud. Finally, none of the profitability features appeared as top features in the general model.

Next, we compared the top features between the general model and segmented models by fraud type. There were five segmented models whose predictive performances were better than their respective performance results of the general model. We started with the models whose category distribution differed most from the general model. The category distribution of the top features for fictitious asset (P2502) showed a huge discrepancy with the general model, where none of the top features overlapped with the general model. We observed that corporate governance information played a vital role in the detection of fictitious asset, and on the contrary, none of the top features belonged to the risk level category. The top features for unauthorized change in capital usage (P2509) also showed a great disparity with the general model, with solvency features dominating the top features. There were only two overlapping features when compared to the general model, namely, net cash flow from operating activities and operating cash flow ratio. These features were negatively associated with the target variable, as in the case of the general model. However, they had a larger effect on fraud compared to the general model, as suggested by the feature importance.

We observed that profitability features had a significant impact in the detection of illegal stock trading (P2512), and the role of solvency features was less important compared to the general model. Three of the top ten features overlapped with the top features in the general model, including the RE/TA ratio, which appeared as the top feature in both lists. They all showed the same direction of association with the target variable as the general model but had a much higher feature importance compared to the general model. The category distribution of the top features for occupancy of company’s asset (P2510) was similar to the general model, with five overlapping features between the models and RE/TA ratio as the top feature in both models. All the overlapping features had the same direction of association with the target variable as the general model. Similar to other segmented models, these features also had a larger impact on fraud than the general model. Finally, nine of the top ten features for fictitious profit (P2501) were different from the general model, despite having the same feature distribution as the general model. The only overlapping feature between the models was the RE/TA ratio, which emerged as the top feature, was negatively associated with the target variable, and had a larger effect compared to the general model.

As for the industry segmented models, only two segmented models outperformed the general model in forecast prediction. The top features for the agriculture industry (Industry A) showed some disparity with the general model, with eight of the top ten features coming from the solvency category. At the feature level, none of these features overlapped with the general model. This observation provides support on the dominant performance of the segmented model over the general model despite having a small sample size. On the other hand, the top five features for the real estate industry (Industry K) appeared in the top features for the general model. Nevertheless, we observed non-linear relationships between financial leverage, RE/TA ratio, and total leverage and the target variable, which differed from the relationships portrayed in the general model. The discrepancies in the direction of association caused the general model to perform badly on this industry and be outperformed by the segmented model.

The RE/TA ratio emerged as the top feature in the general model and multiple segmented models, indicating its prominent effect in Chinese corporate fraud detection. In fact, it is one of the four financial indicators used to derive the Altman’s

Z_{China}

score, which identifies potential distress in Chinese firms [47]. A higher RE/TA ratio indicates that a company’s operations are mainly funded by its internal resources rather than external debt or capital. It reflects the financial strategy and growth ability of a firm and is positively correlated with a company’s financial health. In general, the existing literature reports a negative relationship between the RE/TA ratio and the occurrence of fraud [48,49], suggesting that a financially unhealthy firm is more likely to be involved in corporate fraud.

5. Conclusions

We conducted Chinese corporate fraud prediction by incorporating the selection of an optimal time window to address the population drift problem and segmented models for fraud types and industries. This was achieved via a three-stage experimental design that served different purposes at each stage. The first stage involved the selection of the best machine learning model from 17 candidate models. Random forest without oversampling emerged as the model with the best predictive performance across four evaluation metrics. We then carried out the selection of an optimal time window for the training data for the general model and segmented models for fraud types. We found that when testing on data in 2016 to 2017, the optimal time window for all models excluded training data prior to 2010. The results indicate that population drift exists in fraud detection, and addressing it leads to an improvement in predictive performance.

Using the best model and optimal time window found, we built a general model and segmented models for nine fraud types and 12 industries. Our findings indicate that five segmented models for fraud type achieved better performance than their respective predictions using the general model, out of which four of them had low fraud rates. In terms of industry segmented models, even though only two segmented models outperformed the respective performances of the general model, our investigation reveals that this is due to insufficient training set size for most of the models. In particular, we found that 11 segmented models for industries had comparable or even better performance in terms of the AUROC when we controlled the training size. Also, we have identified dissimilarities between the top features of the general model and segmented models for fraud types and industries whose predictive performance results were better in segmented models. These findings suggest that segmented models should be employed to investigate fraud occurrence for important fraud types and industries given sufficient training samples.

In summary, the main findings of this study are the following:

Random forest classifier without resampling emerged as the best machine learning model for Chinese corporate fraud prediction.
Population drift exists in corporate fraud prediction, and addressing it using different time windows for training data selection led to an improved predictive performance.
The optimal time windows for the general model and segmented models for fraud type suggest the use of historical data within three to eight years for model building to diminish the effect of population drift.
Segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size.
Risk level and solvency features emerged as the top features associated with corporate fraud for the general model and segmented models for most fraud types and industries.

Our study provides practical information for different parties and contributes to the existing literature on Chinese corporate fraud detection. Firstly, we shed new light on the impact of population drift in corporate fraud detection, which is unaddressed in the current literature despite being applied in different credit scoring scenarios. It enhances the understanding of academic researchers and practitioners regarding the importance of addressing the population drift problem. Secondly, we provide new insights on the use of segmented models in corporate fraud prediction by improving the understanding of differences between the various fraud types and industries in terms of the risk features associated with each segment and how relatively easier or more difficult it is to predict fraud within each segment. These results offer a new perspective on corporate fraud prediction for regulators and policymakers. Thirdly, we propose a comprehensive framework to predict corporate fraud in China. Apart from Chinese firms, the general framework and methodologies can also be implemented on firms in other regions like the US and EU, since class imbalance and population drift are common problems in corporate fraud prediction internationally. The only difference is that there may be other perspectives which are uniquely available in these markets for segmentation that would need to be taken into account in the analysis. In general, our study provides a practical framework for future research in the domain.

This study has several limitations. Firstly, due to a limited sample size, the segmented models for most industries are not as informative as the general model. Future research could focus on enhancing the predictive performance given a limited sample size by developing a better framework for information extraction or including additional information such as textual data in the model. Techniques in transfer learning can also be utilized to extract transferable information from the general model. Secondly, the unbalanced class distribution resulted in low F1-scores and AUPRC scores due to high false positives. In order to boost the F1-scores and AUPRC scores, improving the precision and recall rate, especially in the case of extremely unbalanced class distribution, is also an important task in the future research. This could possibly be achieved by considering alternative machine learning algorithms or resampling techniques. Furthermore, the main focus of this study is on Chinese corporate fraud prediction, and we did not consider data from other regions in our study. Since the general framework is applicable to firms in other regions, future work could consider applying this experimental design framework to historical fraud data in other jurisdictions.

Author Contributions

Conceptualization: C.C.G. and A.B.; Methodology: C.C.G., Y.Y. and A.B.; Formal analysis: C.C.G.; Data curation: C.C.G.; Writing—original draft: C.C.G.; Writing—review and editing: Y.Y., A.B. and X.H.; Supervision: A.B. and X.H.; Funding acquisition: A.B. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

We are grateful to Ningbo Municipal Government for funding of this project as part of grant number 2021B-008-C. We are also grateful to Ningbo Science and Technology Bureau for Key Plan Program for funding the project as a part of grant numbers 2022Z243 and 2022Z173.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data were obtained from the China Stock Market and Accounting Research Database (CSMAR) and are publicly available at https://data.csmar.com/, accessed on 7 March 2023.

Acknowledgments

We would like to thank the anonymous reviewers for providing suggestions on the methodologies used in the experiments.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Additional Tables

Table A1. Number of observations and violation rate by year in in-sample, out-sample, and post-sample.

Year	No. of Observations	Violation Count	Violation Rate
In-sample
2000	605	54	8.93%
2001	739	95	12.86%
2002	785	82	10.45%
2003	824	68	8.25%
2004	889	78	8.77%
2005	954	64	6.71%
2006	934	71	7.60%
2007	1002	84	8.38%
2008	1105	109	9.86%
2009	1118	130	11.63%
2010	1198	133	11.10%
2011	1503	206	13.71%
2012	1671	231	13.82%
2013	1750	229	13.09%
2014	1780	211	11.85%
2015	1856	259	13.95%
2016	1951	313	16.04%
2017	2201	391	17.76%
Total	22,865	2808	12.28%
Out-sample
2016	647	102	15.77%
2017	727	121	16.64%
Total	1374	223	16.23%
Post-sample
2018	3328	613	18.42%
2019	3372	553	16.40%
Total	6700	1166	17.40%

Table A2. Fraud count and fraud rate for each fraud type.

Fraud Type	In-Sample		Out-Sample		Post-Sample
Fraud Type	Count	Rate	Count	Rate	Count	Rate
P2501 Fictitious profit	345	1.51%	40	2.91%	130	1.94%
P2502 Fictitious asset	77	0.34%	9	0.66%	33	0.49%
P2503 False record	1602	7.01%	129	9.39%	783	11.69%
P2505 Material omission	1812	7.92%	145	10.55%	698	10.42%
P2507 Fraudulent listing	13	0.06%	0	0.00%	0	0.00%
P2509 Unauthorized change in capital usage	149	0.65%	28	2.04%	77	1.15%
P2510 Occupancy of company’s asset	409	1.79%	48	3.49%	319	4.76%
P2511 Insider trading	3	0.01%	0	0.00%	0	0.00%
P2512 Illegal stock trading	82	0.36%	2	0.15%	46	0.69%
P2513 Stock price manipulation	2	0.01%	0	0.00%	0	0.00%
P2514 Illegal guarantee	328	1.43%	46	3.35%	233	3.48%
P2515 Mishandling of general account	666	2.91%	44	3.20%	216	3.22%

Table A3. Number of observations and fraud rate for each industry.

Industry	In-Sample		Post-Sample
Industry	Obs. (%)	Fraud (%)	Obs. (%)	Fraud (%)
A—Agriculture	356 (1.56%)	88 (24.72%)	77 (1.15%)	13 (16.88%)
B—Mining	710 (3.11%)	91 (12.82%)	141 (2.10%)	26 (18.44%)
C—Manufacturing	12,960 (56.68%)	1563 (12.06%)	4295 (64.09%)	731 (17.02%)
D—Power, gas, and water	1054 (4.61%)	111 (10.53%)	217 (3.24%)	26 (11.98%)
E—Construction	607 (2.65%)	100 (16.47%)	187 (2.79%)	31 (16.58%)
F—Wholesale and retail	1451 (6.35%)	173 (11.92%)	319 (4.76%)	63 (19.75%)
G—Transport and storage	839 (3.67%)	62 (7.39%)	197 (2.94%)	19 (9.64%)
H—Accommodation and food	94 (0.41%)	6 (6.38%)	13 (0.19%)	1 (7.69%)
I—Information technology	1584 (6.93%)	247 (15.59%)	534 (7.97%)	118 (21.97%)
K—Real estate	1477 (6.46%)	147 (9.95%)	229 (3.42%)	35 (15.28%)
L—Leasing and commercial	473 (2.07%)	60 (12.68%)	117 (1.75%)	35 (29.91%)
M—Research and technology	125 (0.55%)	6 (4.80%)	83 (1.24%)	11 (13.25%)
N—Environment	335 (1.47%)	34 (10.15%)	110 (1.64%)	17 (15.45%)
P—Education	86 (0.38%)	12 (13.95%)	19 (0.28%)	6 (31.58%)
Q—Healthcare	105 (0.46%)	23 (21.90%)	24 (0.36%)	5 (20.83%)
R—Culture and sports	420 (1.84%)	56 (13.33%)	111 (1.66%)	23 (20.72%)
S—Others	189 (0.83%)	29 (15.34%)	27 (0.40%)	6 (22.22%)
Total	22,865 (100%)	2808 (12.28%)	6700 (100%)	1166 (17.40%)

Table A4. Best hyperparameter set and its respective candidate values for each machine learning model.

Model	Hyperparameter	Candidate Values	Best Value
LR-NO	C	0.0001, 0.001, 0.01, 0.1, 1	0.01
LR-CSL	C	0.0001, 0.001, 0.01, 0.1, 1	0.01
	class_weight	1:1, 1:2, 1:3, 1:4, balanced	balanced
LR-SMOTE	C	0.0001, 0.001, 0.01, 0.1, 1, 2, 3, 5, 8, 10	5
SVM-NO	kernel	linear, polynomial, rbf	rbf
	C	0.5, 1, 1.5, 2, 3, 5	2
	gamma	1/1000, 1/500, 1/200, 1/100, 1/50, 1/20, 1/10	1/50
SVM-CSL	kernel	linear, polynomial, rbf	rbf
	C	0.5, 1, 1.5, 2, 3, 4, 5	3
	gamma	1/5000, 1/4000, 1/3000, 1/2000, 1/1000, 1/500	1/3000
	class_weight	1:1, 1:2, 1:3, 1:4, balanced	balanced
SVM-SMOTE	kernel	linear, polynomial, rbf	rbf
	C	1, 2, 3, 4, 5, 6, 8, 10, 12	8
	gamma	1/1000, 1/500, 1/200, 1/100, 1/50, 1/20, 1/10	1/20
RF-NO	n_estimators	500, 1000, 2000	1000
	max_depth	10, 15, 20, 25, 30, 40	25
	max_features	10, 15, 20	15
RF-CSL	n_estimators	500, 1000, 2000	1000
	max_depth	10, 15, 20, 25, 30	20
	max_features	10, 15, 20	15
	class_weight	1:1, 1:2, 1:3, 1:4, balanced	balanced
RF-SMOTE	n_estimators	500, 1000, 2000, 3000	2000
	max_depth	10, 15, 20, 25, 30	20
	max_features	10, 20, 30, 40, 50, 60, 80, 100	80
GB-NO	gamma	0, 5, 10	5
	max_depth	3, 5, 8, 10, 15	5
	subsample	0.5, 0.6, 0.7, 0.8, 1	0.6
	learning rate	0.01, 0.05, 0.1	0.05
GB-CSL	gamma	0, 5, 10, 15	10
	max_depth	3, 5, 8, 10, 15	5
	subsample	0.5, 0.6, 0.7, 0.8, 1	0.5
	learning rate	0.01, 0.05, 0.1	0.05
	class_weight	1:1, 1:2, 1:3, 1:4, balanced	1:2
GB-SMOTE	gamma	0, 5, 10	0
	max_depth	5, 10, 15, 20, 25, 30, 40	30
	subsample	0.5, 0.6, 0.8, 1	0.6
	learning rate	0.01, 0.05, 0.1, 0.2	0.1
ANN-NO	batch size	64, 128 256, 512	256
	no. of epochs	10, 20, 30, 50, 100	10
	no. of neurons	10, 20, 30, 50, 100	20
ANN-CSL	batch size	64, 128, 256, 512	256
	no. of epochs	10, 20, 30, 50, 100	10
	no. of neurons	10, 20, 30, 50, 100	10
	class_weight	1:1, 1:2, 1:3, 1:4, balanced	balanced
ANN-SMOTE	batch size	32, 64, 128 256, 512	64
	no. of epochs	20, 50, 100, 150, 200	150
	no. of neurons	20, 50, 100, 200, 300, 400, 500	400
RUSBoost	n_estimators	20, 50, 100, 200, 500	50
	learning rate	0.5, 0.6, 0.7, 0.8, 1	0.5
BRF	n_estimators	500, 1000, 2000, 3000, 4000, 5000	3000
	max_depth	5, 10, 15, 20	10
	max_features	10, 15, 20, 25, 30	20

Table A5. Out-sample accuracy, AUROC, adjusted F1, and AUPRC for each machine learning model. Numbers 1st, 2nd, and 3rd refer to the results of three separate runs, and R indicates to the rank of the mean value across 17 models.

Model	Accuracy					AUROC					F1					AUPRC					Avg.
Model	1st	2nd	3rd	Mean	R	1st	2nd	3rd	Mean	R	1st	2nd	3rd	Mean	R	1st	2nd	3rd	Mean	R	Rank
Non-oversampling
LR-NO	0.8771	0.8771	0.8771	0.8771	4	0.6517	0.6517	0.6517	0.6517	9	0.2754	0.2754	0.2754	0.2754	9	0.2050	0.2050	0.2050	0.2050	8	7.5
SVC-NO	0.7816	0.7816	0.7816	0.7816	17	0.5803	0.5802	0.5802	0.5802	15	0.2322	0.2322	0.2322	0.2322	12	0.1666	0.1666	0.1666	0.1666	15	14.8
RF-NO	0.8772	0.8770	0.8771	0.8771	4	0.6644	0.6641	0.6641	0.6642	3	0.2843	0.2841	0.2835	0.2840	2	0.2143	0.2148	0.2144	0.2145	2	2.8
GB-NO	0.8545	0.8513	0.8551	0.8536	14	0.6598	0.6671	0.6604	0.6625	4	0.2795	0.2776	0.2826	0.2799	5	0.2089	0.2117	0.2130	0.2112	4	6.8
NN-NO	0.8721	0.8718	0.8743	0.8727	13	0.6322	0.6625	0.6540	0.6495	10	0.2639	0.2843	0.2756	0.2746	10	0.1901	0.2075	0.2014	0.1997	11	11.0
Cost-sensitive learning
LR-CSL	0.8754	0.8754	0.8754	0.8754	10	0.6538	0.6538	0.6538	0.6538	8	0.2782	0.2782	0.2782	0.2782	7	0.2044	0.2044	0.2044	0.2044	9	8.5
SVC-CSL	0.8775	0.8775	0.8775	0.8775	1	0.6538	0.6538	0.6538	0.6538	7	0.2782	0.2782	0.2782	0.2782	8	0.2058	0.2058	0.2058	0.2058	7	5.8
RF-CSL	0.8759	0.8747	0.8730	0.8745	11	0.6652	0.6658	0.6642	0.6651	1	0.2861	0.2885	0.2869	0.2872	1	0.2124	0.2121	0.2104	0.2116	3	4.0
GB-CSL	0.8767	0.8763	0.8766	0.8765	8	0.6601	0.6657	0.6671	0.6643	2	0.2763	0.2808	0.2890	0.2820	3	0.2127	0.2133	0.2189	0.2150	1	3.5
NN-CSL	0.8768	0.8770	0.8772	0.8770	6	0.6442	0.6541	0.6502	0.6495	11	0.2701	0.2739	0.2656	0.2699	11	0.1964	0.2085	0.1999	0.2016	10	9.5
SMOTE
LR-SMOTE	0.8750	0.8736	0.8714	0.8733	12	0.6315	0.6327	0.6318	0.6320	13	0.0445	0.0591	0.0636	0.0557	15	0.1902	0.1909	0.1900	0.1904	12	13.0
SVC-SMOTE	0.8772	0.8772	0.8772	0.8772	2	0.5773	0.5764	0.5763	0.5767	16	0.0214	0.0238	0.0256	0.0236	17	0.1564	0.1556	0.1553	0.1558	17	13.0
RF-SMOTE	0.8351	0.8418	0.8370	0.8380	16	0.6191	0.6216	0.6230	0.6212	14	0.1466	0.1433	0.1527	0.1475	14	0.1747	0.1767	0.1779	0.1765	14	14.5
GB-SMOTE	0.8762	0.8759	0.8754	0.8758	9	0.6439	0.6497	0.6381	0.6439	12	0.0368	0.0326	0.0367	0.0354	16	0.1932	0.1906	0.1871	0.1903	13	12.5
NN-SMOTE	0.8473	0.8428	0.8403	0.8435	15	0.5510	0.5762	0.5763	0.5678	17	0.1432	0.1596	0.1588	0.1539	13	0.1517	0.1625	0.1667	0.1603	16	15.3
Ensemble-based resampling approaches
RUSBoost	0.8772	0.8772	0.8772	0.8772	2	0.6487	0.6581	0.6586	0.6551	6	0.2758	0.2778	0.2825	0.2787	6	0.2018	0.2120	0.2060	0.2066	6	5.0
BRF	0.8764	0.8766	0.8767	0.8766	7	0.6578	0.6577	0.6585	0.6580	5	0.2828	0.2801	0.2779	0.2803	4	0.2092	0.2093	0.2097	0.2094	5	5.3

Bold font indicates the best performance and the highest ranking across 17 models for each metric. For non-oversampling and cost-sensitive learning models, the only difference between each run is the random state of each model; hence, both LR and SVM have the same results across all three runs.

Table A6. Mean and standard deviation for accuracy, AUROC, AUPRC, and F1 over 100 runs for general model trained using each time window. R indicates the rank of the mean values across 17 time windows.

Time	Accuracy		AUROC		F1		AUPRC		Avg.
Window	Mean (SD)	R	Mean (SD)	R	Mean (SD)	R	Mean (SD)	R	Rank
2000–2017	0.8370 (0.0017)	10	0.6585 (0.0084)	16	0.3377 (0.0087)	14	0.2772 (0.0134)	17	14.3
2001–2017	0.8370 (0.0017)	9	0.6584 (0.0078)	17	0.3381 (0.0094)	13	0.2778 (0.0119)	16	13.8
2002–2017	0.8367 (0.0017)	12	0.6589 (0.0069)	15	0.3370 (0.0092)	17	0.2781 (0.0113)	15	14.8
2003–2017	0.8362 (0.0025)	16	0.6602 (0.0070)	14	0.3372 (0.0095)	16	0.2796 (0.0113)	14	15.0
2004–2017	0.8363 (0.0026)	15	0.6608 (0.0075)	13	0.3376 (0.0086)	15	0.2808 (0.0112)	13	14.0
2005–2017	0.8365 (0.0021)	13	0.6614 (0.0077)	11	0.3393 (0.0092)	12	0.2821 (0.0109)	11	11.8
2006–2017	0.8364 (0.0021)	14	0.6615 (0.0067)	10	0.3427 (0.0093)	11	0.2813 (0.0106)	12	11.8
2007–2017	0.8368 (0.0017)	11	0.6614 (0.0069)	12	0.3459 (0.0101)	10	0.2825 (0.0114)	10	10.8
2008–2017	0.8371 (0.0014)	8	0.6632 (0.0061)	9	0.3497 (0.0100)	9	0.2841 (0.0107)	9	8.8
2009–2017	0.8374 (0.0014)	7	0.6663 (0.0056)	8	0.3506 (0.0097)	7	0.2910 (0.0107)	8	7.5
2010–2017	0.8378 (0.0015)	5	0.6677 (0.0055)	7	0.3504 (0.0101)	8	0.2966 (0.0091)	6	6.5
2011–2017	0.8379 (0.0016)	4	0.6691 (0.0056)	5	0.3514 (0.0091)	6	0.2981 (0.0085)	5	5.0
2012–2017	0.8381 (0.0019)	2	0.6720 (0.0052)	1	0.3522 (0.0093)	5	0.3044 (0.0087)	1	2.2
2013–2017	0.8382 (0.0023)	1	0.6704 (0.0050)	3	0.3522 (0.0090)	4	0.3032 (0.0087)	2	2.5
2014–2017	0.8378 (0.0019)	6	0.6708 (0.0043)	2	0.3530 (0.0085)	3	0.3007 (0.0079)	3	3.5
2015–2017	0.8379 (0.0020)	3	0.6697 (0.0029)	4	0.3584 (0.0074)	1	0.2991 (0.0056)	4	3.0
2016–2017	0.8317 (0.0017)	17	0.6685 (0.0010)	6	0.3539 (0.0034)	2	0.2954 (0.0024)	7	8.0

Bold font indicates the best performance and the highest ranking across 17 time windows for each metric.

Table A7. Forecast prediction results on post-sample data. GM and SM each stands for general model and segmented model, respectively.

Model	Fraud	Accuracy		AUROC		F1		AUPRC		Best
Model	Rate	GM	SM	GM	SM	GM	SM	GM	SM	Best
General (Entire post-sample data)	17.40%	0.8127	-	0.7452	-	0.4433	-	0.3864	-	-
Segmented models for fraud type
P2501 Fictitious profit	1.94%	0.9667	0.9511	0.7759	0.8110	0.0844	0.1687	0.0774	0.0791	SM
P2502 Fictitious asset	0.49%	0.9779	0.9948	0.7687	0.8156	0.0216	0.1026	0.0201	0.0920	SM
P2503 False record	11.69%	0.8824	0.8832	0.7558	0.7520	0.3584	0.3502	0.3048	0.2966	GM
P2505 Material omission	10.42%	0.8915	0.8959	0.7315	0.7320	0.3101	0.3083	0.2500	0.2446	Tie
P2509 Unauthorized change in capital usage	1.15%	0.9705	0.9776	0.6739	0.7436	0.0441	0.1000	0.0192	0.0660	SM
P2510 Occupancy of company’s asset	4.76%	0.9397	0.9291	0.7501	0.7559	0.1721	0.2032	0.1161	0.1394	SM
P2512 Illegal stock trading	0.69%	0.9763	0.9931	0.6668	0.6668	0.0224	0.0941	0.0536	0.0958	SM
P2514 Illegal guarantee	3.48%	0.9540	0.9509	0.7797	0.7617	0.1386	0.1814	0.1321	0.1330	Tie
P2515 Mishandling of general account	3.22%	0.9539	0.9388	0.7647	0.7336	0.1354	0.1419	0.0942	0.0857	GM
Segmented models for industry
A—Agriculture	16.88%	0.8312	0.7792	0.6743	0.7656	0.3175	0.4348	0.3851	0.4058	SM
B—Mining	18.44%	0.8298	0.8156	0.7528	0.7047	0.4730	0.3438	0.4730	0.3412	GM
C—Manufacturing	17.02%	0.8317	0.8326	0.7477	0.7403	0.4396	0.4247	0.3852	0.3748	GM
D—Power, gas, and water	11.98%	0.8802	0.8802	0.7773	0.7710	0.4416	0.3385	0.4748	0.2875	GM
E—Construction	16.58%	0.8289	0.8342	0.7746	0.6931	0.4103	0.3103	0.3882	0.3432	GM
F—Wholesale and retail	19.75%	0.7931	0.8025	0.7544	0.7032	0.4194	0.3988	0.3988	0.3851	GM
G—Transport and storage	9.64%	0.9036	0.9036	0.7451	0.7392	0.3500	0.3125	0.3034	0.1972	GM
I—Information technology	21.97%	0.7865	0.7959	0.7129	0.6919	0.4779	0.4341	0.4237	0.4523	Tie
K—Real estate	15.28%	0.8472	0.8472	0.6728	0.7470	0.3226	0.3125	0.2848	0.3166	SM
L—Leasing and commercial	29.91%	0.7179	0.6581	0.7885	0.6620	0.6237	0.5345	0.5779	0.4471	GM
N—Environment	15.45%	0.8636	0.8455	0.7584	0.7242	0.3784	0.2727	0.4437	0.3014	GM
R—Culture and sports	20.72%	0.7928	0.7928	0.7826	0.6902	0.5385	0.3889	0.4722	0.3663	GM

Bold font indicates the best performance between general model and segmented model for each metric.

Table A8. Post-sample AUROC for the general model and segmented models for industry when training size was fixed to be the sample size of each respective industry.

Industry	Post-Sample AUROC
Industry	GM	SM
A—Agriculture	0.5951	0.7615
B—Mining	0.6751	0.7032
C—Manufacturing	0.7403	0.7412
D—Power, gas, and water	0.6241	0.7744
E—Construction	0.5900	0.6989
F—Wholesale and retail	0.7243	0.7022
G—Transport and catering	0.6424	0.7469
I—Information technology	0.6868	0.6916
K—Real estate	0.6201	0.7425
L—Leasing and commercial	0.6480	0.6541
N—Environment	0.6199	0.7324
R—Culture and sports	0.6442	0.6901

Bold font indicates the best performance between general model and segmented model.

Table A9. Top 10 features for general model and selected segmented models based on SHAP. RF R. indicates the ranking of each feature based on feature importance of random forest model.

Rank	Feature name	Abbr.	Cat.	Imp.	Effect	RF R.
(a) General model
1	Retained earnings to total assets ratio	RETA	R	0.0085	Negative	1
2	Financial leverage	FinLev	R	0.0040	Positive	3
3	EBITDA to total liabilities ratio	ETLR	S	0.0037	Negative	4
4	Total leverage	TotLev	R	0.0031	Positive	8
5	Operating cash flow ratio	OCFR	S	0.0030	Negative	12
6	Net cash flow from operating activities	NCFA	S	0.0028	Negative	16
7	Interest coverage ratio	ICR	S	0.0028	Negative	13
8	Tangible assets ratio	TAR	S	0.0027	Negative	7
9	Executive’s salary	ExeSal	C	0.0027	Negative	5
10	Management’s salary	MgmSal	C	0.0025	Negative	9
(b) P2501—Fictitious profit
1	Retained earnings to total assets ratio	RETA	R	0.0161	Negative	1
2	Shareholders’ equity to fixed assets ratio	SEFA	R	0.0095	Positive	9
3	Conformity rate for long-term assets	CRLA	S	0.0092	Positive	5
4	Fixed assets ratio	FAR	S	0.0091	Negative	4
5	Executives’ shares	ExeSha	C	0.0086	Positive	8
6	Cash ratio	CR	S	0.0085	Non-linear	15
7	Operating leverage	OpeLev	R	0.0084	Negative	10
8	No. of employees	Emp	C	0.0079	Negative	7
9	Non-current assets ratio	NCA	S	0.0078	Non-linear	2
10	Liabilities to equity market cap ratio	LEMC	S	0.0075	Non-linear	6
(c) P2502—Fictitious asset
1	No. of supervisors without compensation	SWC	C	0.0203	Negative	1
2	Managers’ shares	MgrSha	C	0.0177	Positive	2
3	Supervisor’s salary	SupSal	C	0.0135	Negative	3
4	Chairman’s shares	ChaSha	C	0.0126	Positive	4
5	Executives’ shares	ExeSha	C	0.0123	Positive	6
6	Management’s shares	MgmSha	C	0.0119	Positive	5
7	Liabilities to tangible assets ratio	LTAR	S	0.0113	Positive	9
8	Directors’ salary	DirSal	C	0.0110	Negative	7
9	Directors’ shares	DirSha	C	0.0109	Positive	10
10	Fixed assets ratio	FAR	S	0.0097	Negative	8
(d) P2509—Unauthorized change in capital usage
1	Financial liabilities ratio	FLR	S	0.0299	Positive	1
2	Operating liabilities ratio	OLR	S	0.0289	Negative	2
3	Conformity rate for long-term assets	CRLA	S	0.0171	Positive	3
4	Quick ratio	QR	S	0.0121	Positive	5
5	Net cash flow from operating activities	NCFA	S	0.0110	Negative	4
6	Shareholders’ equity to fixed assets ratio	SEFA	R	0.0095	Positive	7
7	Cash assets ratio	CAR	S	0.0091	Positive	14
8	Fixed assets ratio	FAR	S	0.0088	Negative	6
9	Operating cash flow ratio	OCFR	S	0.0078	Negative	8
10	Composite tax rate	CTR	P	0.0075	Negative	9
(e) P2510—Occupancy of company’s asset
1	Retained earnings to total assets ratio	RETA	R	0.0149	Negative	1
2	Tangible assets ratio	TAR	S	0.0107	Negative	4
3	Supervisor’s salary	SupSal	C	0.0100	Negative	2
4	Financial leverage	FinLev	R	0.0095	Positive	3
5	Equity attributable to parent company to invested capital ratio	EPIC	P	0.0084	Non-linear	7
6	Interest coverage ratio	ICR	S	0.0080	Negative	10
7	Executives’ salary	ExeSal	C	0.0078	Negative	11
8	Operating liabilities ratio	OLR	S	0.0074	Negative	8
9	Liabilities to tangible assets ratio	LTAR	S	0.0074	Positive	12
10	No. of employees	Emp	C	0.0065	Negative	6
(f) P2512—Illegal stock trading
1	Retained earnings to total assets ratio	RETA	R	0.0255	Negative	1
2	Turnover tax rate	TTR	P	0.0138	Negative	2
3	Composite tax rate	CTR	P	0.0136	Negative	3
4	No. of employees	Emp	C	0.0116	Non-linear	6
5	No. of supervisors without compensation	SWC	C	0.0099	Negative	10
6	Tangible assets ratio	TAR	S	0.0095	Negative	4
7	Net profit to comprehensive income ratio	NPCI	P	0.0080	Negative	14
8	EBITDA to total liabilities ratio	ETLR	S	0.0079	Negative	5
9	Total leverage	TotLev	R	0.0078	Positive	16
10	Operating liabilities ratio	OLR	S	0.0076	Non-linear	7
(g) Industry A—Agriculture
1	Fixed assets ratio	FAR	S	0.0159	Negative	1
2	Working capital	WC	S	0.0144	Positive	2
3	Long-term debt to total assets ratio	LDTA	S	0.0141	Positive	3
4	Long-term debt to equity ratio	LDER	S	0.0110	Positive	5
5	No. of shareholders	ShaHol	C	0.0102	Positive	6
6	Current liabilities ratio	CLR	S	0.0092	Non-linear	7
7	Debt to long-term capital ratio	DLCR	S	0.0090	Positive	8
8	Shareholders’ equity to fixed assets ratio	SEFA	R	0.0085	Positive	4
9	Receivable assets ratio	RAR	S	0.0081	Positive	9
10	Non-current liabilities ratio	NCLR	S	0.0075	Positive	10
(h) Industry K—Real estate
1	Interest coverage ratio	ICR	S	0.0047	Negative	1
2	Financial leverage	FinLev	R	0.0038	Non-linear	2
3	Retained earnings to total assets ratio	RETA	R	0.0029	Non-linear	3
4	Total leverage	TotLev	R	0.0028	Non-linear	5
5	Tangible assets ratio	TAR	S	0.0027	Negative	6
6	Cash assets ratio	CAR	S	0.0022	Positive	4
7	Supervisors’ shares	SupSha	C	0.002	Non-linear	22
8	Shareholders’ equity to fixed assets ratio	SEFA	R	0.002	Negative	14
9	Equity attributable to parent company to invested capital ratio	EPIC	P	0.0019	Negative	17
10	Fixed charged coverage ratio	FCCR	S	0.0019	Negative	10

Bold feature name in segmented models indicates the feature also appears as a top 10 feature in general model.

References

Wells, J.T. Corporate Fraud Handbook: Prevention and Detection, 5th ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
Warren, J. Occupational Fraud 2024: A Report to the Nations; Technical Report; Association of Certified Fraud Examiners (ACFE): Austin, TX, USA, 2024; Available online: https://legacy.acfe.com/report-to-the-nations/2024 (accessed on 5 March 2025).
Cressey, D.R. Other People’s Money; A Study in the Social Psychology of Embezzlement; Patterson Smith: Montclair, NJ, USA, 1953. [Google Scholar]
Zuo, Y.; Liu, X.; Jiang, F.; Shen, S. Sharpened “Real Teeth” of China’s securities regulatory agency: Evidence from CEO turnover. Int. Rev. Econ. Financ. 2024, 96, 103637. [Google Scholar] [CrossRef]
Hand, D.J.; Henley, W.E. Statistical Classification Methods in Consumer Credit Scoring: A Review. J. R. Stat. Society. Ser. A (Stat. Soc.) 1997, 160, 523–541. [Google Scholar] [CrossRef]
Li, G.; Wang, S.; Feng, Y. Making differences work: Financial fraud detection based on multi-subject perceptions. Emerg. Mark. Rev. 2024, 60, 101134. [Google Scholar] [CrossRef]
Tang, Y.; Liu, Z. A Distributed Knowledge Distillation Framework for Financial Fraud Detection Based on Transformer. IEEE Access 2024, 12, 62899–62911. [Google Scholar] [CrossRef]
Achakzai, M.; Juan, P. Using Machine learning Meta-Classifiers to detect financial frauds. Financ. Res. Lett. 2022, 48, 102915. [Google Scholar] [CrossRef]
Duan, W.; Hu, N.; Xue, F. The information content of financial statement fraud risk: An ensemble learning approach. Decis. Support Syst. 2024, 182, 114231. [Google Scholar] [CrossRef]
Xu, X.; Xiong, F.; An, Z. Using Machine Learning to Predict Corporate Fraud: Evidence Based on the GONE Framework. J. Bus. Ethics 2022, 186, 137–158. [Google Scholar] [CrossRef]
Lu, Q.; Fu, C.; Nan, K.; Fang, Y.; Xu, J.; Liu, J.; Bellotti, A.G.; Lee, B.G. Chinese corporate fraud risk assessment with machine learning. Intell. Syst. Appl. 2023, 20, 200294. [Google Scholar] [CrossRef]
Whittaker, J.; Whitehead, C.; Somers, M. A dynamic scorecard for monitoring baseline performance with application to tracking a mortgage portfolio. J. Oper. Res. Soc. 2007, 58, 911–921. [Google Scholar] [CrossRef]
Adams, N.M.; Tasoulis, D.K.; Anagnostopoulos, C.; Hand, D.J. Temporally-Adaptive Linear Classification for Handling Population Drift in Credit Scoring. In Proceedings of the International Conference on Computational Statistics, Paris, France, 22–27 August 2010; pp. 167–176. [Google Scholar]
Nikolaidis, D.; Doumpos, M.; Zopounidis, C. Exploring population drift on consumer credit behavioral scoring. In Proceedings of the Operational Research in Business and Economics: 4th International Symposium and 26th National Conference on Operational Research, Chania, Greece, 4–6 June 2015; Springer: Cham, Switzerland, 2017; pp. 145–165. [Google Scholar]
Cecchini, M.; Aytug, H.; Koehler, G.; Pathak, P. Detecting Management Fraud in Public Companies. Manag. Sci. 2010, 56, 1146–1160. [Google Scholar] [CrossRef]
Perols, J. Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. Audit. A J. Pract. Theory 2011, 30, 19–50. [Google Scholar] [CrossRef]
Kim, Y.; Baik, B.; Cho, S. Detecting financial misstatements with fraud intention using multi-class cost-sensitive learning. Expert Syst. Appl. 2016, 62, 32–43. [Google Scholar] [CrossRef]
Hájek, P.; Henriques, R. Mining Corporate Annual Reports for Intelligent Detection of Financial Statement Fraud—A Comparative Study of Machine Learning Methods. Knowl.-Based Syst. 2017, 128, 139–152. [Google Scholar] [CrossRef]
Perols, J.L.; Bowen, R.M.; Zimmermann, C.; Samba, B. Finding Needles in a Haystack: Using Data Analytics to Improve Fraud Prediction. Account. Rev. 2017, 92, 221–245. [Google Scholar] [CrossRef]
Brown, N.; Crowley, R.; Elliott, W. What Are You Saying? Using topic to Detect Financial Misreporting. J. Account. Res. 2019, 58, 237–291. [Google Scholar] [CrossRef]
Bao, Y.; Ke, B.; Li, B.; Yu, Y.J.; Zhang, J. Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. J. Account. Res. 2020, 58, 199–235. [Google Scholar] [CrossRef]
Bertomeu, J.; Cheynel, E.; Floyd, E.; Pan, W. Using machine learning to detect misstatements. Rev. Account. Stud. 2021, 26, 468–519. [Google Scholar] [CrossRef]
Khan, A.T.; Cao, X.; Li, S.; Katsikis, V.N.; Brajevic, I.; Stanimirovic, P.S. Fraud detection in publicly traded U.S. firms using Beetle Antennae Search: A machine learning approach. Expert Syst. Appl. 2022, 191, 116148. [Google Scholar] [CrossRef]
Yi, Z.; Cao, X.; Pu, X.; Wu, Y.; Chen, Z.; Khan, A.T.; Francis, A.; Li, S. Fraud detection in capital markets: A novel machine learning approach. Expert Syst. Appl. 2023, 231, 120760. [Google Scholar] [CrossRef]
Ravisankar, P.; Ravi, V.; Rao, G.; Bose, I. Detection of financial statement fraud and feature selection using data mining techniques. Decis. Support Syst. 2011, 50, 491–500. [Google Scholar] [CrossRef]
Song, X.; Hu, Z.H.; Du, J.g.; Sheng, Z. Application of Machine Learning Methods to Risk Assessment of Financial Statement Fraud: Evidence from China. J. Forecast. 2014, 33, 611–626. [Google Scholar] [CrossRef]
Liu, C.; Chan, Y.; Hasnain, S.; Alam Kazmi, S.H.; Fu, H. Financial Fraud Detection Model: Based on Random Forest. Int. J. Econ. Financ. 2015, 7, 178–188. [Google Scholar] [CrossRef]
Yao, J.; Pan, Y.; Chen, Y.; Li, Y. Detecting Fraudulent Financial Statements for the Sustainable Development of the Socio-Economy in China: A Multi-Analytic Approach. Sustainability 2019, 11, 1579. [Google Scholar] [CrossRef]
Wu, X.; Du, S. An Analysis on Financial Statement Fraud Detection for Chinese Listed Companies Using Deep Learning. IEEE Access 2022, 10, 22516–22532. [Google Scholar] [CrossRef]
Chen, X.; Zhai, C. Bagging or boosting? Empirical evidence from financial statement fraud detection. Account. Financ. 2023, 63. [Google Scholar] [CrossRef]
Rahman, M.J.; Zhu, H. Detecting accounting fraud in family firms: Evidence from machine learning approaches. Adv. Account. 2023, 64, 100722. [Google Scholar] [CrossRef]
Cai, S.; Xie, Z. Explainable fraud detection of financial statement data driven by two-layer knowledge graph. Expert Syst. Appl. 2024, 246, 123126. [Google Scholar] [CrossRef]
Sun, Y.; Zeng, X.; Xu, Y.; Yue, H.; Yu, X. An intelligent detecting model for financial frauds in Chinese A-share market. Econ. Politics 2024, 36, 1110–1136. [Google Scholar] [CrossRef]
Zhou, Y.; Xiao, Z.; Gao, R.; Wang, C. Using data-driven methods to detect financial statement fraud in the real scenario. Int. J. Account. Inf. Syst. 2024, 54, 100693. [Google Scholar] [CrossRef]
Pavlidis, N.; Tasoulis, D.; Adams, N.; Hand, D. Adaptive consumer credit classification. J. Oper. Res. Soc. 2012, 63, 1645–1654. [Google Scholar] [CrossRef]
Lucas, Y.; Portier, P.E.; Laporte, L.; Calabretto, S.; He-Guelton, L.; Oble, F.; Granitzer, M. Dataset shift quantification for credit card fraud detection. In Proceedings of the 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Sardinia, Italy, 3–5 June 2019; pp. 97–100. [Google Scholar]
China Stock Market and Accounting Research Database (CSMAR). Corporate Fraud Dataset. Data. Available online: https://data.csmar.com/ (accessed on 7 March 2023).
Granger, C.W.J.; Huang, L. Evaluation of Panel Data Models: Some Suggestions from Time Series. Econ. E J. 1997. [Google Scholar] [CrossRef]
Gunnarsson, B.R.; vanden Broucke, S.; Baesens, B.; Óskarsdóttir, M.; Lemahieu, W. Deep learning for credit scoring: Do or don’t? Eur. J. Oper. Res. 2021, 295, 292–305. [Google Scholar] [CrossRef]
Chikoore, R.; Kogeda, O.P.; Ojo, S.O. Recent Approaches to Drift Effects in Credit Rating Models. In Proceedings of the e-Infrastructure and e-Services for Developing Countries, Porto-Novo, Benin, 3–4 December 2019; Zitouni, R., Agueh, M., Houngue, P., Soude, H., Eds.; Springer: Cham, Switzerland, 2020; pp. 237–253. [Google Scholar]
Qian, H.; Wang, B.; Ma, P.; Peng, L.; Gao, S.; Song, Y. Managing Dataset Shift by Adversarial Validation for Credit Scoring. In Proceedings of the PRICAI 2022: Trends in Artificial Intelligence, Shanghai, China, 10–13 November 2022; Khanna, S., Cao, J., Bai, Q., Xu, G., Eds.; Springer: Cham, Switzerland, 2022; pp. 477–488. [Google Scholar]
Bijak, K.; Thomas, L. Does segmentation always improve model performance in credit scoring? Expert Syst. Appl. 2012, 39, 2433–2442. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Yang, Y.; Fang, T.; Hu, J.; Goh, C.C.; Zhang, H.; Cai, Y.; Bellotti, A.G.; Lee, B.G.; Ming, Z. A comprehensive study on the interplay between dataset characteristics and oversampling methods. J. Oper. Res. Soc. 2025, 1–22. [Google Scholar] [CrossRef]
Lessmann, S.; Baesens, B.; Seow, H.V.; Thomas, L.C. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 2015, 247, 124–136. [Google Scholar] [CrossRef]
Youden, W.J. Index for rating diagnostic tests. Cancer 1950, 3, 32–35. [Google Scholar] [CrossRef]
Zhang, L.; Altman, E.I.; Yen, J. Corporate financial distress diagnosis model and application in credit rating for listing firms in China. Front. Comput. Sci. China 2010, 4, 220–236. [Google Scholar] [CrossRef]
Wei, Y.; Chen, J.; Wirth, C. Detecting fraud in Chinese listed company balance sheets. Pac. Account. Rev. 2017, 29, 356–379. [Google Scholar] [CrossRef]
Lin, L.; Nguyen, N.H.; Young, M.; Zou, L. Military executives and corporate outcomes: Evidence from China. Emerg. Mark. Rev. 2021, 49, 100765. [Google Scholar] [CrossRef]

Figure 1. Experimental design flow chart. *** In Stage 2, the selection of optimal time window is not carried out for segmented models for industry due to insufficient sample size. Apart from the manufacturing industry (Industry C), the full sample from 2000 to 2017 for a particular industry is used to train segmented model.

Figure 2. (a) Out-sample accuracy (top left), AUROC (top right), F1 (bottom left), AUPRC (bottom right), and (b) the average ranking for each machine learning model.

Figure 3. (a) Mean accuracy (top left), AUROC (top right), F1 (bottom left), AUPRC (bottom right) with their respective standard deviation over 100 runs on out-sample data using the general model for each time window, and (b) their average rankings over 100 runs.

Figure 4. Post-sample prediction for each fraud type of the general model and its respective segmented model evaluated using accuracy (top left), AUROC (top right), F1 (bottom left), and AUPRC (bottom right).

Figure 5. Post-sample prediction for each industry of the general model and its respective segmented model evaluated using accuracy (top left), AUROC (top right), F1 (bottom left), and AUPRC (bottom right). The leftmost bar for the general model in each metric represents the prediction results of the general model on the entire post-sample data.

Figure 6. Plot of post-sample AUROC against training size for manufacturing industry (left) and power, gas, and water industry (right).

Figure 7. Top 10 features in SHAP for general model and selected segmented models. (a) General model. (b) Model for fraud P2501. (c) Model for fraud P2502. (d) Model for fraud P2509. (e) Model for fraud P2510. (f) Model for fraud P2512. (g) Model for Industry A (agriculture). (h) Model for Industry K (real estate).

Figure 8. Partial dependence plots for top features in general model which appear as top features in segmented models.

Table 1. Studies related to machine learning for corporate fraud prediction in China.

Article	Article	Data	Fraud	No. of	Fraud	Features	ML	EM	CI	PD	SM	FI
Article	Year	Year	Type	Obs.	Rate	Features	ML	EM	CI	PD	SM	FI
Ravisankar et al. [25]	2011	Unknown	FSF	220	50.00%	PRS	6	4	N	N	Y	N
Song et al. [26]	2014	2008–2012	FSF	550	20.00%	CPRSO	5	3	N	N	N	N
Liu et al. [27]	2015	1998–2014	FF	398	46.31%	PRS	5	3	N	N	N	Y
Yao et al. [28]	2019	2008–2017	FSF	536	25.00%	CPRS	6	5	N	N	Y	N
Achakzai and Juan [8]	2022	2007–2019	FSF	32,173	11.02%	PRS	8	6	N	N	Y	N
Wu and Du [29]	2022	2016–2020	FSF	5130	4.76%	CPRSO	9	6	Y	N	Y	N
Xu et al. [10]	2022	2009–2018	All	35,922	12.36%	CPRSO	6	7	Y	N	Y	Y
Chen and Zhai [30]	2023	2012–2022	FSF	37,388	1.82%	PRSO	5	3	Y	N	N	Y
Lu et al. [11]	2023	2016–2020	All	10,844	10.85%	CPRS	5	6	Y	N	Y	Y
Rahman and Zhu [31]	2023	2003–2017	AF	15,554	12.35%	PRS	5	3	Y	N	Y	N
Cai and Xie [32]	2024	2009–2022	FSF	2647	24.44%	PRS	18	6	Y	N	Y	N
Duan et al. [9]	2024	2007–2018	FSF	23,371	1.69%	CPRSO	6	4	Y	N	N	Y
Li et al. [6]	2024	2017–2022	FF	1272	50.00%	CPRSO	8	4	N	N	Y	Y
Sun et al. [33]	2024	2001–2016	FF	30,636	1.19%	PRS	4	2	N	N	Y	Y
Tang and Liu [7]	2024	Unknown	FF	18,060	1.00%	CPRSO	7	5	Y	N	Y	N
Zhou et al. [34]	2024	2007–2020	FSF	37,502	1.15%	PRS	4	5	Y	N	Y	Y

Fraud Type: AF—accounting fraud; FF—financial fraud; FSF—financial statement fraud. Features: C—corporate governance; P—profitability; R—risk level; S—solvency; O—others; ML—number of machine learning algorithms used; EM—number of evaluation metrics used; CI—dealing with class imbalance (yes or no); PD—dealing with population drift (yes or no); SM—use segmented models (yes or no); FI—analyze feature importance (yes or no).

Table 2. Number of observations and fraud rate for full sample.

Sample	Year	No. of Observations	Fraud Count	Fraud Rate
In-sample	2000–2017	22,865	2808	12.28%
Out-sample	2016–2017	1374	223	16.23%
Post-sample	2018–2019	6700	1166	17.40%

Table 3. Number of features and description for each feature category.

Category	Abbr.	Description	No. of
Category	Abbr.	Description	Features
Corporate governance	C	Non-financial information related to the company’s administration.	73
Profitability	P	Measures a company’s ability to generate income.	36
Risk Level/Leverage	R	Measures a company’s debt levels.	10
Solvency/Liquidity	S	Measures a company’s ability to meet short-term and long-term debts.	86
Others	O	Features which do not belong to aforementioned categories.	36

Table 4. Time window with best average ranking across four metrics over 100 runs for each model.

Model	Optimal Time Window
General model	2012 to 2017
Segmented models for fraud type
P2501 Fictitious profit	2015 to 2017
P2502 Fictitious asset	2012 to 2017
P2503 False record	2012 to 2017
P2505 Material omission	2016 to 2017
P2509 Unauthorized change in capital usage	2010 to 2017
P2510 Occupancy of company’s asset	2015 to 2017
P2512 Illegal stock trading	2011 to 2017
P2514 Illegal guarantee	2016 to 2017
P2515 Mishandling of general account	2016 to 2017
Segmented model for industry
Industry C—Manufacturing	2012 to 2017

Table 5. Best model between general model (GM) and segmented model (SM) for each fraud type and its respective fraud rate on post-sample data. The model with better performance among the four evaluation metrics is considered as the best model.

Fraud Type	Fraud Rate	Best Model
P2501 Fictitious profit	1.94%	SM
P2502 Fictitious asset	0.49%	SM
P2503 False record	11.69%	GM
P2505 Material omission	10.42%	Tie
P2509 Unauthorized change in capital usage	1.15%	SM
P2510 Occupancy of company’s asset	4.76%	SM
P2512 Illegal stock trading	0.69%	SM
P2514 Illegal guarantee	3.48%	Tie
P2515 Mishandling of general account	3.22%	GM

Table 6. Best model between general model (GM) and segmented model (SM) for each industry and its respective fraud rate on post-sample data. The model with better performance among the four evaluation metrics is considered as the best model.

Industry	Fraud Rate	Best Model
A—Agriculture	16.88%	SM
B—Mining	18.44%	GM
C—Manufacturing	17.02%	GM
D—Power, gas, and water	11.98%	GM
E—Construction	16.58%	GM
F—Wholesale and retail	19.75%	GM
G—Transport and storage	9.64%	GM
I—Information technology	21.97%	Tie
K—Real estate	15.28%	SM
L—Leasing and commercial	29.91%	GM
N—Environment	15.45%	GM
R—Culture and sports	20.72%	GM

Table 7. Category distribution of top ten features via SHAP for each model. The description for each feature category is presented in Table 3.

Model	C	P	R	S	O	Diff.
General model	2	0	3	5	0	-
Segmented models for fraud type
P2501 Fictitious profit	2	0	3	5	0	0
P2502 Fictitious asset	8	0	0	2	0	12
P2503 False record	4	2	2	2	0	8
P2505 Material omission	2	0	3	5	0	0
P2509 Unauthorized change in capital usage	0	1	1	8	0	8
P2510 Occupancy of company’s asset	3	1	2	4	0	4
P2512 Illegal stock trading	2	3	2	3	0	6
P2514 Illegal guarantee	2	0	3	5	0	0
P2515 Mishandling of general acc	1	1	2	6	0	4
Segmented models for industry
A—Agriculture	1	0	1	8	0	6
B—Mining	3	0	2	5	0	2
C—Manufacturing	3	0	3	4	0	2
D—Power, gas, and water	4	3	1	2	0	10
E—Construction	3	3	1	3	0	8
F—Wholesale and retail	4	0	2	4	0	4
G—Transport and catering	3	2	0	5	0	6
I—Information technology	0	1	3	6	0	4
K—Real estate	1	1	4	4	0	4
L—Leasing and commercial	4	0	1	5	0	4
N—Environment	1	4	0	5	0	8
R—Culture and sports	3	3	1	3	0	8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows

Abstract

1. Introduction