A Powerful Predicting Model for Financial Statement Fraud Based on Optimized XGBoost Ensemble Learning Technique

: This study aims to develop a better Financial Statement Fraud (FSF) detection model by utilizing data from publicly available ﬁnancial statements of ﬁrms in the MENA region. We develop an FSF model using a powerful ensemble technique, the XGBoost (eXtreme Gradient Boosting) algorithm, that helps to identify fraud in a set of sample companies drawn from the Middle East and North Africa (MENA) region. The issue of class imbalance in the dataset is addressed by applying the Synthetic Minority Oversampling Technique (SMOTE) algorithm. We use different Machine Learning techniques in Python to predict FSF, and our empirical ﬁndings show that the XGBoost algorithm outperformed the other algorithms in this study, namely, Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), AdaBoost, and Random Forest (RF). We then optimize the XGBoost algorithm to obtain the best result, with a ﬁnal accuracy of 96.05% in the detection of FSF.


Introduction
Financial Statement Fraud (FSF) is a global concern. FSF is characterized as major omissions or false representations in financial statements caused by a deliberate failure to report financial data in conformity with generally accepted accounting standards. FSF can cause significant impacts on stakeholders of both fraudulent enterprises and nonfraudulent firms if it is not recognized and prevented in a timely manner [1]. Unfortunately, FSF is not easy to spot. Furthermore, even when discovered, significant damage has generally occurred already. As a result, regulators, auditors, and investors would benefit greatly from more efficient and effective techniques that can detect FSF. The Association of Certified Fraud Examiners (ACFE) states that financial statement fraud is the intentional misrepresentation of an enterprise's financial condition by deliberate distortion or omission of the amounts or disclosures in the financial statements to mislead the users of financial statements. According to the Center for Audit Quality (CAQ), individuals or companies are involved in financial statement manipulation for a variety of reasons, including monetary benefits, the need to fulfill short-term financial targets, or to cover up unfortunate news. External and internal consumers of financial statements are constantly questioning financial statements, and regulatory bodies cannot say with confidence that financial statements are credible and prepared in compliance with the regulatory and ethical mandates of the practices of accountants and auditors [2,3]. Consequently, the detection of fraud or deception is important in order to ensure the authenticity of financial statements. In this context, the present study is of practical importance to businesses and auditors, as the global market is witnessing an upsurge in financial accounting fraud that costs businesses billions of dollars a year. Financial turmoil has a significant impact on a country's businesses and creditors, and consequently on its economy [1,4,5]. As an outcome, the detection and prediction of financial accounting fraud are becoming an emerging topic for academic studies and industry experts.

Motivation and Contribution
Globally, the cost of FSF has continued to rise over the years. The most direct victims are the investors and financiers who supply funds to firms under false pretenses. There are greater costs to society at the macro-level, such as loss of trust in financial systems, even though the economic impact at the company level can be highly variable. Because FSF causes significant property harm to investors, stakeholders, and society, a great number of studies have been conducted. Most of the works in the literature on FSF apply traditional regression analysis; see, for instance, Andrew and Robin [6][7][8][9][10].
In recent years, a number of experts and researchers have attempted to use machine learning (ML) and data mining methods to carry out research in this field as a way of reducing detection errors. While many business strategies are based on the accuracy of financial statements, there are not enough resources to analyze all of them thoroughly. As auditors have been found culpable in multiple examples of FSF, credible financial fraud detection models should be offered in an easy-to-use manner to auditors, investors, regulators, and other stakeholders. Because of the inherent reliance on limited distributional assumptions, parametric techniques such as LR lack the general applicability that nonparametric approaches may provide [11]. Existing research on quantitative methods of financial fraud detection, on the other hand, is primarily focused on the banking and financial services industry, mostly on the detection of insurance and credit card fraud. Quantitative techniques to identify and discourage FSF should receive significant attention in respected academic journals. The current scientific and academic literature currently seeks further rules or classifications from previous data to achieve the purpose of prediction or detection. Machine learning (ML) with an appropriate amount of data can yield more accurate prediction and classification outcomes compared to the traditional approach.
Contribution: This study aims to establish a superior model for financial fraud prediction using a powerful ensemble technique, the XGBoost (eXtreme Gradient Boosting) algorithm, to help identify fraud on a set of sample companies drawn from the MENA region by spotting the early signs. This approach can help to mitigate losses to investors, auditors, and all stakeholders in the financial market. This research enables researchers and practitioners to gain a deeper understanding and make informed decisions based on a financial statement fraud detection model. We focus on utilizing data from publicly available financial statements of firms in the Middle East and North Africa (MENA) region. We selected a powerful ML ensemble model, namely, the XGBoost algorithm, to model our proposed method for a number of reasons. Ensemble algorithms have been successfully used in many fields of research [12], though they have been less utilized in financial fraud studies.
The characteristics of XGBoost fit very well with our small dataset, which is characterized by many missing values and high class imbalance. XGBoost facilitates the tuning of a range of hyperparameters in order to further improve the efficiency of the model. Because of the high class imbalance in this one-of-a-kind domain, the samples selected in past studies have a tendency to be processed less realistically. XGBoost is able to effectively learn from imbalanced data with the help of Synthetic Minority Oversampling Technique (SMOTE) sampling. We chose expert-defined financial ratios [13] along with raw financial data for our research. FSF detection models based on financial ratios may be more effective, as the ratios determined by domain experts are mostly based on assumptions that provide a strong prediction of situations in which corporate managers are encouraged to commit fraud [14]. Fernandez-Delgado, Cernadas, Barro, and Amorim [15] demonstrate that there may be no single right model that can be applied to all data environments; therefore, there is uncertainty about whether ensemble algorithms perform better than conventional finan-cial fraud detection methods in our particular context. Hence, we selected five other ML techniques that are widely used in this area and modeled them for performance analysis and comparison with our model: Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), AdaBoost, and Random Forest (RF). We think that the results of this study can be useful to investors, regulators, stock exchanges, boards of directors, external auditors, and other key stakeholders as they seek to prevent, deter, and detect fraudulent financial reporting. Benford's Law is an effective method and analytical technique to help detect accounting fraud; see, for instance, [16][17][18]. Benford's Law is a mathematical tool used to determine whether investigated financial statements contain unintentional errors or fraud. When using Benford's law, counterfeit numbers have a slightly different pattern than valid or random samples.
The rest of this research is organized as follows: Section 2 is dedicated to a brief review of the related research in the literature; the data and methodology used for the study are discussed in Section 3; Section 4 is reserved for empirical studies and results; finally, Section 5 includes our conclusions and future research directions.

Related Research
Financial Statement Fraud (FSF) detection has been under the limelight for the past few decades, with emphasis placed on accounting anomalies broadly and on financial statement fraud specifically. While initial research made use of statistical or traditional techniques that are both time-consuming and expensive, more recently the focus has drifted with the emergence of big data and ML [19]. The statistical approaches are centered on traditional mathematical methods, while methods involving ML are focused on learning and intelligence. While both categories have many similarities, the key difference between them is that statistical methods are more rigid, while ML methods are able to learn from and adapt to the problem domain [20].
In addition, previous studies have shown the superior efficiency of ML approaches over conventional statistical approaches [21]. According to the literature, there is no onesize-fits-all strategy for detecting financial statement fraud [22]. The findings of [23] show that models built using ML approaches can efficiently detect financial statement fraud, keep up with the continuous evolution of financial reporting fraudulent behavior, and respond with the most up-to-date technology. In [24], the authors demonstrate that detection models developed using ML approaches are more accurate than traditional methods. Therefore, our review of past research is limited to papers that have used only ML techniques for FSF detection. Most of the research in the previous literature has formulated the detection of FSF as a binary classification problem, sometimes as a multi-class problem and other times as a clustering problem. Researchers have conducted both quantitative and qualitative FSF analyses. Text mining has been used extensively for qualitative research. Here, we focus on papers that perform quantitative analysis using ML techniques. In the initial stages, research mainly included the Neural Network (NN) [25][26][27][28][29], LR, DT, SVM, Discriminant Analysis (DA), and Bayesian Belief Network approaches. Supervised learning techniques have been selected for analysis more than unsupervised ones, with studies from the USA, China, Taiwan, and Spain making up 65% of such papers [30]. A considerable number of studies have analyzed the performance of classifiers on FSF detection, showing that SVM [25,26,[31][32][33][34], NN [35][36][37][38][39][40], and DT [41,42] perform well in FSF detection/prediction.
In recent years, ensemble ML techniques have begun to be be used in studies, mostly outperforming single classifiers. Ensemble classifiers integrate the predictions of multiple base models. Numerous empirical and theoretical findings have shown that combining different models can improve predictive accuracy [43]. Moreover, ensemble models are well known for their capability to reduce bias and variance. Many researchers have shown interest in studying ensemble models incorporating boosting [14], bagging [44][45][46][47], and other hybrid methods [48] on both balanced and unbalanced data. It has been determined that the performance of such models depends on the selection of the base classifiers. An illustration of the reviewed papers is provided in Table 1. Although prior research shows that ensemble classifiers are best at detecting FSF, there is less research on them compared to single classifiers. Most previous studies have used imbalanced datasets for evaluation, as is the case with real-world data. Consequently, because of the common problem of class imbalance, traditional ensemble models must generally be coupled with sampling techniques such as oversampling or undersampling in order to balance the class distribution. Only a few studies have considered the imbalance issue during modeling. While most researchers have used financial ratios for prediction, others have argued that raw variables produce better results [49][50][51]. Various metrics can be used to assess classifier performance, with the prevalent ones being sensitivity or recall, precision, and accuracy [52]. In this paper, we evaluate the different classifiers that can be used in FSF detection while accounting for the class imbalance issue. We take into account both raw financial variables and financial ratios. In this way, the problem can be solved by ensuring that the related data are distributed among a set of sites across different networks [53][54][55][56][57], which will form part of our future work.

Data
Our experimental dataset included 950 companies in the MENA region. All of the selected companies are from different sectors, including manufacturing, technology, energy, telecommunications, real estate, and insurance. Data were collected from the Osiris global company database (source: https://www.bvdinfo.com/en-gb/our-products/data/ international/Osiris (accessed on 20 December 2022)). Based on the availability of the data, we selected two consecutive years from 2012 to 2019 for each company, for a final tally of 102 fraudulent years and 1798 non-fraudulent years. The financial indicators in the database are taken from the respective companies' financial statements and balance sheets. The details of the 26 financial attributes, including financial attributes from the Beneish model [13], are provided in Table 2. All attributes are quantitative, with the target value being discrete and the others being continuous. Professor Messod Beneish, in June 1999, published his study "The Detection of Earnings Manipulation" in which he argued that high sales growth, declining gross margins, soaring operating expenses, and increasing leverage encourage companies to manipulate profits. Such companies are most likely to alter profits by speeding up sales recognition, increasing accruals and cost deferrals, and minimizing depreciation. Therefore, we have taken these attributes into account in our study. Cost of Goods Sold (COGS) a4 Current asset a5 Fixed assets a6 Total assets a7 Depreciation a8 General and administrative expenses a9 Long term debt a10 Total current liabilities a11 Change in current assets a12 Change in cash a13 Change in current liabilities a14 Change in income tax payable a15 Current maturity of long term debt a16 Current maturity of LTD a17 Change in current maturity of long term debt a18 Amortization a19 Day's Sales in Receivables Index (DSRI) a20 Gross Margin Index (GMI) a21 Asset Quality Index (AQI) a22 Sales Growth Index (SGI) a23 Depreciation (DEPI) a24 Sales, General and Administrative Expenses (SGAI) a25 Leverage Index (LVGI) a26 Total Accruals to Total Assets (TATA)

Methodology
A schematic representation of the stages involved in the rest of this study is shown in Figure 1. The first phase is the data preprocessing phase, followed by the classifier modeling phase, and finally the optimization phase.

Data Preprocessing
Our dataset is limited, as there are not many publicly available companies in the MENA region. Rather than list-wise deletion, it is more suitable to deal with missing values by replacing them [58][59][60][61]. In this work, the missing values in the dataset were replaced using the within-country mean values. Normalization was carried out using MinMaxScaler.
Detailed descriptive statistics were obtained, and outliers in the dataset were detected and excluded. Feature Importance was calculated by calculating the Gini importance value for each feature. Each feature was then sorted in descending order and the top k features were selected. The top features are highly linked to the target variable. We found that keeping all attributes yielded better results than avoiding the least important ones. As our dataset was imbalanced and small, we balanced it with synthetic minority data by oversampling [62], which has been extensively and successfully used in the literature for similar datasets [63][64][65].

Synthetic Minority Oversampling Technique (SMOTE)
In the case of fraud detection problems the minority class needs to receive special consideration, as it defines the phenomena which we aim to anticipate from a multitude of majority class structures that reflect correct processes. The performance of standard classifiers is biased towards the majority class, as they are programmed to minimize the overall inaccuracy of classification regardless of class distribution. This bias problem can be overcome by excluding examples of the majority class, known as undersampling, or by including new examples of the minority class, known as oversampling. Due to the small size of our dataset, we chose the latter.
An effective oversampling technique for producing new examples is SMOTE [62], which is one of the more advanced sampling methods. SMOTE can be implemented independently of the classifier being used. This algorithm addresses the challenge of overfitting caused by random oversampling. It relies on the feature space to create new instances with the aid of interpolation between positive instances that lie together. SMOTE starts by finding examples near the feature space, connecting the dots between the examples, and drawing a new instance at a point along that line [66]. In particular, a random sample of the minority class is chosen first. Then, for this sample, the value k of the nearest neighbors is found (usually k = 5). A randomly selected neighbor is chosen and a synthetic example is constructed at a randomly selected point in between two examples in the feature space.
It is evident from prior studies that SVM, NN, DT, and LR have good performance for the task of fraud detection. For comparison with XGBoost, we selected SVM, DT, and LR. From the ensemble methods, we selected the RF and AdaBoost algorithms, as decision tree-based algorithms are considered best for small-and medium-sized data. Model training and testing of the comparison phase algorithms was performed with the help of the Scikit-learn package from Python.
After the comparison phase, it was clear that XGBoost, a tree-based algorithm, was the best dataset. In the next stage, we optimized the XGBoost algorithm to obtain an optimal hyperparameter combination with the help of RandomizedSearchCV from Scikitlearn, which performs a randomized search of hyperparameters, to further enhance the performance of the algorithm. The estimator parameters used to implement these methods were optimized by performing cross-validation on the parameter settings search.

Base Classifiers: SVM, DT, and LR Support Vector Machine (SVM)
An SVM [67,68] is a discriminative classifier, typically explained as a separating hyperplane. Put another way, this algorithm generates an optimal hyperplane that classifies new instances from labelled training data (supervised learning). The data points or vectors nearest to the hyperplane, which influence the direction of the hyperplane, are referred to as the Support Vector, as these vectors support the hyperplane. SVM has high predictive accuracy and generalization capabilities, particularly for small, nonlinear, and high-dimensional samples [45]. Here, we used linear SVM.

Decision Tree (DT)
DTs are among the most successful ML algorithms thanks to their intelligibility and clarity [69]. DT approaches are used for estimation, clustering, and classification tasks [70]. The best attribute is placed at the root node. This attribute is selected based on a measure of information gain that is subsequently used at each stage of tree building. The training dataset is then split into two or more subsets based on the values of the chosen attribute. This process is repeated for each of the subsets, selecting the best attribute for each and creating child nodes. The process continues until a stopping criterion is met, such as reaching a maximum depth or all instances belonging to the same class. The leaf nodes in the decision tree reflect the class, while the decision nodes determine the rules. The test data class is predicted by the decision rules. Among the key benefits of Decision Trees are that they offer a model that meaningfully describes the acquired knowledge and enables the extraction of if-then classification rules [36].

Logistic Regression (LR)
LR as a general statistical model was originally developed and popularized primarily by Joseph Berkson [71]. LR is conducted when the dependent variable is binary. It is a discriminative classifier that is linear in its parameters, and is used to describe the relationship between a single dependent binary variable and one or more independent variables. LR can handle both nominal and numerical data. It predicts the likelihood of a binary response based on one or more predictor attributes [72].

Ensembles: RF, AdaBoost, and XGBoost
Predictions from previously developed individual base estimators can be integrated using ensemble strategies to improve robustness and generalization ability; they can often deliver better results compared to a single estimator. Even when the individual models in the ensembles are fairly simple, the power of ensembling can lead to the creation of strong ensemble models.

Random Forest (RF)
Random Forest (RF) is a bagging ensemble technique proposed particularly for trees [73]. The base model of an RF is a DT. RF addresses the issue of high variance in DTs by combining the predictions of multiple DTs, with each DT trained on a different subset of the data and a different subset of the features. The resulting combination of trees leads to a more robust and less variable model. Random subsets of the data are generated via replacement, and each subset is trained with the help of a DT. While expanding the trees, RF adds more randomness to the structure [74]. As a result, there is a wide range of diversity, which contributes to a successful model in general. As a result, in an RF the algorithm only considers a random subset of the features when dividing a node. The randomness of the trees can be increased by using additional random thresholds for every feature, instead of looking for the highest suitable thresholds as in a normal DT.

AdaBoost
AdaBoost, or Adaptive boosting, was the first really successful boosting algorithm developed for binary classification [75]. It is an approach to minimize the error of a weak learning algorithm. Theoretically, any algorithm can be a weak learning algorithm if it can produce classifiers that need to be slightly more consistent than random guessing [76]. AdaBoost helps to combine multiple "weak classifiers" into a single "strong classifier" [77]. The most common algorithms used with AdaBoost are DTs. A weak classifier (decision stump) is prepared using weighted samples from the training data. Only binary (two-class) classification problems are supported; thus, each decision stump makes a decision on one input variable and outputs a value of +1 or −1 for the first or second class. Weak models are sequentially added and trained with weighted training data. The method progresses until a predetermined number of weak learners has been generated or no more improvements can be achieved on the training dataset.

XGBoost
XGBoost, or eXtreme Gradient Boosting, is a tree-based algorithm [78]. XGBoost has shown great success in terms of both performance and speed. Boosting is an ensemble strategy with the key goal of reducing bias and variance. The aim is to sequentially build weak trees in such a way that each new tree (or learner) works on the flaw (misclassified data) of the preceding tree. The data weights are re-adjusted, known as "re-weighting", whenever a weak learner is added. Because of this auto-correction after every new learner is introduced, the whole forms a strong model after convergence. The loss function of the model is characterized as penalizing the complexity of the model with regularization in order to decrease the possibility of overfitting. This technique performs well even when there are missing values or many zero values, demonstrating a good ability to deal with sparsity. XGBoost uses an algorithm called the "weighted quantile sketch algorithm" that helps the classifier to concentrate on data that are incorrectly classified. In each iteration, the aim of each new learner is to learn how to classify the incorrect data.

Implementation
We used Python 3.8 for implementation. Detailed descriptive statistics were obtained using pandas_profiling in Python. The outliers in the dataset were detected with the help of IsolationForest, while the most important features were enumerated using the ExtraTreesClassifier in sklearn. SMOTE was implemented using the imblearn package, with k_neighbors = 5. All models were implemented using the Scikit Learn library and evaluated using ten-fold cross-validation.
The Pearson's correlation and feature importance of the attributes are depicted in Figures 2 and 3, respectively. A high correlation means values between −0.50 and −1.00. Feature importance order is depicted in Figure 3. We included all the attributes, as omitting the least important ones resulted in lower performance; extreme outliers in the dataset were detected, analyzed (e.g., to determine whether they resulted from errors in data entry), and removed.  All classifiers were modelled using sklearn. LR was implemented using the Logis-ticRegression method. DT was implemented using the DecisionTreeClassifier method with max_depth = 4 and the criterion set to 'entropy'. SVM was implemented using Lin-earSVC. For the ensemble classifiers, RF was modelled using RandomForestClassifier with n_estimators = 100. AdaBoost was modelled using AdaBoostClassifier with sigmoid kernel SVC as the base estimator and default n_estimators and learning_rate. Finally, XGBoost was modelled using XGBClassifier() with default parameters, then the hyperparameters were fine tuned in the subsequent phase.

Performance Measures
The predictive performance of data mining classifiers is measured in terms of accuracy, precision, recall, and F1-score, the common evaluation metrics of ML. The testing accuracy and F1-score measures are used for performance evaluation. The k-fold cross-validation score, with k set to 10, was used for all the models (XGBoost has inbuilt CV). Accuracy is the ratio of the number of correct predictions to the total number of input samples; Precision is the ratio of correctly predicted positive observations to the total predicted positive observations; Recall, or sensitivity, is the ratio of correctly predicted positive observations to all the observations in the actual class; and F1 Score is the weighted average of Precision and Recall.

Accuracy = Number o f correct predictions
Total number o f predictions

Analysis
The accuracy of any ML algorithm is highly dependent on the problem at hand, as well as on the integrity and complexity of the training dataset. The prediction performance of all six models on the dataset with SMOTE applied are listed in Table 3, with graphical representations displayed in Figures 4-6.    The highest mean accuracy rate was achieved by the XGBoost algorithm, followed by SVM and then AdaBoost. LR had a low accuracy rate compared to the others, although it had a high F1-score of 0.8196, as seen in Figure 6 and Table 3. SVM displayed an accuracy rate of 0.8888, second to XGBoost, and from Figures 4-6 it can be seen that it had good performance on the basis of Precision, Recall, and F1-score. DT, RF, and AdaBoost had average performance on all four metrics.
It is evident from these results that XGBoost delivers consistent performance on all four metrics with the highest mean accuracy and F1-score on our SMOTE-applied MENA dataset. The dataset was split into training and testing sets with test_size = 0.3. SVM performed well, though not as well as the XGBoost algorithm. Accuracy and F1-score were obtained after k-fold cross-validation on the training dataset, with k set to 10.

XGBoost Optimization
Based on preliminary observations, XGBoost is the best model for the detection of fraud using the financial statements in our dataset. Next, we further optimized the performance of XGBoost via hyperparameter tuning using RandomizedSearchCV on the accuracy scores with n_iter = 1000 and three-fold cross-validation: All possible parameter combinations were run and the model was trained until validation_0-error improved in ten rounds. The fitting was achieved by three-fold cross-validation for each of 1000 candidates, totaling 3000 folds. The best iteration for each round was the one with the least validation error. The list of the best parameters is provided below: learning_rate: 0.03, min_child_weight: 3, gamma: 1.5, subsample: 0.8, colsample_bytree: 1.0, max_depth: 9, reg_lambda: 1 The best accuracy score across all the parameter combinations for the XGBoost algorithm on our SMOTE-sampled MENA dataset was 0.9605, which is a significant improvement over the accuracy score of 0.9366 in the previous stage. The XGBoost optimization process is summarized in Table 4. Table 4. XGBoost optimization process.

Steps Description
Hyperparameter tuning using RandomizedSearchCV on accuracy scores with n_iter = 1000 and three-fold cross-validation Obtain the best accuracy score 0.9605, a significant improvement on the accuracy score of 0.9366 in the previous stage.

Conclusions
This study proposed a better FSF detection model by utilizing data from publicly available financial statements of firms in the MENA region. An FSF prediction model was developed using the ensemble technique and the XGBoost algorithm while investigating the utilization of various ML techniques in detecting financial statement fraud using published financial disclosures. Upsampling was performed on the dataset using the SMOTE technique to prevent class imbalance issues. In our experiments, SMOTE proved to be a beneficial metric for sampling data with a large class imbalance. When learning on an unbalanced dataset, this study shows that XGBoost outperforms other techniques for the task of FSF detection. We conducted analysis and comparison of three individual ML classifiers and three ensemble techniques used widely in FSF detection using a dataset comprising companies from the MENA region. We used SVM, DT, and LR as individual classifiers and RF, AdaBoost, and XGBoost as ensemble techniques. While all the classifier models yielded an acceptable accuracy rate, the simulation results from Table 4 indicate that the XGBoost classifier is the most efficient model for financial statement fraud detection when using these settings. In the next phase of the study, the XGBoost classifier was further optimized by hyperparameter tuning with cross-validation to obtain the best model for the problem. The simulation results indicated that the proposed model has higher performance compared to both classic ML models and ensemble models. In this study, we have considered only financial attributes. Analysis of the decentralized model and of non-financial attributes may be incorporated in the future.

Conflicts of Interest:
The authors declare that they have no competing interests.

Abbreviations
The following abbreviations are used in this manuscript: