You are currently viewing a new version of our website. To view the old version click .
Journal of Risk and Financial Management
  • Article
  • Open Access

2 January 2026

Fraud Risk and Audit Opinions Across Countries: Complementing Accounting-Based Fraud Risk with Machine Learning Methods

and
1
School of Accounting, Bangkok University, Klong Nueng, Klong Luang, Pathumthani 12120, Thailand
2
Chulalongkorn Business School, Chulalongkorn University, Pathumwan, Bangkok 10330, Thailand
*
Author to whom correspondence should be addressed.
J. Risk Financial Manag.2026, 19(1), 26;https://doi.org/10.3390/jrfm19010026 
(registering DOI)
This article belongs to the Section Financial Technology and Innovation

Abstract

Financial statement fraud is a significant threat to corporate governance, investor confidence, and global capital markets. Traditional fraud detection models, including DF-Score and PF-Score, often rely on linear approaches that may fail to capture complex fraudulent behaviors. This study applies machine learning techniques using eXtreme Gradient Boosting across 34 countries (2016–2024). Results show that enhanced DF-Score and PF-Score effectively capture fraud risk, which is significantly associated with auditors’ opinions. The study integrates machine learning with traditional models to address data complexity and nonlinearity. Practically, the findings provide auditors, regulators, and financial analysts with a tool for improved fraud detection and risk-based auditing.

1. Introduction

Financial statement fraud remains one of the most persistent and damaging issues in corporate governance, undermining investor confidence, market efficiency, and overall economic stability. Traditional fraud detection models, such as the Dechow fraud score (DF-Score) and Piotroski’s fraud score (PF-Score), have been widely employed in accounting research to identify financial misstatements (Dechow et al., 2011a; Piotroski, 2000). These indices primarily rely on static, rule-based frameworks grounded in linear relationships between accounting ratios and fraudulent behavior. However, the rapidly changing nature of financial fraud, driven by increasingly sophisticated manipulation techniques and complex corporate structures, challenges the effectiveness of these traditional methods. (Prabha et al., 2024).
The global financial landscape has witnessed several high-profile fraud scandals, such as the Wirecard and Continental AG in Germany, Luckin Coffee and China Evergrande in China, the Petrobras, Grupo de Educação e Tecnologia, and Americanas SA in Brazil, Adani Group in India, and Lactalis in France, which have exposed the limitations of existing audit and accounting fraud detection mechanisms across different regulatory regimes (Jolley, 2025; Costa, 2018; Kersting et al., 2024; Teichmann et al., 2023). These cases underscore the urgent need for more adaptive and robust fraud detection models that can accommodate nonlinearities, interactions, and context-specific variations in financial data.
In recent years, machine learning approaches—particularly ensemble methods such as XGBoost—have demonstrated strong predictive capabilities and adaptability across a range of financial applications, including credit risk assessment, bankruptcy prediction, and fraud detection (T. Chen & Guestrin, 2016; Mallela et al., 2024). These methods are especially valuable for modeling complex and nonlinear relationships. They can automatically identify the most relevant features from high-dimensional datasets, thereby addressing many of the limitations inherent in traditional statistical models (Ngai et al., 2011). Al Natour et al. (2023) emphasize the usefulness of computer-assisted audit techniques and tools (CAATTs) in enhancing auditors’ risk assessment capabilities. Similarly, prior research highlights the potential of emerging technologies to support auditors in evaluating and managing risk more effectively (Dulgeridis et al., 2025; Qatawneh, 2024; Tümmler & Quick, 2025). Despite these advances, a notable research gap remains: the systematic application of advanced machine learning methods to dynamically improve and reassess established fraud detection indices using global accounting data.
This study addresses this gap by employing XGBoost to re-estimate the DF- and PF-Score fraud indices. We leverage rich datasets from firms across multiple countries. By integrating both holdout (HO) and cross-validation (CV) frameworks, this research evaluates model performance and generalizability. The originality of this study lies in its combination of advanced machine learning techniques with cross-country financial data to improve fraud detection accuracy, offering both methodological innovation and practical insights for accounting fraud research. This approach not only improves the detection accuracy of fraudulent financial reporting but also contributes to the methodological advancement in accounting fraud research by bridging machine learning and traditional accounting fraud detection.
This study utilizes an international dataset, including financial statement data from firms across 34 countries over the period 2016–2024. Employing two fraud detection models—DF-Score and PF-Score—the analysis highlights substantial differences in predictive performance. The DF-Score model demonstrates superior accuracy, balanced accuracy, and AUC-ROC across both HO and CV methods, compared to the PF-Score model, which shows markedly lower performance metrics. These findings underscore the effectiveness of DF-Score in identifying fraud risk across diverse regulatory environments. This offers robust implications for global accounting research and policy.
The instrumental variable two-stage least square (IV-2SLS) results reveal a significant relationship between fraud and audit outcomes. Using industry-fraud and firm growth as instruments, both DF-Score and PF-Score are associated with audit opinions. This suggests that fraud risk captured by the DF-Score and PF-Score is more likely to trigger stricter auditor responses. Strong first-stage and identification statistics confirm instrument validity, and model diagnostics support the robustness of the causal interpretation.
Using predicted DF-Score and PF-Score derived from XGBoost models, the IV-2SLS estimation results demonstrate a strong and consistent relationship between fraud risk and audit outcomes for both DF-Score and PF-Score. The strength of these results underscores the predictive power of the XGBoost in capturing fraud risk that auditors respond to. Robustness tests confirm instrument strength and model validity. This reinforces the value of machine learning–based fraud detection tools in accounting fraud detection areas.
The contributions of this study are twofold. First, it progresses the understanding of fraud detection by demonstrating how ensemble learning can capture the intricate patterns and evolving tactics of financial misstatements. From a theoretical perspective, this study makes a significant contribution by integrating advanced machine learning techniques with traditional accounting fraud detection models. By applying XGBoost to re-estimate established fraud indices including the DF-Score and PF-Score, the research advances the literature on financial fraud detection by capturing nonlinear and complex relationships that linear, rule-based models often fail to identify. This approach addresses the limitations of static coefficient models and provides a data-driven framework that adapts to evolving fraud patterns across different accounting standards and regulatory environments. Moreover, the study contributes to the growing interdisciplinary dialogue between accounting and data science. It demonstrates how ensemble learning methods can enhance theoretical models of financial misstatements and fraud risk assessment.
Second, it offers auditors, regulators, and financial analysts a more effective and adaptable tool for identifying fraud. This facilitates timely and informed decision-making in diverse regulatory environments. The findings offer valuable insights for auditors, regulators, and financial analysts engaged in fraud detection and prevention. The use of machine learning-driven fraud indices improves the accuracy and robustness of identifying potentially fraudulent financial reporting. As a result, it supports more effective risk-based auditing and regulatory oversight. By highlighting key financial ratios and accrual-based indicators as consistent predictors, the study informs the development of targeted audit procedures and early warning systems. These enhanced fraud detection tools can lead to more timely interventions, minimizing financial losses and reputational damage for firms across different countries. Additionally, the model’s adaptability to diverse datasets and regulatory contexts makes it applicable to multinational corporations and international regulatory bodies seeking scalable fraud detection solutions.
The paper is structured as follows. The next section is the literature review and hypothesis development. Following sections elaborate data and research methodology. Findings and results are followed. The last section provides summary of this research.

2. Literature Review and Hypothesis Development

Accounting theories provide a conceptual framework for understanding the financial misstatements. One of the most widely recognized frameworks is the fraud triangle, which has been firstly proposed by Cressey (1953). This theory identifies three necessary elements for fraud to occur—pressure such as financial difficulties or performance targets, opportunity such as weak internal controls or ineffective oversight, and rationalization such as the perpetrator’s justification of their actions. Wolfe and Hermanson (2004) expand this model by adding the element of capability. They argue that the ability to commit fraud requires not only motive and opportunity but also the skills and position to exploit weaknesses.
Agency theory (Jensen & Meckling, 1976) offers another critical lens. They suggest that conflicts of interest between principals and agents create incentives for fraudulent reporting. Managers, driven by personal incentives like bonuses or career concerns, may manipulate earnings to meet expectations. This behavior can induce financial misstatements. As a result, it weakens investor confidence and market efficiency (Healy & Wahlen, 1999).
Signaling theory also plays a essential role in fraud-related research. Firms use audited financial statements and audit opinions to signal quality and credibility to the market (Spence, 1973). When fraud occurs, it distorts these signals. This leads to increased uncertainty and potentially higher cost of capital (Diamond, 1985). Auditors, in this context, act as gatekeepers who mitigate information asymmetry by providing reasonable assurance about the reliability of financial reports.
These theoretical frameworks collectively establish the groundwork for examining how fraud emerges, how it is rationalized, and how audit mechanisms function as deterrents or detectors of fraudulent financial reporting.

2.1. Fraud Across Countries

Fraud is a global phenomenon. Its prevalence, types, and regulatory responses vary considerably by country due to differences in legal systems, enforcement rigor, cultural norms, and corporate governance environments. In the United States, regulatory reforms such as the Sarbanes-Oxley Act (SOX) of 2002 significantly strengthen the regulatory environment after high-profile scandals like Enron and WorldCom (Coates, 2007). SOX introduces rigorous internal control requirements and enhanced auditor responsibilities. It aims to reduce fraud risk and improve financial reporting quality. Khurana and Raman (2004) evidence that cross-country differences in litigation risk, rather than audit firm brand name, are the primary drivers of perceived audit quality.
In emerging economies, the challenge of fraud is often compounded by weaker institutional frameworks and less effective enforcement. For example, China has experienced a flow in accounting fraud cases, particularly among firms listed in foreign markets. Studies point to factors such as governance weaknesses, audit quality variability, and regulatory arbitrage as contributors to these fraudulent activities (H. Chen et al., 2011). Similarly, India has faced major corporate fraud cases like Satyam, which exposed gaps in audit oversight and corporate governance. This leads to reforms aimed at enhancing auditor accountability and board independence.
European countries, such as Germany and the UK, have also dealt with fraud scandals. The Wirecard scandal in Germany draw attention to systemic weaknesses in regulatory scrutiny and auditor independence. This leads to ongoing debates about audit reform and tighter enforcement (Beerbaum, 2021). In the UK, the collapse of Carillion intensified calls for improved corporate governance and auditor reform to restore public trust (Alkaraan et al., 2024).
These cross-country variations underscore the importance of context-specific approaches to fraud detection and prevention, as well as the potential for leveraging advanced data-driven tools to complement traditional audit processes in different regulatory environments.

2.2. Fraud Index

The detection of financial misstatements has evolved from manual, rule-based approaches toward more sophisticated, data-driven methodologies. Classical fraud detection models, such as the Beneish M-score (Beneish, 1999), use financial ratios to identify earnings manipulation. The Dechow et al. (2011a) F-score and Piotroski’s fraud score similarly integrate accounting-based metrics to identify fraud risk. These models rely on expert-designed variables and thresholds to flag suspicious financial statements. While these indices provide interpretable frameworks, they often struggle to capture complex, nonlinear patterns and interactions in high-dimensional financial data.
Machine learning (ML) techniques have updated fraud detection by enabling the analysis of vast amounts of financial data (Agostino et al., 2025; Nemati et al., 2025). Unlike traditional statistical models, ML algorithms learn patterns directly from data. This allows them to detect subtle anomalies and adaptive fraud schemes that evolve over time. These techniques include supervised learning methods such as decision trees, random forests, support vector machines, and neural networks, as well as unsupervised methods like clustering for anomaly detection (West & Bhattacharya, 2016).
Among ML methods, XGBoost (Extreme Gradient Boosting) has gained prominence due to its combination of predictive accuracy, computational efficiency, and scalability (Tayebi & El Kafhali, 2025; Mallela et al., 2024; Nti & Somanathan, 2024; Liu, 2023; T. Chen & Guestrin, 2016). Nguyen Thanh and Phan Huy (2025) demonstrate that XGBoost outperforms other machine learning techniques in fraud detection model construction. XGBoost is an ensemble learning algorithm that builds additive decision trees iteratively. It optimizes a differentiable loss function via gradient descent. Its key advantages include regularization to prevent overfitting, handling of missing values, and parallel processing capabilities, which are particularly valuable for large, noisy financial datasets. Moreover, XGBoost’s flexibility in feature engineering and handling of imbalanced classes makes it well-suited to fraud detection, where fraudulent cases are typically rare compared to legitimate transactions.
Recent studies have demonstrated the superior performance of XGBoost in fraud detection contexts. For instance, Ali et al. (2023) use XGBoost to identify fraud in Middle East and North Africa countries. Nti and Somanathan (2024) apply XGBoost to detect earnings management and find it outperformed traditional logistic regression in classification accuracy and robustness. Furthermore, SHAP (SHapley Additive exPlanations) has been used alongside XGBoost to interpret model predictions. This enhances trust and facilitates audit decision-making (Lundberg & Lee, 2017).

2.3. Fraud and Audit Opinions

The relationship between fraud and auditing is central to the integrity of financial reporting, as auditors serve a critical gatekeeping role in detecting and mitigating financial misstatement activities. The issuance of an audit opinion—whether unqualified, qualified, adverse, or a disclaimer of opinion—reflects the auditor’s professional judgment about the fairness and reliability of a company’s financial statements (Messier et al., 2014). Fraud influences this judgment, often increasing the likelihood that auditors issue an opinion to signal potential misstatements or uncertainties to stakeholders.
The prior research shows that firms implicated in fraudulent activities are more likely to receive adverse or qualified audit opinions, as auditors express skepticism about the reliability of financial information (Hennes et al., 2008). The nature and severity of fraud impact the auditor’s risk assessment and opinion. For example, material fraud involving deliberate misstatement of revenues or expenses typically leads to adverse opinions or disclaimers of opinion. In contrast, less severe or immaterial fraud may result in qualified opinions.
The determinants of audit opinion extend beyond the presence of fraud and include a combination of firm-specific, auditor-specific, and environmental factors. Firm-specific determinants include financial distress, complexity, profitability, size, and governance structures. Financially distressed firms or those experiencing poor financial performance often face higher fraud risk, prompting auditors to issue more conservative opinions. Complex firms with numerous subsidiaries or international operations increase audit difficulty, raising audit risk and the likelihood of modified opinions (J. Francis & Krishnan, 1999).
Auditor-related factors are equally significant. Audit firm and reputation correlate with audit quality and fraud detection capacity (DeAngelo, 1981). Larger firms have greater resources and stricter quality control mechanisms, enabling more thorough audit procedures and skepticism. Auditor tenure also affects opinion decisions. The longer tenure may reduce independence, potentially leading to less aggressive reporting, while short tenure may prompt caution and more modified opinions (Junaidi et al., 2012). However, Garcia-Blandon et al. (2020) evidence that audit quality is not associated with long audit firm tenures. Regulatory and institutional environments also influence audit opinion practices. Stronger legal enforcement and auditor oversight enhance audit quality. This should increase the detection of fraud and the issuance of opinions. For example, post-SOX regulations in the United States increased auditor accountability, resulting in more frequent issuance of adverse opinions in fraud cases (Coates, 2007). Moreover, the auditor’s assessment of internal control effectiveness is a vital determinant. The PCAOB and IAASB frameworks emphasize that auditors must evaluate whether internal controls mitigate fraud risk adequately. Weak or ineffective internal controls often lead auditors to increase substantive testing. If controls are deemed insufficient, to issue qualified or adverse opinions (Arens et al., 2017, p. 407).
Advancements in data analytics and machine learning, such as the integration of fraud risk scores generated via algorithms like XGBoost, provide auditors with objective tools to enhance fraud risk assessment (Appelbaum et al., 2017). These tools augment auditor judgment by highlighting high-risk accounts or transactions, improving the likelihood of detecting material misstatements and appropriately modifying audit opinions (Mukhidinov et al., 2025; Suyono et al., 2025; Yuan et al., 2025). Qatawneh (2024) provides evidence that artificial intelligence in accounting information systems has a statistically significant impact on auditing and fraud detection. Nguyen Thanh and Phan Huy (2025) demonstrate that integrating ML techniques, particularly XGBoost into audit procedures can significantly improve fraud detection.
In summary, audit opinions reflect a multifaceted evaluation of fraud risk, firm characteristics, auditor attributes, and regulatory context. Understanding these determinants helps clarify how auditors respond to fraud risks and how audit opinions serve as a signaling mechanism for financial statement reliability.
Drawing from the literature, grounded in agency theory and the fraud triangle, which link managerial misreporting and fraud risk to auditor judgement. We propose the first hypothesis:
H1. 
There is an association between higher fraud risk scores and the issuance of audit opinions.
Drawing on positive accounting theory—PAT (Watts & Zimmerman, 1986), managers’ accounting choices, including misreporting, respond to contracting, regulatory, and incentive pressures. Machine learning algorithm quantifies patterns in financial statement that reflect potential managerial behaviors. The machine learning—derived fraud indices make the latent risk predicted by PAT observable for empirical testing against audit outcomes. Based on this, we propose the second hypothesis:
H2. 
Machine learning-derived fraud indices constructed via XGBoost significantly associate with audit opinion.

3. Materials and Methods

3.1. Data and Variable Description

The dataset used in this study is sourced from the Wharton Research Data Services (WRDS) platform, encompassing accounting and financial statement data from firms operating in 34 countries between 2016 and 2024. The countries included are Australia (AUS), Bulgaria (BGR), Bermuda (BMU), Brazil (BRA), Switzerland (CHE), Chile (CHL), China (CHN), Cayman Islands (CYM), Germany (DEU), Spain (ESP), Finland (FIN), France (FRA), United Kingdom (GBR), Hong Kong (HKG), Indonesia (IDN), Israel (ISR), Japan (JPN), South Korea (KOR), Mexico (MEX), Malaysia (MYS), Norway (NOR), Pakistan (PAK), Peru (PER), Philippines (PHL), Poland (POL), Romania (ROU), Russia (RUS), Saudi Arabia (SAU), Singapore (SGP), Sweden (SWE), Thailand (THA), Turkey (TUR), Taiwan (TWN), and Vietnam (VNM).
To ensure the relevance of the analysis to core operational activities and mitigate sector-specific complex, firms from the finance and insurance sectors are excluded from the sample. The sample period from 2016 to 2024 allows for an examination of recent trends in financial reporting and fraud risk. It covers both stable and volatile economic environments across diverse regulatory regimes. Data preprocessing involved standardizing financial metrics and ensuring consistency across jurisdictions to support reliable model development and cross-country comparisons. Table 1 presents variable definitions.
Table 1. Variable definitions.

3.2. Models and Economic Specifications

XGBoost (Extreme Gradient Boosting) has emerged as a superior machine learning technique for predictive tasks involving complex, high-dimensional accounting datasets, due to several key methodological advantages. First, it excels at capturing nonlinear relationships and interactions among financial variables that traditional econometric and linear-based models probably fail to detect. Accounting data typically exhibit multicollinearity, outliers, and non-normal distributions, which violate core assumptions of classical statistical models. In contrast, XGBoost is tree-based and non-parametric, making it more robust to such violations. Second, XGBoost integrates regularization techniques into its objective function, controlling for model complexity and reducing overfitting—a potential issue in accounting datasets that are often sparse or imbalanced. This is particularly useful when working with limited instances of fraud cases within large datasets because XGBoost maintains generalizability across samples through built-in CV and early stopping. Moreover, the algorithm can handle missing data, scale to large datasets, and optimize performance through parallel processing. This makes it highly efficient and scalable for real-world accounting applications, including fraud risk modeling, bankruptcy prediction, and earnings management detection. For studies involving data from multiple countries, heterogeneity may become an issue in the analysis. Therefore, we apply one-hot encoding (Zhu et al., 2024) prior to model training to support the analysis in cases where heterogeneity is not explicitly controlled. These preprocessing steps are embedded within the model pipeline and performed before CV and hyperparameter tuning to avoid data leakage.
Empirical evidence in financial research supports its efficacy. Existing studies comparing machine learning models for financial statement analysis frequently report that XGBoost outperforms support vector machines, neural networks, and logistic regression in terms of accuracy, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), and robustness (T. Chen & Guestrin, 2016). These advantages make XGBoost an especially suitable choice for developing fraud indices from diverse and complex accounting data. Moreover, it combines predictive strength with interpretability through tools like SHAP for feature importance analysis.
When developing a fraud detection index using XGBoost with accounting data, model evaluation is critical to ensure reliable identification of fraudulent activities. The HO validation technique involves dividing the dataset into a single training subset and a test subset, where the model is trained on historical accounting records and evaluated on unseen data. While this approach is computationally efficient, it risks overfitting or underfitting due to the potential non-representative nature of the HO sample, especially in imbalanced fraud datasets. In contrast, K-fold CV systematically partitions the accounting dataset into k folds, iteratively training the model on k−1 folds and validating on the remaining fold. This process provides a comprehensive assessment of model performance across different subsets, which is crucial in fraud detection where subtle patterns in financial statements and transaction records must be consistently identified. CV reduces the variance in performance estimates and enhances the robustness of the fraud index by ensuring that the model’s predictive accuracy is not contingent on a single data split. For accounting professionals tasked with fraud risk assessment, employing CV with XGBoost facilitates more reliable model tuning and validation. It ultimately supports the development of a robust fraud index that generalizes well across diverse accounting scenarios and time periods (T. Chen & Guestrin, 2016).
In building a model with XGBoost, the first step is data preparation, starting with combining all the data into a single dataset. Then, the data is split into different parts using either the HO or CV to separate it into a training set for training the XGBoost model and a validation set (or folds in the case of CV) to evaluate how well the model can predict unseen data. In some cases, a separate test set may also be prepared for final evaluation. After splitting the data (70/30), the model transforms each categorical variable into separate binary columns (0/1), a process known as one-hot encoding to handle the presence of possible heterogeneity. For the DF-Score as a binary variable, we apply the F1-optimizing method to ensure that the optimized threshold accounts for all binary feature contributions from one-hot encoding. The optimizing method maximizes the F1 score, balancing precision and recall. The next step is training the model. XGBoost is trained on the training set. Once training is complete, the model makes predictions on the validation set or each fold in CV. The prediction results are then used for evaluation by calculating metrics including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Accuracy, Precision, Recall, F1-score, and AUC-ROC (Chicco et al., 2021). If CV is used, the results from each fold are averaged to obtain a more reliable overall measure of the model’s performance. The equation is operationalized as follows.

3.2.1. General XGBoost Regression Model

The prediction from XGBoost is a sum of decision trees:
ŷi = ∑k fk(xi), where fk ∈ 𝔽
The objective function with regularization is:
L(ŷ, y) = ∑i ℓ(yi, ŷi) + ∑k Ω(fk), where Ω(fk) = γT + ½λw2

3.2.2. Fraud Index Construction

DF-Score
The feature vector xi used in the DF-Score includes the following variables: CH_CS, representing change in common stock; CH_CM, representing change in current liabilities; SOFT_ASSETS, representing soft assets; CH_ROA, representing change in return on assets (ROA); RSST_ACC, representing RSST accruals; CH_FCF, representing change in free cash flow; CH_REC, representing change in receivables; and CH_INV, representing change in inventory. The prediction formula is:
y i ^ { d f _ s c o r e } = k = 1 K f ( C H _ C S ,   C H _ C M ,   S O F T _ A S S E T S ,   C H _ R O A ,   R S S T _ A C C ,   C H _ F C F ,   C H _ R E C ,   C H _ I N V )
PF-Score
The feature vector xi used in the PF-Score includes the following binary indicators: return on assets (F_ROA), change in return on assets (F_AROA), operating cash flow (F_CFO), change in current ratio (F_ALIQUID), accruals (F_ACCRUAL), change in gross margin (F_AMARGIN), change in asset turnover (F_ATURN), change in leverage (F_ALEVER), and equity offer indicator (EQ_OFFER). These binary indicators are derived from underlying financial variables and reflect various aspects of a firm’s financial performance and stability. The indicators are summed to produce the PF-Score, which serves as a key predictor in the model. The prediction formula is:
y i ^ { p f _ s c o r e } = k = 1 K f ( F _ R O A ,   F _ A R O A ,   F _ C F O ,   F _ A C C R U A L ,   F _ A M A R G I N ,   F _ A T U R N ,   F _ A L E V E R ,   F _ A L I Q U I D ,   E Q _ O F F E R )

3.3. Validation and Metrics

To evaluate the predictive accuracy of the DF-Score and PF-Score models, this research exclusively employs classification-based validation methods. When applying machine learning techniques such as XGBoost, the choice between regression and classification frameworks is critical, as it influences both the modeling approach and the theoretical interpretation of results. Regression-based models—used to predict continuous fraud risk indices—can provide nuanced estimates suitable for ranking firms or setting risk thresholds. Classification-based models are more appropriate for binary outcomes, such as identifying whether financial statement manipulation has occurred. The Dechow F-Score applies a cutoff value of 1, where any value below 1 indicates no presence of fraud, while a higher score suggests an increased likelihood of fraudulent financial reporting. PF-Score is ranging from 0 to 9, where lower score suggests the absence of fraud. Given that the binary F-Score and the categorical PF-Score, classification validation is chosen for the XGBoost model to appropriately evaluate its performance in distinguishing between different levels of fraud risk.

3.3.1. Holdout Validation

Holdout validation (HO) is a straightforward method where the available data is divided into two subsets: a training set for model estimation and a test set (holdout) for model evaluation. The theoretical distinction arises from the nature of the outcome variable.
In classification tasks (e.g., predicting whether a firm is fraudulent or not), the model’s output is a class label or probability. Evaluation metrics including accuracy, precision, recall, or AUC-ROC are more suitable. For binary classification, the model is assessed based on its ability to correctly classify the test instances:
Accuracyholdout = (TP + TN) ÷ (TP + TN + FP + FN)
where
  • TP = True Positives
  • TN = True Negatives
  • FP = False Positives
  • FN = False Negatives
This approach assumes a categorical target, common in audit risk flagging or financial distress prediction.

3.3.2. K-Fold Cross-Validation

K-fold cross-validation (CV) provides a more robust evaluation by averaging performance across multiple data splits. The dataset is divided into k equal parts (folds), with the model trained on k−1 folds and tested on the remaining fold. This is repeated k times, and the performance metrics are averaged.
In classification, each fold is evaluated using metrics suited for class imbalance and decision boundaries (Accuracy, AUC-ROC, F1-score, Precision, and Recall). CV classification ensures that performance measures are not overly optimistic due to random data splits, which is crucial when classifying rare events such as fraudulent reporting. Let M ( j ) represent the metric (e.g., F1-score) computed on fold j, then the cross-validated metric is:
M e t r i c c v =   1 k   j = 1 k M j

3.3.3. Integration with Caret Tuning Parameter—η

The Classification and Regression Training (Caret) package provides a unified interface for training, tuning, and evaluating machine learning models across both regression and classification tasks. It supports key functionalities such as data partitioning for holdout validation, automated k-fold CV, hyperparameter tuning, and model comparison and visualization. One of its core strengths lies in its ability to select the optimal learning rate (η∗) through CV.
η * = arg m i n (   η H 1 K j = 1 k C V _ E r r o r ( j ) ( η ) )
where
  • η = model parameter
  • H = hypothesis space (possible parameter values)
  • k = number of folds
  • C V _ E r r o r ( j ) ( η ) = cross-validation error on fold j with parameter η
This process is consistent across regression and classification models, allowing us to compare model performance with consistent tuning methodologies. In XGBoost, Caret tunes the learning rate using the following formulation:
η * = arg m i n   η H 1 K k = 1 K C V _ E r r o r k ( η )  
This ensures that the selected learning rate minimizes the average cross-validation error across folds, leading to improved model generalization and predictive performance.

3.3.4. SHAP Value

After we have finished training the model and made predictions on new data, whether it is a test set or a validation set, we use SHAP to help explain those prediction results. SHAP shows how each feature influences the model’s decision. For example, which features increase or decrease the prediction and how much impact each feature has on individual cases. Using SHAP helps us better understand and trust the model because it clearly reveals the role of each feature in making the predictions.

3.4. Causality and 2SLS Regression

The study employs a two-stage least squares (2SLS) estimation technique to address the potential endogeneity of the financial fraud score (FRAUD) in the audit opinion (AUO) determination model. The concern arises from the possibility that unobserved factors influencing audit opinions (such as internal governance quality or regulatory scrutiny) may also be correlated with the fraud score, leading to biased estimates in ordinary least squares regression.
To resolve this issue, we instrument FRAUD using two theoretically and empirically justified variables: the industry-fraud (IndF), and revenue growth (GROWTH). We use the industry-level average fraud (excluding the firm) as an instrumental variable for firm-level fraud. This exploits the fact that firms tend to mimic the fraudulent behavior of their industry peers (relevance), while the average fraud of other firms in the industry does not directly affect the audit opinion of the focal firm except through its influence on the firm’s own fraudulent behavior (exogeneity). This approach mitigates potential endogeneity arising from simultaneity or omitted variable bias in the relationship between fraud and audit outcomes. Revenue growth is included as it may influence incentives to misstate earnings but is not expected to directly affect audit opinions conditional on other controls. In the first stage, these instruments significantly predict FRAUD, satisfying the relevance condition. The first-stage regression is specified as follows:
F R A U D i t D F S c o r e P F S c o r e   =   π 0 + π 1 I n d F i t + π 2 G R O W T H i t + u i t
The predicted value of FRAUD from this first stage is then used in the second-stage structural equation, which models the audit opinion as a function of the instrumented fraud score and a set of control variables:
A U O i t =   β 0 + β 1 F R A U D i t + β 2 D E B T i t   +   β 3 S I Z E i t + β 4 A S T D i t   +   γ t +   δ g   + ε i t
The regression model estimates the effect of firm-level fraud on audit opinion (AUO), controlling for relevant financial variables including leverage (DEBT) and firm size (SIZE). Additionally, we include the variation of accounting standards (ASTD) to control for institutional and regulatory heterogeneity across accounting standard adoptions. We include year fixed effects ( γ t ) to capture common shocks across time. Country fixed effects ( δ g ) serve to capture systemic factors such as legal tradition, enforcement intensity, and audit market structure that might influence audit outcomes. To ensure reliable statistical inference, the model uses two-way clustered standard errors—clustered at both the firm and industry levels. This approach accounts for the possibility that residuals (the unexplained part of the outcome) may be correlated within firms over time (e.g., due to persistent firm-specific factors) and within industries across firms (e.g., due to common industry trends or policies). By adjusting for these within-group correlations, two-way clustering improves the robustness of standard errors, reducing the risk of overstating statistical significance due to underestimated variability. This technique strengthens the validity of the inference, particularly in panel data settings where observations are not fully independent. This specification helps ensure that the estimated effect of FRAUD on audit opinion (AUO) is not biased by reverse causality or omitted variable bias. Table 2 summarizes variables in the above equation.
Table 2. Summary of equation variables.

4. Results

4.1. Descriptive Statistics

Table 3 presents the two fraud index measures—DF-Score and PF-Score. They demonstrate considerable variation across countries, reflecting underlying differences in financial reporting behaviors and market structures. For example, DF-Score, which captures discretionary accrual-based risk, shows extreme variability, with a total sample mean of −0.88 and a standard deviation of 132.51. This is driven by extreme values in certain jurisdictions, such as Indonesia (mean = −23.96, SD = 735.81) and Switzerland (mean = 7.27, SD = 128.14), suggesting significant heterogeneity in earnings management practices (see Appendix A).
Table 3. Descriptive statistics—Actual continuous DF-Score and ordinal PF-Score. DF-Score = RSST_ACC + CH_REC + CH_INV + SOFT_ASSETS + CH_CS + CH_CM + CH_ROA + CH_FCF. PF-Score = F_ROA + F_AROA + F_CFO + F_ACCRUAL + F_AMARGIN + F_ATURN + F_ALEVER + F_ALIQUID + EQ_OFFER.
In contrast, PF-Score, a probabilistic fraud signal score, is relatively consistent across countries, with a median of 6 in nearly all cases. It exhibits lower standard deviations and more symmetrical distributions (overall mean = 5.51, SD = 1.90), indicating that it may be less sensitive to outliers than DF-Score. Countries like China, Indonesia, and Taiwan show wider dispersion and skewness in DF-Score, reflecting either unusual financial reporting patterns or weaknesses in regulatory oversight.
Notably, DF-Score often exhibits extreme skewness and kurtosis in many jurisdictions. These distributional irregularities highlight potential outliers, structural anomalies in earnings management, or fundamentally non-normal data patterns. For example, Romania, China, and Taiwan reveal particularly heavy-tailed and skewed DF-Score distributions, indicating possible concerns with financial transparency or aggressive accounting behaviors.
From a theoretical standpoint, these findings reinforce the argument that fraud risk models are highly context-sensitive. Fraud indicators behave differently depending on local accounting standards, enforcement mechanisms, and broader economic environments. The variation in DF-Score across countries underscores the need for fraud detection models to be adapted to country-specific conditions, rather than applying a uniform threshold. High skewness and kurtosis values suggest that traditional parametric techniques may be inadequate in certain environments, necessitating more robust or non-parametric statistical approaches.
Practically, this has significant implications for investors, auditors, and regulators. In high-variance jurisdictions such as Indonesia, China, or Taiwan, elevated fraud signals—particularly from DF-Score—should trigger heightened scrutiny during audits or investment assessments. Meanwhile, the relatively consistent performance of PF-Score may provide a useful baseline for cross-country fraud comparison. However, its reduced sensitivity could limit its ability to detect localized or subtle manipulation patterns.
Firms themselves can leverage these insights by benchmarking their fraud risk exposure, governance structures, and financial reporting quality against both regional and international standards. This cross-country analysis of fraud risk indicators—DF-Score and PF-Score—offers empirical support for the contextual variability of financial reporting practices. The observed distributional anomalies in DF-Score reflect the heterogeneous landscape of global accounting, shaped by institutional, regulatory, and cultural differences.
These statistical disparities raise questions about the comparability of global fraud detection benchmarks and underscore the necessity for localized calibration of predictive models. The relative stability of PF-Score suggests it may serve as a robust, though potentially less sensitive, tool for international fraud surveillance. Ultimately, these findings contribute to the literature on comparative corporate transparency and emphasize the importance of dynamic, data-driven frameworks in fraud risk evaluation—frameworks that align with national regulatory complexity and financial reporting environments. As the accounting profession continues to globalize, the integration of machine learning tools with localized intelligence will be essential for achieving more accurate, equitable fraud detection and enforcement.

4.2. Fraud Index Constructions

As shown in Table 4 Panel A, the DF-Score’s predicted classifications (binary outputs) reveal a very low mean (0.03) across both HO and CV, with a median of 0. This indicates that the model classifies only a small proportion of firms as fraudulent. Such output is consistent with high skewness (>5) and extreme kurtosis (>29), suggesting a heavily right-skewed distribution with a long tail—typical of rare-event modeling such as fraud detection.
Table 4. Predicting fraud score using XGBoost: Constructing binary DF-Score and ordinal PF-score. (A) Results without heterogeneity handling. (B) Results with heterogeneity handling.
In contrast, the PF-Score classification shows greater variation and a more balanced distribution. For instance, under the HO method, the PF-Score has a mean of 5.51 and a median of 6 (on a scale from 0 to 9), along with relatively low skewness (–0.25) and kurtosis (2.48), indicating a more symmetric and approximately normal distribution. The CV-based PF-Score reflects a slightly higher mean and median, while maintaining similar distributional characteristics. This suggests that the PF-Score, as a multi-level ordinal classification, captures a more nuanced gradation of fraud risk compared to binary classifications.
Given the cross-country heterogeneity in the data, the fraud prediction model controls for country, industry, and year effects using one-hot method. A comparison of the predicted results without heterogeneity handling in Panel A and the heterogeneity-handled results in Panel B shows that both produce qualitatively similar patterns, indicating that country-level heterogeneity has slight impact on the overall fraud analysis in this dataset.
When comparing these classification-based indices to the actual scores of each model, notable discrepancies are evident. These properties suggest a relatively well-behaved, bounded score that offers more intuitive interpretation and is potentially more useful for ranking or categorizing firms. However, it should be noted that transforming raw data into categorical groupings inevitably entails a loss of information, even though such transformations may enhance certain statistical properties of the variables.
Overall, the results suggest that while the DF-Score provides value for identifying financial misstatements, its extreme distributional properties in both predicted and actual forms may hinder its practical utility. Conversely, the PF-Score, with its balanced and stable characteristics, emerges as a more suitable index for applied audit or forensic analysis settings, especially where ordinal assessments of fraud risk are preferred. Additionally, the consistency of results between HO and CV methods across all models supports the reliability of the XGBoost classification framework, though careful consideration of threshold selection remains essential for ensuring interpretability and accuracy. Thus, to obtain an optimal threshold for imbalanced data, this study employs an optimization approach that maximizes the F1 score to improve the accuracy of binary outcome predictions.
In evaluating the performance of classification models applied to accounting and financial datasets, a wide range of statistical metrics are employed to capture different dimensions of predictive accuracy and reliability, as presented in Table 5. Among these, accuracy remains one of the most commonly reported measures, indicating the proportion of total correct classifications over all observations. However, accuracy can be misleading in the presence of class imbalance—a frequent issue in accounting domains such as fraud detection or financial restatements—where predicting the majority class can produce deceptively high scores (Provost & Fawcett, 2013). For example, in Panel A, a model may achieve 99% accuracy by always predicting “non-fraud,” even while missing all actual fraud cases.
Table 5. Model performance metrics for predicted DF-Score and PF-Score. (A) Results without heterogeneity handling. (B) Results with heterogeneity handling.
To mitigate such limitations, balanced accuracy is used to average the recall across all classes, ensuring that minority class performance is equally weighted. This metric is especially valuable in contexts where the positive class—such as fraudulent firms—is rare but highly consequential (He & Garcia, 2009). Balanced accuracy helps ensure that model evaluation does not disproportionately reward correct predictions of the dominant class, thus better reflecting real-world performance. Complementing this, Cohen’s Kappa statistic quantifies the agreement between predicted and actual labels while adjusting for chance agreement. In classification problems involving audit outcomes or corporate governance red flags, Kappa provides a more conservative and statistically grounded assessment of model reliability (Landis & Koch, 1977).
Another widely used metric, logarithmic loss (log loss), evaluates the accuracy of probabilistic predictions, penalizing confident but incorrect classifications more severely. This is crucial in accounting applications involving predictive risk scoring, where the quality of probability estimates matters as much as the final classification. A lower log loss implies better-calibrated probabilities, which are essential when models are used to guide decisions such as audit allocations or enforcement investigations (Brier, 1950). In tandem with log loss, precision (macro-averaged) captures the proportion of true positives among all predicted positives, highlighting the model’s ability to avoid false alarms. This is particularly important in high-stakes accounting environments, where wrongly accusing a firm of fraud can have severe reputational and regulatory consequences.
Equally important is recall (macro-averaged), which measures the proportion of actual positives that are correctly identified. In practical terms, recall answers the question: “Of all the firms that manipulated earnings, how many did the model detect?” A high recall is critical when failing to detect a fraudulent firm could lead to investor losses or audit failures. The F1 score (macro-averaged) synthesizes both precision and recall into a single metric by computing their harmonic mean, offering a balanced view of model effectiveness. Lastly, AUC-ROC quantifies the model’s discriminative ability across all thresholds. An AUC-ROC close to 1.0 indicates that the model effectively ranks positive cases (e.g., fraudulent firms) above negatives, regardless of the specific decision boundary. This is vital for threshold-independent evaluation and is widely used in regulatory and financial surveillance contexts (Fawcett, 2006). Together, these metrics provide a comprehensive framework for evaluating classification models in accounting and financial research. Each metric highlights a different facet of model performance—ranging from overall accuracy to risk of misclassification—thereby enabling more robust, nuanced interpretations of predictive power.
The classification performance metrics for both HO and CV methods across scoring models—DF-Score and PF-Score—show strong results for certain tasks, with some notable weaknesses. In the HO approach, accuracy scores are exceptionally high for DF-Score, ranging between 0.995 and 0.996, indicating excellent predictive alignment with actual class labels. However, for PF-Score, the accuracy drops significantly to 0.552, suggesting the model struggles to predict that particular target. Balanced accuracy, which adjusts for class imbalance by averaging recall across classes, tells a more reliable story: DF-Score maintains a high value above 0.964, while PF-Score again lags at 0.751. Other performance metrics—such as precision, recall, F1-score, and AUC-ROC—reinforce these trends, with macro-averaged F1-scores above 0.96 for DF-Score, while PF-Score records a much lower 0.563.
When comparing the HO and CV results, the patterns remain generally consistent, although CV introduces greater model scrutiny. Accuracy and balanced accuracy for DF-Score stay robust under CV, even improving slightly in some metrics such as precision and Kappa, which reflect agreement between predicted and true labels beyond chance. However, PF-Score underperforms in CV, with accuracy dropping further to 0.223 and Kappa falling to 0.094, suggesting poor class prediction stability. This discrepancy shows how CV uncovers model vulnerabilities that the HO method may obscure (Kohavi, 1995). Different learning rates (ETA) are selected in the models: both methods tuned ETA values of 0.1 and 0.3 depending on the task, indicating that XGBoost finds different trade-offs between convergence speed and generalization. In classification tasks, a lower learning rate like 0.1 often improves generalization by allowing more precise updates but may require more boosting rounds. The use of multiple learning rates in CV reflects the adaptive nature of the tuning process rather than inconsistency—Caret selects the best parameter set for each fold and reports all combinations that achieve similar performance, rather than relying on a single minimum.
In addition, baseline results (Panel A) and heterogeneity-handled results (Panel B) show qualitatively similar patterns, suggesting minimal impact of country-level heterogeneity on the overall analysis.
When classifying firms into high- or low-fraud groups using the optimized classification thresholds of the DF-score, the threshold values decrease from 0.31 (HO) and 0.39 (CV) to 0.19 (HO) and 0.21 (CV), respectively, after controlling for country, industry, and year heterogeneity. This indicates that incorporating heterogeneity controls makes the model more sensitive in distinguishing between high- and low-fraud observations. The lower thresholds suggest that the model requires less evidence (lower probability) to classify a firm as high performing, likely because the encoded structure absorbs much of the variation that previously inflated prediction uncertainty.
Theoretically, these findings reinforce the importance of using appropriate evaluation metrics in classification, especially under class imbalance. While high accuracy is appealing, it can be misleading in skewed datasets where models simply favor the dominant class (Chicco & Jurman, 2020). Balanced accuracy, macro-averaged precision, recall, and F1-score offer a more reliable assessment of true performance, particularly in real-world financial or accounting datasets that often include unbalanced outcomes. The sharp drop in metrics for PF-Score across both validation methods underscores the limitations of relying solely on overall accuracy and points to potential deficiencies in feature representation or data quality for this target variable. From a model training perspective, the use of different ETA also reflects the necessity of flexible hyperparameter tuning depending on the complexity and signal-to-noise ratio of the target variable.
In practical terms, the high and consistent performance of models trained on DF-Score suggests this target is well-suited for classification with XGBoost under both HO and CV. Financial institutions or accounting firms using such models can expect stable generalization performance across unseen data, particularly if proper CV protocols are implemented. However, the poor results for PF-Score highlight the need for additional data preparation techniques—such as synthetic sampling, feature transformation, or even alternative modeling frameworks—to better capture the structure of that variable. Practitioners must be cautious of misleading holdout results that appear strong, as CV clearly revealed performance degradation in PF-Score that holdout could not. The multiple ETA values further imply that tuning hyperparameters on a task-by-task basis is essential to obtaining optimal results, rather than relying on a fixed learning rate across all scenarios.
In conclusion, while both HO and CV methods produced high-performing classification models for certain score types, CV provided a more rigorous and realistic assessment of model generalizability. Consistent with Y. Wang et al.’s (2025) findings, they evidence that deep learning-based accounting fraud prediction achieves remarkably high prediction accuracy. The consistent underperformance of PF-Score—despite decent holdout results—reinforces the theoretical consensus that CV is more trustworthy in performance validation (Kuhn & Johnson, 2013). The use of multiple learning rates is not a flaw but an advantage of adaptive model selection, where hyperparameter configurations are flexibly chosen based on validation feedback. Future research should investigate why certain score types like PF-Score are resistant to classification under current feature sets and whether feature engineering or model stacking can help. Ultimately, this analysis affirms the critical role of robust validation and metric interpretation in machine learning applications within financial and accounting domains.
In the context of XGBoost with HO classification as presented in Table 6 Panel A1, models are trained on a subset of the data and evaluated on a separate, unseen holdout set. This setup closely simulates real-world deployment and provides an unbiased estimate of model performance. However, it also means that traditional feature importance metrics, typically derived during model training (e.g., gain, cover, or frequency of feature usage in tree splits), may not reflect how features behave on unseen data. In contrast, SHAP values computed on the holdout set provide a post hoc, individualized measure of each feature’s contribution to predictions, making them directly applicable to out-of-sample inference.
Table 6. Comparison of predictive feature contributions using traditional metrics and SHAP values. (A) Results without heterogeneity handling. (B) Results with heterogeneity handling.
For example, in the DF-Score model, features like CH_CS, CH_CM, and SOFT_ASSETS maintain high rankings in both importance and SHAP values, indicating stable, generalizable predictors of fraud risk across both training and test data. However, discrepancies—such as CH_INV and CH_REC showing near-zero importance yet non-negligible SHAP contributions—suggest that these features, while not dominant in training splits, influence predictions in certain contexts on the holdout set. This highlights a key limitation of relying solely on training-based feature importance: it can underrepresent features that are predictive only in specific sub-populations or in interaction with other variables—issues that SHAP is designed to uncover.
The PF-Score results show more variation: although features like F_AROA, F_CFO, and F_ALEVER are highly ranked by both metrics, the precise ordering and relative impact differ. Notably, F_CFO emerges as the top SHAP contributor on the holdout set (0.629), even though it is ranked third by traditional importance. This suggests that F_CFO plays a particularly critical role in the model’s generalization to unseen data, possibly due to its sensitivity to underlying cash flow anomalies that are not captured fully during training splits. Moreover, features like F_ALIQUID, which are assigned zero traditional importance, still receive meaningful SHAP values, again reinforcing the idea that SHAP captures context-specific predictive power that tree-splitting metrics may overlook.
In Panel A2, the results from the XGBoost CV classification models for DF-Score and PF-Score demonstrate both convergences and discrepancies between traditional feature importance measures and SHAP values, highlighting important considerations for accounting research and practice. CV enhances model robustness by repeatedly training and testing the model on multiple data folds, ensuring that performance metrics and feature relevance are not overly dependent on any single partition of the data. Unlike a single HO evaluation, CV better approximates model generalizability by averaging across diverse subsets, which is especially critical in accounting contexts where data distributions can vary significantly across firms and time periods. However, this setup also introduces subtle complexities in interpreting feature importance metrics versus SHAP values.
In the DF-Score model, key features such as CH_CS, CH_CM, and SOFT_ASSETS consistently rank highest in both importance and SHAP values, underscoring their robust predictive power in detecting financial manipulation. Nonetheless, the shift in feature rankings for variables like RSST_ACC and CH_FCF between importance and SHAP metrics suggests that while these features contribute strongly during tree construction, their influence on out-of-sample predictions is somewhat moderated, reflecting real-world complexities. Furthermore, features like CH_REC and CH_INV have low importance scores but maintain non-trivial SHAP values, indicating their predictive utility in specific cases or subsets of the data—an insight that traditional importance measures may miss. This divergence underscores the technical limitation of relying solely on gain-based importance for interpretability in models involving heterogeneous firm behaviors and interactions among accounting variables.
In the PF-Score model, a more pronounced discrepancy is evident. While F_AROA holds the highest feature importance, F_CFO emerges as the most influential predictor based on SHAP values. This suggests that F_CFO may be particularly sensitive to cash flow-related anomalies that affect model predictions on the holdout set but may be underweighted during training splits due to complex interactions with other features. The fact that features such as F_ALIQUID and F_AMARGIN receive meaningful SHAP values despite lower or zero importance rankings reinforces the interpretive power of SHAP for uncovering subtle but practically relevant predictive relationships that could otherwise be obscured.
The results in Panel B, which incorporate heterogeneity handling, follow the same overall pattern as those shown in Panel A. However, for the PF-Score, the SHAP values indicate slightly different feature importance for the top three features, reflecting minor shifts in their relative contributions compared with traditional importance rankings.
Theoretically, the combined insights from HO and CV classification reinforce the critical need to move beyond traditional feature importance metrics when interpreting complex machine learning models in accounting research. Both approaches demonstrate that SHAP values, by quantifying the marginal impact of each feature on individual predictions within out-of-sample contexts, provide a richer and more precise understanding of how models function in real-world scenarios. While HO validation directly simulates future forecasting on unseen data, offering a clear snapshot of model generalizability, CV extends this by averaging performance and feature effects across multiple data partitions, thereby enhancing the robustness and reliability of inference. This dual perspective emphasizes the value of post-hoc interpretability tools like SHAP in validating and interpreting machine learning outputs, especially in accounting environments characterized by heterogeneous data and complex, nonlinear interactions among financial variables.
Practically, the results indicate that sales growth, cost of goods sold, current assets, cash flow and return on assets are key indicators of financial fraud, consistently highlighting potential risk. Revenue growth, in particular, has been supported by prior research (Brazel et al., 2023). Auditors can leverage these insights to focus investigations on accounts with unusual movements, prioritize testing where anomalies are most pronounced, and track patterns over time. Recognizing that feature importance can be context-dependent encourages the development of more nuanced, adaptive risk assessment frameworks that better capture the complexities inherent in financial reporting and manipulation.
In sum, blending the theoretical rigor of HO and CV frameworks with SHAP-based interpretability advances the frontier of accounting analytics. It provides a comprehensive and defensible foundation for employing machine learning in high-stakes financial decision-making, fostering models that are not only predictive but also transparent, generalizable, and practically actionable.
As presented in Table 7, this study examines the relationship between fraud indicators and audit opinions (AUO) using IV-2SLS, addressing endogeneity concerns by employing leave-one-out industry fraud (IndF) and revenue growth (GROWTH) as instruments. IndF captures industry-level fraud tendencies while excluding the firm itself. The first-stage regression results for the DF-Score model (DF-Score > 1 = financial integrity) indicate that IndF is a significant predictor of DF-Score (coefficient = 0.085, t = 2.69), whereas revenue growth is statistically insignificant (coefficient = 0.000, t = 1.37). For the PF-Score model, both IndF (0.224, t = 8.73) and GROWTH (0.000, t = 2.19) are significant predictors. The result suggests that firms in industries with higher fraud tend to have higher fraud score. Instrument strength is confirmed by the first-stage statistics: the F test of excluded instruments (4.44 for DF-Score, 40.81 for PF-Score), the Sanderson-Windmeijer multivariate F-statistic (4.44, 40.81), and the Kleibergen-Paap rk LM statistic (5.45, 22.42) all indicate that the instruments are relevant and the model is not underidentified. Additionally, the Cragg-Donald Wald F-statistics (44.56, 162.47) exceed the Stock-Yogo critical values for various maximal IV sizes, further confirming instrument strength (Stock & Yogo, 2005).
Table 7. Main regression analysis using actual fraud score: Binary DF-score and Ordinal PF-score. Instrumental variable regression: F R A U D i t = π 0 + π 1 I n d F i t + π 3 G R O W T H i t + u i t . Main analysis: A U O i t = β 0 + β 1 F R A U D i t + β 2 D E B T i t + β 3 S I Z E i t + β 4 A S T D i t + γ t + δ g + ε i t . .
In the second-stage estimation for the binary DF-Score, the positive sign of coefficient is expected. The result shows that the fraud indicator is marginally significant predictor of audit opinions (coefficient = 5.63). This implies that firms with a higher likelihood of financial misstatements, as captured by the DF-Score, are more prone to receive modified audit reports. The insignificant Hansen J statistic (0.602) confirms the validity of the overidentifying restrictions, affirming that the instruments used are exogenous. These findings align with prior literature asserting that auditors are more likely to issue modified audit opinions when indicators of earnings manipulation are present (Carcello & Neal, 2000).
For PF-Score, the fraud score is negatively associated with audit opinion (coefficient = −0.019), meaning firms with higher fraud risk (lower PF scores) are more likely to receive qualified or adverse audit opinions. The marginally significant Hansen J statistic (3.655, p < 0.10) in the PF model suggests a potential risk of overidentification, although overall model validity is supported by robust Wald and Anderson-Rubin statistics. This aligns with past findings that audit decisions are sensitive not only to the presence but also to the severity of fraud signals (Yousefi Nejad et al., 2024).
Comparatively, both DF and PF scores exhibit marginally statistically significant predictive power in explaining audit opinion outcomes, reinforcing the empirical link between fraud detection models and auditor judgments. However, the magnitude of the effect is stronger in the DF model, likely because binary fraud classifications present a clearer red flag to auditors than more continuous or probabilistic indicators like the PF score. The PF-Score’s granularity allows it to capture subtler differences in fraud probability, but its influence on audit outcomes appears diluted compared to the more definitive DF binary classification. The positive and significant coefficient on DEBT suggests that auditors may view higher leverage as a signal of potential fraud, independent of firm size, highlighting that financial structure can influence audit assessments even when other firm characteristics are controlled, as noted by Nasfi Snoussi et al. (2025).
From a theoretical standpoint, the results support foundational concepts in agency theory, where information asymmetry between managers and external stakeholders leads to opportunistic behavior such as earnings manipulation, which auditors are expected to mitigate (Jensen & Meckling, 1976). The significant role of fraud scores in shaping audit opinions is also consistent with signaling theory: an adverse audit opinion serves as a market signal of deteriorating financial reporting quality (Spence, 1973). Furthermore, these findings contribute to the literature on audit quality by demonstrating that high leverage firms are more likely to issue adverse opinions in the presence of fraud indicators.
Practically, these findings have critical implications for stakeholders such as regulators, auditors, and institutional investors. For auditors, fraud scores such as DF and PF offer actionable insights into client risk profiles and can be integrated into audit planning and sampling procedures. In industries with more fraud, auditors may work harder or firms may adopt stricter internal controls, resulting in higher DF-Scores. The strong association between DF-Score and adverse opinions implies that such binary models may serve as red flags during preliminary risk assessments. In contrast, the PF-Score may be better suited for continuous monitoring or for tiered audit attention, where firms are prioritized based on their risk bands. Regulators and enforcement agencies can also leverage these models to pre-emptively identify firms at risk of financial misreporting, allocating limited investigative resources more efficiently. Real-world applications are already evident: forensic tools based on similar scoring systems have been adopted by agencies like the U.S. Securities and Exchange Commission and the Public Company Accounting Oversight Board (PCAOB) to inform risk-based inspection programs. Investors, too, can use these scores in screening portfolios for potential governance risks.
In summary, the IV-2SLS analysis confirms that fraud risk, whether measured by a binary DF-Score or an ordinal PF-Score, significantly predicts the likelihood of receiving an adverse audit opinion, even after correcting for endogeneity. The instruments used—industry-fraud and revenue growth—are statistically valid, and the results are consistent across multiple model specifications. The DF-Score shows a stronger marginal effect, while the PF-Score offers more nuanced fraud probability signals. The findings contribute to the theoretical understanding of audit decision-making and provide practical guidance for using machine-learning-derived fraud scores in real-world financial oversight, audit planning, and regulatory enforcement. Our results support Hypothesis H1, indicating that there is an association between higher fraud risk scores and the issuance of audit opinions.
In this analysis, fraud risk is not directly observed through labeled data but instead predicted using machine learning—specifically XGBoost classifiers trained via HO and CV methods. In Panel A of Table 8, both predicted DF-Score and PF-Score, without controlling for heterogeneity, show positive associations with adverse audit opinions, consistent with the baseline results, although the associations are only marginally significant (p < 0.10). The instrumented DF-Score is positively associated with the likelihood of receiving a negative audit opinion, with coefficients of 0.586 in the HO model and 0.560 in the CV model. The instrumented PF-Score is negatively associated with the likelihood of receiving a negative audit opinion, with coefficients of -0.019 in both HO and CV models. These results closely mirror those in Table 7. The consistency in magnitude and significance across the three models confirms that XGBoost predictions of fraud, even without labeled outcomes, are highly aligned with auditors’ judgments. Moreover, the instruments perform robustly across all versions: IndF remains a statistically strong predictor in the first stage.
Table 8. Main regression analysis using predicted fraud score: Binary DF-score and Ordinal PF-score. Instrumental variable regression: F R A U D i t   =   π 0 + π 1 I n d F i t + π 3 G R O W T H i t + u i t . Main analysis: A U O i t =   β 0 + β 1 F R A U D i t + β 2 D E B T i t   +   β 3 S I Z E i t + β 4 A S T D i t   +   γ t +   δ g   + ε i t .
When accounting for heterogeneity in the prediction of DF-score and PF-Score, The results for DF-Score and PF-Score generally follow patterns similar to those observed in Panel A. However, in the DF-Score prediction using holdout method, the relationship between fraud and audit opinions is not statistically significant. Likewise, the association between industry-fraud and fraud is also insignificant. Apart from these exceptions, the results remain consistent with those obtained without controlling for heterogeneity.
Compared to the baseline regressions using labeled fraud data (Table 7), the machine-learning-based scores perform equivalently, if not slightly more robustly in some dimensions. In both DF and PF models, the use of XGBoost predictions generated through HO and CV offers coefficients and statistical significances that are nearly identical to the ones based on observed fraud outcomes. This reinforces the notion that predictive models trained on accounting and financial features are capable of approximating real-world audit assessments with high fidelity. The minimal variation between HO and CV models further validates the stability and generalizability of the fraud scores across different model training approaches (Mullainathan & Spiess, 2017).
These findings have strong theoretical implications. From the lens of machine learning interpretability and audit economics, the results support the idea that fraud is a latent construct that can be probabilistically inferred using patterns in financial and governance data (Dechow et al., 2011b). The consistency of predictive scores with actual audit decisions lends support to theories of auditor rationality and efficiency in the presence of asymmetric information (Jensen & Meckling, 1976). Moreover, the fact that auditors appear to respond similarly to both actual and predicted fraud risks suggests that auditor judgments are aligned with quantifiable red flags derived from machine learning—validating a data-driven interpretation of auditor behavior (J. R. Francis, 2011).
Practically, the successful use of predicted fraud scores as endogenous variables suggests significant promise for real-world applications. Audit firms and regulators can adopt such models to prioritize engagements, flag potentially problematic firms, and allocate resources more effectively—even in the absence of confirmed fraud cases. For example, tools based on XGBoost fraud scores could be integrated into pre-audit risk assessments, reducing manual screening efforts and allowing auditors to concentrate on high-risk areas. Regulators such as the SEC and PCAOB could similarly leverage these scores to develop more predictive enforcement algorithms. Furthermore, CV ensures that these models remain effective across different contexts, improving their reliability for firms of varying size, sector, and geography. The minor differences in coefficients between HO and CV approaches also highlight the robustness of model training techniques and suggest that predictive fraud scores remain stable under different validation schemes.
In summary, the results of Table 8 demonstrate that XGBoost-predicted fraud scores, generated through both HO and CV strategies, produce instrumental variable estimates that are nearly indistinguishable from those obtained using actual labeled fraud data. Both DF and PF scores maintain strong statistical and economic significance in predicting adverse audit opinions. These findings substantiate the claim that fraud risk can be effectively measured through predictive modeling and used in econometric analysis, even without directly labeled outcomes. This opens new avenues for scalable, automated, and high-accuracy fraud detection systems that are theoretically grounded and practically implementable. Our results support the Hypothesis H2, indicating that machine learning-derived fraud indices constructed via XGBoost significantly associate with audit opinion.

4.3. Robustness Tests

Following the baseline IV specification, we re-estimate the relationship between fraud indicators and audit opinions using a simultaneous equation framework in a generalized structural equation model (GSEM). This GSEM approach allows the first and second stage equations to be estimated jointly and relaxes some of the parametric and distributional constraints of the 2SLS estimator while preserving the structural relationship between fraud and audit outcomes. Since DF-Score is coded as a binary indicator, we employ GSEM with a binomial family and logit link. This specification allows us to model the probability of a fraud indicator being triggered while simultaneously accounting for endogeneity through the full systems of equations. Based on the GSEM, the baseline findings and consistent with those from the main IV analysis but GSEM results present a strong statistically significant for the association between fraud score and audit judgement. In addition, the results remain essentially unchanged whether heterogeneity controls are included or omitted. This indicates the estimated association is robust to alternative model specifications. The DF-Score estimated using the HO method is not statistically significant, consistent with the result reported in Table 8 Panel B.
In conclusion, Table 9 provides robust support for the main findings by demonstrating that both labeled and predicted fraud scores—across binary and ordinal formulations—consistently explain audit opinion modifications, even when addressing endogeneity through an instrument variable framework. The negligible differences in coefficients and statistical strength across baseline and predicted models indicate high predictive validity of the machine learning-based fraud scores. These results not only reinforce the main conclusions derived from linear IV estimation but also highlight the added value of fraud risk analytics in auditing contexts.
Table 9. Robustness tests using generalized structural equation model estimation (GSEM). Instrumental variable regression: F R A U D i t = π 0 + π 1 I n d F i t + π 3 G R O W T H i t + u i t . Main analysis: A U O i t = β 0 + β 1 F R A U D i t + β 2 D E B T i t + β 3 S I Z E i t + β 4 A S T D i t + γ t + δ g + ε i t . (A) DF-Score. (B) PF-Score.

5. Conclusions

The findings from the comprehensive empirical analyses, utilizing a variety of model specifications and validation strategies, provide robust evidence regarding the relationship between accounting fraud and auditor opinion issuance. The deployment of DF-Score and PF-Score as proxies for fraud risk—representing binary and ordinal classifications, respectively—facilitates a multifaceted understanding of how auditors respond to fraudulent financial reporting. Across all baseline models estimated using IV-2SLS and GSEM, the fraud variable remains statistically significant, regardless of whether it is derived from labeled data or predicted via machine learning algorithms. The magnitude and direction of the coefficients remain reasonably stable across estimation methods and validation samples, suggesting the robustness and external validity of these fraud risk measures. For example, the estimated effect of fraud on the likelihood of receiving a modified audit opinion remains significant across both HO and CV samples, and regardless of whether fraud is labeled or predicted.
These empirical findings are congruent with core principles of audit theory. According to the audit risk model (Arens et al., 2017, p. 309), auditors respond to increases in inherent or control risk by adjusting audit strategies, including the likelihood of issuing a qualified or adverse opinion. Moreover, signaling theory (Spence, 1973) posits that audit opinions serve as market signals of firm quality. Thus, when fraudulent activity is detected or even suspected, auditors have strong incentives to issue modified opinions to protect their reputational capital and maintain regulatory compliance. This interpretation aligns with DeFond and Zhang’s (2014) assertion that auditors function as economic agents whose professional skepticism is shaped by observable risk cues. Variables such as debt levels and industry-fraud found to be statistically significant covariates in several model specifications—have long been associated with audit risk and misreporting behavior (Hennes et al., 2008).
The study contributes to several theoretical domains. From a methodological standpoint, it demonstrates that machine learning-generated fraud scores can be used effectively in econometric frameworks as either instrumental variables or explanatory constructs. This advances the literature on the integration of artificial intelligence into traditional accounting models, challenging critiques about the opacity or unreliability of algorithmic predictions in audit settings (Appelbaum et al., 2017). The predictive validity of the SHAP-based features supports this view. For instance, the models reveal that features such as sales, cost of goods sold, cash flow, or return on asset hold substantial predictive power, consistent with the fraud triangle. These findings affirm the conceptual grounding of machine learning models within fraud theory and auditing standards.
Practically, the implications of this study are substantial for audit firms, regulators, and investors. The use of machine learning to generate forward-looking fraud scores that predict auditor behavior could significantly enhance the risk assessment phase of audits. Incorporating predictive models could improve the timeliness and precision of audit interventions. Moreover, regulators could integrate algorithmic fraud scoring tools into enforcement algorithms, thereby enhancing their surveillance capabilities.
The study opens several avenues for future research. First, further exploration of model generalizability across different institutional environments could provide insights into how legal regimes, audit firm structures, or industry contexts moderate the fraud–opinion relationship. Second, extending the fraud prediction models to include unstructured data—such as textual disclosures, tone of earnings calls, or sustainability reports—could improve model granularity. Third, researchers could investigate whether auditors differentially respond to machine-predicted versus traditional red flags, shedding light on the evolving cognitive frameworks within audit judgment under increasing technological influence.
Despite the robustness of the findings, several limitations should be acknowledged. First, while the study employs advanced econometric techniques and validation strategies, the reliance on structured financial statement data may omit nuanced signals of fraud that are present in unstructured sources such as managerial communication, audit reports, or social media sentiment. Second, although the machine learning models demonstrate high predictive performance, they are trained and validated within a specific institutional and regulatory context. This potentially limits the generalizability of results across jurisdictions with different audit standards, enforcement intensity, or legal environments. Third, endogeneity concerns, while addressed through instrumental variable techniques, may not be fully eliminated, particularly given the complex and latent nature of fraud. Fourth, the study does not directly assess how auditors interpret or interact with machine-generated fraud scores. This should leave open questions about the practical integration of AI tools into audit decision-making processes. Future research should explore these areas to further refine the application and implications of AI-driven fraud detection in auditing practice. Fifth, this study derives fraud indicators solely from the XGBoost model. While XGBoost is a powerful algorithm, relying exclusively on a single modeling approach may limit the generalizability of the results. Different machine learning models might capture distinct pattern of fraud risk. Future research could explore multiple modeling approaches to assess the robustness and consistency of fraud predictions across alternative algorithms. Finally, data quality varies across countries and incomplete internal governance variables. This may reduce the precision of the estimated fraud risk measures. Consequently, the predictive performance and generalizability of the results could be affected in these contexts. Future research could incorporate additional governance indicators or alternative data sources to improve cross-country comparability.

Author Contributions

Conceptualization, P.B. and N.O.; methodology, P.B.; software, P.B.; validation, P.B. and N.O.; formal analysis, P.B.; investigation, P.B. and N.O.; resources, N.O.; data curation, P.B.; writing—original draft preparation, P.B.; writing—review and editing, P.B. and N.O.; visualization, P.B. and N.O.; supervision, P.B.; project administration, P.B.; funding acquisition, N.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Actual continuous DF-Score and ordinal PF-Score.
DF-Score = RSST_ACC + CH_REC + CH_INV + SOFT_ASSETS + CH_CS + CH_CM + CH_ROA + CH_FCF
PF-Score = F_ROA + F_AROA + F_CFO + F_ACCRUAL + F_AMARGIN + F_ATURN + F_ALEVER + F_ALIQUID + EQ_OFFER
CountryMeanMedianSDMinMaxSkewnessKurtosisN
DFPFDFPFDFPFDFPFDFPFDFPFDFPF
AUS−0.784.91−0.48513.821.88−119.21123.869−7.180.1863.832.32121
BGR0.015.65−0.4862.51.79−3.25122.296.17−0.3251.072.82178
BMU−0.665.06−0.3753.111.87−42.2904.819−10.65−0.12128.872.59722
BRA−0.15.56−0.1661.381.85−4.04112.4495.13−0.3343.632.44449
CHE7.275.73−0.266128.141.79−33.1702142.85916.62−0.38277.533.11360
CHL−0.575.43−0.52611.86−4.1217.6193.19−0.3931.052.74250
CHN0.755.44−0.33652.171.96−301.5602813.27946.64−0.252352.562.475023
CYM−0.315.2−0.3152.231.94−36.62039.969−1.47−0.15211.812.361575
DEU−0.375.73−0.2860.771.66−4.4104.149−0.25−0.287.462.79936
ESP−0.355.71−0.1561.241.88−11.2115.519−4.46−0.349.932.64180
FIN−0.345.66−0.2360.781.87−7.4410.89−4.99−0.1844.872.38243
FRA−0.025.8−0.1160.821.66−3.6514.1391.9−0.2912.172.64620
GBR−0.025.7−0.0861.781.82−9.81126.2697.8−0.24109.582.41739
HKG−0.315.53−0.4160.81.64−3.0214.692.1−0.213.42.76338
IDN−23.965.56−0.456735.811.9−23,022.99034.769−31.24−0.25976.992.441261
ISR−0.245.68−0.2360.761.9−3.17010.5195.07−0.3767.122.65854
JPN−0.485.78−0.4960.721.77−3.1415.2892.17−0.320.532.72504
KOR−0.345.45−0.4761.31.94−5.66028.41914.03−0.3278.852.563337
MEX−0.365.91−0.3960.551.7−3.412.5490.08−0.3510.932.48252
MYS−0.425.62−0.561.291.86−5.4022.59910.07−0.15160.192.4936
NOR0.315.47−0.1964.431.77−5.5139.3597.43−0.2360.972.36183
PAK−0.15.46−0.565.291.89−2.35165.63912.19−0.22151.952.63208
PER−0.735.91−0.6661.51.87−14.8517.599−3.81−0.3746.842.44259
PHL0.895.5−0.32617.461.82−3.470294.01915.71−0.22259.152.55405
POL−0.415.56−0.2467.991.89−143.14065.269−13.13−0.28271.772.46511
ROU−5.025.37−0.42669.541.74−1017.3615.189−14.52−0.31211.92.6374
RUS−0.545.77−0.4660.751.92−4.8901.149−0.95−0.47.712.61266
SAU−0.695.26−0.6550.982.03−6.3107.791.880.0224.872.15414
SGP−0.45.35−0.3752.481.88−27.54042.4397.44−0.25202.362.74666
SWE0.155.570.0667.621.86−23.990185.45923.11−0.25564.82.5814
THA−0.685.54−0.7161.072−5.01112.0194.6−0.2251.082.4732
TUR0.115.53−0.1162.961.75−6.69051.69912.72−0.11202.532.49683
TWN−0.235.51−0.45610.121.91−26.430679.75960.98−0.284034.552.556525
VNM−0.235.66−0.6664.541.91−4.31164.97912.35−0.26173.732.57323
Australia (AUS), Bulgaria (BGR), Bermuda (BMU), Brazil (BRA), Switzerland (CHE), Chile (CHL), China (CHN), Cayman Islands (CYM), Germany (DEU), Spain (ESP), Finland (FIN), France (FRA), United Kingdom (GBR), Hong Kong (HKG), Indonesia (IDN), Israel (ISR), Japan (JPN), South Korea (KOR), Mexico (MEX), Malaysia (MYS), Norway (NOR), Pakistan (PAK), Peru (PER), Philippines (PHL), Poland (POL), Romania (ROU), Russia (RUS), Saudi Arabia (SAU), Singapore (SGP), Sweden (SWE), Thailand (THA), Turkey (TUR), Taiwan (TWN), and Vietnam (VNM).

References

  1. Agostino, D., Lourenço, R., Jorge, S., Bracci, E., & Cruz, I. (2025). Data science and public sector accounting: Reviewing impacts on reporting, auditing, and accountability practices. Public Money & Management. Available online: https://www.tandfonline.com/doi/full/10.1080/09540962.2025.2529266 (accessed on 12 October 2025).
  2. Ali, A. A., Khedr, A. M., El-Bannany, M., & Kanakkayil, S. (2023). A powerful predicting model for financial statement fraud based on optimized XGBoost ensemble learning technique. Applied Sciences, 13(4), 2272. [Google Scholar] [CrossRef]
  3. Alkaraan, F., Albahloul, M., Abdoush, T., Elmarzouk, M., & Gulko, N. (2024). Big Four ‘rhetorical’ strategies: Carillion’s collapse. Accounting and Management Information Systems, 23(2), 295–316. [Google Scholar] [CrossRef]
  4. Al Natour, A. R., Al-Mawali, H., Zaidan, H., & Said, Y. H. Z. (2023). The role of forensic accounting skills in fraud detection and the moderating effect of CAATTs application: Evidence from Egypt. Journal of Financial Reporting and Accounting, 23(1), 30–55. [Google Scholar] [CrossRef]
  5. Appelbaum, D., Kogan, A., & Vasarhelyi, M. A. (2017). Big data and analytics in the modern audit engagement: Research needs. Auditing: A Journal of Practice & Theory, 36(4), 1–27. [Google Scholar] [CrossRef]
  6. Arens, A. A., Elder, R. J., & Beasley, M. S. (2017). Auditing and assurance services: An integrated approach (16th ed.). Pearson. [Google Scholar]
  7. Beerbaum, D. (2021). The future of audit after the Wirecard accounting scandal–proposal for a change in the payment model. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3934773 (accessed on 12 October 2025).
  8. Beneish, M. D. (1999). The detection of earnings manipulation. Financial Analysts Journal, 55(5), 24–36. [Google Scholar] [CrossRef]
  9. Brazel, J. F., Jones, K. L., & Lian, Q. (2023). Auditor use of benchmarks to assess fraud risk: The case for industry data. Journal of Forensic Accounting Research, 9(1), 23–57. [Google Scholar] [CrossRef]
  10. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. [Google Scholar] [CrossRef]
  11. Carcello, J. V., & Neal, T. L. (2000). Audit committee composition and auditor reporting. The Accounting Review, 75(4), 453–467. [Google Scholar] [CrossRef]
  12. Chen, H., Chen, J. Z., Lobo, G. J., & Wang, Y. (2011). Effects of audit quality on earnings management and cost of equity capital: Evidence from China. Contemporary Accounting Research, 28(3), 892–925. [Google Scholar] [CrossRef]
  13. Chen, T., & Guestrin, C. (2016, August 13–17). XGBoost: A scalable tree boosting system. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. [Google Scholar] [CrossRef]
  14. Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, 6. [Google Scholar] [CrossRef] [PubMed]
  15. Chicco, D., Warrens, M. J., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 7, e623. [Google Scholar] [CrossRef]
  16. Coates, J. C. (2007). The goals and promise of the Sarbanes-Oxley Act. Journal of Economic Perspectives, 21(1), 91–116. [Google Scholar] [CrossRef]
  17. Costa, L. M. (2018). Corruption and corporate social responsibility codes of conduct: The case of Petrobras and the oil and gas sector in Brazil. Rule of Law and Anti-Corruption Center Journal, 2018(1), 6. [Google Scholar] [CrossRef]
  18. Cressey, D. R. (1953). Other people’s money: A study in the social psychology of embezzlement. Free Press. [Google Scholar]
  19. DeAngelo, L. E. (1981). Auditor size and audit quality. Journal of Accounting and Economics, 3(3), 183–199. [Google Scholar] [CrossRef]
  20. Dechow, P. M., Ge, W., Larson, C. R., & Sloan, R. G. (2011a). Predicting material accounting misstatements. Contemporary Accounting Research, 28(1), 17–82. [Google Scholar] [CrossRef]
  21. Dechow, P. M., Ge, W., & Schrand, C. (2011b). Understanding earnings quality: A review of the proxies, their determinants and their consequences. Journal of Accounting and Economics, 50(2–3), 344–401. [Google Scholar] [CrossRef]
  22. DeFond, M. L., & Zhang, J. (2014). A review of archival auditing research. Journal of Accounting and Economics, 58(2–3), 275–326. [Google Scholar] [CrossRef]
  23. Diamond, D. W. (1985). Optimal release of information by firms. Journal of Finance, 40(4), 1071–1094. [Google Scholar] [CrossRef]
  24. Dulgeridis, M., Schubart, C., & Dulgeridis, S. (2025). Harnessing AI for accounting integrity: Innovations in fraud detection and prevention (No. 4 (July 2025)). IU Discussion Papers-Business & Management. Available online: https://repository.iu.org/items/f43bd461-a387-48f4-a397-16d8b64ef34b (accessed on 12 October 2025).
  25. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874. [Google Scholar] [CrossRef]
  26. Francis, J., & Krishnan, J. (1999). Accounting accruals and auditor reporting conservatism. Contemporary Accounting Research, 16(1), 135–165. [Google Scholar] [CrossRef]
  27. Francis, J. R. (2011). A framework for understanding and researching audit quality. Auditing: A Journal of Practice & Theory, 30(2), 125–152. [Google Scholar] [CrossRef]
  28. Garcia-Blandon, J., Argilés, J., & Ravenda, D. (2020). Audit firm tenure and audit quality: A cross-European study. Journal of International Financial Management & Accounting, 31(1), 35–64. [Google Scholar] [CrossRef]
  29. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. Available online: https://ieeexplore.ieee.org/document/5128907. [CrossRef]
  30. Healy, P. M., & Wahlen, J. M. (1999). A review of the earnings management literature and its implications for standard setting. Accounting Horizons, 13(4), 365–383. [Google Scholar] [CrossRef]
  31. Hennes, K. M., Leone, A. J., & Miller, B. P. (2008). The importance of distinguishing errors from irregularities in restatement research: The case of restatements and CEO/CFO turnover. The Accounting Review, 83(6), 1487–1519. [Google Scholar] [CrossRef]
  32. Jensen, M. C., & Meckling, W. H. (1976). Theory of the firm: Managerial behavior, agency costs and ownership structure. Journal of Financial Economics, 3(4), 305–360. [Google Scholar] [CrossRef]
  33. Jolley, M. (2025). Biggest accounting scandals of 2024. Transparently. Available online: https://www.transparently.ai/blog/biggest-accounting-scandals-2024 (accessed on 30 September 2025).
  34. Junaidi, J., Miharjo, S., & Hartadi, B. (2012). Does auditor tenure reduce audit quality? Gadjah Mada International Journal of Business, 14(3), 303–315. [Google Scholar] [CrossRef]
  35. Kersting, L., Kim, J.-C., Mazumder, S., & Su, Q. (2024). Unveiling the brew: Probing the lingering impact of the Luckin coffee scandal on the liquidity of Chinese cross-listed stocks. Journal of Risk and Financial Management, 17(11), 514. [Google Scholar] [CrossRef]
  36. Khurana, I. K., & Raman, K. K. (2004). Litigation risk and the financial reporting credibility of Big 4 versus non-Big 4 audits: Evidence from Anglo-American countries. The Accounting Review, 79(2), 473–495. [Google Scholar] [CrossRef]
  37. Kohavi, R. (1995, August 20–25). A study of cross-validation and bootstrap for accuracy estimation and model selection. 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada. [Google Scholar]
  38. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer. [Google Scholar] [CrossRef]
  39. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. [Google Scholar] [CrossRef]
  40. Liu, Y. (2023). Design of XGBoost prediction model for financial operation fraud of listed companies. International Journal of System Assurance Engineering and Management, 14, 2354–2364. [Google Scholar] [CrossRef]
  41. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 4768–4777). Curran Associates Inc. [Google Scholar]
  42. Mallela, I. R., Kankanampati, P. T., Tangudu, A., Goel, O., Gopalakrishna, P., & Jain, A. (2024). Machine learning applications in fraud detection for financial institutions. Darpan International Research Analysis, 12(3), 711–743. [Google Scholar] [CrossRef]
  43. Messier, W. F., Glover, S. M., & Prawitt, D. F. (2014). Auditing & assurance services: A systematic approach (9th ed.). McGraw-Hill Education. [Google Scholar]
  44. Mukhidinov, A. N., Karimova, M. B., Kavitha, V. O., & Shermatov, A. O. U. (2025). Advanced AI algorithms in accounting: Redefining accuracy and speed in financial auditing. AIP Conference Proceedings, 3306(1), 050008. [Google Scholar] [CrossRef]
  45. Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106. [Google Scholar] [CrossRef]
  46. Nasfi Snoussi, S., Nasfi Salem, F., & Boulila Taktak, N. (2025). Impact of pressures on the detection of financial statement fraud risk. Corporate Ownership & Control, 22(2), 34–40. [Google Scholar] [CrossRef]
  47. Nemati, Z., Mohammadi, A., Bayat, A., & Mirzaei, A. (2025). Fraud prediction in financial statements through comparative analysis of data mining methods. International Journal of Finance & Managerial Accounting, 10(38), 151–166. [Google Scholar]
  48. Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559–569. [Google Scholar] [CrossRef]
  49. Nguyen Thanh, C., & Phan Huy, T. (2025). Predicting financial reports fraud by machine learning: The proxy of auditor opinions. Cogent Business & Management, 12(1), 2510556. [Google Scholar] [CrossRef]
  50. Nti, K., & Somanathan, A. R. (2024). A scalable RF-XGBoost framework for financial fraud mitigation. IEEE Transactions on Computational Social Systems, 11(2), 1556–1563. [Google Scholar] [CrossRef]
  51. Piotroski, J. D. (2000). Value investing: The use of historical financial statement information to separate winners from losers. Journal of Accounting Research, 38, 1–41. [Google Scholar] [CrossRef]
  52. Prabha, M., Sharmin, S., Khatoon, R., Imran, M. A. U., & Mohammad, N. (2024). Combating banking fraud with IT: Integrating machine learning and data analytics. The American Journal of Management and Economics Innovations, 6(07), 39–56. [Google Scholar] [CrossRef]
  53. Provost, F., & Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. O’Reilly Media. [Google Scholar]
  54. Qatawneh, A. M. (2024). The role of artificial intelligence in auditing and fraud detection in accounting information systems: Moderating role of natural language processing. International Journal of Organizational Analysis. Advance online publication. [Google Scholar] [CrossRef]
  55. Reynolds, J. K., & Francis, J. R. (2001). Does size matter? The influence of large clients on office-level auditor reporting decisions. Journal of Accounting and Economics, 30(3), 375–400. [Google Scholar] [CrossRef]
  56. Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355–374. [Google Scholar] [CrossRef]
  57. Stock, J. H., & Yogo, M. (2005). Testing for Weak Instruments in Linear IV Regression. In D. W. K. Andrews, & J. H. Stock (Eds.), Identification and inference for econometric models: Essays in honor of thomas rothenberg (pp. 80–108). Cambridge University Press. [Google Scholar] [CrossRef]
  58. Suyono, W. P., Puspa, E. S., Anugrah, S., & Firnanda, R. (2025). Redefining fraud detection: The synergy between auditor competency and AI-powered audit analytics. RIGGS: Journal of Artificial Intelligence and Digital Business, 4(3), 953–960. [Google Scholar] [CrossRef]
  59. Tayebi, M., & El Kafhali, S. (2025). A novel approach based on XGBoost classifier and Bayesian optimization for credit card fraud detection. Cyber Security and Applications, 3, 100093. [Google Scholar] [CrossRef]
  60. Teichmann, F., Boticiu, S., & Sergi, B. (2023). Wirecard scandal: A commentary on the biggest accounting fraud in Germany’s post-war history. Journal of Financial Crime, 31(5), 1166–1173. [Google Scholar] [CrossRef]
  61. Tümmler, M., & Quick, R. (2025). How to detect fraud in an audit: A systematic review of experimental literature. Management Review Quarterly. Advance online publication. [Google Scholar] [CrossRef]
  62. Wang, Y., Chiu, T., & Vasarhelyi, M. A. (2025). Financial statement fraud prediction system: A deep learning-based approach. Journal of Forensic Accounting Research. Available online: https://publications.aaahq.org/jfar/article-abstract/doi/10.2308/JFAR-2024-003/13888/Financial-Statement-Fraud-Prediction-System-A-Deep (accessed on 12 October 2025).
  63. Watts, R. L., & Zimmerman, J. L. (1986). Positive accounting theory. Prentice-Hall. [Google Scholar]
  64. West, J., & Bhattacharya, M. (2016). Intelligent financial fraud detection: A comprehensive review. Computers & Security, 57, 47–66. [Google Scholar] [CrossRef]
  65. Wolfe, D. T., & Hermanson, D. R. (2004). The fraud diamond: Considering the four elements of fraud. CPA Journal, 74(12), 38–42. Available online: https://digitalcommons.kennesaw.edu/cgi/viewcontent.cgi?article=2546&context=facpubs (accessed on 12 October 2025).
  66. Yousefi Nejad, M., Sarwar Khan, A., & Othman, J. (2024). A panel data analysis of the effect of audit quality on financial statement fraud. Asian Journal of Accounting Research, 9(4), 422–445. [Google Scholar] [CrossRef]
  67. Yuan, T., Zhang, X., & Chen, X. (2025). Machine learning based enterprise financial audit framework and high risk identification. arXiv. Available online: https://arxiv.org/abs/2507.06266 (accessed on 1 October 2025).
  68. Zhu, K., Yang, X., Zhang, Y., Liang, M., & Wu, J. (2024). A heterogeneity-aware car-following model: Based on the XGBoost method. Algorithms, 17(2), 68. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.