4.2. Fraud Index Constructions
As shown in
Table 4 Panel A, the DF-Score’s predicted classifications (binary outputs) reveal a very low mean (0.03) across both HO and CV, with a median of 0. This indicates that the model classifies only a small proportion of firms as fraudulent. Such output is consistent with high skewness (>5) and extreme kurtosis (>29), suggesting a heavily right-skewed distribution with a long tail—typical of rare-event modeling such as fraud detection.
In contrast, the PF-Score classification shows greater variation and a more balanced distribution. For instance, under the HO method, the PF-Score has a mean of 5.51 and a median of 6 (on a scale from 0 to 9), along with relatively low skewness (–0.25) and kurtosis (2.48), indicating a more symmetric and approximately normal distribution. The CV-based PF-Score reflects a slightly higher mean and median, while maintaining similar distributional characteristics. This suggests that the PF-Score, as a multi-level ordinal classification, captures a more nuanced gradation of fraud risk compared to binary classifications.
Given the cross-country heterogeneity in the data, the fraud prediction model controls for country, industry, and year effects using one-hot method. A comparison of the predicted results without heterogeneity handling in Panel A and the heterogeneity-handled results in Panel B shows that both produce qualitatively similar patterns, indicating that country-level heterogeneity has slight impact on the overall fraud analysis in this dataset.
When comparing these classification-based indices to the actual scores of each model, notable discrepancies are evident. These properties suggest a relatively well-behaved, bounded score that offers more intuitive interpretation and is potentially more useful for ranking or categorizing firms. However, it should be noted that transforming raw data into categorical groupings inevitably entails a loss of information, even though such transformations may enhance certain statistical properties of the variables.
Overall, the results suggest that while the DF-Score provides value for identifying financial misstatements, its extreme distributional properties in both predicted and actual forms may hinder its practical utility. Conversely, the PF-Score, with its balanced and stable characteristics, emerges as a more suitable index for applied audit or forensic analysis settings, especially where ordinal assessments of fraud risk are preferred. Additionally, the consistency of results between HO and CV methods across all models supports the reliability of the XGBoost classification framework, though careful consideration of threshold selection remains essential for ensuring interpretability and accuracy. Thus, to obtain an optimal threshold for imbalanced data, this study employs an optimization approach that maximizes the F1 score to improve the accuracy of binary outcome predictions.
In evaluating the performance of classification models applied to accounting and financial datasets, a wide range of statistical metrics are employed to capture different dimensions of predictive accuracy and reliability, as presented in
Table 5. Among these, accuracy remains one of the most commonly reported measures, indicating the proportion of total correct classifications over all observations. However, accuracy can be misleading in the presence of class imbalance—a frequent issue in accounting domains such as fraud detection or financial restatements—where predicting the majority class can produce deceptively high scores (
Provost & Fawcett, 2013). For example, in Panel A, a model may achieve 99% accuracy by always predicting “non-fraud,” even while missing all actual fraud cases.
To mitigate such limitations, balanced accuracy is used to average the recall across all classes, ensuring that minority class performance is equally weighted. This metric is especially valuable in contexts where the positive class—such as fraudulent firms—is rare but highly consequential (
He & Garcia, 2009). Balanced accuracy helps ensure that model evaluation does not disproportionately reward correct predictions of the dominant class, thus better reflecting real-world performance. Complementing this, Cohen’s Kappa statistic quantifies the agreement between predicted and actual labels while adjusting for chance agreement. In classification problems involving audit outcomes or corporate governance red flags, Kappa provides a more conservative and statistically grounded assessment of model reliability (
Landis & Koch, 1977).
Another widely used metric, logarithmic loss (log loss), evaluates the accuracy of probabilistic predictions, penalizing confident but incorrect classifications more severely. This is crucial in accounting applications involving predictive risk scoring, where the quality of probability estimates matters as much as the final classification. A lower log loss implies better-calibrated probabilities, which are essential when models are used to guide decisions such as audit allocations or enforcement investigations (
Brier, 1950). In tandem with log loss, precision (macro-averaged) captures the proportion of true positives among all predicted positives, highlighting the model’s ability to avoid false alarms. This is particularly important in high-stakes accounting environments, where wrongly accusing a firm of fraud can have severe reputational and regulatory consequences.
Equally important is recall (macro-averaged), which measures the proportion of actual positives that are correctly identified. In practical terms, recall answers the question: “Of all the firms that manipulated earnings, how many did the model detect?” A high recall is critical when failing to detect a fraudulent firm could lead to investor losses or audit failures. The F1 score (macro-averaged) synthesizes both precision and recall into a single metric by computing their harmonic mean, offering a balanced view of model effectiveness. Lastly, AUC-ROC quantifies the model’s discriminative ability across all thresholds. An AUC-ROC close to 1.0 indicates that the model effectively ranks positive cases (e.g., fraudulent firms) above negatives, regardless of the specific decision boundary. This is vital for threshold-independent evaluation and is widely used in regulatory and financial surveillance contexts (
Fawcett, 2006). Together, these metrics provide a comprehensive framework for evaluating classification models in accounting and financial research. Each metric highlights a different facet of model performance—ranging from overall accuracy to risk of misclassification—thereby enabling more robust, nuanced interpretations of predictive power.
The classification performance metrics for both HO and CV methods across scoring models—DF-Score and PF-Score—show strong results for certain tasks, with some notable weaknesses. In the HO approach, accuracy scores are exceptionally high for DF-Score, ranging between 0.995 and 0.996, indicating excellent predictive alignment with actual class labels. However, for PF-Score, the accuracy drops significantly to 0.552, suggesting the model struggles to predict that particular target. Balanced accuracy, which adjusts for class imbalance by averaging recall across classes, tells a more reliable story: DF-Score maintains a high value above 0.964, while PF-Score again lags at 0.751. Other performance metrics—such as precision, recall, F1-score, and AUC-ROC—reinforce these trends, with macro-averaged F1-scores above 0.96 for DF-Score, while PF-Score records a much lower 0.563.
When comparing the HO and CV results, the patterns remain generally consistent, although CV introduces greater model scrutiny. Accuracy and balanced accuracy for DF-Score stay robust under CV, even improving slightly in some metrics such as precision and Kappa, which reflect agreement between predicted and true labels beyond chance. However, PF-Score underperforms in CV, with accuracy dropping further to 0.223 and Kappa falling to 0.094, suggesting poor class prediction stability. This discrepancy shows how CV uncovers model vulnerabilities that the HO method may obscure (
Kohavi, 1995). Different learning rates (ETA) are selected in the models: both methods tuned ETA values of 0.1 and 0.3 depending on the task, indicating that XGBoost finds different trade-offs between convergence speed and generalization. In classification tasks, a lower learning rate like 0.1 often improves generalization by allowing more precise updates but may require more boosting rounds. The use of multiple learning rates in CV reflects the adaptive nature of the tuning process rather than inconsistency—Caret selects the best parameter set for each fold and reports all combinations that achieve similar performance, rather than relying on a single minimum.
In addition, baseline results (Panel A) and heterogeneity-handled results (Panel B) show qualitatively similar patterns, suggesting minimal impact of country-level heterogeneity on the overall analysis.
When classifying firms into high- or low-fraud groups using the optimized classification thresholds of the DF-score, the threshold values decrease from 0.31 (HO) and 0.39 (CV) to 0.19 (HO) and 0.21 (CV), respectively, after controlling for country, industry, and year heterogeneity. This indicates that incorporating heterogeneity controls makes the model more sensitive in distinguishing between high- and low-fraud observations. The lower thresholds suggest that the model requires less evidence (lower probability) to classify a firm as high performing, likely because the encoded structure absorbs much of the variation that previously inflated prediction uncertainty.
Theoretically, these findings reinforce the importance of using appropriate evaluation metrics in classification, especially under class imbalance. While high accuracy is appealing, it can be misleading in skewed datasets where models simply favor the dominant class (
Chicco & Jurman, 2020). Balanced accuracy, macro-averaged precision, recall, and F1-score offer a more reliable assessment of true performance, particularly in real-world financial or accounting datasets that often include unbalanced outcomes. The sharp drop in metrics for PF-Score across both validation methods underscores the limitations of relying solely on overall accuracy and points to potential deficiencies in feature representation or data quality for this target variable. From a model training perspective, the use of different ETA also reflects the necessity of flexible hyperparameter tuning depending on the complexity and signal-to-noise ratio of the target variable.
In practical terms, the high and consistent performance of models trained on DF-Score suggests this target is well-suited for classification with XGBoost under both HO and CV. Financial institutions or accounting firms using such models can expect stable generalization performance across unseen data, particularly if proper CV protocols are implemented. However, the poor results for PF-Score highlight the need for additional data preparation techniques—such as synthetic sampling, feature transformation, or even alternative modeling frameworks—to better capture the structure of that variable. Practitioners must be cautious of misleading holdout results that appear strong, as CV clearly revealed performance degradation in PF-Score that holdout could not. The multiple ETA values further imply that tuning hyperparameters on a task-by-task basis is essential to obtaining optimal results, rather than relying on a fixed learning rate across all scenarios.
In conclusion, while both HO and CV methods produced high-performing classification models for certain score types, CV provided a more rigorous and realistic assessment of model generalizability. Consistent with
Y. Wang et al.’s (
2025) findings, they evidence that deep learning-based accounting fraud prediction achieves remarkably high prediction accuracy. The consistent underperformance of PF-Score—despite decent holdout results—reinforces the theoretical consensus that CV is more trustworthy in performance validation (
Kuhn & Johnson, 2013). The use of multiple learning rates is not a flaw but an advantage of adaptive model selection, where hyperparameter configurations are flexibly chosen based on validation feedback. Future research should investigate why certain score types like PF-Score are resistant to classification under current feature sets and whether feature engineering or model stacking can help. Ultimately, this analysis affirms the critical role of robust validation and metric interpretation in machine learning applications within financial and accounting domains.
In the context of XGBoost with HO classification as presented in
Table 6 Panel A1, models are trained on a subset of the data and evaluated on a separate, unseen holdout set. This setup closely simulates real-world deployment and provides an unbiased estimate of model performance. However, it also means that traditional feature importance metrics, typically derived during model training (e.g., gain, cover, or frequency of feature usage in tree splits), may not reflect how features behave on unseen data. In contrast, SHAP values computed on the holdout set provide a post hoc, individualized measure of each feature’s contribution to predictions, making them directly applicable to out-of-sample inference.
For example, in the DF-Score model, features like CH_CS, CH_CM, and SOFT_ASSETS maintain high rankings in both importance and SHAP values, indicating stable, generalizable predictors of fraud risk across both training and test data. However, discrepancies—such as CH_INV and CH_REC showing near-zero importance yet non-negligible SHAP contributions—suggest that these features, while not dominant in training splits, influence predictions in certain contexts on the holdout set. This highlights a key limitation of relying solely on training-based feature importance: it can underrepresent features that are predictive only in specific sub-populations or in interaction with other variables—issues that SHAP is designed to uncover.
The PF-Score results show more variation: although features like F_AROA, F_CFO, and F_ALEVER are highly ranked by both metrics, the precise ordering and relative impact differ. Notably, F_CFO emerges as the top SHAP contributor on the holdout set (0.629), even though it is ranked third by traditional importance. This suggests that F_CFO plays a particularly critical role in the model’s generalization to unseen data, possibly due to its sensitivity to underlying cash flow anomalies that are not captured fully during training splits. Moreover, features like F_ALIQUID, which are assigned zero traditional importance, still receive meaningful SHAP values, again reinforcing the idea that SHAP captures context-specific predictive power that tree-splitting metrics may overlook.
In Panel A2, the results from the XGBoost CV classification models for DF-Score and PF-Score demonstrate both convergences and discrepancies between traditional feature importance measures and SHAP values, highlighting important considerations for accounting research and practice. CV enhances model robustness by repeatedly training and testing the model on multiple data folds, ensuring that performance metrics and feature relevance are not overly dependent on any single partition of the data. Unlike a single HO evaluation, CV better approximates model generalizability by averaging across diverse subsets, which is especially critical in accounting contexts where data distributions can vary significantly across firms and time periods. However, this setup also introduces subtle complexities in interpreting feature importance metrics versus SHAP values.
In the DF-Score model, key features such as CH_CS, CH_CM, and SOFT_ASSETS consistently rank highest in both importance and SHAP values, underscoring their robust predictive power in detecting financial manipulation. Nonetheless, the shift in feature rankings for variables like RSST_ACC and CH_FCF between importance and SHAP metrics suggests that while these features contribute strongly during tree construction, their influence on out-of-sample predictions is somewhat moderated, reflecting real-world complexities. Furthermore, features like CH_REC and CH_INV have low importance scores but maintain non-trivial SHAP values, indicating their predictive utility in specific cases or subsets of the data—an insight that traditional importance measures may miss. This divergence underscores the technical limitation of relying solely on gain-based importance for interpretability in models involving heterogeneous firm behaviors and interactions among accounting variables.
In the PF-Score model, a more pronounced discrepancy is evident. While F_AROA holds the highest feature importance, F_CFO emerges as the most influential predictor based on SHAP values. This suggests that F_CFO may be particularly sensitive to cash flow-related anomalies that affect model predictions on the holdout set but may be underweighted during training splits due to complex interactions with other features. The fact that features such as F_ALIQUID and F_AMARGIN receive meaningful SHAP values despite lower or zero importance rankings reinforces the interpretive power of SHAP for uncovering subtle but practically relevant predictive relationships that could otherwise be obscured.
The results in Panel B, which incorporate heterogeneity handling, follow the same overall pattern as those shown in Panel A. However, for the PF-Score, the SHAP values indicate slightly different feature importance for the top three features, reflecting minor shifts in their relative contributions compared with traditional importance rankings.
Theoretically, the combined insights from HO and CV classification reinforce the critical need to move beyond traditional feature importance metrics when interpreting complex machine learning models in accounting research. Both approaches demonstrate that SHAP values, by quantifying the marginal impact of each feature on individual predictions within out-of-sample contexts, provide a richer and more precise understanding of how models function in real-world scenarios. While HO validation directly simulates future forecasting on unseen data, offering a clear snapshot of model generalizability, CV extends this by averaging performance and feature effects across multiple data partitions, thereby enhancing the robustness and reliability of inference. This dual perspective emphasizes the value of post-hoc interpretability tools like SHAP in validating and interpreting machine learning outputs, especially in accounting environments characterized by heterogeneous data and complex, nonlinear interactions among financial variables.
Practically, the results indicate that sales growth, cost of goods sold, current assets, cash flow and return on assets are key indicators of financial fraud, consistently highlighting potential risk. Revenue growth, in particular, has been supported by prior research (
Brazel et al., 2023). Auditors can leverage these insights to focus investigations on accounts with unusual movements, prioritize testing where anomalies are most pronounced, and track patterns over time. Recognizing that feature importance can be context-dependent encourages the development of more nuanced, adaptive risk assessment frameworks that better capture the complexities inherent in financial reporting and manipulation.
In sum, blending the theoretical rigor of HO and CV frameworks with SHAP-based interpretability advances the frontier of accounting analytics. It provides a comprehensive and defensible foundation for employing machine learning in high-stakes financial decision-making, fostering models that are not only predictive but also transparent, generalizable, and practically actionable.
As presented in
Table 7, this study examines the relationship between fraud indicators and audit opinions (AUO) using IV-2SLS, addressing endogeneity concerns by employing leave-one-out industry fraud (IndF) and revenue growth (GROWTH) as instruments. IndF captures industry-level fraud tendencies while excluding the firm itself. The first-stage regression results for the DF-Score model (DF-Score > 1 = financial integrity) indicate that IndF is a significant predictor of DF-Score (coefficient = 0.085, t = 2.69), whereas revenue growth is statistically insignificant (coefficient = 0.000, t = 1.37). For the PF-Score model, both IndF (0.224, t = 8.73) and GROWTH (0.000, t = 2.19) are significant predictors. The result suggests that firms in industries with higher fraud tend to have higher fraud score. Instrument strength is confirmed by the first-stage statistics: the F test of excluded instruments (4.44 for DF-Score, 40.81 for PF-Score), the Sanderson-Windmeijer multivariate F-statistic (4.44, 40.81), and the Kleibergen-Paap rk LM statistic (5.45, 22.42) all indicate that the instruments are relevant and the model is not underidentified. Additionally, the Cragg-Donald Wald F-statistics (44.56, 162.47) exceed the Stock-Yogo critical values for various maximal IV sizes, further confirming instrument strength (
Stock & Yogo, 2005).
In the second-stage estimation for the binary DF-Score, the positive sign of coefficient is expected. The result shows that the fraud indicator is marginally significant predictor of audit opinions (coefficient = 5.63). This implies that firms with a higher likelihood of financial misstatements, as captured by the DF-Score, are more prone to receive modified audit reports. The insignificant Hansen J statistic (0.602) confirms the validity of the overidentifying restrictions, affirming that the instruments used are exogenous. These findings align with prior literature asserting that auditors are more likely to issue modified audit opinions when indicators of earnings manipulation are present (
Carcello & Neal, 2000).
For PF-Score, the fraud score is negatively associated with audit opinion (coefficient = −0.019), meaning firms with higher fraud risk (lower PF scores) are more likely to receive qualified or adverse audit opinions. The marginally significant Hansen J statistic (3.655,
p < 0.10) in the PF model suggests a potential risk of overidentification, although overall model validity is supported by robust Wald and Anderson-Rubin statistics. This aligns with past findings that audit decisions are sensitive not only to the presence but also to the severity of fraud signals (
Yousefi Nejad et al., 2024).
Comparatively, both DF and PF scores exhibit marginally statistically significant predictive power in explaining audit opinion outcomes, reinforcing the empirical link between fraud detection models and auditor judgments. However, the magnitude of the effect is stronger in the DF model, likely because binary fraud classifications present a clearer red flag to auditors than more continuous or probabilistic indicators like the PF score. The PF-Score’s granularity allows it to capture subtler differences in fraud probability, but its influence on audit outcomes appears diluted compared to the more definitive DF binary classification. The positive and significant coefficient on DEBT suggests that auditors may view higher leverage as a signal of potential fraud, independent of firm size, highlighting that financial structure can influence audit assessments even when other firm characteristics are controlled, as noted by
Nasfi Snoussi et al. (
2025).
From a theoretical standpoint, the results support foundational concepts in agency theory, where information asymmetry between managers and external stakeholders leads to opportunistic behavior such as earnings manipulation, which auditors are expected to mitigate (
Jensen & Meckling, 1976). The significant role of fraud scores in shaping audit opinions is also consistent with signaling theory: an adverse audit opinion serves as a market signal of deteriorating financial reporting quality (
Spence, 1973). Furthermore, these findings contribute to the literature on audit quality by demonstrating that high leverage firms are more likely to issue adverse opinions in the presence of fraud indicators.
Practically, these findings have critical implications for stakeholders such as regulators, auditors, and institutional investors. For auditors, fraud scores such as DF and PF offer actionable insights into client risk profiles and can be integrated into audit planning and sampling procedures. In industries with more fraud, auditors may work harder or firms may adopt stricter internal controls, resulting in higher DF-Scores. The strong association between DF-Score and adverse opinions implies that such binary models may serve as red flags during preliminary risk assessments. In contrast, the PF-Score may be better suited for continuous monitoring or for tiered audit attention, where firms are prioritized based on their risk bands. Regulators and enforcement agencies can also leverage these models to pre-emptively identify firms at risk of financial misreporting, allocating limited investigative resources more efficiently. Real-world applications are already evident: forensic tools based on similar scoring systems have been adopted by agencies like the U.S. Securities and Exchange Commission and the Public Company Accounting Oversight Board (PCAOB) to inform risk-based inspection programs. Investors, too, can use these scores in screening portfolios for potential governance risks.
In summary, the IV-2SLS analysis confirms that fraud risk, whether measured by a binary DF-Score or an ordinal PF-Score, significantly predicts the likelihood of receiving an adverse audit opinion, even after correcting for endogeneity. The instruments used—industry-fraud and revenue growth—are statistically valid, and the results are consistent across multiple model specifications. The DF-Score shows a stronger marginal effect, while the PF-Score offers more nuanced fraud probability signals. The findings contribute to the theoretical understanding of audit decision-making and provide practical guidance for using machine-learning-derived fraud scores in real-world financial oversight, audit planning, and regulatory enforcement. Our results support Hypothesis H1, indicating that there is an association between higher fraud risk scores and the issuance of audit opinions.
In this analysis, fraud risk is not directly observed through labeled data but instead predicted using machine learning—specifically XGBoost classifiers trained via HO and CV methods. In Panel A of
Table 8, both predicted DF-Score and PF-Score, without controlling for heterogeneity, show positive associations with adverse audit opinions, consistent with the baseline results, although the associations are only marginally significant (
p < 0.10). The instrumented DF-Score is positively associated with the likelihood of receiving a negative audit opinion, with coefficients of 0.586 in the HO model and 0.560 in the CV model. The instrumented PF-Score is negatively associated with the likelihood of receiving a negative audit opinion, with coefficients of -0.019 in both HO and CV models. These results closely mirror those in
Table 7. The consistency in magnitude and significance across the three models confirms that XGBoost predictions of fraud, even without labeled outcomes, are highly aligned with auditors’ judgments. Moreover, the instruments perform robustly across all versions: IndF remains a statistically strong predictor in the first stage.
When accounting for heterogeneity in the prediction of DF-score and PF-Score, The results for DF-Score and PF-Score generally follow patterns similar to those observed in Panel A. However, in the DF-Score prediction using holdout method, the relationship between fraud and audit opinions is not statistically significant. Likewise, the association between industry-fraud and fraud is also insignificant. Apart from these exceptions, the results remain consistent with those obtained without controlling for heterogeneity.
Compared to the baseline regressions using labeled fraud data (
Table 7), the machine-learning-based scores perform equivalently, if not slightly more robustly in some dimensions. In both DF and PF models, the use of XGBoost predictions generated through HO and CV offers coefficients and statistical significances that are nearly identical to the ones based on observed fraud outcomes. This reinforces the notion that predictive models trained on accounting and financial features are capable of approximating real-world audit assessments with high fidelity. The minimal variation between HO and CV models further validates the stability and generalizability of the fraud scores across different model training approaches (
Mullainathan & Spiess, 2017).
These findings have strong theoretical implications. From the lens of machine learning interpretability and audit economics, the results support the idea that fraud is a latent construct that can be probabilistically inferred using patterns in financial and governance data (
Dechow et al., 2011b). The consistency of predictive scores with actual audit decisions lends support to theories of auditor rationality and efficiency in the presence of asymmetric information (
Jensen & Meckling, 1976). Moreover, the fact that auditors appear to respond similarly to both actual and predicted fraud risks suggests that auditor judgments are aligned with quantifiable red flags derived from machine learning—validating a data-driven interpretation of auditor behavior (
J. R. Francis, 2011).
Practically, the successful use of predicted fraud scores as endogenous variables suggests significant promise for real-world applications. Audit firms and regulators can adopt such models to prioritize engagements, flag potentially problematic firms, and allocate resources more effectively—even in the absence of confirmed fraud cases. For example, tools based on XGBoost fraud scores could be integrated into pre-audit risk assessments, reducing manual screening efforts and allowing auditors to concentrate on high-risk areas. Regulators such as the SEC and PCAOB could similarly leverage these scores to develop more predictive enforcement algorithms. Furthermore, CV ensures that these models remain effective across different contexts, improving their reliability for firms of varying size, sector, and geography. The minor differences in coefficients between HO and CV approaches also highlight the robustness of model training techniques and suggest that predictive fraud scores remain stable under different validation schemes.
In summary, the results of
Table 8 demonstrate that XGBoost-predicted fraud scores, generated through both HO and CV strategies, produce instrumental variable estimates that are nearly indistinguishable from those obtained using actual labeled fraud data. Both DF and PF scores maintain strong statistical and economic significance in predicting adverse audit opinions. These findings substantiate the claim that fraud risk can be effectively measured through predictive modeling and used in econometric analysis, even without directly labeled outcomes. This opens new avenues for scalable, automated, and high-accuracy fraud detection systems that are theoretically grounded and practically implementable. Our results support the Hypothesis H2, indicating that machine learning-derived fraud indices constructed via XGBoost significantly associate with audit opinion.