Using Machine Learning to Detect Financial Statement Fraud: A Cross-Country Analysis Applied to Wirecard AG

Steingen, Luca; Löw, Edgar

doi:10.3390/jrfm18110605

Open AccessArticle

Using Machine Learning to Detect Financial Statement Fraud: A Cross-Country Analysis Applied to Wirecard AG

by

Luca Steingen

^1,2,*

and

Edgar Löw

¹

Accounting Department, Frankfurt School of Finance & Management, Adickesallee 32-34, 60322 Frankfurt am Main, Germany

²

DZ BANK AG, Platz der Republik, 60325 Frankfurt am Main, Germany

^*

Author to whom correspondence should be addressed.

J. Risk Financial Manag. 2025, 18(11), 605; https://doi.org/10.3390/jrfm18110605 (registering DOI)

Submission received: 18 September 2025 / Revised: 19 October 2025 / Accepted: 21 October 2025 / Published: 28 October 2025

(This article belongs to the Special Issue AI and Emerging Technologies in Governance, Risk and Earnings Management)

Download

Browse Figures

Versions Notes

Abstract

This study analyzes the ability of machine-learning algorithms to detect financial statement fraud using four financial ratios as inputs: the Altman Z-Score, Beneish M-Score, Montier C-Score, and Dechow F-Score. It also evaluates whether the Wirecard AG scandal of 2020 could have been detected by the model developed in this study. Financial statement data was obtained from the financial data vendor Bloomberg L.P. The dataset consists of 2,014,827 firm years between 1988–2019, from companies across the globe, of which 1145 firm years were identified as fraudulent. A balanced dataset of 1046 fraudulent firm years and 1046 randomly selected firm years was used to train and evaluate multiple machine-learning algorithms via an automated pipeline search. The selected model is an ensemble combining gradient boosting and k-nearest neighbors. On the held-out test set, it correctly classified 82.03% of the manipulated and 89.88% of the non-manipulated firm years, with an overall accuracy of 85.69%. Applied retrospectively to Wirecard AG, the model identified 7 of 17 firm years as fraudulent.

Keywords:

Altman Z-Score; Beneish M-Score; classification algorithms; Dechow F-Score; financial ratios; financial statement fraud; gradient boosting; machine-learning algorithms; Montier C-Score; Wirecard

1. Introduction

Financial statement fraud (FSF) has emerged as a significant threat to corporate stakeholders. Once financial statement fraud becomes public, investors, employees, customers, auditors, and local authorities can experience substantial financial losses as well as reputational harm. Recent cases, such as Wirecard AG (Granville, 2020) and Luckin Coffee Inc. (Gray, 2020), have demonstrated the severe consequences. Data from the United States alone illustrates the financial damage associated with financial statement fraud.

Table 1 shows the 20 largest bankruptcies in U.S. history by assets at the time of bankruptcy filing. Of these cases, six were associated with financial statement fraud. Taken together, they account for USD 985.38 billion in total assets, underscoring the enormous financial consequences.

Figure 1 visualizes the annual number of Accounting and Auditing Enforcement Releases (AAERs) over time, published by the United States Securities and Exchange Commission (SEC) since 1982 (U.S. Securities and Exchange Commission, n.d.). The decreasing number of reported cases over time might suggest a reduced relevance of FSF in today’s corporate environment.

However, this impression is misleading. Numerous scandals have come to light in recent years as well. The spike in incidents between 2000–2009 is believed to be due to the increasing scale and complexity of the finance sector and the associated power imbalances between regulators and the regulated (Toms, 2019, p. 492). Many of the cases during that period occurred in the banking and finance industry, which was much less regulated compared to the time after the global financial crisis of 2009 (Toms, 2019, p. 492). Despite enhanced reporting standards and enforcement, FSF persists because incentives to inflate performance remain, transaction structures are increasingly complex, and the digitalization of business models simplifies both the creation and concealment of accounting irregularities, increasing the need for AI-based risk screening. The consequences extend beyond equity losses to credit markets, labor, and suppliers. Given the substantial financial and reputational damage, scalable screening tools for early detection are needed alongside traditional audits and investigations.

Many common fraud methods leave identifiable footprints in accounting ratios that ML models can learn to detect: overstated revenue and receivables (e.g., sales growth decoupled from cash, and rising days-sales-outstanding), earnings management via accruals (high accruals relative to assets or cash flows), abnormal inventory growth indicating inventory manipulation, expense capitalization (increase in capital expenditures with stable operations), or liquidity distortions (anomalies in debt ratios and coverage ratios that are not consistent with profitability). These patterns are the basis for indicators like the Beneish M-Score or the Dechow F-Score and motivate an ML approach based on financial ratios.

Several approaches have been developed in the previous literature attempting to detect FSF. The detection of errors in financial statements was first conducted in the accounting and auditing literature (Hylas & Ashton, 1982). The first approaches in this field mainly focused on the detection using financial ratios from financial statements (Beneish, 1999; Dechow et al., 1995; Lev & Thiagarajan, 1993; Persons, 1995). Other literature has explored detection through textual analysis in verbal and written form (Glancy & Yadav, 2011; Goel et al., 2010; Hoberg & Lewis, 2017; Purda & Skillicorn, 2015). Lately, a third area has increasingly been researched: detecting FSF by analyzing large datasets by applying machine learning (ML) (Bertomeu et al., 2019; Kirkos et al., 2007; Ngai et al., 2011; Sharma & Kumar Panigrahi, 2012; West & Bhattacharya, 2016). This approach can be combined with financial ratios or textual analysis.

The recent work in FSF detection research has seen a trend toward ML models, especially tree-based ensembles like random forests and gradient boosting, which often outperform traditional statistical approaches like logistic regression (Ali et al., 2023; Bertomeu et al., 2019; Y. Chen & Wu, 2023). A key challenge is the extreme class imbalance in fraud data, so studies increasingly employ techniques such as oversampling or cost-sensitive learning to rebalance and avoid bias toward the non-fraud class (Cheah et al., 2023; Cheng et al., 2021; L. Huang et al., 2022). There is also a growing emphasis on interpretability: modern fraud detectors often use explainable AI (permutation feature importance, SHapley Additive exPlanations (SHAP), and knowledge graph explanations) to identify the financial indicators or risk factors driving predictions (W. Kim & Kim, 2025; Z. Zhang et al., 2025; Zhu et al., 2025). Furthermore, recent approaches extend the use of numerical financial ratios with textual signals, like linguistic cues from MD&A sections, audit opinions, or news in multimodal models, to boost detection accuracy (W. Liu et al., 2025; Wang et al., 2023; Z. Zhang et al., 2022).

This paper examines whether an ML classifier based on cross-country financial ratios can provide interpretable risk signals for detecting FSF. The goal is not to replicate a single indicator, but to learn from components of multiple indicators on a large scale and correctly predict fraudulent financial statements outside the training data. The screening perspective is relevant for several stakeholders, such as banks, auditors, regulators, and investors, as a screening tool to set priorities for a more in-depth examination.

The study advances FSF detection by assembling a cross-country dataset of financial ratios beyond a single jurisdiction and implements a novel approach by combining four scores from the previous literature, the Altman Z-Score, Beneish M-Score, Montier C-Score, and Dechow F-Score, and provides a case study applying the scores individually and the developed ML model to Wirecard AG. The specific hypotheses derived from the literature are stated in Section 2.4, Hypothesis Development, concerning the predictive capability of ML algorithms for FSF detection and the Wirecard case study.

2. Literature Review

Foundations in FSF Research

The research on financial statement fraud is grounded in criminological–accounting frameworks that explain why managers commit misreporting and which conditions enable it. The classic lineage starts with the “fraud triangle” (incentive, opportunity, and rationalization), which suggests that fraud is likely to occur when (1) someone has an incentive to commit fraud, (2) weak oversight provides an opportunity to commit fraud, and (3) the person can rationalize the fraudulent behavior (Cressey, 1953; Schuchter & Levi, 2016). This concept later evolved into the “fraud diamond” (adding capability), which emphasizes that the perpetrator’s personal traits, abilities, and position also matter for successful manipulation (Wolfe & Hermanson, 2004). A recent extension, the “fraud hexagon”, further incorporates collusion and ego/arrogance, recognizing coordinated schemes and executive overconfidence as amplifiers of FSF risk (Vousinas, 2019). Empirical work applying the hexagon to fraudulent reporting partly supports the significance of the six factors (Achmad et al., 2023; Nugroho & Diyanty, 2022; Sukmadilaga et al., 2022).

Conceptually, these frameworks can be mapped to observable FSF footprints: pressures and ego often manifest in aggressive growth narratives, opportunity relates to weak controls and complex transaction structures, and rationalization and capability support continuous manipulation. These mechanisms imply detectable patterns in the ratio dynamics (e.g., receivables, revenue, accrual intensity, inventory changes, and liquidity) that motivate the feature space used for FSF detection models.

Foundations in ML Research

From an information systems perspective, an ML-based FSF detector is an artifact whose value depends on how well its predictive value fits the risk-screening task. The Task-Technology Fit (TTF) theory hypothesizes that performance gains arise when technology aligns with information processing needs (Goodhue & Thompson, 1995). In this study’s setting, calibrated risk scores and data preparation, like class imbalance and outlier handling, support the screening workflow for FSF. According to Design Science, an information systems artifact should be purposefully designed and rigorously evaluated for utility (Hevner et al., 2004, pp. 82–83). For FSF, that motivates clear data partitions (training, testing, and validation) at the firm-year level, optimized pipeline search and testing, and the permutation importance of the predictor variables.

From a FinTech perspective, ML-based FSF screening reflects the ongoing digitization and analytics-driven transformation of the financial industry: data- and algorithm-driven systems now reshape risk management and supervision (Gomber et al., 2017). Studies in the field emphasize auditability and explainability when analytics inform oversight decisions. Models must generate traceable outputs that regulators can review (Appelbaum, 2016; Appelbaum et al., 2017; C. Zhang et al., 2022).

By bringing the strands together, fraud drivers (incentive, opportunity, rationalization, capability, collusion, and ego) can be translated into measurable financial ratio signals. An ML model, evaluated under information systems principles, can transform those signals into calibrated and explainable risk probabilities that can support screening for FSF risk.

2.1. Definition of Terminology

In accounting research, various terms are used interchangeably with financial statement fraud to describe related, though not always identical, types of misconduct (Amiram et al., 2018, p. 733), for instance, financial accounting fraud, financial reporting misconduct, financial misreporting, earnings management, earnings misstatement, anomalies in financial statements, or management fraud. To ensure consistent and unambiguous terminology, this paper adopts the following terms from Albizri et al. (2019) as their work is a recent meta-study of FSF literature:

Accrual Accounting Practices

“Accrual or GAAP accounting uses accrual, deferral, and allocation procedures, many of which are based on judgments and estimates made by management, to relate revenues, expenses, gains, and losses to periods that actually reflect a firm’s performance of these transactions.” (Albizri et al., 2019, p. 208). Thus, accrual accounting practices cover all accounting practices aimed to show the most accurate and unaltered information to the best of one’s knowledge and conscience.

Earnings Management

Healy and Wahlen define earnings management as “when managers use judgment in financial reporting and in structuring transactions to alter financial reports to either mislead some stakeholders about the underlying economic performance of the company or to influence contractual outcomes that depend on reported accounting numbers.” (Healy & Wahlen, 1999, p. 368). In other words, earnings management involves aggressive accounting practices whereby managers deliberately alter financial reporting, for example, through one-sided asset estimates, to achieve earnings targets. Thereby, earnings management only refers to activities within the legal scope of the respective accounting standard. Nevertheless, earnings management is commonly used as an indicator of FSF; see Section 2.2. Prior research documents a positive correlation between the two (Perols & Lougee, 2011).

Financial Statement Fraud

The term financial statement fraud refers to aggressive accounting practices that exceed the legal scope of accounting standards and intentionally alter the message of financial statements and a company’s outlook. Rezaee states that “financial statement fraud is a deliberate attempt by corporations to deceive or mislead users of published financial statements, especially investors and creditors, by preparing and disseminating materially misstated financial statements.” (Rezaee, 2005, p. 279). This paper seeks to detect this specific form of misconduct.

Figure 2 illustrates the relationships among the three terms.

2.2. Detecting FSF Through Financial Ratios

To identify FSF, accounting literature was reviewed for metrics indicative of FSF. The results of this analysis are four scores: the Altman Z-Score (Altman, 1968), the Beneish M-Score (Beneish, 1999), the Montier C-Score (Montier, 2008), and the Dechow F-Score (Dechow et al., 2011). These four scores were selected because they are computed from financial statement data. This makes them suitable for machine-learning algorithms, as they rely solely on publicly available financial statement data. Moreover, composite scores may provide stronger predictive power for FSF than raw numbers from financial statements.

2.2.1. Altman Z-Score (1968)

The Altman Z-Score (Altman, 1968), developed by Edward I. Altman in 1968, is a multiple discriminant analysis (MDA) combining five financial ratios to predict financial distress. The Z-Score was originally designed to predict bankruptcy. However, since it is a robust indicator of financial health, it can also be used as a predictor of FSF. Kirkos et al. found that companies classified as financially distressed by the Z-Score are more likely to manipulate their financial statements (Kirkos et al., 2007, p. 1000). Due to this correlation, this study uses the Altman Z-Score. It is calculated as follows:

Z = 1.2 X_{1} + 1.4 X_{2} + 3.3 X_{3} + 0.6 X_{4} + 1.0 X_{5}

(1)

where:

\begin{array}{l} X_{1} = \frac{W o r k i n g c a p i t a l}{T o t a l a s s e t s} \\ X_{2} = \frac{R e t a i n e d E a r n i n g s}{T o t a l a s s e t s} \\ X_{3} = \frac{E a r n i n g s b e f o r e i n t e r e s t a n d t a x e s (E B I T)}{T o t a l a s s e t s} \\ X_{4} = \frac{M a r k e t v a l u e o f e q u i t y}{B o o k v a l u e o f t o t a l d e b t} \\ X_{5} = \frac{S a l e s}{T o t a l a s s e t s} \end{array}

X₁—Liquidity

This ratio indicates a company’s ability to pay its short-term liabilities. Working capital is defined as the difference between a company’s current assets and current liabilities. A positive working capital indicates that current assets cover current liabilities; thus, there are likely no liquidity problems. A negative working capital indicates potential liquidity problems, as the firm may need to liquidate its long-term assets to pay back its short-term liabilities.

X_{1}

is normalized; i.e., working capital is divided by total assets, yielding values from

- 1

to

+ 1

, making it comparable to companies of different sizes. Normalization is also applied to most of the following ratios.

X₂—Profitability

This ratio measures the cumulative profitability over time and earning power. Firms with high retained earnings, and, thus, a higher

X_{2}

, signify a profitable history. This depends on the firm age. Young companies are more likely to have low retained earnings compared to older ones. This may disadvantage young businesses, which are more likely to be classified as bankrupt. However, Altman argues that this ratio is a particularly good indicator of financial distress since young businesses often go bankrupt within their first few years of existence (Altman, 1968, p. 595). This is still valid today. According to the U.S. Bureau of Labor Statistics (BLS) (U.S. Bureau of Labor Statistics, 2020), on average, from 1994 to 2019, 21.32% of U.S. private sector establishments failed within their first year, 40.17% failed within three years, and 51.18% failed within five years since their establishment. In the European Union, according to data from Eurostat (2017), 16.75% of all new businesses failed within their first year, 42.04% failed within three years, and 56.14% failed within five years since their establishment

X₃—Productivity

This ratio measures how efficiently a company can generate earnings from its core operations without considering the costs of the capital structure and tax expenses. Since the EBIT ratio excludes interest and taxes, this ratio is comparable across jurisdictions. A positive

X_{3}

value indicates that the company is making efficient use of its assets to generate an operating income. Altman identifies this ratio as particularly informative for detecting corporate failure, as a firm’s existence ultimately rests on its earning power (Altman, 1968, p. 595).

X₄—Solvency

This ratio puts a company’s market capitalization in relation to its total debt. What Altman calls the market value of equity, he defines as “the combined market value of all shares of stock, preferred and common” (Altman, 1968, p. 595), i.e., the market capitalization. Any

X_{4}

value greater than 1 implies that assets exceed liabilities, and is therefore a positive signal for solvency.

X_{4}

values below 1 indicate liabilities may exceed assets; thus, insolvency is likely. Since firms can have assets besides the capital received through issued stock,

X_{4}

values below 1 do not necessarily imply bankruptcy. A ratio where this would be the case, i.e.,

(N e t W o r t h) / (T o t a l D e b t)

, could have been applied. But Altman found that using the market capitalization results in a more effective bankruptcy predictor (Altman, 1968, p. 595). This likely reflects the stock market serving as an estimator of a firm’s worth, and price changes can anticipate upcoming problems (Anjum, 2012, p. 214).

X₅—Efficiency

This capital–turnover ratio measures a firm’s efficiency in generating sales with its assets. The higher the

X_{5}

value is, the more sales can be generated with one dollar of assets. A company that needs few assets to generate a lot of sales indicates a stable future and, thus, low distress risk.

Interpretation

Each ratio is multiplied by a predetermined weight, and the results are summed to obtain the Z-Score. The resulting value can be used to assess financial stability, i.e., the probability of bankruptcy within two years (Maccarthy, 2017, p. 161). As illustrated in Table 2, Z-Score values greater than 2.67 indicate financial soundness, while Z-Score values below 1.81 indicate financial distress. Values between 1.81 and 2.67 are classified in the gray area or zone of ignorance.

Benefits and Limitations

The Altman Z-Score has shown a high accuracy, with correct bankruptcy predictions of about 95% (Altman et al., 1998, p. 394; Altman et al., 2013, p. 2). However, the original model was obtained from 66 publicly traded manufacturing firms and was not intended to be used for privately held or non-manufacturing firms. Private firms lack market capitalization data, which is required to compute

X_{4}

. Since this paper’s data contains only public firms, market capitalization is observable, and

X_{4}

can be computed. Furthermore, non-manufacturing firms differ in capital intensity.

X_{5}

is likely to be higher, for example, for merchandising and service companies than for manufacturers, since these industries are typically less capital-intensive (Eidleman, 1995, p. 52). Consequently, non-manufacturing firms may exhibit a higher capital turnover and higher Z-Scores than manufacturers.

2.2.2. Beneish M-Score (1999)

The Beneish M-Score (Beneish, 1999), developed by Messod Daniel Beneish in 1999, is a statistical model combining eight financial ratios to distinguish manipulators from non-manipulators. All indices, except TATA, are computed from financial data reported in two consecutive years. Therefore, each index and the M-Score itself reflect changes from year

t - 1

to year

t

. The indices are designed so that higher values indicate a greater probability of FSF, whereas lower values indicate a smaller probability. It is calculated as follows:

M = - 4.84 + 0.92 \times D S R I + 0.528 \times G M I + 0.404 \times A Q I + 0.892 \times S G I + 0.115 \times D E P I - 0.172 \times S G A I - 0.327 \times L V G I + 4.679 \times T A T A

(2)

where:

\begin{array}{l} D S R I = \frac{(\frac{{R e c e i v a b l e s}_{t}}{{S a l e s}_{t}})}{(\frac{{R e c e i v a b l e s}_{t - 1}}{{S a l e s}_{t - 1}})} \\ G M I = \frac{(\frac{{S a l e s}_{t - 1} - {C o s t o f G o o d s S o l d}_{t - 1}}{{S a l e s}_{t - 1}})}{(\frac{{S a l e s}_{t} - {C o s t o f G o o d s S o l d}_{t}}{{S a l e s}_{t}})} \\ A Q I = \frac{(1 - \frac{{C u r r e n t A s s e t s}_{t} + {P P E}_{t}}{{T o t a l A s s e t s}_{t}})}{(1 - \frac{{C u r r e n t A s s e t s}_{t - 1} + {P P E}_{t - 1}}{{T o t a l A s s e t s}_{t - 1}})} \\ S G I = (\frac{{S a l e s}_{t}}{{S a l e s}_{t - 1}}) \\ D E P I = \frac{(\frac{{D e p r e c i a t i o n}_{t - 1}}{{D e p r e c i a t i o n}_{t - 1} + {P P E}_{t - 1}})}{(\frac{{D e p r e c i a t i o n}_{t}}{{D e p r e c i a t i o n}_{t} + {P P E}_{t}})} \\ S G A I = \frac{(\frac{{S G A E x p e n s e s}_{t}}{{S a l e s}_{t}})}{(\frac{{S G A E x p e n s e s}_{t - 1}}{{S a l e s}_{t - 1}})} \\ L V G I = \frac{(\frac{{L T D}_{t} + {C u r r e n t L i a b i l i t i e s}_{t}}{{T o t a l A s s e t s}_{t}})}{(\frac{{L T D}_{t - 1} + {C u r r e n t L i a b i l i t i e s}_{t - 1}}{{T o t a l A s s e t s}_{t - 1}})} \\ T A T A = (\frac{{I n c o m e B e f o r e E x t r a o r d i n a r y I t e m s}_{t} - {O p e r a t i n g C a s h f l o w}_{t}}{{T o t a l A s s e t s}_{t}}) \end{array}

Days Sales in Receivables Index (DSRI)

The DSRI is an indicator of revenue overstatement that compares receivables to revenue across two consecutive years. Under normal circumstances, the index is expected to have values close to 1.0, thus suggesting innocence. A large increase can result from a credit-policy change, i.e., granting customers longer timeframes to pay, or when receivables rise disproportionately to sales, from revenue inflation. Therefore, when revenue increases but receivables do not increase proportionally, this may be due to an overstated revenue. Therefore, large increases in days sales in receivables are associated with a higher probability of an overstated revenue.

Gross Margin Index (GMI)

The GMI is an indicator of earnings manipulation. The index measures whether the gross margin (sales minus cost of goods sold) has improved or deteriorated over two consecutive years. A GMI below 1.0 indicates an increasing gross margin, and a GMI greater than 1.0 indicates a decreasing gross margin. Beneish assumes that a decreasing gross margin tempts companies to manipulate their earnings, as it signals deteriorating prospects to external stakeholders. Other authors confirm this (Lev & Thiagarajan, 1993, p. 195; Maccarthy, 2017, p. 162; Mahama, 2015, p. 11). Under this assumption, a declining gross margin can be associated with a higher probability of earnings manipulation.

Asset Quality Index (AQI)

The AQI is an indicator of euphemistic profit. It measures the share of current assets plus property, plant, and equipment (PPE) relative to total assets. Thus, the “asset quality” is low when these short-term assets make up a large share of total assets, and high when long-term assets make up a large share. An AQI larger than 1.0 indicates that the company has increased its capitalized expenses or has deferred more expenses to the future compared to year

t - 1

. Such changes may have legitimate causes but can also reflect attempts to make profits appear higher. Because capitalized expenses are depreciated over time, only a portion of the cost is recognized in year

t

, rather than the full amount. This incentivizes capitalizing as many expenses as possible, even those items prohibited from capitalization, constituting FSF.

Sales Growth Index (SGI)

The SGI denotes growing sales from year

t - 1

to year

t

. Growth in sales is not necessarily a measure of manipulation. However, according to Beneish, managers of growth companies are more likely to manipulate to maintain growth, especially when they face diminishing growth and pressure to achieve earnings targets. Thus, a higher SGI is associated with a higher probability of FSF.

Depreciation Index (DEPI)

The DEPI is an indicator of euphemistic profit. It measures the rate of depreciation in year

t - 1

against the rate of depreciation in year

t

. A DEPI above 1.0 indicates that depreciation expenses have been distributed over a longer timeframe; i.e., the rate at which assets are being depreciated has slowed down, or a new depreciation method has been used. Managers are incentivized to do so because lower current-year depreciation increases current-year profits. Thus, Beneish expects a positive relationship between DEPI and the probability of manipulation.

Sales General and Administrative Expenses Index (SGAI)

The SGAI is an indicator of inefficient sales activities. It measures the ratio of sales, general, and administrative expenses relative to sales from year

t - 1

to year

t

. As the relationship between SG&A expenses and sales is known to be static, a disproportionate increase signals weaker prospects (Lev & Thiagarajan, 1993, p. 196; Warshavsky, 2012, p. 17). This is intuitive: selling costs rising faster than sales themselves is adverse. Beneish expects a positive relationship between SGAI and the probability of manipulation.

Leverage Index (LVGI)

The LVGI denotes an increasing debt ratio. It measures the ratio of total debt, i.e., long-term debt (LTD) + current liabilities, relative to total assets for the current year over the previous year. A LVGI greater than 1.0 indicates an increase in leverage, often associated with more and tighter debt covenants. Debt covenants may create incentives to manipulate earnings because they constrain managerial discretion.

Total Accruals to Total Assets (TATA)

TATA is the only M-Score variable benchmarked at 0 instead of 1, but higher values still imply a higher probability of FSF. Beneish originally used a different formular to compute TATA than the one used in this paper, but both yield similar results. This is because, before the current version of the cash flow statement became effective (pre-1987), few firms reported cash flow from operations, which the new formular uses (Beneish et al., 2012, pp. 31–32). The TATA index is an indicator of outstanding revenue. It measures the extent to which earnings are cash-based. Negative TATA values indicate earnings are largely cash-based, whereas positive TATA values indicate a greater share of accruals (cash not yet received). Hence, large positive accruals relative to total assets signal a lower earnings quality. This may reflect fabricated or prematurely reported revenue, i.e., FSF.

Interpretation

Each index is multiplied by a predetermined weight, and the weighted sum yields the M-Score. This final value can be used to interpret a company’s probability of FSF in year. As illustrated in Table 3, M-Score values above

- 2.22

indicate likely manipulation, whereas M-Scores below

- 2.22

indicate that the company is unlikely to have manipulated its financial statements (Aghghaleh et al., 2016, p. 59; Warshavsky, 2012, p. 18).

Benefits and Limitations

The M-Score demonstrates a strong overall classification performance. In the original study, Beneish correctly classified up to 76% of manipulators and 92.4% of non-manipulators (Beneish, 1999, p. 17). Other papers were able to achieve similar results. Aghghaleh et al. correctly identified 69.51% of manipulators and 76.83% non-manipulators from a sample of 164 Malaysian publicly listed firms (Aghghaleh et al., 2016, p. 62), and Golec reported 67% and 75% in a sample of 24 Polish firms (Golec, 2019, p. 182). However, as a probabilistic model, the M-Score cannot detect fraud with complete certainty (Tarjo & Herawati, 2015, p. 926). The eight variables reflect the profile of a “typical earnings manipulator,” which is, according to Beneish, a company that (1) grows extremely quickly (SGI), (2) experiences deteriorating fundamentals (GMI, AQI, SGAI, and LVGI), and (3) adopts aggressive accounting practices (DSRI, DEPI, and TATA) (Beneish et al., 2013, pp. 76–77). A firm meeting all these criteria may still not be a manipulator. A further limitation is that the model is estimated using data from publicly listed companies; its performance for private firms is uncertain (Beneish, 1999, p. 19). This limitation does not apply to this paper because its sample contains only publicly listed firms.

2.2.3. Montier C-Score (2008)

The Montier C-Score (Montier, 2008) (C for “cooking the books” or “cheating”), developed by James Montier in 2008, captures six common elements of earnings manipulations. Each of the six inputs is binary: 0 indicates manipulation is unlikely; and 1 indicates it is likely. The six indicators are summed, resulting in a C-Score ranging from 0 to 6. Like the Beneish M-Score, the C-Score variables are computed from financial data reported in consecutive years. It is calculated as follows:

C = C_{1} + C_{2} + C_{3} + C_{4} + C_{5} + C_{6}

(3)

where:

\begin{array}{l} C_{1} = I f (\frac{{N e t I n c o m e}_{t}}{{C a s h F l o w f r o m O p e r a t i o n s}_{t}}) > (\frac{{N e t I n c o m e}_{t - 1}}{{C a s h F l o w f r o m O p e r a t i o n s}_{t - 1}}) t h e n 1, e l s e 0 \\ C_{2} = I f ({R e c e i v a b l e s}_{t} - {R e c e i v a b l e s}_{t - 1}) > ({S a l e s}_{t} - {S a l e s}_{t - 1}) t h e n 1, e l s e 0 \\ C_{3} = I f ({I n v e n t o r y}_{t} - {I n v e n t o r y}_{t - 1}) > 0 t h e n 1, e l s e 0 \\ C_{4} = I f ((\frac{{T o t a l A s s e t s}_{t}}{{S a l e s}_{t}}) - (\frac{{T o t a l A s s e t s}_{t - 1}}{{S a l e s}_{t - 1}})) > 0 t h e n 1, e l s e 0 \\ C_{5} = I f ((\frac{{D e p r e c i a t i o n}_{t}}{{G r o s s P P E}_{t}}) - (\frac{{D e p r e c i a t i o n}_{t - 1}}{{G r o s s P P E}_{t - 1}})) < 0 t h e n 1, e l s e 0 \\ C_{6} = I f (\frac{{T o t a l A s s e t s}_{t} - {T o t a l A s s e t s}_{t - 1}}{{T o t a l A s s e t s}_{t - 1}}) > 0.20 t h e n 1, e l s e 0 \end{array}

Source: Adapted from Mccain (2017).

C₁—A growing difference in net income and cash flow from operations

This ratio is an indicator of the aggressive capitalization of costs. It measures how the net income changes relative to cash flow from operations. According to Montier, managers have less discretion over cash flows than over earnings, as earnings incorporate subjective estimates such as bad debts or pension returns. Under normal circumstances, he therefore expects the net income and cash flow from operations to move proportionally. A widening gap in which the net income rises faster than cash flow from operations

(C_{1} = 1)

suggests more aggressive cost capitalization to elevate reported earnings.

C₂—Days sales outstanding (DSO) is increasing

This ratio is an indicator of overstated receivables. Like Beneish’s DSRI, it assesses whether receivables and revenue grow proportionally in two consecutive years. If receivables increase faster than revenue, this indicates issues with collecting customers’ payments or overstated receivables.

C₃—Growing days sales of inventory (DSI)

This ratio indicates decreasing sales by measuring whether the inventory grew over two consecutive years. Inventory growth does not need to be a negative signal if the whole company grows with the inventory. However, substantial inventory growth without corresponding firm growth can indicate difficulty selling inventory, i.e., decreasing sales.

C₄—Increasing other current assets to revenues

This indicator flags managers emphasizing total assets, potentially diverting attention from revenue-based metrics. Montier argues that total assets work as a “catch-all line item”, since investors often care about how much money a company brings in. This is supposed to “hide things they do not want investors to focus upon” (Montier, 2008, p. 3).

C₅—Declines in depreciation relative to gross PPE

This ratio is an indicator of euphemistic profit. Like Beneish’s DEPI, it compares depreciation in year

t - 1

with that in year

t

.

C_{5} = 1

indicates a declining depreciation rate, potentially due to managers lengthening useful lives to make current earnings appear higher.

C₆—High total asset growth

Since asset value is often subject to estimates, this ratio attempts to flag overvalued assets.

C_{6} = 1

captures managers’ behavior to accumulate lots of assets, so their discretion in valuation increases. It triggers when total asset growth exceeds 20% in two consecutive years, hence showing a high asset growth that may be used to influence reported earnings.

2.2.4. Dechow F-Score (2011)

The Dechow F-Score (Dechow et al., 2011), developed by Patricia M. Dechow et al. in 2011, is a scaled probability measure of FSF risk. Like the M-Score and C-Score, the F-Score variables are calculated from financial data reported in consecutive years. Dechow et al. developed three F-Score models that differ in data requirements (Dechow et al., 2011, p. 21): model 1 (seven variables) relies only on financial-statement data; model 2 (nine variables) adds nonfinancial and off-balance-sheet items like the number of employees; and model 3 (eleven variables) adds market-based measures like the stock return. Due to data availability, this study applies model 1. It is calculated as follows:

F = \frac{p (F S F)}{0.0037}

(4)

where:

\begin{array}{l} p (F S F) & = \frac{e^{l o g i t}}{(1 + e^{l o g i t})} \\ l o g i t & = - 7.893 + 0.790 \times {r s s t}_{a c c} + 2.518 \times {c h}_{r e c} + 1.191 \times {c h}_{i n v} \\ + 1.979 \times {s o f t}_{a s s e t s} + 0.171 \times {c h}_{c s} - 0.932 \times {c h}_{r o a} + 1.029 \times i s s u e \\ r s s t_{a c c} & = [({T o t a l A s s e t s}_{t} - {C a s h & E q u i v a l e n t s}_{t} - {I n v e s t m e n t s & A d v a n c e s}_{t} \\ + {I n v e s t m e n t s a t E q u i t y}_{t} - {T o t a l L i a b i l i t i e s}_{t} - {P r e f e r r e d S t o c k}_{t}) \\ - ({T o t a l A s s e t s}_{t - 1} - {C a s h & E q u i v a l e n t s}_{t - 1} - {I n v e s t m e n t s & A d v a n c e s}_{t - 1} \\ + {I n v e s t m e n t s a t E q u i t y}_{t - 1} - {T o t a l L i a b i l i t i e s}_{t - 1} - {P r e f e r r e d S t o c k}_{t - 1})] \\ / [0.5 \times ({T o t a l A s s e t s}_{t - 1} + {T o t a l A s s e t s}_{t})] \\ c h_{r e c} & = \frac{{A c c o u n t s R e c e i v a b l e}_{t} - {A c c o u n t s R e c e i v a b l e}_{t - 1}}{0.5 \times ({T o t a l A s s e t s}_{t - 1} + {T o t a l A s s e t s}_{t})} \\ c h_{i n v} & = \frac{{I n v e n t o r y}_{t} - {I n v e n t o r y}_{t - 1}}{0.5 \times ({T o t a l A s s e t s}_{t - 1} + {T o t a l A s s e t s}_{t})} \\ s o f t_{a s s e t s} & = \frac{{T o t a l A s s e t s}_{t} - {N e t P P E}_{t} - {C a s h & E q u i v a l e n t s}_{t}}{{T o t a l A s s e t s}_{t}} \\ c h_{c s} & = \frac{{S a l e s}_{t} - {(A c c o u n t s R e c e i v a b l e}_{t} - {A c c o u n t s R e c e i v a b l e}_{t - 1})}{{S a l e s}_{t - 1} - {(A c c o u n t s R e c e i v a b l e}_{t - 1} - {A c c o u n t s R e c e i v a b l e}_{t - 2})} - 1 \\ c h_{r o a} & = \frac{{N e t I n c o m e}_{t}}{0.5 \times {(T o t a l A s s e t s}_{t - 1} + {T o t a l A s s e t s}_{t})} - \frac{{N e t I n c o m e}_{t - 1}}{0.5 \times {(T o t a l A s s e t s}_{t - 2} + {T o t a l A s s e t s}_{t - 1})} \\ i s s u e & = 1 i f t h e c o m p a n y i s s u e d l o n g t e r m d e b t o r c o m m o n s t o c k i n y e a r t; \\ 0 o t h e r w i s e . \end{array}

Source: Adapted from Dechow et al. (2011, Table 3).

rsst_acc—RSST Accrual

The

{r s s t}_{a c c}

variable captures accrual quality. It measures the change in net operating assets, or, equivalently, the change in non-cash working capital (∆WC) + the change in net non-current operating assets (∆NCO) + the change in net financial assets (∆FIN), scaled by the average total assets of year

t - 1

and year

t

. Dechow et al. adapted this variable from (Richardson et al., 2005, pp. 446–451) and made changes in the long-term operating assets and long-term operating liabilities. Like Beneish’s TATA index, it reflects the changes in total accruals. Discretionary accruals rely on managerial estimates and are, therefore, sensitive to FSF (Dechow et al., 2011, p. 34; Healy, 1985). Large

{r s s t}_{a c c}

values signal substantial changes in total accruals from year

t - 1

to

t

and, thus, higher FSF risk.

ch_rec—Change in Accounts Receivable

This ratio indicates overstated sales. It measures the change in receivables from year

t - 1

to year

t

, scaled by the average total assets of year

t - 1

and year

t

. Large increases in receivables may reflect fabricated or overstated sales. The higher

{c h}_{r e c}

is, the likelier FSF becomes, according to the F-Score. It also captures the common exploitation of accrual accounting and, thus, informs accrual quality. This is based on the following: according to accounting standards (US GAAP—Topic 310), receivables must be stated at the net realizable value, i.e., the amount of cash expected to be collected (Financial Accounting Standards Board (FASB), 2010). Hence, for receivables that are overdue or not expected to be paid, provisions should be booked (Hung et al., 2017, p. 311). By altering the expected collectability estimates, managers can adjust the reported profits.

ch_inv—Change in Inventory

This ratio indicates declining sales, obsolescence, or inventory liquidation. Like

C_{3}

(DSI) from the Montier C-Score, it measures the change in inventory from year

t - 1

to year

t

, but divided by the average total assets of year

t - 1

and year

t

. Large inventory increases indicate the company is having problems selling its inventory, i.e., inventory surpluses. Large inventory decreases indicate write-downs due to obsolescence or inventory liquidation. The higher

{c h}_{i n v}

is, the likelier FSF becomes, according to the F-Score. Like

{r s s t}_{a c c}

and

{c h}_{r e c}

, this ratio assesses the accrual quality. According to accounting standards, inventories must be measured at the lower of the original cost (purchase or production cost) and net realizable value (IFRS—IAS 2), respectively, the original cost and market value (US GAAP—Topic 330) (Financial Accounting Standards Board (FASB), 2015). If an item’s net realizable value or market value is less than its original value, provisions should be made for those differences (Hung et al., 2017, p. 310). The amount of accruals for this inventory write-down is up to managers’ judgment, so this can be used to adjust the reported profit.

soft_assets—Percentage of Soft Assets

The

{s o f t}_{a s s e t s}

ratio indicates managerial flexibility in asset valuation. It measures the share of assets on the balance sheet that are neither cash and cash equivalents (CCE) nor property, plant, and equipment (PPE). A higher share of soft assets is associated with a higher likelihood of FSF. Its informativeness rests on the assumption from Barton, J. and Simko, P. J. (Barton & Simko, 2002, p. 21) that a high percentage of soft assets, e.g., intangible assets like the value of a brand, provides managers with flexibility in how to value assets, and thus more flexibility to adjust short-term earnings.

ch_cs—Change in Cash Sales

This ratio indicates financial performance. It measures the change in sales backed by the actual cash flow, i.e., excluding receivables, from last year to the current year. A large sales growth net of receivables is associated with a higher FSF risk, consistent with Beneish’s SGI. In contrast to the previous F-Score variables,

{c h}_{c s}

does not assess the accrual quality but simply monitors financial performance changes.

ch_roa—Change in Return on Assets

The

{c h}_{r o a}

ratio also indicates financial performance. It measures the change in net income from year

t - 1

to year

t

, scaled by those two years’ average total assets. However, for this ratio, large decreases contribute to the likelihood of FSF, which is why this parameter, unlike the others, has a negative weight (−0.932) in the calculation of the logit parameter. This reflects Dechow et al.’s findings that manipulating firms have consistently shown a strong performance before manipulation; as performance declines, the pressure to misstate increases to keep up with the previous performance growth.

issue—Actual Issuance

This ratio denotes the issuance of new securities. It receives a value of 1 if the company issued new debt or equity during year

t

, and a value of 0 if it did not issue any new securities. Since managers’ compensation is often tied to stock performance, Dechow et al. assume strong incentives to influence the share price. First, if the company needs new cash to finance its operations, high share prices reduce the cost of equity issuance (Dechow et al., 2011, p. 41). Second, raising new capital signals operating cash flow problems that need to be compensated by additional funding (Aghghaleh et al., 2016, p. 62). Third, equity issuance may coincide with managers exercising their stock options (Aghghaleh et al., 2016, pp. 62–63). This could signal that managers are trying to sell at a peak as they anticipate a diminishing performance of the company in the future.

Interpretation

Each of the seven variables (

{r s s t}_{a c c}

,

{c h}_{r e c}

,

{c h}_{i n v}

,

{s o f t}_{a s s e t s}

,

{c h}_{c s}

,

{c h}_{r o a}

, and

i s s u e

) is multiplied by a predetermined weight to compute the logit parameter. Logit is then transformed into a probability, denoted as p(FSF), meaning the probability of financial statement fraud. The final F-Score is obtained by dividing p(FSF) by the unconditional probability of misstatement (0.0037). This 0.0037 is derived from the original dataset Dechow et al. used to build the model and corresponds to the number of fraudulent firm years divided by the total number of firm years,

494 / (123,967 + 494) = 0.0037

. The resulting F-Score, a scaled probability, can now be interpreted as described in Table 4.

For instance, Enron’s financial statement of the year 2000 has an F-Score of 2.76 (Dechow et al., 2011, p. 61). This means Enron’s misstatement risk for the year 2000 is 276% compared to a randomly selected company from the population.

Benefits and Limitations

Compared to the prior scores, the F-Score is based on a more comprehensive dataset and was developed more recently. The dataset comprises 676 unique firms that were published by the SEC for misstating their financial statements through all 2190 AAERs from 1982 to 2005. Thus, unlike the Z-Score or M-Score, it is not limited to manufacturing or public firms. Additionally, the seven variables from model 1 were selected from 28 variables used in the prior literature and have received broad empirical support (Cecchini et al., 2010a; Dechow et al., 2011; Price et al., 2011). One study reports the F-Score outperforms the Beneish M-Score (Aghghaleh et al., 2016, p. 57). Finally, the work of Dechow et al. has been widely cited relative to others in the field.

Despite these advantages, the error rates remain material. In the original sample, 31.4% of misstated firm years have been classified with an F-Score < 1 (false negatives), and 36.3% of non-misstated firm years have been classified with an F-Score > 1 (false positives). Moreover, the F-Score is based on fraudulent and random firm years, not fraudulent and non-fraudulent ones. Undetected FSF may be present in the random sample.

2.3. Detecting FSF Through Machine Learning

A straightforward approach would be to apply the four ratios introduced in Section 2.2 to the entire sample of 2,014,827 firm years and classify them using the authors’ original thresholds. Although methodologically valid, this strategy has an inherent weakness: the interpretation thresholds of the four scores are data-driven estimates derived from the authors’ samples and may not generalize to other datasets like the one examined in this study.

Machine-learning models mitigate this limitation because they do not require fixed decision thresholds for interpretation. They learn the classification boundaries directly from labeled data (fraudulent or non-fraudulent) and identify typical ranges associated with each class. Thus, no predetermined thresholds are required. The algorithm determines which indicators are the most predictive of FSF and their typical ranges.

To identify best-performing algorithms for the binary classification of firm years (fraudulent or non-fraudulent), this study reviews prior ML-based FSF-detection research. The review narrows the scope of this study to algorithms found most effective in prior work. Some literature has shown a high accuracy in correctly identifying fraudulent and non-fraudulent firm years. Most published studies report fraud-detection accuracies above 80%. Table 5 summarizes ML-based FSF studies, and their datasets, algorithms, and reported accuracies, sorted in ascending order by year of publication.

The average detection accuracy of all 84 analyzed classification algorithms is 78.37%. As shown, many studies focus on U.S. firms: 16 of 31 analyze U.S. data. Notably, the studies that analyzed the largest datasets (Bertomeu et al., 2018; Dechow et al., 2011) report comparatively lower accuracies. To guide the algorithm selection, the average accuracy for each algorithm category is listed in Table 6.

Methods labeled AdaBoost, XGBoost, and Gradient Boosted Regression Trees (GBRT) were grouped under Boosting. Approaches using textual data, regardless of whether ratios were also included, were categorized as Text Mining. TM methods were excluded from further analysis as this study focuses on algorithms using financial ratios.

To find the best algorithm, category-level average accuracies were calculated; the results are listed in Table 7.

The category means suggest that EC performs best, but this is based on only two studies; similarly, DA (

n = 1

) and BBN (

n = 2

) have limited evidential weight. ANN and LMM perform well on average and are widely used in FSF literature. DT, Ensemble methods, and other models show lower mean accuracies; yet, they remain common choices for FSF classification. These findings inform the final algorithm choices described in Section 3.2.2.

2.4. Hypothesis Development

Several prior studies researching ML approaches report high accuracies in correctly identifying fraudulent and non-fraudulent financial statements. The average detection accuracy of the analyzed literature in Section 2.3 is 78.37%. However, most studies rely on relatively small, single-country datasets. This geographical limitation constitutes a research gap. Albizri et al.’s recent meta-study (Albizri et al., 2019, Tables 8 and 9) analyzed 29 studies, of which all 29 investigated data from single countries, and 21 of them analyzed data from U.S. companies. Other research confirms the need for cross-country analyses (Anh & Linh, 2016, p. 22). A cross-country analysis would also allow more generalizable predictions. Since the dataset analyzed in this study covers more than 1000 fraudulent firm years from companies worldwide, this study’s model is expected to outperform the previous research. Thus, the first hypothesis of this study is as follows:

Hypothesis 1 (H1).

The detection accuracy of this study’s machine-learning model exceeds the average detection accuracy of 78.37% of the previous literature’s 84 algorithms analyzed in Section 2.3, as it makes use of global data and four scores: the Altman Z-Score, Beneish M-Score, Montier C-Score, and Dechow F-Score.

Moreover, a recent incident has unleashed a noticeable wave of discussion on the issue of FSF. Wirecard AG was the first company in Germany’s DAX stock index to file for insolvency in the context of accounting manipulation (Alderman & Schuetze, 2020). Wirecard was chosen for this paper as an illustrative case because it is a relevant, non-U.S. fraud case in an IFRS environment. Its June 2020 collapse followed the company’s acknowledgment that approximately €1.9 billion, which it claimed to have on its balance sheets, probably never existed, and intensified scrutiny of audit and regulatory oversight (Alderman & Schuetze, 2020; Granville, 2020). As a cross-border payments platform with a complex, digital business model, Wirecard provides a stringent test of whether ratio-based signals learned from a largely U.S.-sourced dataset generalize to international settings. It also offers a long financial statement history (2000–2018). The Wirecard case motivated this study and led to the research objective of finding the best possible detection model for FSF. The model is, therefore, applied to Wirecard’s financial data, leading to the second hypothesis:

Hypothesis 2 (H2).

The machine-learning approach used in this study would have been able to detect the Wirecard AG accounting scandal of 2020 using Wirecard’s financial statement data between 2000–2018.

3. Materials and Methods

To identify the most effective machine-learning model for detecting FSF, this study proceeded as follows. First, based on the literature review in Section 2.2, four indicators were selected: the Altman Z-Score, Beneish M-Score, Montier C-Score, and Dechow F-Score. Second, two datasets were assembled: financial statement data to compute the four scores (described in Section 3.1.1: Financial Data) and data on known fraudulent firm years (described in Section 3.1.2: Fraud Data). Third, the datasets were manually linked (Merged Data) and preprocessed to create an ML-ready sample (Prepared Data) (described in Section 3.2.1: Data preparation). Finally, multiple ML algorithms were trained and evaluated to develop a predictive model (described in Section 3.2.2: Applying machine-learning algorithms).

3.1. Data

3.1.1. Financial Data

This sample comprises the four scores and the variables required to calculate them, for 205,911 distinct companies worldwide from 1988–2019. Each row represents a firm year (a firm’s reported financials for one fiscal year). In total, the sample contains 2,014,827 firm years. Table 8 summarizes the dataset structure.

The data was obtained from “Bloomberg Terminal”, a software system provided by the financial data vendor Bloomberg L.P. via multiple BQL (Bloomberg Query Language) queries executed between 25 September 2020 16:37 CET and 30 October 2020 16:56 CET. The full queries are attached as Supplementary Material, File S1 “bql_query.txt”. The objective with this query was to obtain as much financial data from as many companies as possible for the subsequent machine-learning analysis. Bloomberg was also selected because its standardized financial statement format enables cross-country comparability, thereby addressing this research gap. Nevertheless, it has limitations: pre-1988 financials are unavailable in Bloomberg’s database and were excluded; and post-2019 observations were excluded because financials for the fiscal year 2020 were not yet available at the query dates.

The 2,014,827 firm years correspond to all Bloomberg observations (as of October 2020) with total assets exceeding USD 5 million. Firm years with total assets below USD 5 million were not retrieved due to cost and time constraints. These account for 268,779 additional firm years (11.72% of Bloomberg’s total), leaving coverage at 88.28%.

Moreover, the BQL query restricted the universe to companies where a total assets value was available, which is required to compute most components of the four scores. Querying firms without total assets would have returned about 2 million more firm years, but with poor data availability at much higher query cost; thus, they were excluded.

Furthermore, the Altman Z-Score variable

X_{4}

requires a firm’s market capitalization. Unlike the other variables, which are usually reported once a year, the market capitalization can change during any trading day. Thus, an assumption had to be made as to which market capitalization is best to be used within a fiscal year. Bloomberg offers two variables: “periodic_market_cap” (market capitalization on the financial statement release date, so usually a few months after the fiscal year ended); and “cur_mkt_cap” (market capitalization on the query date). Both options are too inconsistent for cross-firm comparability. Instead, the following BQL expression was used to request the last market capitalization of the last trading day before 31 December of any fiscal year:

#market_cap = applypreferences(cur_mkt_cap(fill=prev).value,aligndatesby = bs_tot_asset().period_end_date)

This corresponds to the end of a fiscal year for most companies and, therefore, provides acceptable consistency. Ideally, one might average the market capitalization of all trading days over the fiscal year; this was not implemented to avoid substantial query complexity, with limited expected impact on results.

Another limitation arises from calculating the Dechow F-Score. The F-Score’s variable “issue”, which is 1 if the firm issued new securities (equity or debt) and 0 otherwise, is not based on financial statement data. This required splitting the BQL query into two parts because a single BQL query can request data from only one universe. Part A queries the “equities universe”, which contains financial data that Bloomberg stores about public companies. Part B queries the “debt universe”, which contains data about corporate debt issuance. First, part B captures only new debt issues; comparable equity-issuance data was unavailable on Bloomberg as of October 2020. Second, due to slow response times for this query, the issue variable could be retrieved for less than 1% of the 2,014,827 firm years. Thus, the variable was set to 1 for all firm years to enable computation of the F-Score. Consequently, the F-Score in this study differs from its original definition.

The sample contains some missing values (N/A), where data was unavailable in Bloomberg’s database. Table 9 provides an overview of the number of missing values.

Some fields show a significant number of missing values. This is because each score requires all component variables; any missing component results in the score being unavailable for that dataset. The handling of missing values is described in Section 3.2.1.

To contextualize data availability over time, Table 10 describes the distribution of firm years and N/A values across the 32 years queried. The relative N/A rate is

r = n / (f \times 25)

, where n is the number of N/A values across all firm years in year x and f is the number of firm years in year x. The constant 25 equals the number of variables used to compute the four scores (field IDs: 7–11, 13–20, 22–27, and 29–34 from Table 8), excluding issue. For instance, in 1988,

94,431 / (4066 \times 25) = 92.90 %

of the score variables across the 4066 firm years are unavailable.

It becomes clear that earlier years contain fewer firm years and show lower data availability. Fortunately, data availability is comparatively high around the turn of the millennium, which coincides with a concentration of reported FSF cases (see Figure 1).

Finally, a key parameter in the BQL query (S1: bql_query.txt; Part A) requires explanation:

F S = L R (f a_f i l i n g_s t a t u s = l a s t r e c e n t)

.

This parameter determines whether originally reported or restated values are returned. When a firm restates data, Bloomberg retains both the original and the restated values. If FS is not specified, BQL returns the most recently stated values. For companies that had to restate a financial statement due to FSF becoming public, the corrected data would be queried. Training on corrected (restated) data while assuming to train on misstated data would likely have undermined the study’s objective. Therefore, the query requests the initially reported values, assuming this returns the firm years most likely to contain misstatements.

3.1.2. Fraud Data

The second sample lists firms that misstated their financial statements in specific fiscal years. In total, 1038 distinct companies from 12 countries that misstated at least one statement during 1879–2014 were analyzed, yielding 1145 fraudulent firm years. Each row corresponds to one firm year (company name and the fiscal year in which the misstatement occurred). The sample is based on two sources:

First, the most comprehensive source is the SEC’s list of Accounting and Auditing Enforcement Releases (AAERs) (U.S. Securities and Exchange Commission, n.d.), published since 1982 and covering misstatements back to 1971, making it a comprehensive and reliable source. However, AAERs do not map one-to-one to fraudulent firm years: multiple releases may cover the same misstatement, and some describe individuals’ misconduct rather than firm-level violations. Thus, deriving a list of fraudulent firm years requires a manual analysis of all AAERs. Fortunately, previous research has already carried out this analysis and offers this data to others. It was acquired from the USC Marshall School of Business, which analyzed all 4012 AAERs spanning 17 May 1982–31 December 2018 (USC Marshall School of Business, 2020). A detailed description of the dataset is provided by Dechow et al. (2011, pp. 78–80). The data was then manually merged with the financial data sample from Bloomberg by adding a new data field “fraud”, which equals 1 for fraudulent firm years and 0 otherwise. Only fraudulent firm years that could unambiguously be matched were merged. Some extracted firm years could not be merged because (1) the misstatement preceded 1988 (Bloomberg data goes back only to 1988), (2) total assets were below USD 5 million, and (3) the respective company or its fraudulent firm years could not be found in the database. For transparency, Table 11 describes the number of available and retained fraudulent firm years.

The AAER dataset only covers misstatements of U.S. companies. Relying solely on U.S. cases could limit the ML model’s generalizability to data from other countries. Therefore, an additional non-U.S. source was incorporated.

Second, a dataset from Michael Jones’s Book “Creative Accounting, Fraud and International Accounting Scandals” (Jones, 2011, pp. 509–517) was used. Jones compiled a list of major cases across 12 countries (Appendix 1 of the book). This list features 222 fraudulent firm years from 142 distinct companies between 1879–2009. For similar reasons as with the previous dataset, not all 222 firm years could be merged. Table 12 describes the retained firm years.

We found that 63 firm years from this source could be used. This results in an imbalance between firms from the U.S. and other countries, illustrated in Table 13. However, assuming the four scores generally show similar characteristics for fraudulent firm years from different countries, the imbalance should not materially affect model performance.

To clarify source contributions and exclusion steps, Table 14 details the sample selection process.

To conclude, 1145 fraudulent firm years from 12 countries between 1988–2014 were combined into the merged dataset. Publicly known FSF cases after 2014 are not included due to the absence of a sufficiently comprehensive, high-quality source. However, more recent cases like Wirecard can be used for model validation. This is carried out in detail in Section 4: Results.

3.2. Machine Learning

Detecting fraudulent financial statements is a binary classification task: statements are labeled fraudulent (1) or non-fraudulent (0). A standard three-step workflow was used. First, the available data was split into training and testing samples. Second, the model was trained on the training sample (supervised learning). Third, the trained model classified firm years in the testing sample. Since the fraud labels of the testing sample are known, prediction accuracy can be computed. After training, the model can make predictions about new sets of data with unknown fraud status.

3.2.1. Data Preparation

The merged dataset required preprocessing due to (a) an imbalance between fraudulent and non-fraudulent firm years, (b) missing values, and (c) outliers.

Imbalance

Ideally, the merged data would contain equal numbers of fraudulent and non-fraudulent firm years. In practice, there are far fewer fraudulent than non-fraudulent firm years available. In the merged data,

1145 / 2,014,827 = 0.06 %

of firm years are fraudulent. A naive classifier that learned to predict 0 for all cases would likely yield an accuracy of >99%, which is uninformative. The primary objective is to correctly identify fraudulent firm years. Prior FSF research faced the same issue. Two approaches address the imbalance: data-level and algorithmic-level methods (B. Li et al., 2016, pp. 179–180).

The data-level method balances classes (usually to 50/50) via under- or oversampling so standard classifiers can be applied. In this case, undersampling (removing majority-class cases) would mean to delete

2, 013,682

of the

2,014,827

firm years, yielding 1145 fraudulent and 1145 random firm years. Oversampling (synthetically generating minority cases) would mean to artificially generate fraudulent cases based on the existing fraudulent firm years. Many prior studies used undersampling, as indicated by the matching firm year numbers in Table 5. Selection is typically not purely random: fraudulent firm years are matched to random ones by similar industry, country, or firm size.

The algorithmic-level method retains the existing data and modifies the learning algorithm to handle the imbalance. This could be achieved by applying a Biased-Penalty Linear Support Vector Machine algorithm (BP-SVM) (Bach et al., 2006, pp. 1718–1721). Its equation contains two penalties:

C^{+}

for false negatives (fraud classified as non-fraud) and

C^{-}

for false positives (non-fraud classified as fraud) (B. Li et al., 2016, p. 181). Tuning these penalties can steer the classifier to reduce misclassification, particularly false negatives.

Since algorithmic approaches are out of scope, this paper applied a data-level undersampling approach, resulting in 1145 fraudulent and 1145 random firm years. Non-fraudulent firm years were not selected at random but were matched to (a) have high data availability (no missing fields), (b) originate from the same country (e.g., every fraudulent Japanese firm year gets one random Japanese firm year), and (c) have about the same amount of total assets.

Missing Values and Outliers

Because ML algorithms cannot interpret missing values, they must be deleted or replaced by assumptions close to the actual values. To address the missing values in the merged dataset, the number of N/A values per row was analyzed, as illustrated in Table 15. Only the 25 FSF-predictive fields described in Section 3.1.1 were considered.

Some rows exhibit such low data availability that they contribute little predictive value. Rows with more than 12 of 25 fields missing were deleted. This corresponds to 99 deleted rows, leaving 1046 fraudulent firm years.

The same procedure was repeated column-wise to identify fields with disproportionately large N/A numbers. The outcome of this analysis is shown in Table 16.

As expected, some fields have a high N/A rate. Fields with more than 50% missing values (≥523 of 1046 rows) were deleted: the Z-Score, SGAI, M-Score, rsst_acc, logit, prob_FSF, and the F-Score. The issue variable was also removed because it is constant (all its values were previously set to 1) and, thus, has no predictive value. The C-Score was also excluded despite being below the 50% threshold. Unlike the other three scores, it is an unweighted sum of six binary indicators and is, therefore, perfectly linearly dependent on its components. Including both would overweigh its predictive information. In total, 9 of 32 columns were removed, leaving 23 fields and 1046 rows of fraudulent firm years.

Despite the described preprocessing, some datasets with missing values remained. Those values were replaced with the average of all values of the respective column. For transparency, the resulting averages are listed in Table 17.

All computed averages are in a reasonable range, except for the AQI average. The discrepancy is driven by three extreme outliers (83,518,801.58, 26,969,010.02, and 562,818.99). These three values were replaced with the mean of the remaining AQI observations, yielding a new mean of 3.9461 used for missing AQI values. No further outliers were identified. For the C-Score components, rounded averages were used as their values can either be zero or one.

The “merged dataset” is now prepared as input for ML classifier training and evaluation, hereafter, the “prepared dataset”. The non-fraudulent (random) sample was reduced correspondingly to 1046 firm years with the same 23 fields.

3.2.2. Applying Machine-Learning Algorithms

Selecting the most suitable algorithm with optimal hyperparameters for FSF detection can be a time-consuming and complex task. The analysis of previous research in Section 2.3 suggests using SVMs, since they are widely used in FSF literature and perform well on average. However, this might not apply to the specific sample analyzed in this study. To solve for this uncertainty, this study uses TPOT (Tree-based Pipeline Optimization Tool) (EpistasisLab, 2021b) to automate model selection and hyperparameter optimization. Developed by the Computational Genetics Lab at the University of Pennsylvania (EpistasisLab, 2021a) and introduced by Olson et al. (2016a), with further evaluations in Le et al. (2020) and Olson et al. (2016b), TPOT uses genetic programming to optimize ML pipelines and returns an executable Python pipeline for training and prediction. TPOT can iterate over hundreds of candidate pipelines, scoring each by 5-fold cross-validation on the training set, and keeps the best-scoring pipeline. This avoids limiting the analysis to SVMs or manually trying different algorithms. The part of the ML workflow that TPOT automates is visualized in Figure 3.

The code for the TPOT workflow is provided as Supplementary Material, File S2 “ml_code.py”. The code is separated into three parts: 1. TPOT Auto-ML (Automatic Machine Learning) Model Generation; 2. TPOT-generated Python-code output; and 3. Apply new data to the trained model. The following excerpts describe steps relevant to reproducibility and interpretation. Initially, the imported data is split into four data frames: X_train, X_test, Y_train, and Y_test. X corresponds to the 23 predictors, and Y is the binary fraud indicator (0/1, labeled target). The split is 75% training and 25% testing of the 2092 available firm years (1046 fraudulent and 1046 non-fraudulent) from the prepared dataset. This balances the need for comprehensive training data with sufficient holdout data for accuracy assessment.

Lines 13–21 implement a group–shuffle–split rather than a random train–test–split to minimize dependence between train and test sets. If firm years from the same company would appear in both sets, the model may learn to identify companies themselves instead of fraud patterns. The group–shuffle–split assigns all firm years from the same company to either train or test, never mixed in both, using the Name column as the grouping key. This procedure reduced the accuracy for predicting fraudulent firm years by 3%, indicating there was indeed an effect of this dependency.

Lines 23–34 instantiate TPOT and configure the optimization: generations = 100 (TPOT carries out 100 iterations to run the pipeline optimization process), population_size = 100 (TPOT tests 100 varied machine-learning algorithms each generation), and cv=gss_5 (TPOT applies 5-fold cross-validation to validate its pipelines). The Auto-ML run (line 29) was executed on Windows 10 Professional 64-bit (version 20H2; Python 3.8.5; TPOT 0.11.7) with an Intel Core i7-4930K @ 3.40 GHz and 16 GB RAM. With multithreading enabled, the run completed in 04:16:55 (hh:mm:ss). Figure 4 displays TPOT’s current best internal cross-validation score over time.

The score rises sharply during the first five generations, and then plateaus. After 25 iterations, it stabilizes near 91% with only marginal improvements thereafter.

TPOT then returned a Python file containing the optimized model, lines 36–74. The algorithm TPOT found to be optimal is an ensemble combining gradient boosting (scikit-learn, 2021a) and k-nearest neighbors (k-NN) (scikit-learn, 2021b). Gradient boosting captures non-linear, global interactions in ratios, while k-NN can emphasize local neighborhoods in the feature space. The combination should therefore perform well on heterogeneous firm-year patterns. The remaining code displays results and applies the Wirecard firm years with the model; findings are presented in Section 4: Results. For transparency, Figure 5 summarizes the data flow of the machine-learning process. Due to company-level grouping, the realized split deviates slightly from the planned 75%/25% split.

4. Results

4.1. Model Accuracy (H1)

To test the first hypothesis, performance and generalizability were assessed on the held-out test set by comparing the predicted labels to true labels. Since FSF detection represents a binary classification problem, there are four possible outcomes: True Negative (a non-fraud firm correctly classified as a non-fraud firm), False Positive (a non-fraud firm incorrectly classified as a fraud firm, False Negative (a fraud firm incorrectly classified as a non-fraud firm), and True Positive (a fraud firm correctly classified as a fraud firm). The results are illustrated in Figure 6.

To compute the overall detection accuracy, all correct predictions were divided by all predictions:

(231 + 242) / (231 + 26 + 53 + 242) = 85.69 %

. The average detection accuracy reported in the previous research, 78.37%, was therefore exceeded by 7.32 percentage points. Thus, the first hypothesis is confirmed.

To interpret the final TPOT pipeline, feature importance was computed on the held-out test set. For each variable, values were randomly permuted while the trained model was kept fixed, and the resulting mean drop in accuracy was recorded, as displayed in Figure 7.

The ranking indicates the strongest reliance on X2 and DEPI, with secondary contributions from X4, X5, ch_rec, X3, and GMI. The remaining variables show small contributions once the top predictors are present. Overall, the predictive signal appears concentrated in a subset of variables, consistent with the correlation among financial ratios and prior evidence that depreciation dynamics and receivables-related measures are informative for FSF detection.

4.2. Wirecard Case Study (H2)

To test H2, Wirecard’s financial statements were evaluated in two steps: (1) examine the four scores in isolation and (2) report the model’s predictions for each Wirecard firm year. The financial data obtained for this study contains 17 Wirecard firm years (2000–2018; excluding 2001 and 2002). The fiscal years 2015–2018 are of primary interest, given the reports of manipulation from 2015 onward (Der Spiegel, 2020; Tagesschau, 2020).

First, the Z-Score was analyzed. Wirecard’s Z-Score data is displayed in Table 18.

Since the Altman Z-Score measures financial stability, it is not directly diagnostic of FSF. It is still interesting to see Wirecard’s performance in this regard. All available firm years except 2004 exceed 2.67 (non-distress zone), indicating a consistently stable financial situation. In 2004, the Z-Score is at the upper end of the gray zone, which does not alter the overall picture. The five component variables also confirm this, since inputs greater than 0 positively impact the aggregate Z-Score. Given the subsequent evidence of overstated financials, the Z-Score’s low bankruptcy-risk assessment is not surprising. The Z-Score alone was, therefore, not able to detect Wirecard’s fraudulent activities.

Wirecard’s M-Score data is displayed in Table 19. The M-Score’s interpretation is limited by missing components. To still provide an interpretable reference, NULL values were replaced with the neutral value (1.00) for all ratios except TATA (neutral value 0.00). This yields a neutral-imputed M-Score without pushing the result toward either class. The resulting M-Scores in 2003 and 2005 exceed the −2.22 threshold, driven primarily by an elevated DSRI (6.12) and AQI (7.93) in 2003 and by SGI (7.17) in 2005. However, these are isolated signals and are not conclusive on their own. By contrast, 2015–2018 fall below −2.22 (manipulation unlikely). Thus, the M-Score alone also did not detect the Wirecard scandal, at least for the most recent years.

Wirecard’s C-Score data is displayed in Table 20. For the Montier C-Score, missing components were set to 0, assuming Wirecard’s innocence. As shown in Table 20, each firm year has at least one component indicating FSF. The 2011 fiscal year stands out: all six indicators were flagged, consistent with an elevated manipulation risk. In addition, five firm years exhibit four of the six indicators flagged. Wirecard’s average C-Score of all available firm years is 3.00, exceeding the overall mean of 2.66 computed across 2,014,827 firm years in the Bloomberg sample. This suggests a modestly elevated risk signal on the C-Score. However, 2015–2018 are not clearly flagged; thus, the C-Score alone did not detect the Wirecard case in the most recent years.

Wirecard’s F-Score data is displayed in Table 21. Due to the limited data availability before 2011, only F-Score values between 2011–2018 were calculated and interpreted. The F-Score, like the C-Score, flagged the fiscal year 2011 as the most elevated. Wirecard’s financial statement for the year 2011 has an F-Score of 1.34. This means Wirecard’s misstatement risk for the year 2011 is 134% compared to a randomly selected company from the population. The sustained elevation across subsequent years would strengthen this signal. However, the F-Score declines through 2018, and the estimated FSF probability decreases over time. Given the reports of manipulation from 2015 onward (Der Spiegel, 2020; Tagesschau, 2020), an opposite pattern would be expected. Thus, the F-Score alone did not detect the Wirecard case in the most recent years.

The ensemble model from Section 3.2.2 was applied to Wirecard’s firm years using the values reported in Table 18, Table 19, Table 20 and Table 21, stored at nine-decimal precision. Predictors not used by the model were excluded; see Table 16. The remaining 23 predictors contained missing values and were imputed under the same rules as the prepared dataset (Section 3.2.1): Z-Score components (5 N/A values replaced by column averages), M-Score components (24 N/A values replaced by 1), C-Score components (12 N/A values replaced by 0), and F-Score components (18 N/A values replaced by column averages). Table 22 lists, for each firm year, the model’s predicted class and the number of imputed predictors.

Predictions with more imputed predictors should be interpreted with greater caution. Thus, the 2011–2018 results are most informative, as they rely almost entirely on Wirecard’s reported values. The only imputed predictor is the M-Score’s GMI, which was set to 1.00 from 2011 to 2018. Setting it to 0.00 still returned the same results. Changing it to 2.00 additionally flagged 2004 and 2013 as fraudulent.

The model flagged 2014–2016 as fraudulent, aligning with the misstatement reports for 2015 onwards (Der Spiegel, 2020; Tagesschau, 2020). 2017 and 2018 were not flagged (false negatives). The earlier flags in 2000, 2005, 2006, and 2007 should be interpreted cautiously due to the higher imputation counts. In total, 7 of the 17 firm years were flagged as fraudulent. In a monitoring context, such signals could help authorities prioritize cases for further review. Accordingly, H2 is supported: the model correctly identified multiple years in the manipulation window and detected the Wirecard AG accounting scandal of 2020 using Wirecard’s financial statement data.

5. Discussion

This study developed an ML model for detecting FSF from financial ratios. Financial data of 2,014,827 firm years between 1988–2019 was analyzed and matched to 1046 fraudulent firm years. TPOT was used to select an optimized pipeline, yielding an ensemble combining gradient boosting and k-NN. On the test set, the model correctly classified 82.03% of the fraudulent firm years and 89.88% of the non-fraudulent firm years. Applied to Wirecard AG, 7 of the 17 firm years were classified as fraudulent. The results support H1 and H2.

Interpreting H1: Relative to the 78.37% mean accuracy across 84 algorithms reported in the literature review (Section 2.3), the performance on this study’s held-out test set is higher and was obtained on a balanced sample derived from a larger global dataset (2,014,827 firm-years, 12 countries) using company-level grouped splits. The improvement is plausibly attributable to the model treating each score’s components as continuous signals, which lets the classifier adjust boundaries when the original thresholds do not generalize perfectly across different companies. The gradient-boosting component can capture global interactions among ratio families (accrual quality, and receivables growth), while k-NN contributes local structure, yielding robust patterns across companies in the test set.

Interpreting H2: Individually, none of the four scores clearly detects Wirecard’s fraud. A plausible reason is that their cutoffs, calibrated on earlier samples, may not generalize to digital business models. Each score targets specific patterns its authors deemed indicative of misstatement, which may not align with Wirecard’s misstatement approach. By contrast, the ML model integrates the scores as continuous features and learns decision boundaries directly from the labeled data, rather than relying on fixed thresholds that the authors have set. Accordingly, the learned boundaries may differ from the original cutoffs and yield better performance in this sample.

This study contributes to a cross-country FSF analysis by covering data from 12 countries, extending generalizability beyond U.S. samples. To the authors’ knowledge, it is among the most extensive ratio-based FSF analyses to date (2,014,827 firm years; 12 countries; see Section 2.3). The test-set design evaluates generalization across companies within a cross-country dataset using company-level grouped splits. These results do not claim invariance to settings markedly outside that universe (e.g., industries, jurisdictions, or reporting practices that are not well-represented), for which additional validation would be warranted.

Implications for the theory and further FSF detection research: The results reinforce that accounting fraud mechanisms leave footprints in financial ratios that are learnable at scale. Signals associated with accrual manipulation and performance pressure are informative across firms and jurisdictions in our sample. Treating these signals as continuous aligns with the idea that incentives and opportunities (e.g., to capitalize costs, smooth earnings, or meet covenants) guide managers’ decisions. The cross-firm generalization we observe supports the view that fraud-related distortions in receivables, inventory dynamics, and soft-asset intensity can be combined to form a stable risk model within the applied dataset. This adds a large sample, cross-country analysis to prior studies that document similar mechanisms in single-country settings.

The results have practical implications: the model can serve as a screening tool for FSF risk for multiple stakeholders. (1) Banks and other credit institutions may use the model to assess other companies’ creditworthiness before lending or use it as an ongoing early-warning indicator in portfolio surveillance. (2) Auditing firms such as KPMG, Deloitte, PwC, and EY could integrate the model into their audit and certification processes. Since their primary function is to validate financial statements, this might represent one of the most promising applications. (3) Regulators and public authorities, like the European Central Bank (ECB), BaFin, and the Federal Reserve (FED), may apply the model when supervising and controlling activities among financial market participants. (4) Investors such as venture capital and private equity funds, institutional investors, and individuals can use the model during due diligence to mitigate investment risk. Across these stakeholders, the model can enhance decision-making by identifying companies that may engage in fraudulent accounting before major financial or regulatory decisions are made.

5.1. Limitations

Several limitations merit consideration. The training data comprises fraudulent and random firm years rather than verified non-fraudulent cases. Although selection criteria were applied to ensure quality, some firm years labeled as non-fraudulent may include undetected fraud.

Labels for the fraudulent class may also be imperfect. Although the BQL query requested the earliest available statement from Bloomberg, some misstated statements might never have been published. Subsequent restated (correct) statements would then be captured instead. For example, as of January 2021, FY2019 financials for Wirecard were not yet published (Wirecard AG, 2019). Thus, the analysis assumes that the earliest Bloomberg record reflects the misstated values.

Similarly, some firm years labeled non-fraudulent may later have been legitimately restated. Such valid restatements would not be included in the requested data. Therefore, labels for both classes may contain errors. This risk of imperfect class labels is typical of FSF detection research and reflects a restriction on the observable ground truth.

Implementation of the four scores required adaptations. The F-Score issue variable could not be queried and was set to 1 for all firm years, and 3 of the 26 inputs were excluded from training due to low data availability. Thus, the scores could not be implemented exactly as in the literature.

Some scores target specific industries or company types. For example, the Z-Score was developed for capital-intensive manufacturing firms and is less suited to non-manufacturers. Nevertheless, it was applied to the non-manufacturing company Wirecard.

Cross-country heterogeneity is another limitation. While prior studies often analyze a single jurisdiction, this study implicitly treats financial statements across 12 countries as comparable, despite variations in accounting and reporting standards. Differences in reporting standards and disclosure requirements may limit the comparability of firms across jurisdictions.

Data quality and availability remain central constraints in FSF research. Even with access to U.S. FSF cases through 31 December 2018, only 51.50% of identified fraudulent firm years could be used (see Table 11) due to the limited availability of the financial data.

Class imbalance and undersampling: Because the rate of fraudulent firm years is very low in the raw data, a 1:1 balanced sample (1046 fraudulent, and 1046 non-fraudulent) was created via undersampling of the non-fraudulent class. This design prevents trivial majority-class predictions but loses a large part of the predictive value of the 2,014,827 available firm years, and a balanced test set may not directly translate to populations with a very low fraud prevalence. In such settings, the precision and threshold calibration warrant further validation.

Reporting granularity varies by jurisdiction. For many U.S. cases, only one or two quarters were misstated, but the analysis treats the full fiscal year as misstated to ensure comparability with countries reported at the annual level. This convention may overstate the misstatement for some U.S. firm years but preserves a consistent unit of analysis. A quarterly-level design, where data availability permits, is a valuable approach for future work. However, mixing quarterly and annual periods within a single model would introduce comparability trade-offs across firm years.

Finally, the Wirecard case analysis requires caution. The model flags 7 of the 17 available firm years, including years within the manipulation window from 2015 onwards, but not 2017–2018. These signals are indicative rather than evidentiary. In practice, such outputs would be suited for screening and prioritization, not for proving FSF.

5.2. Future Research

Future work should reduce the cross-country imbalance. Although this study includes 12 countries, the distribution of firm years is skewed toward the U.S., limiting generalizability for non-U.S. firms.

This study only examines FSF detection using ratio-based ML. Future research should also incorporate text-derived signals (e.g., MD&A sections, audit opinions, and earnings-call transcripts) and evaluate multimodal models that fuse textual and ratio features. While prior work has explored textual data, it is still a comparatively unexplored area of FSF research.

The extreme class imbalance required undersampling the 2,014,827 firm years to a 1:1 ratio (1046 fraudulent, and 1046 non-fraudulent). To leverage the full dataset, future research should use imbalance-robust algorithms, like biased-penalty SVMs, class-weighted gradient boosting, focal loss boosting, and anomaly-detection baselines.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jrfm18110605/s1, File S1: bql_query.txt (BQL queries used for financial data acquisition); File S2: ml_code.py (Python script for model training and evaluation).

Author Contributions

Conceptualization, L.S.; methodology, L.S.; software, L.S.; validation, L.S.; formal analysis, L.S.; investigation, L.S.; resources, L.S.; data curation, L.S.; writing—original draft preparation, L.S.; writing—review and editing, L.S.; visualization, L.S.; supervision, E.L.; project administration, L.S.; funding acquisition, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by DZ BANK AG, Frankfurt am Main, Germany, which covered the data acquisition costs and the article processing charge. No grant number was assigned.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the underlying financial-statement data was obtained from Bloomberg L.P. under an institutional license and cannot be publicly shared. Researchers with access to a Bloomberg Terminal can reproduce the dataset using the BQL queries provided in the Supplementary Materials.

Acknowledgments

I thank Michael Kopmann for his guidance and for making this research possible, Christian Kahler for initially drawing my attention to the Beneish M-Score, and Simon Farshid for suggesting a solution to query the Bloomberg equities universe.

Conflicts of Interest

Author L.S. has been involved as an employee in the company DZ BANK AG. Otherwise, the authors declare no conflicts of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations

AAER	Accounting and Auditing Enforcement Release
ANN	Artificial Neural Network
AUC	Area under the receiver operating characteristic curve
Auto-ML	Automatic Machine Learning
AQI	Asset Quality Index
BaFin	Bundesanstalt für Finanzdienstleistungsaufsicht
BBN	Bayesian Belief Network
BLR	Backward Logistic Regression
BLS	U.S. Bureau of Labor Statistics
BP-SVM	Biased-Penalty Linear Support Vector Machine
BQL	Bloomberg Query Language
CART	Classification and Regression Tree
CCE	Cash and cash equivalents
ch_cs	Change in Cash Sales
ch_inv	Change in Inventory
ch_rec	Change in Accounts Receivable
ch_roa	Change in Return on Assets
CV	Cross-validation
DA	Discriminant Analysis
DAX	Deutscher Aktienindex
DEPI	Depreciation Index
DSO	Days Sales Outstanding
DSI	Growing Days Sales of Inventory
DSRI	Days Sales in Receivables Index
DT	Decision Tree
DWD	Distance Weighted Discrimination
e.g.,	“exempli gratia”/for example
EBIT	Earnings before interest and taxes
EC	Evolutionary Computation
ECB	European Central Bank
FED	Federal Reserve
FSF	Financial Statement Fraud
FLR	Forward Logistic Regression
GA	Genetic Algorithm
GBRT	Gradient Boosted Regression Trees
GLM	Generalized Linear Models
GMDH	Group Method Data Handling
GMI	Gross Margin Index
GP	Genetic Programming
IAS	International Accounting Standard
IASB	International Accounting Standards Board
ID3	Iterative Dichotomiser 3
IFRS	International Financial Reporting Standards
ISIN	International Securities Identification Number
ISYDNN	Incremental sum-of-years’-digit Weighted Average Neural Network
k-NN	k-Nearest Neighbors
LDA	Linear Discriminant Analysis
LMM	Large-Margin Methods
LR	Logistic Regression
LTD	Long-Term Debt
LVGI	Leverage Index
MDA	Multiple Discriminant Analysis
MLP	Multilayer Perceptron
N/A	Not available
PNN	Probabilistic Neural Network
PPE	Property, Plant and Equipment
PR	Probability Unit Regression
prob_FSF	Probability of Financial Statement Fraud
PSYDNN	Plain sum-of-years’-digit Weighted Average Neural Network
RBF	Radial Basis Function Neural Network
RF	Random Forest
rsst_acc	RSST Accrual
SEC	United States Securities and Exchange Commission
SG&A	Selling, General and Administrative Expenses
SGAI	Sales General and Administrative Expenses Index
SGI	Sales Growth Index
SHAP	SHapley Additive exPlanations
SPCNN	Simple Percentage Change Neural Network
SVM	Support Vector Machine
TATA	Total Accruals to Total Assets
TM	Text Mining
TPOT	Tree-based Pipeline Optimization Tool
TTF	Task-Technology Fit
U.S.	United States
USD	United States Dollar
US GAAP	United States Generally Accepted Accounting Principles
∆FIN	Change in net financial assets
∆NCO	Change in net non-current operating assets
∆WC	Change in non-cash working capital

References

Abbasi, A., Albrecht, C., Vance, A., & Hansen, J. (2012). MetaFraud: A meta-learning framework for detecting financial fraud. MIS Quarterly, 36(4), 1293–1327. [Google Scholar] [CrossRef]
Achmad, T., Ghozali, I., Helmina, M. R. A., Hapsari, D. I., & Pamungkas, I. D. (2023). Detecting fraudulent financial reporting using the fraud hexagon model: Evidence from the banking sector in Indonesia. Economies, 11(1), 5. [Google Scholar] [CrossRef]
Aghghaleh, S. F., Mohamed, Z. M., & Rahmat, M. M. (2016). Detecting financial statement frauds in Malaysia: Comparing the abilities of Beneish and Dechow models. Asian Journal of Accounting and Governance, 7, 57–65. [Google Scholar] [CrossRef]
Albizri, A., Appelbaum, D., & Rizzotto, N. (2019). Evaluation of financial statements fraud detection research: A multi-disciplinary analysis. International Journal of Disclosure and Governance, 16(4), 206–241. [Google Scholar] [CrossRef]
Alderman, L., & Schuetze, C. F. (2020, June 26). In a German tech giant’s fall, charges of lies, spies and missing billions. The New York Times. Available online: https://web.archive.org/web/20210131232425/https://www.nytimes.com/2020/06/26/business/wirecard-collapse-markus-braun.html (accessed on 28 January 2021).
Ali, A. A., Khedr, A. M., El-Bannany, M., & Kanakkayil, S. (2023). A powerful predicting model for financial statement fraud based on optimized XGBoost ensemble learning technique. Applied Sciences, 13(4), 2272. [Google Scholar] [CrossRef]
Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4), 589–609. [Google Scholar] [CrossRef]
Altman, E. I., Danovi, A., & Falini, A. (2013). Z-score models’ application to Italian companies subject to extraordinary administration. Journal of Applied Finance, 23(1), 1–10. Available online: https://ssrn.com/abstract=2275390 (accessed on 25 October 2025).
Altman, E. I., Hartzell, J., & Peck, M. (1998). Emerging market corporate bond—A scoring system. In R. M. Levich (Ed.), Emerging market capital flows (Vol. 2, pp. 391–400). Springer. [Google Scholar] [CrossRef]
Amiram, D., Bozanic, Z., Cox, J. D., Dupont, Q., Karpoff, J. M., & Sloan, R. (2018). Financial reporting fraud and other forms of misconduct: A multidisciplinary review of the literature. Review of Accounting Studies, 23(2), 732–783. [Google Scholar] [CrossRef]
Anh, N., & Linh, N. (2016). Using the M-score model in detecting earnings management: Evidence from non-financial Vietnamese listed companies. VNU Journal of Science: Economics and Business, 32(2), 14–23. Available online: https://js.vnu.edu.vn/EAB/article/view/1287 (accessed on 25 October 2025).
Anjum, S. (2012). Business bankruptcy prediction models: A significant study of the Altman’s Z-score model. Asian Journal of Management Research, 3(1), 212–219. [Google Scholar] [CrossRef]
Appelbaum, D. (2016). Securing big data provenance for auditors: The big data provenance black box as reliable evidence. Journal of Emerging Technologies in Accounting, 13(1), 17–36. [Google Scholar] [CrossRef]
Appelbaum, D., Kogan, A., & Vasarhelyi, M. A. (2017). Big data and analytics in the modern audit engagement: Research needs. AUDITING: A Journal of Practice & Theory, 36(4), 1–27. [Google Scholar] [CrossRef]
Bach, F. R., Heckerman, D., & Horvitz, E. (2006). Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7(63), 1713–1741. Available online: https://www.jmlr.org/papers/v7/bach06a.html (accessed on 25 October 2025).
Bai, B., Yen, J., & Yang, X. (2008). False financial statements: Characteristics of China’s listed companies and CART detecting approach. International Journal of Information Technology & Decision Making, 7(2), 339–359. [Google Scholar] [CrossRef]
Barton, J., & Simko, P. J. (2002). The balance sheet as an earnings management constraint. The Accounting Review, 77(s-1), 1–27. [Google Scholar] [CrossRef]
Beneish, M. D. (1999). The detection of earnings manipulation. Financial Analysts Journal, 55(5), 24–36. [Google Scholar] [CrossRef]
Beneish, M. D., Lee, C. M. C., & Nichols, D. C. (2012). Fraud detection and expected returns. SSRN Electronic Journal. [Google Scholar] [CrossRef]
Beneish, M. D., Lee, C. M. C., & Nichols, D. C. (2013). Earnings manipulation and expected returns. Financial Analysts Journal, 69(2), 57–82. [Google Scholar] [CrossRef]
Bertomeu, J., Cheynel, E., Floyd, E., & Pan, W. (2018). Ghost in the machine: Using machine learning to uncover hidden misstatements. Semantic Scholar, 1–32. Available online: https://api.semanticscholar.org/CorpusID:53965228 (accessed on 25 October 2025).
Bertomeu, J., Cheynel, E., Floyd, E., & Pan, W. (2019). Using machine learning to detect misstatements. Review of Accounting Studies. Available online: https://ssrn.com/abstract=3496297 (accessed on 25 October 2025).
Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, P. (2010a). Detecting management fraud in public companies. Management Science, 56(7), 1146–1160. [Google Scholar] [CrossRef]
Cecchini, M., Aytug, H., Koehler, G. J., & Pathak, P. (2010b). Making words work: Using financial text as a predictor of financial events. Decision Support Systems, 50(1), 164–175. [Google Scholar] [CrossRef]
Cheah, P. C. Y., Yang, Y., & Lee, B. G. (2023). Enhancing financial fraud detection through addressing class imbalance using hybrid SMOTE-GAN techniques. International Journal of Financial Studies, 11(3), 110. [Google Scholar] [CrossRef]
Chen, F. H., Chi, D. J., & Zhu, J. Y. (2014). Application of random forest, rough set theory, decision tree and neural network to detect financial statement fraud—Taking corporate governance into consideration. In D.-S. Huang, V. Bevilacqua, & P. Premaratne (Eds.), International conference on intelligent computing (Vol. 8588, pp. 221–234). Springer International Publishing. [Google Scholar] [CrossRef]
Chen, S., Goo, Y.-J. J., & Shen, Z.-D. (2014). A hybrid approach of stepwise regression, logistic regression, support vector machine, and decision tree for forecasting fraudulent financial statements. The Scientific World Journal, 2014(1), 968712. [Google Scholar] [CrossRef] [PubMed]
Chen, Y., & Wu, Z. (2023). Financial fraud detection of listed companies in China: A Machine Learning Approach. Sustainability, 15(1), 105. [Google Scholar] [CrossRef]
Cheng, C.-H., Kao, Y.-F., & Lin, H.-P. (2021). A financial statement fraud model based on synthesized attribute selection and a dataset with missing values and imbalanced classes. Applied Soft Computing, 108, 107487. [Google Scholar] [CrossRef]
Cressey, R. D. (1953). Other people’s money; A study of the social psychology of embezzlement. Free Press. [Google Scholar]
Dechow, P. M., Ge, W., Larson, C. R., & Sloan, R. G. (2011). Predicting material accounting misstatements. Contemporary Accounting Research, 28(1), 17–82. [Google Scholar] [CrossRef]
Dechow, P. M., Sloan, R. G., & Sweeney, A. P. (1995). Detecting earnings management. The Accounting Review, 70(2), 193–225. Available online: https://www.jstor.org/stable/248303 (accessed on 25 October 2025).
Der Spiegel. (2020, July 22). Ex-Wirecard-Chef Braun erneut verhaftet. Available online: https://www.spiegel.de/wirtschaft/unternehmen/staatsanwaltschaft-laesst-ex-wirecard-chef-braun-erneut-verhaften-a-d361846e-b33f-4936-8945-03243a1aef2d (accessed on 24 January 2021).
Desjardins, J. (2019, June 25). The 20 biggest bankruptcies in U.S. history. Visual Capitalist. Available online: https://www.visualcapitalist.com/the-20-biggest-bankruptcies-in-u-s-history/ (accessed on 22 December 2020).
Dong, W., Liao, S., & Zhang, Z. (2018). Leveraging financial social media data for corporate fraud detection. Journal of Management Information Systems, 35(2), 461–487. [Google Scholar] [CrossRef]
Dong, W., Liao, S. S., Fang, B., Cheng, X., Zhu, C., & Fan, W. (2014, July 13). The detection of fraudulent financial statements: An integrated language model. Pacific Asia Conference on Information Systems, Chengdu, China. [Google Scholar]
Eidleman, G. J. (1995). Z scores—A guide to failure prediction. The CPA Journal, 65(2), 52–54. Available online: http://archives.cpajournal.com/old/16641866.htm (accessed on 25 October 2025).
EpistasisLab. (2021a). Computational Genetics Laboratory (CGL). Available online: http://epistasis.org/ (accessed on 23 January 2021).
EpistasisLab. (2021b). TPOT. Available online: http://epistasislab.github.io/tpot/ (accessed on 23 January 2021).
Eurostat. (2017). Business survival rates in selected European countries in 2017, by length of survival. Available online: https://www.statista.com/statistics/1114070/eu-business-survival-rates-by-country-2017/ (accessed on 8 October 2020).
Financial Accounting Standards Board (FASB). (2010). US GAAP—Receivables (Topic 310). Available online: https://storage.fasb.org/ASU%202010-XX%20Receivables%20(Topic%20310)%20Disclosures%20about%20the%20Credit%20Quality%20of%20Financing%20Receivables.pdf (accessed on 25 October 2025).
Financial Accounting Standards Board (FASB). (2015). Inventory (Topic 330). Available online: https://storage.fasb.org/ASU%202015-11.pdf (accessed on 25 October 2025).
Gaganis, C. (2009). Classification techniques for the identification of falsified financial statements: A comparative analysis. Intelligent Systems in Accounting, Finance & Management, 16(3), 207–229. [Google Scholar] [CrossRef]
Glancy, F. H., & Yadav, S. B. (2011). A computational model for financial reporting fraud detection. Decision Support Systems, 50(3), 595–601. [Google Scholar] [CrossRef]
Goel, S., Gangolly, J., Faerman, S. R., & Uzuner, O. (2010). Can linguistic predictors detect fraudulent financial filings? Journal of Emerging Technologies in Accounting, 7(1), 25–46. [Google Scholar] [CrossRef]
Goel, S., & Uzuner, O. (2016). Do sentiments matter in fraud detection? Estimating semantic orientation of annual reports. Intelligent Systems in Accounting, Finance and Management, 23(3), 215–239. [Google Scholar] [CrossRef]
Golec, A. (2019). Effectiveness of the Beneish model in detecting financial statement manipulations. Acta Universitatis Lodziensis. Folia Oeconomica, 2(341), 161–182. [Google Scholar] [CrossRef]
Gomber, P., Kauffman, R. J., Parker, C., & Weber, B. (2017). On the fintech revolution: Interpreting the forces of innovation, disruption and transformation in financial services. Journal of Management Information Systems, 35(1), 220–265. Available online: https://ssrn.com/abstract=3190052 (accessed on 25 October 2025). [CrossRef]
Goodhue, D. L., & Thompson, R. L. (1995). Task-technology fit and individual performance. Management Information Systems Quarterly, 19(2), 213–236. [Google Scholar] [CrossRef]
Granville, K. (2020, June 19). Wirecard, a payments firm, is rocked by a report of a missing $2 billion. The New York Times. Available online: https://web.archive.org/web/20201219112713/https://www.nytimes.com/2020/06/19/business/wirecard-scandal.html/ (accessed on 20 December 2020).
Gray, A. (2020, December 17). Luckin Coffee to pay $180m in accounting fraud settlement. Financial Times. Available online: https://www.ft.com/content/4db3b074-829f-4f1c-a256-11c7e28a31d1 (accessed on 22 December 2020).
Green, B. P., & Choi, J. H. (1997). Assessing the risk of management fraud through neural network technology. Auditing: A Journal of Practice & Theory, 16(1), 14–28. Available online: https://www.researchgate.net/publication/245508224_Assessing_the_Risk_of_Management_Fraud_Through_Neural_Network_Technology (accessed on 25 October 2025).
Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent detection of financial statement fraud—A comparative study of machine learning methods. Knowledge-Based Systems, 128, 139–152. [Google Scholar] [CrossRef]
Healy, P. M. (1985). The effect of bonus schemes on accounting decisions. Journal of Accounting and Economics, 7(1–3), 85–107. [Google Scholar] [CrossRef]
Healy, P. M., & Wahlen, J. M. (1999). A review of the earnings management literature and its implications for standard setting. Accounting Horizons, 13(4), 365–383. [Google Scholar] [CrossRef]
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in information systems research. Management Information Systems Quarterly, 28(1), 75–106. [Google Scholar] [CrossRef]
Hoberg, G., & Lewis, C. (2017). Do fraudulent firms produce abnormal disclosure? Journal of Corporate Finance, 43, 58–85. [Google Scholar] [CrossRef]
Hoogs, B., Kiehl, T., Lacomb, C., & Senturk, D. (2007). A genetic algorithm approach to detecting temporal patterns indicative of financial statement fraud. Intelligent Systems in Accounting, Finance and Management, 15(1–2), 41–56. [Google Scholar] [CrossRef]
Huang, L., Abrahams, A., & Ractham, P. (2022). Enhanced financial fraud detection using cost-sensitive cascade forest with missing value imputation. Intelligent Systems in Accounting, Finance and Management, 29(3), 133–155. [Google Scholar] [CrossRef]
Huang, S., & Liang, X. (2013). Fraud detection model by using support vector machine techniques. International Journal of Digital Content Technology and Its Applications, 15(1), 32–37. [Google Scholar] [CrossRef]
Humpherys, S. L., Moffitt, K. C., Burns, M. B., Burgoon, J. K., & Felix, W. F. (2011). Identification of fraudulent financial statements using linguistic credibility analysis. Decision Support Systems, 50(3), 585–594. [Google Scholar] [CrossRef]
Hung, D. N., Ha, H. T. V., & Binh, D. T. (2017). Application of F-score in predicting fraud, errors: Experimental research in Vietnam. International Journal of Accounting and Financial Reporting, 7(2), 303–322. [Google Scholar] [CrossRef]
Hylas, R. E., & Ashton, R. H. (1982). Audit detection of financial statement errors. The Accounting Review, 57(4), 751–765. Available online: https://www.jstor.org/stable/247410 (accessed on 25 October 2025).
Jones, M. J. (2011). Creative accounting, fraud and international accounting scandals. John Wiley & Sons. [Google Scholar] [CrossRef]
Kim, W., & Kim, S. (2025). Enhancing Corporate Transparency: AI-based detection of financial misstatements in Korean firms using NearMiss sampling and explainable models. Sustainability, 17(19), 8933. [Google Scholar] [CrossRef]
Kim, Y. J., Baik, B., & Cho, S. (2016). Detecting financial misstatements with fraud intention using multi-class cost-sensitive learning. Expert Systems with Applications, 62, 32–43. [Google Scholar] [CrossRef]
Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). Data mining techniques for the detection of fraudulent financial statements. Expert Systems with Applications, 32(4), 995–1003. [Google Scholar] [CrossRef]
Kotsiantis, S., Koumanakos, E., Tzelepis, D., & Tampakas, V. (2006). Forecasting fraudulent financial statements using data mining. International Journal of Computational Intelligence, 3(2), 104–110. Available online: https://www.researchgate.net/publication/228084523_Forecasting_fraudulent_financial_statements_using_data_mining (accessed on 25 October 2025).
Le, T. T., Fu, W., & Moore, J. H. (2020). Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics, 36(1), 250–256. [Google Scholar] [CrossRef] [PubMed]
Lev, B., & Thiagarajan, S. R. (1993). Fundamental information analysis. Journal of Accounting Research, 31(2), 190–215. [Google Scholar] [CrossRef]
Li, B., Yu, J., Zhang, J., & Ke, B. (2016). Detecting accounting frauds in publicly traded U.S. firms: A machine learning approach. Asian Conference on Machine Learning, 45, 173–188. [Google Scholar]
Li, X., Xu, W., & Tian, X. (2014). How to protect investors? A GA-based DWD approach for financial statement fraud detection. IEEE International Conference on Systems, Man and Cybernetics, 3548–3554. [Google Scholar] [CrossRef]
Lin, C. C., Chiu, A. A., Huang, S. Y., & Yen, D. C. (2015). Detecting the financial statement fraud: The analysis of the differences between data mining techniques and experts’ judgments. Knowledge-Based Systems, 89, 459–470. [Google Scholar] [CrossRef]
Liu, C., Chan, Y., Alam Kazmi, S. H., & Fu, H. (2015). Financial fraud detection model: Based on random forest. International Journal of Economics and Finance, 7(7), 178–188. [Google Scholar] [CrossRef]
Liu, W., Wang, Z., & Zhang, X. (2025). Research on financial fraud detection by integrating latent semantic features of annual report text with accounting indicators. Journal of Accounting & Organizational Change. [Google Scholar] [CrossRef]
Maccarthy, J. (2017). Using Altman Z-score and Beneish M-score models to detect financial fraud and corporate failure: A case study of Enron corporation. International Journal of Finance and Accounting, 6(6), 159–166. Available online: http://article.sapub.org/10.5923.j.ijfa.20170606.01.html (accessed on 25 October 2025).
Mahama, M. (2015). Detecting corporate fraud and financial distress using the Altman and Beneish models. International Journal of Economics, Commerce and Management, 3(1), 1–18. Available online: https://ijecm.co.uk/wp-content/uploads/2015/01/3159.pdf (accessed on 25 October 2025).
Mccain, T. (2017). Montier C Score—Who is Cooking the Books? Available online: https://web.archive.org/web/20220516114102/https://www.equitieslab.com/montier-c-score/ (accessed on 3 September 2025).
Montier, J. (2008). Cooking the books, or, more sailing under the black flag (pp. 1–8). Société Générale Mind Matters. [Google Scholar]
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559–569. [Google Scholar] [CrossRef]
Nugroho, D. S., & Diyanty, V. (2022). Hexagon fraud in fraudulent financial statements: The moderating role of audit committee. Jurnal Akuntansi Dan Keuangan Indonesia, 19(1), 46–67. [Google Scholar] [CrossRef]
Olson, R. S., Bartley, N., Urbanowicz, R. J., & Moore, J. H. (2016a). Evaluation of a tree-based pipeline optimization tool for automating data science. In GECCO ‘16: Proceedings of the 2016 genetic and evolutionary computation conference (GECCO 2016) (pp. 485–492). Association for Computing Machinery. [Google Scholar] [CrossRef]
Olson, R. S., Urbanowicz, R. J., Andrews, P. C., Lavender, N. A., Kidd, L. C., & Moore, J. H. (2016b). Automating biomedical data science through tree-based pipeline optimization. In Applications of evolutionary computation (Vol. 9597, pp. 123–137). Springer International Publishing. [Google Scholar] [CrossRef]
Omar, N., Johari, Z. A., & Smith, M. (2017). Predicting fraudulent financial reporting using artificial neural network. Journal of Financial Crime, 24(2), 362–387. [Google Scholar] [CrossRef]
Pai, P. F., Hsu, M. F., & Wang, M. C. (2011). A support vector machine-based model for detecting top management fraud. Knowledge-Based Systems, 24(2), 314–321. [Google Scholar] [CrossRef]
Perols, J. L., & Lougee, B. A. (2011). The relation between earnings management and financial statement fraud. Advances in Accounting, 27(1), 39–53. [Google Scholar] [CrossRef]
Persons, O. S. (1995). Using financial statement data to identify factors associated with fraudulent financial reporting. Journal of Applied Business Research, 11(3), 38–46. [Google Scholar] [CrossRef]
Price, R. A., III, Sharp, N. Y., & Wood, D. A. (2011). Detecting and predicting accounting irregularities: A comparison of commercial and academic risk measures. Accounting Horizons, 25(4), 755–780. [Google Scholar] [CrossRef]
Purda, L., & Skillicorn, D. (2015). Accounting variables, deception, and a bag of words: Assessing the tools of fraud detection. Contemporary Accounting Research, 32(3), 1193–1223. [Google Scholar] [CrossRef]
Rahul, K., Steh, N., & Dinesh Kumar, U. (2018). Spotting earnings manipulation: Using machine learning for financial fraud detection. International Conference on Innovative Techniques and Applications of Artificial Intelligence, 11311, 343–356. [Google Scholar] [CrossRef]
Ravisankar, P., Ravi, V., Raghava Rao, G., & Bose, I. (2011). Detection of financial statement fraud and feature selection using data mining techniques. Decision Support Systems, 50(2), 491–500. [Google Scholar] [CrossRef]
Rezaee, Z. (2005). Causes, consequences, and deterence of financial statement fraud. Critical Perspectives on Accounting, 16(3), 277–298. [Google Scholar] [CrossRef]
Richardson, S. A., Sloan, R. G., Soliman, M. T., & Tuna, I. (2005). Accrual reliability, earnings persistence and stock prices. Journal of Accounting and Economics, 39(3), 437–485. [Google Scholar] [CrossRef]
Schuchter, A., & Levi, M. (2016). The fraud triangle revisited. Security Journal, 29(2), 107–121. [Google Scholar] [CrossRef]
scikit-learn. (2021a). GradientBoostingClassifier. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html (accessed on 24 January 2021).
scikit-learn. (2021b). KNeighborsClassifier. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html (accessed on 24 January 2021).
Sharma, A., & Kumar Panigrahi, P. (2012). A review of financial accounting fraud detection based on data mining techniques. International Journal of Computer Applications, 39(1), 37–47. [Google Scholar] [CrossRef]
Song, X. P., Hu, Z. H., Du, J. G., & Sheng, Z. H. (2014). Application of machine learning methods to risk assessment of financial statement fraud: Evidence from China. Journal of Forecasting, 33(8), 611–626. [Google Scholar] [CrossRef]
Sukmadilaga, C., Winarningsih, S., Handayani, T., Herianti, E., & Ghani, E. K. (2022). Fraudulent financial reporting in ministerial and governmental institutions in indonesia: An analysis using hexagon theory. Economies, 10(4), 86. [Google Scholar] [CrossRef]
Tagesschau. (2020, July 22). Neue haftbefehle gegen Wirecard-manager. Available online: https://web.archive.org/web/20210305135753/https://www.tagesschau.de/wirtschaft/wirecard-staatsanwaltschaft-103.html (accessed on 2 September 2025).
Tarjo, & Herawati, N. (2015). Application of Beneish M-score models and data mining to detect financial fraud. Procedia—Social and Behavioral Sciences, 211, 924–930. [Google Scholar] [CrossRef]
Throckmorton, C. S., Mayew, W. J., Venkatachalam, M., & Collins, L. M. (2015). Financial fraud detection using vocal, linguistic and financial cues. Decision Support Systems, 74, 78–87. [Google Scholar] [CrossRef]
Toms, S. (2019). Financial scandals: A historical overview. Accounting and Business Research, 49(5), 477–499. [Google Scholar] [CrossRef]
U.S. Bureau of Labor Statistics. (2020). Survival of private sector establishments by opening year. Available online: https://www.bls.gov/bdm/us_age_naics_00_table7.txt (accessed on 8 October 2020).
USC Marshall School of Business. (2020). Accounting and auditing enforcement release (AAER) dataset. Available online: https://sites.google.com/usc.edu/aaerdataset/ (accessed on 3 September 2025).
U.S. Securities and Exchange Commission. (n.d.). Accounting and auditing enforcement releases (AAER). Available online: https://www.sec.gov/divisions/enforce/friactions.shtml (accessed on 21 January 2021).
Vousinas, G. L. (2019). Advancing theory of fraud: The S.C.O.R.E. model. Journal of Financial Crime, 26(1), 372–381. [Google Scholar] [CrossRef]
Wang, G., Ma, J., & Chen, G. (2023). Attentive statement fraud detection: Distinguishing multimodal financial data with fine-grained attention. Decision Support Systems, 167, 113913. [Google Scholar] [CrossRef]
Warshavsky, M. (2012). Analyzing earnings quality as a financial forensic tool. Financial Valuation and Litigation Expert Journal, 39, 16–20. Available online: https://www.scribd.com/document/199209631/Analyzing-Earnings-Quality-as-a-Financial-Forensics-Tool (accessed on 25 October 2025).
West, J., & Bhattacharya, M. (2016). Intelligent financial fraud detection: A comprehensive review. Computers & Security, 57, 47–66. [Google Scholar] [CrossRef]
Whiting, D. G., Hansen, J. V., McDonald, J. B., Albrecht, C., & Albrecht, W. S. (2012). Machine learning methods for detecting patterns of management fraud. Computational Intelligence, 28(4), 505–527. [Google Scholar] [CrossRef]
Wirecard AG. (2019). News & publications, financial reports. Available online: https://web.archive.org/web/20201023020946/https://ir.wirecard.com/websites/wirecard/English/5000/news-_-publications.html#financialreports (accessed on 25 January 2021).
Wolfe, D. T., & Hermanson, D. R. (2004). The fraud diamond: Considering the four elements of fraud. Available online: https://digitalcommons.kennesaw.edu/facpubs/1537/ (accessed on 25 October 2025).
Zhang, C., Cho, S., & Vasarhelyi, M. (2022). Explainable Artificial Intelligence (XAI) in auditing. International Journal of Accounting Information Systems, 46, 100572. [Google Scholar] [CrossRef]
Zhang, Z., Ma, Y., & Hua, Y. (2022). Financial fraud identification based on stacking ensemble learning algorithm: Introducing MD&A text information. Computational Intelligence and Neuroscience, 2022(1), 1780834. [Google Scholar] [CrossRef]
Zhang, Z., Wang, Z., & Cai, L. (2025). Predicting financial fraud in Chinese listed companies: An enterprise portrait and machine learning approach. Pacific-Basin Finance Journal, 90, 102665. [Google Scholar] [CrossRef]
Zhu, S., Ma, T., Wu, H., Ren, J., He, D., Li, Y., & Ge, R. (2025). Expanding and interpreting financial statement fraud detection using supply chain knowledge graphs. Journal of Theoretical and Applied Electronic Commerce Research, 20(1), 26. [Google Scholar] [CrossRef]

Figure 1. Number of AAERs by year (1982–2020). Source: 1982–2018 adapted from USC Marshall School of Business (2020); 2019–2020 adapted from U.S. Securities and Exchange Commission (n.d.).

Figure 2. The relationship of accrual accounting practices, earnings management, and FSF (Albizri et al., 2019, Figure 2). Colors indicate severity: green = lawful accrual accounting practices, yellow = earnings management (aggressive but within legal scope), red = financial statement fraud (unlawful). Overlap shows subset relationships.

Figure 3. ML-workflow parts automated by TPOT (Olson et al., 2016a, p. 486).

Figure 4. TPOT optimization process over 100 generations measured by the cross-validation score.

Figure 5. Flow of data in the machine-learning process. The value 1 denotes fraudulent and 0 denotes non-fraudulent (random) firm years.

Figure 6. Confusion matrix of the testing data.

Figure 7. Permutation importance of the 23 predictor variables.

Table 1. The 20 largest bankruptcies in U.S. history by assets, as of June 2019.

	Company	Assets (USD bn)	Filed	Fraud Involved
1.	Lehman Brothers, Inc.	691.06	2008	Yes
2.	Washington Mutual, Inc.	327.91	2008	No
3.	WorldCom, Inc.	103.91	2002	Yes
4.	General Motors Corp.	82.29	2009	No
5.	CIT Group, Inc.	71.00	2009	No
6.	Pacific Gas and Electric	71.00	2019	No
7.	Enron Corp.	65.50	2001	Yes
8.	Conseco, Inc.	61.39	2002	Yes
9.	MF Global, Inc.	41.00	2011	No
10.	Chrysler LLC	39.30	2009	No
11.	Thornburg Mortgage, Inc.	36.52	2009	No
12.	Pacific Gas and Electric	36.15	2001	No
13.	Texaco, Inc.	34.94	1987	No
14.	Financial Corporation of America	33.86	1988	No
15.	Refco, Inc.	33.33	2005	Yes
16.	IndyMac Bancorp, Inc.	32.73	2008	No
17.	Global Crossing Ltd.	30.19	2002	Yes
18.	Bank of New England Corp.	29.77	1991	No
19.	General Growth Properties, Inc.	29.56	2009	No
20.	Lyondell Chemical Corp.	27.39	2009	No

Source: Bankruptcy data (Desjardins, 2019); fraud data (Abbasi et al., 2012, Table 4).

Table 2. Altman Z-Score interpretation.

Altman Z-Score	Interpretation
$Z > 2.67$	Non-distress Zone
$1.81 < Z < 2.67$	Gray Zone
$Z < 1.81$	Distress Zone

Source: Adapted from Anjum (2012, p. 216).

Table 3. Beneish M-Score interpretation.

Beneish M-Score	Interpretation
$M > - 2.22$	Manipulation likely
$M < - 2.22$	Manipulation unlikely

Source: Adapted from Aghghaleh et al. (2016, p. 59) and Warshavsky (2012, p. 18).

Table 4. Dechow F-Score interpretation.

Dechow F-Score	Interpretation
$F > 2.45$	High risk
$F > 1.85$	Substantial risk
$F \geq 1.00$	Above normal risk
$F < 1.00$	Normal or low risk

Source: Adapted from Dechow et al. (2011, p. 63).

Table 5. Studies on the detection of FSF using machine learning.

Study	Fraudulent/Non-fraudulent Firm Years, Country	Supervised Classification Algorithms Used (Accuracy in Percent)
(Green & Choi, 1997)	86/86 US	Balances Accuracies (BA) ¹: PSYDNN (70.0), SPCNN (71.2), ISYDNN (50.0)
(Kotsiantis et al., 2006)	41/123 Greek	Stacking (95.1), C4.5 (91.2), SVM (78.7)
(Kirkos et al., 2007)	38/38 Greek	BBN (90.3), MLP (80.0), ID3 (73.6)
(Hoogs et al., 2007)	51/339 US	GA (90.8)
(Bai et al., 2008)	24/124 Chinese	CART (92.5), LR (89.6)
(Gaganis, 2009)	199/199 Greek	PNN (90.2), SVM (88.4), MLP (88.4), LDA (87.8)
(Cecchini et al., 2010a)	132/3187 US	SVM (90.3) ² (AUC 0.878; fraud recall 80.0%, non-fraud recall 90.6%)
(Cecchini et al., 2010b)	61/61 US	TM + SVM (82.0), TM (75.4)
(Goel et al., 2010)	126/622 US	TM (89.5)
(Dechow et al., 2011)	293/79358 US	LR (63.7)
(Ravisankar et al., 2011)	101/101 Chinese	PNN (98.1), GP (94.1), GMDH (93.0), MLP (78.8), SVM (73.4)
(Humpherys et al., 2011)	101/101 US	TM + C4.5 (67.3), TM + Naïve Bayes (67.3), TM + SVM (65.8)
(Glancy & Yadav, 2011)	100 US	TM (83.9)
(Pai et al., 2011)	25/50 Taiwanese	SVM (92.0), C4.5 (84.0), RBF (82.7), MLP (82.7)
(Whiting et al., 2012)	114/114 US	RF (90.1), Rule Ensemble (88.2), Stochastic Gradient Boosting (86.3), LR (72.3), PR (68.6)
(S. Huang & Liang, 2013)	Taiwanese (1:4)	SVM (92.0), LR (76.0)
(Dong et al., 2014)	26/26 Chinese	TM (78.1)
(Song et al., 2014)	110/440 Chinese	Voting (88.9), SVM (85.5), MLP (85.1), C5.0 (78.6), LR (77.9)
(S. Chen et al., 2014)	66/66 Taiwanese	C5.0 (85.7), LR (81.0), SVM (72.0)
(F. H. Chen et al., 2014)	47/47 Taiwanese	C5.0 (79.0), Rough Set (78.9), MLP (67.1)
(X. Li et al., 2014)	12/45 US	DWD (91.2), SVM (89.5), C4.5 (82.5), MLP (75.4)
(Lin et al., 2015)	127/447 Taiwanese	MLP (92.8), CART (90.3), LR (88.5)
(C. Liu et al., 2015)	138/160 Chinese	RF (88.0), SVM (80.2), CART (66.4), k-NN (60.1), LR (42.9)
(Purda & Skillicorn, 2015)	1407/4708 US	TM (83.0)
(Throckmorton et al., 2015)	41/1531 US	Audio + TM (81.0)
(Y. J. Kim et al., 2016)	788/2156 US	LR (88.4), SVM (87.7), BBN (82.5)
(Goel & Uzuner, 2016)	180/180 US	TM (81.8)
(Omar et al., 2017)	75/475 Malaysian	MLP (94.9), LR (92.4)
(Bertomeu et al., 2018)	1227/8471 US	Detection rates ³: GBRT (49.5), LR (46.2), BLR (46.8), FLR (46.8)
(Dong et al., 2018)	64/64 US	TM + SVM (75.5), TM + ANN (63.2), TM + DT (63.1), TM + LR (54.5)
(Rahul et al., 2018)	39/1200 Indian	AdaBoost (60.5), XGBoost (54.2), RF (60.4)

Source: Adapted from Hajek and Henriques (2017, Table 1); own work. Abbreviations listed in Table 6 and Table 7. ¹ BA derived from Green and Choi (1997, Table 4); Test Sample Type-I and Type-II error rates:

B A = 1 - 0.5 * (T y p e I + T y p e I I)

. ² Accuracy computed from reported recalls and test-set class sizes (25 fraud, 982 non-fraud):

(0.80 \times 25 + 0.906 \times 982) / (25 + 982) = 90.3 %

. ³ Share of restatement firm years caught when auditing the top 1/3 highest-risk firms.

Table 6. Average machine-learning accuracies in FSF literature.

Category	Algorithm	Ø	n
ANN	Probabilistic Neural Network (PNN)	94.2	2
ANN	Group Method of Data Handling (GMDH)	93.0	1
ANN	Multilayer Perceptron (MLP)	82.8	9
ANN	Radial Basis Function Neural Network (RBF)	82.7	1
BBN	Bayesian Belief Networks (BBN)	86.4	2
DT	C4.5	85.9	3
DT	Classification and Regression Tree (CART)	83.1	3
DT	C5.0	81.1	3
DT	Iterative Dichotomiser 3 (ID3)	73.6	1
DA	Linear Discriminant Analysis (LDA)	87.8	1
EC	Genetic Programming (GP)	94.1	1
EC	Genetic Algorithm (GA)	90.8	1
Ensemble	Stacking	95.1	1
Ensemble	Voting	88.9	1
Ensemble	Rule Ensemble	88.2	1
Ensemble	Random Forest (RF)	79.5	3
Ensemble	Boosting	62.6	4
GLM	Logistic Regression (LR)	74.4	11
GLM	Probit Regression (PR)	68.6	1
GLM	Forward Logistic Regression (FLR)	46.8	1
GLM	Backward Logistic Regression (BLR)	46.8	1
k-NN	k-Nearest Neighbors (k-NN)	60.1	1
LMM	Distance Weighted Discrimination (DWD)	91.2	1
LMM	Support Vector Machine (SVM)	84.5	11
Other	Plain sum-of-years’-digit weighted avg. (PSYDNN)	70.0	1
Other	Simple Percentage Change (SPCNN)	71.2	1
Other	Incremental sum-of-years’-digit weighted avg. (ISYDNN)	50.0	1
Rough Set	Rough Set Classifier	78.9	1
Text	Text Mining (TM)	74.1	15

Abbreviations listed in Table 7.

Table 7. Average accuracies in FSF literature by category.

Category	Ø	n
Evolutionary Computation (EC)	92.45	2
Discriminant Analysis (DA)	87.80	1
Bayesian Belief Network (BBN)	86.40	2
Artificial Neural Network (ANN)	85.32	13
Large-Margin Methods (LMM)	85.08	12
Decision Tree (DT)	82.38	10
Rough Set	78.90	1
Ensemble	74.39	10
Text	74.09	15
Generalized Linear Model (GLM)	70.08	14
Other	63.73	3
k-Nearest Neighbor (k-NN)	60.10	1

Table 8. Description of fields in the financial data sample.

ID	Field	Description
1	ID	Unique Identifier in the Bloomberg database
2	ISIN	International Securities Identification Number
3	Name	Name of the company
4	Period	Fiscal year of the dataset
5	QuerylistReference	A constructed field indicating total asset size
6	Timestamp	Point in time the query was executed
7	X1	Variables used to calculate the Altman Z-Score
8	X2
9	X3
10	X4
11	X5
12	Altman Z-Score
13	DSRI	Variables used to calculate the Beneish M-Score
14	GMI
15	AQI
16	SGI
17	DEPI
18	SGAI
19	LVGI
20	TATA
21	Beneish M-Score
22	C1	Variables used to calculate the Montier C-Score
23	C2
24	C3
25	C4
26	C5
27	C6
28	Montier C-Score
29	rsst_acc	Variables used to calculate the Dechow F-Score
30	ch_rec
31	ch_inv
32	soft_assets
33	ch_cs
34	ch_roa
35	issue
36	logit
37	prob_FSF
38	Dechow F-Score

Table 9. N/A values in the financial data sample per field.

Field	N/A Values Absolute (n)	N/A Values Relative (r)	Data Availability (1 − r)	Data Availability (2,014,827 − n)
ID	0	0.00%	100.00%	2,014,827
ISIN	1,036,634	51.45%	48.55%	978,193
Name	2	0.00%	100.00%	2,014,825
Period	0	0.00%	100.00%	2,014,827
Timestamp	0	0.00%	100.00%	2,014,827
X1	776,186	38.52%	61.48%	1,238,641
X2	1,331,093	66.06%	33.94%	683,734
X3	757,967	37.62%	62.38%	1,256,860
X4	1,236,329	61.36%	38.64%	778,498
X5	692,321	34.36%	65.64%	1,322,506
Altman Z-Score	1,573,337	78.09%	21.91%	441,490
DSRI	465,181	23.09%	76.91%	1,549,646
GMI	1,215,471	60.33%	39.67%	799,356
AQI	515,638	25.59%	74.41%	1,499,189
SGI	291,441	14.46%	85.54%	1,723,386
DEPI	1,358,336	67.42%	32.58%	656,491
SGAI	1,778,689	88.28%	11.72%	236,138
LVGI	942,600	46.78%	53.22%	1,072,227
TATA	1,068,151	53.01%	46.99%	946,676
Beneish M-Score	1,857,670	92.20%	7.80%	157,157
C1	1,167,571	57.95%	42.05%	847,256
C2	419,513	20.82%	79.18%	1,595,314
C3	654,233	32.47%	67.53%	1,360,594
C4	273,489	13.57%	86.43%	1,741,338
C5	1,467,890	72.85%	27.15%	546,937
C6	221,522	10.99%	89.01%	1,793,305
Montier C-Score	1,512,633	75.08%	24.92%	502,194
rsst_acc	1,910,865	94.84%	5.16%	103,962
ch_rec	394,346	19.57%	80.43%	1,620,481
ch_inv	654,252	32.47%	67.53%	1,360,575
soft_assets	109,570	5.44%	94.56%	1,905,257
ch_cs	616,965	30.62%	69.38%	1,397,862
ch_roa	448,078	22.24%	77.76%	1,566,749
issue	0	0.00%	100.00%	2,014,827
logit	1,918,102	95.20%	4.80%	96,725
prob_FSF	1,918,109	95.20%	4.80%	96,718
Dechow F-Score	1,918,109	95.20%	4.80%	96,718

Table 10. Firm years and N/A values in the financial data sample by fiscal year.

$Fiscal Year (x)$	$Firm Years (f)$	N/A Values $Absolute (n)$	N/A Values Relative $r = n / (f \times 25)$	Data Availability $(1 - r$ )
1988	4066	94,431	92.90%	7.10%
1989	4645	58,751	50.59%	49.41%
1990	8787	58,751	26.74%	73.26%
1991	10,858	131,524	48.45%	51.55%
1992	13,750	162,387	47.24%	52.76%
1993	16,435	187,103	45.54%	54.46%
1994	20,049	234,398	46.77%	53.23%
1995	22,973	252,817	44.02%	55.98%
1996	25,320	257,925	40.75%	59.25%
1997	26,766	241,491	36.09%	63.91%
1998	27,855	228,203	32.77%	67.23%
1999	28,747	228,956	31.86%	68.14%
2000	29,099	219,831	30.22%	69.78%
2001	28,565	201,163	28.17%	71.83%
2002	28,741	196,641	27.37%	72.63%
2003	28,865	192,725	26.71%	73.29%
2004	32,368	249,205	30.80%	69.20%
2005	39,395	367,710	37.34%	62.66%
2006	87,733	1,396,176	63.66%	36.34%
2007	121,002	1,740,341	57.53%	42.47%
2008	126,996	1,491,038	46.96%	53.04%
2009	123,824	1,328,356	42.91%	57.09%
2010	130,765	1,378,487	42.17%	57.83%
2011	129,095	1,358,542	42.09%	57.91%
2012	124,551	1,288,704	41.39%	58.61%
2013	123,753	1,257,910	40.66%	59.34%
2014	122,778	1,212,891	39.51%	60.49%
2015	117,618	1,119,914	38.09%	61.91%
2016	119,302	1,149,979	38.56%	61.44%
2017	111,516	996,002	35.73%	64.27%
2018	107,087	900,434	33.63%	66.37%
2019	71,523	508,038	28.41%	71.59%
Total	2,014,827	20,690,824

Table 11. Fraudulent firm years from the AAER dataset.

Fiscal Year	Firm Years Available	Firm Years Used	Relative Usage
1971–1987	273	0	0.00%
1988	37	3	8.11%
1989	56	7	12.50%
1990	47	13	27.66%
1991	58	15	25.86%
1992	62	20	32.26%
1993	62	27	43.55%
1994	47	25	53.19%
1995	52	26	50.00%
1996	54	30	55.56%
1997	79	49	62.03%
1998	97	65	67.01%
1999	126	81	64.29%
2000	148	109	73.65%
2001	143	105	73.43%
2002	131	97	74.05%
2003	109	81	74.31%
2004	82	64	78.05%
2005	68	51	75.00%
2006	45	33	73.33%
2007	45	32	71.11%
2008	38	27	71.05%
2009	60	27	45.00%
2010	49	25	51.02%
2011	42	20	47.62%
2012	43	24	55.81%
2013	27	15	55.56%
2014	21	11	52.38%
Total	2101	1082	51.50%

Table 12. Fraudulent firm years from the Creative Accounting dataset.

Fiscal Year	Firm Years Available	Firm Years Used	Relative Usage
1879–1987	58	0	0.00%
1988	4	0	0.00%
1989	1	0	0.00%
1990	3	1	33.33%
1991	6	1	16.67%
1992	4	1	25.00%
1993	9	3	33.33%
1994	5	1	20.00%
1995	8	2	25.00%
1996	7	3	42.86%
1997	7	3	42.86%
1998	12	2	16.67%
1999	14	4	28.57%
2000	15	6	40.00%
2001	16	8	50.00%
2002	17	7	41.18%
2003	6	5	83.33%
2004	10	6	60.00%
2005	6	4	66.67%
2006	7	3	42.86%
2007	2	1	50.00%
2008	4	1	25.00%
2009	1	1	100.00%
Total	222	63	28.38%

Table 13. Fraudulent firm years used from both datasets by country.

Country	Firm Years Used	Percentage
Australia	3	0.26%
China	2	0.17%
Germany	7	0.61%
Greece	2	0.17%
India	2	0.17%
Italy	3	0.26%
Japan	24	2.10%
Netherlands	11	0.96%
Spain	1	0.09%
Sweden	1	0.09%
United Kingdom	1	0.09%
USA	1088	95.02%
Total	1145	100.00%

Table 14. Sample selection process of fraudulent firm years.

Data	Number
AAER Dataset (USC Marshall School of Business, 2020)
AAERs (1982–2018)	4012
Misstatement events reported in those AAERs	1657
− Enforcements unrelated to FSF (e.g., bribes, and disclosure) or misstatements that cannot be linked to specific reporting periods	− 609
= Misstatement events affecting at least one quarterly or annual financial statement	1048
Fraudulent firm years extracted from the 1048 misstatement events	2101
− Firm years before 1988	−273
− Firm years not existing in the financial data sample from Section 3.1.1	−746
= Fraudulent firm years merged with the financial data sample from Section 3.1.1	1082
Creative Accounting Dataset (Jones, 2011, pp. 509–517)
Misstatement events reported	142
Fraudulent firm years extracted from the 142 misstatement events	222
− Firm years before 1988	58
− Firm years that also appear in the AAER dataset (duplicates)	10
− Firm years not existing in the financial data sample from Section 3.1.1	91
= Fraudulent firm years merged with the financial data sample from Section 3.1.1	63
Total fraudulent firm years merged with the financial data sample from Section 3.1.1	1145

Table 15. Missing values per row in the merged dataset after undersampling.

Number of N/As in 25 Fields (n)	Number of Rows Containing n N/As	Relative Amount
0	3	0.26%
1	59	5.15%
2	207	18.08%
3	154	13.45%
4	146	12.75%
5	139	12.14%
6	114	9.96%
7	37	3.23%
8	85	7.42%
9	32	2.79%
10	38	3.32%
11	32	2.79%
Total firm years kept ( $Σ$ ):	1046	91.35%

12	5	0.44%
13	14	1.22%
14	16	1.40%
15	17	1.48%
16	7	0.61%
17	4	0.35%
18	11	0.96%
19	9	0.79%
20	7	0.61%
21	5	0.44%
22	4	0.35%
23	0	0.00%
24	0	0.00%
25	0	0.00%
Total firm years deleted ( $Σ$ ):	99	8.65%
Total	1145	100.00%

Table 16. Missing values per column in the merged dataset after undersampling.

Data Field	Number of Rows Containing N/As (n)	$Relative Amount (n / 1046)$	Deleted?
X1	277	26.48%	No
X2	481	45.98%	No
X3	225	21.51%	No
X4	77	7.36%	No
X5	225	21.51%	No
Altman Z-Score	549	52.49%	Yes
DSRI	28	2.68%	No
GMI	160	15.30%	No
AQI	59	5.64%	No
SGI	4	0.38%	No
DEPI	482	46.08%	No
SGAI	929	88.81%	Yes
LVGI	63	6.02%	No
TATA	6	0.57%	No
Beneish M-Score	978	93.50%	Yes
C1	12	1.15%	No
C2	3	0.29%	No
C3	72	6.88%	No
C4	0	0.00%	No
C5	489	46.75%	No
C6	0	0.00%	No
Montier C-Score	507	48.47%	Yes
rsst_acc	1042	99.62%	Yes
ch_rec	3	0.29%	No
ch_inv	72	6.88%	No
soft_assets	18	1.72%	No
ch_cs	68	6.50%	No
ch_roa	62	5.93%	No
issue	0	0.00%	Yes
logit	1042	99.62%	Yes
prob_FSF	1042	99.62%	Yes
Dechow F-Score	1042	99.62%	Yes

Table 17. Averages replacing remaining missing values (mean imputation).

Data Field	Average Value Replacing Missing Values in the Respective Field Actual Average (Value Used)	Number of N/As Replaced
X1	0.2642	277
X2	−0.2784	481
X3	0.0449	225
X4	7.7822	77
X5	1.2802	225
DSRI	1.3707	28
GMI	0.9047	160
AQI	112,517.2376 (3.9461)	59 (62)
SGI	2.8458	4
DEPI	5.1758	482
LVGI	1.1594	63
TATA	−0.0601	6
C1	0.4971 (0)	12
C2	0.2579 (0)	3
C3	0.5441 (1)	72
C4	0.5143 (1)	0
C5	0.5673 (1)	489
C6	0.4073 (0)	0
ch_rec	0.0366	3
ch_inv	0.0233	72
soft_assets	0.6213	18
ch_cs	−1.5590	68
ch_roa	−0.0003	62

Table 18. Altman Z-Score of Wirecard AG.

Firm Year	X1	X2	X3	X4	X5	Z-Score
2000	NULL	NULL	NULL	20.67	NULL	NULL
2003	0.14	0.01	0.01	5.84	0.48	4.21
2004	0.16	−0.15	0.06	3.03	0.58	2.56
2005	0.36	NULL	0.14	5.76	0.73	NULL
2006	0.20	0.17	0.15	6.14	0.66	5.32
2007	0.06	0.21	0.14	3.99	0.54	3.77
2008	0.20	0.35	0.18	1.96	0.74	3.26
2009	0.18	0.35	0.15	3.32	0.61	3.80
2010	0.18	0.50	0.19	4.01	0.76	4.70
2011	0.34	0.52	0.17	3.45	0.74	4.51
2012	0.24	0.37	0.12	3.56	0.51	3.85
2013	0.27	0.37	0.10	3.92	0.49	4.00
2014	0.31	0.34	0.10	4.88	0.45	4.57
2015	0.25	0.30	0.09	3.47	0.40	3.51
2016	0.32	0.35	0.10	2.52	0.43	3.14
2017	0.25	0.34	0.10	3.98	0.47	3.96
2018	0.36	0.31	0.10	4.17	0.45	4.14
Altman Z-Score				Interpretation
$Z > 2.67$				Non-distress Zone
$1.81 < Z < 2.67$				Gray Zone
$Z < 1.81$				Distress Zone

Source: Financial data (Bloomberg); interpretation (Anjum, 2012, p. 216). Background color shading follows traffic-light logic (green = non-distress; yellow = gray zone; red = distress) based on the range of the original Z-Score.

Table 19. Beneish M-Score of Wirecard AG.

Firm Year	DSRI	GMI	AQI	SGI	DEPI	SGAI	LVGI	TATA	M-Score	M-Score ¹
2000	1.50	0.97	0.98	1.94	NULL	NULL	0.11	−0.03	NULL	−1.03
2003	6.12	0.78	7.93	1.54	0.80	NULL	0.26	0.24	NULL	6.72
2004	NULL	NULL	0.65	1.49	0.98	NULL	1.89	−0.01	NULL	−2.51
2005	NULL	NULL	1.22	7.17	1.50	NULL	0.63	−0.04	NULL	3.10
2006	NULL	NULL	0.87	1.67	0.68	NULL	1.61	−0.02	NULL	−2.25
2007	NULL	NULL	0.92	1.64	1.75	NULL	1.23	−0.16	NULL	−2.70
2008	NULL	NULL	0.97	1.47	0.58	NULL	0.86	0.00	NULL	−2.06
2009	NULL	NULL	0.88	1.16	NULL	NULL	1.07	−0.04	NULL	−2.59
2010	1.24	NULL	1.25	1.19	NULL	0.69	0.86	0.14	NULL	−1.23
2011	1.28	NULL	0.98	1.20	1.53	0.84	1.06	0.02	NULL	−1.88
2012	0.97	NULL	0.96	1.21	0.96	1.10	1.02	−0.02	NULL	−2.45
2013	1.06	NULL	1.02	1.22	0.74	1.12	1.11	−0.03	NULL	−2.44
2014	1.01	NULL	0.99	1.25	1.03	0.99	0.78	−0.01	NULL	−2.23
2015	0.98	NULL	1.07	1.28	1.18	1.84	1.21	−0.07	NULL	−2.75
2016	0.99	NULL	0.91	1.33	1.08	1.05	1.06	−0.01	NULL	−2.28
2017	0.83	NULL	0.99	1.45	0.92	0.89	1.10	−0.06	NULL	−2.56
2018	0.87	NULL	0.83	1.35	1.11	0.85	1.04	−0.07	NULL	−2.65
Beneish M-Score					Interpretation
$M > - 2.22$					Manipulation likely
$M < - 2.22$					Manipulation unlikely

Source: Financial data (Bloomberg); interpretation (Aghghaleh et al., 2016, p. 59; Warshavsky, 2012, p. 18). ¹ Assuming all NULL values = 1. Background color shading follows traffic-light logic (green = manipulation unlikely; red = manipulation likely) based on the range of the original M-Score.

Table 20. Montier C-Score of Wirecard AG.

Firm Year	C1	C2	C3	C4	C5	C6	C-Score	C-Score ¹
2000	0	0	NULL	1	NULL	1	NULL	2
2003	0	1	0	1	1	1	4	4
2004	1	NULL	NULL	0	1	0	NULL	2
2005	1	NULL	NULL	1	1	1	NULL	4
2006	1	NULL	0	1	0	1	NULL	3
2007	0	NULL	1	1	1	1	NULL	4
2008	1	NULL	0	0	0	0	NULL	1
2009	0	NULL	1	1	NULL	1	NULL	3
2010	0	0	1	0	NULL	0	NULL	1
2011	1	1	1	1	1	1	6	6
2012	0	0	1	1	0	1	3	3
2013	0	0	1	1	0	1	3	3
2014	1	0	0	1	1	1	4	4
2015	0	0	1	1	1	1	4	4
2016	1	0	1	0	1	0	3	3
2017	0	0	1	0	0	1	2	2
2018	0	0	0	0	1	1	2	2

Source: Financial data: Bloomberg. ¹ Assuming all NULL values = 0. Background color shading follows traffic-light logic based on the range of the original C-Score.

Table 21. Dechow F-Score of Wirecard AG.

Firm Year	rsst_acc	ch_rec	ch_inv	soft_assets	ch_cs	ch_roa	Issue	prob_FSF	F-Score
2000	NULL	0.0476	NULL	0.8686	NULL	NULL	1	NULL	NULL
2003	NULL	0.3198	0.0000	0.9393	−0.3763	1.5374	1	NULL	NULL
2004	NULL	NULL	NULL	0.9320	NULL	−0.2492	1	NULL	NULL
2005	NULL	NULL	NULL	0.6997	NULL	0.1124	1	NULL	NULL
2006	NULL	NULL	−0.0070	0.7097	NULL	−0.0220	1	NULL	NULL
2007	NULL	NULL	0.0047	0.5998	NULL	0.0069	1	NULL	NULL
2008	NULL	NULL	−0.0035	0.5306	NULL	0.0027	1	NULL	NULL
2009	NULL	NULL	0.0006	0.4931	NULL	−0.0087	1	NULL	NULL
2010	0.1920	0.0700	0.0000	0.6418	NULL	0.0043	1	NULL	NULL
2011	−0.0792	0.1009	0.0007	0.6813	0.1197	−0.0016	1	0.50%	1.3406
2012	−0.0386	0.0363	0.0009	0.5972	0.3820	−0.0175	1	0.39%	1.0588
2013	−0.1228	0.0496	0.0024	0.6075	0.1578	−0.0152	1	0.37%	1.0060
2014	0.0140	0.0417	−0.0008	0.5818	0.2662	−0.0017	1	0.39%	1.0463
2015	−0.2130	0.0370	0.0001	0.5820	0.2843	−0.0052	1	0.32%	0.8714
2016	−0.0599	0.0435	0.0003	0.5595	0.3066	0.0253	1	0.35%	0.9331
2017	−0.1771	0.0301	0.0022	0.5433	0.5411	−0.0183	1	0.32%	0.8653
2018	−0.1109	0.0234	−0.0005	0.4977	0.3835	0.0021	1	0.29%	0.7802
Dechow F-Score					Interpretation
$F > 2.45$					High risk
$F > 1.85$					Substantial risk
$F \geq 1.00$					Above normal risk
$F < 1.00$					Normal or low risk

Source: Financial data: Bloomberg; interpretation (Dechow et al., 2011, p. 63). Background color shading follows traffic-light logic based on the range of the original F-Score.

Table 22. Model predictions for Wirecard firm years.

Firm Year	Predicted Class	Imputed Predictors (of 23)
2000	1—fraudulent	10/23
2003	0—non-fraudulent	0/23
2004	0—non-fraudulent	7/23
2005	1—fraudulent	8/23
2006	1—fraudulent	5/23
2007	1—fraudulent	5/23
2008	0—non-fraudulent	5/23
2009	0—non-fraudulent	7/23
2010	0—non-fraudulent	4/23
2011	0—non-fraudulent	1/23
2012	0—non-fraudulent	1/23
2013	0—non-fraudulent	1/23
2014	1—fraudulent	1/23
2015	1—fraudulent	1/23
2016	1—fraudulent	1/23
2017	0—non-fraudulent	1/23
2018	0—non-fraudulent	1/23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Steingen, L.; Löw, E. Using Machine Learning to Detect Financial Statement Fraud: A Cross-Country Analysis Applied to Wirecard AG. J. Risk Financial Manag. 2025, 18, 605. https://doi.org/10.3390/jrfm18110605

AMA Style

Steingen L, Löw E. Using Machine Learning to Detect Financial Statement Fraud: A Cross-Country Analysis Applied to Wirecard AG. Journal of Risk and Financial Management. 2025; 18(11):605. https://doi.org/10.3390/jrfm18110605

Chicago/Turabian Style

Steingen, Luca, and Edgar Löw. 2025. "Using Machine Learning to Detect Financial Statement Fraud: A Cross-Country Analysis Applied to Wirecard AG" Journal of Risk and Financial Management 18, no. 11: 605. https://doi.org/10.3390/jrfm18110605

APA Style

Steingen, L., & Löw, E. (2025). Using Machine Learning to Detect Financial Statement Fraud: A Cross-Country Analysis Applied to Wirecard AG. Journal of Risk and Financial Management, 18(11), 605. https://doi.org/10.3390/jrfm18110605

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Using Machine Learning to Detect Financial Statement Fraud: A Cross-Country Analysis Applied to Wirecard AG

Abstract

1. Introduction

2. Literature Review

2.1. Definition of Terminology

2.2. Detecting FSF Through Financial Ratios

2.2.1. Altman Z-Score (1968)

2.2.2. Beneish M-Score (1999)

2.2.3. Montier C-Score (2008)

2.2.4. Dechow F-Score (2011)

2.3. Detecting FSF Through Machine Learning

2.4. Hypothesis Development

3. Materials and Methods

3.1. Data

3.1.1. Financial Data

3.1.2. Fraud Data

3.2. Machine Learning

3.2.1. Data Preparation

3.2.2. Applying Machine-Learning Algorithms

4. Results

4.1. Model Accuracy (H1)

4.2. Wirecard Case Study (H2)

5. Discussion

5.1. Limitations

5.2. Future Research

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI