Next Article in Journal
Intelligent Workforce Scheduling in Manufacturing: An Integrated Optimization Framework Using Genetic Algorithm, Monte Carlo Simulation, and Taguchi Method
Previous Article in Journal
Overcoming Challenges in the Transition Towards Battery Electric and Software-Intensive Modular Heavy-Duty Vehicles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Decoupling-Fusion System for Financial Fraud Detection: Operationalizing Causal–Temporal Asynchrony in Multimodal Data

1
School of Management, Shanghai University, Shanghai 200444, China
2
School of Management Science and Engineering, Shandong University of Finance and Economics, Jinan 250014, China
*
Author to whom correspondence should be addressed.
Systems 2026, 14(1), 25; https://doi.org/10.3390/systems14010025 (registering DOI)
Submission received: 17 November 2025 / Revised: 18 December 2025 / Accepted: 24 December 2025 / Published: 25 December 2025
(This article belongs to the Section Systems Practice in Social Science)

Abstract

Financial statement fraud is a socio-technical risk that arises from coupled organizational, informational, and regulatory processes. To address the Identification Paradox in financial fraud detection, where existing models cannot simultaneously recognize both chronic manipulation and acute outbreaks in financial data, this study proposes the Causal–Temporal Asynchrony (CTA) theory as a process-oriented conceptual framework that guides feature construction and model design in a predictive setting. CTA defines fraud motive as a chronic, multi-period accumulation and fraud action as an acute, single-year event. To operationalize CTA within a predictive setting, we build a deployable Decoupling-Fusion System that encodes CTA as an Acute–Chronic Binary Feature Dimensions schema and performs detection via Decoupling-Fusion FraudNet. Within this system, parallel Long Short-Term Memory networks (LSTM) capture chronic motive signals from longitudinal sequences, while parallel Convolutional Neural Networks (CNN) and a Feed-forward Neural Network (FNN) identify acute action signals from multimodal snapshots; the resulting asynchronous probabilities are integrated via an adaptive decision-level fusion mechanism. Empirical tests on China’s A-share market (2001–2021) show the system (AUC = 0.967) outperforms baseline models. Furthermore, eXplainable AI analysis reveals patterns consistent with the classic fraud triangle (pressure, opportunity and rationalization). This study develops a theory-grounded decision-support system that unifies acute and chronic evidence streams and provides a deployable blueprint for continuous auditing and governance.

1. Introduction

Financial fraud, a persistent malignancy in capital markets, has repeatedly demonstrated its destructive impact globally [1]. From the collapse of Enron to the systemic fabrication at Luckin Coffee, major fraud cases not only devastate investor confidence but also pose severe challenges to financial market stability [2,3]. Static models, such as the Beneish M-Score, rely on classic indicators that are increasingly subject to strategic avoidance by fraudsters, significantly diminishing their detection capabilities [4]. To address this challenge, Regulatory Technology (RegTech) and data mining techniques, especially deep learning, have shown immense potential [5]. However, existing data-driven detection paradigms are trapped in a fundamental Identification Paradox: they struggle to efficiently capture two temporally asynchronous types of fraud signals within a single architecture. The first is the chronic manipulation pattern, manifesting as a gradual deterioration of financial indicators over multiple reporting periods. The second is the acute outbreak pattern, characterized by sudden anomalies in the data of a specific year.
This dual nature of signals has led to a paradigmatic split in the application of current deep learning methods. Dynamic models based on recurrent neural networks, such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU), excel at capturing chronic trends [6,7], but their information aggregation mechanisms tend to smooth over or lose transient acute anomaly signals. On the other hand, static models based on methods such as eXtreme Gradient Boosting (XGBoost) and Convolutional Neural Network (CNN) can efficiently identify synergistic anomaly patterns in specific cross-sections [8,9]. However, their time-blind characteristic prevents them from modeling the chronic evolutionary process. To justify this split more systematically, this study synthesizes prior work by temporal orientation, modality coverage, and fusion strategy. Many studies are strong at modeling either chronic accumulation or acute outbreak sensitivity, but few preserve both within a unified design. For example, sequential encoders are better suited for capturing multi-year deterioration, such as gradual erosion in cash-flow quality. Cross-sectional CNN or XGBoost models are more responsive to within-year co-activation patterns, such as an abnormal ROE surge coupled with asset-quality anomalies. Even when multimodal inputs are introduced to reduce information sparsity, fraud theory is often used to motivate feature selection rather than to constrain model structure and fusion logic. Therefore, the key gap is not the absence of theory in the literature. The gap lies in the lack of an explicit mapping from the motive accumulation and action outbreak process to model design. In other words, existing approaches seldom provide a process-consistent architecture that can simultaneously retain long-horizon motive accumulation cues and short-horizon outbreak cues, while also specifying how multimodal evidence should be decoupled and fused under a theory-guided logic. Consequently, academia urgently needs a new framework capable of handling both asynchronous signals simultaneously.
This study posits that the root of this paradox lies in a limited process-level linkage from theory to model design, rather than in a purely technical limitation or a blanket lack of theoretical research. Most existing models, whether based on cross-sectional or sequential features, are directly designed as predictive classifiers that learn statistical correlations between multimodal indicators and fraud labels, while leaving the motive-to-action process implicit in the architecture and fusion logic. Inspired by classic Fraud Triangle [10] theory, this study introduces the Causal–Temporal Asynchrony (CTA) theory of motive and action. This theory contends that the fraud motive, such as persistent operational difficulties or performance pressure, is a chronic process accumulating over a multi-period time window. In contrast, the fraud action, such as fabricating revenue or fictitious assets, is an acute event that may erupt in a specific year after the accumulated motive crosses a critical threshold. In this study, the term CTA refers to a conceptual process in which chronic fraud motives accumulate before and enable acute fraudulent actions. Accordingly, CTA functions as a theory-guided organizing principle that structures feature dimensions and architectural choices. In this view, the motive and action stages are causally ordered but remain distinct in their temporal dimensions and data manifestations.
Building on CTA, this study establishes the Decoupling-Fusion Identification Paradigm (DFIP) for asynchronous signals. The paradigm advocates that the heterogeneous acute and chronic signals must first be subject to decoupling. They should be analyzed independently using dedicated identification modules best suited to their respective data characteristics to avoid mutual interference between different signal modalities. Subsequently, a high-level fusion is performed at the decision level to form a final, comprehensive judgment. This Decoupling–Fusion logic operationalizes the above synthesis by explicitly aligning model components with the acute–chronic heterogeneity implied by the motive–action process.
Guided by DFIP, this study implements a Decoupling-Fusion System that encodes CTA as an Acute–Chronic Binary Feature Dimensions schema and performs detection via Decoupling-Fusion FraudNet (DF-FraudNet). Within the Decoupling-Fusion System, the chronic feature dimension consists of five-year sequences of financial ratios and Management Discussion and Analysis (MD&A) narratives, designed to quantify the cumulative process of fraud motive. The acute dimension integrates financial anomalies, non-financial governance context, and textual obfuscation signals from the fraud year, forming a multimodal snapshot to capture the moment of the fraud action’s outbreak. These two types of signals do not exist in isolation; they jointly provide mutually reinforcing evidence patterns through specific manifestation feature configurations, thereby offering a richer evidence base for characterizing fraudulent behavior. A typical strong fraud signal configuration might be a continuous decline in the operating cash flow to net profit ratio for several consecutive years (chronic motive accumulation), coupled with an abnormal surge in return on equity (ROE) in the fraud year (acute action execution) [11]. This example highlights that strong fraud evidence often emerges from the joint presence of chronic deterioration and acute anomalies. Motivated by this observation, DF-FraudNet adopts a Decoupling–Fusion design to model the two signal types with dedicated modules and then integrate them at the decision level.
To implement detection, DF-FraudNet adopts a modular multimodal deep learning architecture composed of two parallel identification modules: a chronic Motive Identification Module using parallel LSTMs that captures long-term evolutionary trends in longitudinal sequences, and an acute Action Identification Module using parallel CNN-FNNs that parses synergistic anomalies in static snapshots. Finally, the asynchronous probabilities output by the two modules are integrated through an adaptive decision-level fusion mechanism, allowing the system to autonomously learn the relative importance of the two signals.
The contribution of this study is twofold. First, it provides empirical evidence that the proposed framework achieves superior predictive performance (AUC = 0.967) compared with single-perspective or naive fusion models. Second, XAI analysis reveals attribution patterns that are consistent with the classic fraud triangle perspective on pressure, opportunity and rationalization, indicating that the learned representations can be interpreted as theory-guided proxy signals rather than opaque statistical artifacts. These contributions elucidate the origin of the identification paradox and demonstrate, in operational terms, how a CTA-guided Decoupling–Fusion design mitigates it. Overall, the proposed framework shifts fraud risk identification from purely ex post, correlation-based classification toward a more theory-informed and dynamically oriented predictive paradigm.
The remainder of this study is organized as follows: Section 2 reviews the related research. Section 3 details the CTA theory, the binary feature dimensions, and the architectural design of DF-FraudNet. Section 4 reports the experimental design, baseline comparisons, ablation studies, and explainable analysis results. Section 5 concludes this study and proposes future outlooks.

2. Literature Review

Fraud detection research broadly spans fraud mechanisms, structured analytics, and text mining of management narratives. This review organizes prior work by temporal orientation, modality coverage, and fusion strategy, thereby motivating our CTA-guided Decoupling–Fusion design.

2.1. Fraud Drivers and Fraud Typologies

Classic behavioral frameworks conceptualize fraud as a multi-dimensional process. Fraud triangle explains fraud through the joint presence of pressure, opportunity, and rationalization [12]. Extensions such as fraud diamond add the dimension of capability to explain why only certain actors can execute sophisticated manipulation [13]. Empirical syntheses from archival and practitioner sources further document that fraud schemes are heterogeneous and carry high social costs in governance-sensitive settings [14,15].
These perspectives suggest that fraud is a process rather than a point event: chronic motives may accumulate over time and may culminate in acute actions when firms face binding constraints. Consistent with this view, prior studies distinguish between persistent financial distress, governance weaknesses, and incentive pressures as latent drivers, while treating observable misstatements as event-like manifestations. This separation between motive accumulation and action outbreaks underpins the asynchronous framing adopted in this study.

2.2. Structured Data and Static Fraud Analytics

Structured financial and governance indicators remain the cornerstone of fraud screening. Early work relies on linear or ratio-based diagnostics (e.g., the Beneish M-Score [16]) and statistical classifiers such as logistic regression [17]. More recent studies adopt tree-based methods (e.g., random forests and XGBoost) and deep learning to improve detection by capturing nonlinear interactions among ratios [18,19].
However, a significant limitation of this stream is its implicit assumption of temporal alignment: cross-sectional models are optimized for within-year anomaly synergy but do not explicitly model multi-year motive accumulation. As a result, naive attempts to combine static and sequential inputs often blur acute signals or dilute chronic trends, highlighting the need to balance snapshot sensitivity with longitudinal dynamics.

2.3. Temporal Modeling of Fraud Evolution

To capture multi-period dynamics, studies increasingly employ sequential models, including LSTM/GRU architectures and temporal attention mechanisms, to learn trajectories from multi-year accounting sequences [20,21]. These methods improve trend sensitivity but can smooth away transient spikes, especially when the anomaly is sparse relative to the full observation window.
This trade-off indicates that simply adding temporal modeling to a snapshot model is not sufficient: chronic motive signals and acute outbreak cues are both relevant but temporally misaligned. A dedicated mechanism is therefore required to preserve each signal type before integration.

2.4. Unstructured Data Mining and Multimodal Fusion Strategies

Beyond structured data, managerial narratives provide critical incremental information about firm risk. Early approaches rely on dictionary-based sentiment and topic modeling [22,23], whereas recent work applies transformer encoders to extract contextual semantics from MD&A, Risk Factors, and related disclosures, yielding richer sentence- and document-level representations that detect subtle linguistic cues [24].
To leverage these heterogeneous signals, recent studies employ feature-level, representation-level, or decision-level fusion across structured and unstructured modalities [25]. Multimodal learning can mitigate information sparsity [26], but naive fusion can induce negative transfer when modalities have heterogeneous noise structures or different informative time windows. Moreover, when tabular indicators are imaged into 2D intensity maps, the imposed spatial layout effectively encodes feature adjacency/grouping priors, which interact with cross-modal fusion design; related findings on graph- or topology-aware representations [27,28] reinforce the need for topology-aware fusion.
Despite these advances, a critical limitation remains: text is often treated as an auxiliary predictor rather than a theory-aligned evidence source. Many studies report attention or saliency maps, yet do not explicitly align extracted signals with constructs such as pressure, opportunity, and rationalization, nor do they clarify how multimodal cues map to motive versus action. This motivates an approach that first aligns modalities to a motive-to-action schema, and then uses this alignment to guide the fusion logic.

2.5. CTA Positioning and Research Gap

Since CTA describes an ordered motive-to-action process, it is adjacent to, but distinct from, work on causal-structure and temporal-dependence modeling. For example, Granger causality tests predictive precedence between variables [29]; Structural Equation Modeling (SEM) focuses on confirmatory structural testing under stated assumptions [30]; and Dynamic Bayesian Networks (DBNs) model probabilistic state transitions [31].
In contrast, CTA is not introduced here as an econometric causal identification framework; rather, it serves as a theory-guided organizing principle for feature construction and architectural decoupling. The aim is to operationally separate chronic motive accumulation from acute action outbreaks, and then fuse their asynchronous evidential probabilities in a way that is robust to modality-specific noise and time-window heterogeneity. This positioning motivates our CTA-guided Decoupling-Fusion Identification Paradigm.

3. Methodology and System Design

The goal of this section is to translate the Causal–Temporal Asynchrony (CTA) theory of motive and action from a theoretical construct into a computable and verifiable deep learning architecture. Every component of the Decoupling-Fusion System is designed to serve the core ideas of CTA theory.

3.1. Theoretical Cornerstone

CTA theory reveals the fundamental heterogeneity of fraud signals:
  • Motive signals are chronic, span multiple periods, and appear in longitudinal sequences that reflect an evolutionary process.
  • Action signals are acute, concentrated in a single year, and appear as cross-sectional snapshots that reflect abnormal configurations.
Forcing these two types of signals, which differ fundamentally in data structure, time span, and economic meaning, into a single homogeneous model (such as a solitary LSTM or CNN) will inevitably cause information loss and modal interference. This is the root of the identification paradox.
Therefore, this study’s design philosophy is the Decoupling-Fusion Identification Paradigm (DFIP) for asynchronous signals. Given the fundamental heterogeneity between the temporal, chronic motive signals and the static, acute action signals, any “one-size-fits-all” single model will lead to information loss and modal interference. The first and core principle of this paradigm is heterogeneous decoupling: at the top level of model design, the acute and chronic signals must be separated and handled by dedicated identification modules with distinct topological structures. This approach ensures that each module preserves the unique information of its corresponding modality.
DFIP guides a Decoupling–Fusion System that maps CTA into an Acute–Chronic Binary Feature Dimensions schema, while Decoupling–Fusion FraudNet (DF-FraudNet) serves as the detection engine. DF-FraudNet organizes the method into a feature layer, a learner layer and a decision-level fusion layer, so that CTA is reflected in data representation, model design and final decision.
Notably, in this study CTA is understood as a causal–temporal process theory. It posits that chronic motives build up over time and precede acute fraudulent actions. Empirically, this causal–temporal ordering guides the construction of features and the coupling of the chronic and acute modules in the model architecture. Decoupling–Fusion System is implemented as a theory-guided predictive model. It employs CTA to shape feature representation and the design of the chronic and acute modules, and its outputs are interpreted as predictive risk signals.

3.2. Feature Dimension Construction

To execute this decoupling strategy, CTA theory is first operationalized as the Acute–Chronic Binary Feature Dimensions of fraud signals, constructed from theory-guided proxy features for motive-related and action-related signals. These dimensions serve as the inputs to the heterogeneous identification modules in DF-FraudNet.

3.2.1. Chronic Dimension: Quantifying Motive

This dimension is designed to approximate the long-term accumulation of fraud-related motives at the firm level. It constructs a multimodal sequence spanning a five-year window ( T 4 to T ), which traces how financial anomalies and MD&A narratives evolve over time. In line with CTA, this chronic dimension is interpreted as encoding firm-level proxy signals for fraud-related motives that evolve over time.
(1)
Financial Dynamic Features ( x c h r o n i c _ f i n )
The financial dynamic features form a [ 5 ,   576 ] tensor, built from 576 core financial ratios per year across five consecutive years. This feature subset is the product of a multi-screening strategy.
  • First, 33 classic ratios spanning profitability, solvency, operating efficiency, cash-flow quality, growth, and cross-statement consistency are calculated as baselines and included as a benchmark subset. An expanded pool is generated from the three financial statements by removing line items with missingness > p , collapsing perfectly redundant series, and forming one ratio for each remaining unordered item pair (the item with the larger sample mean over available observations is used as the denominator; ties follow a fixed ordering rule). Observations with zero or missing denominators (or missing numerators) are treated as missing, and ratios with missingness > q are discarded, yielding 2356 candidate ratios. Here, p and q are panel-level missingness cutoffs (including NaNs induced by zero denominators), and we set p = q = 0.30 to retain at least 70% coverage. Robust standardization and Min–Max scaling are applied for numerical stability and unified feature scales in downstream learning.
  • Second, Point-Biserial Correlation is used as an effect-size filter ( r p b 0.40 ), a moderate-to-strong effect-size cutoff, to prioritize ratios with stronger, economically meaningful label relevance while avoiding overly aggressive pruning of potentially informative, non-traditional ratios. XGBoost then ranks the remaining candidates by Gain to capture nonlinear predictive utility, and RFE-CV is subsequently applied to the XGBoost-ranked candidate set to determine the final subset, fixing K = 576 as a representation–efficiency trade-off; since 576 = 24 2 , this also permits a lossless 24 × 24 reshaping for spatial configuration learning without padding or truncation.
The 33 classic ratios retained as a benchmark subset include: net profit margin, gross margin, operating profit margin, return on assets (ROA), return on equity (ROE), cost ratio, current ratio, quick ratio, cash ratio, debt ratio, equity multiplier, interest coverage, operating cash flow to current liabilities, inventory turnover, accounts receivable turnover, accounts payable turnover, total asset turnover, fixed asset turnover, operating cash flow to net profit, operating cash flow to total liabilities, capital expenditure coverage, cash-flow return on assets, revenue growth, net profit growth, asset growth, equity growth, operating cash inflow to revenue, capex-to-depreciation, operating cash outflow to cost, net asset turnover, cash flow to revenue, operating cash flow to current assets, and free cash flow to net profit.
Table 1 shows examples of the top 20 financial ratios. These ratios not only include classic financial indicators such as ROE and the Current Ratio but also successfully uncover non-traditional, highly concealed warning signals like Total Liabilities/Operating Cost. The full list of the 576 selected financial ratios is provided in Supplementary Table S1. Financial sequence is designed to capture gradual shifts in corporate health, such as sustained declines in profitability or rising leverage pressure. In this study, these financial sequences are interpreted as theory-guided proxy features for the gradual buildup of motive-related pressure at the firm level.
(2)
Textual Dynamic Features ( x c h r o n i c _ t e x t )
The textual dynamic features form a [ 5 ,   49 ] tensor, built from 49 interpretable text features extracted from firm-year MD&A texts across five consecutive years. A mixed-method pipeline is employed.
  • First, a hybrid lexicon is constructed by merging the Loughran–McDonald finance categories as a theoretical anchor, localized Chinese finance lexicons from China Research Data Service Platform (CNRDS), and an audit red-flag list curated from auditing standards and enforcement cases. Candidate terms are consolidated via de-duplication and synonym/variant unification to obtain a fixed keyword/category inventory. FinBERT-Chinese is further employed for data-driven topic/keyword mining to supplement this inventory; only terms not covered by existing dictionaries and judged economically relevant are retained and added to the final topical keyword list.
  • Second, MD&A texts are denoised and standardized to ensure stable feature extraction, including removing non-content artifacts (e.g., HTML/XML markup, embedded tables, and duplicated boilerplate) and normalizing whitespace and full-/half-width characters. To preserve syntactic cues, punctuation is not removed but only standardized in form. Texts are then segmented into sentences based on Chinese sentence-ending punctuation, which supports the computation of style/readability proxies and narrative-focus markers.
Based on the finalized inventory and rules, features are computed deterministically. Sentiment and tone features are measured via dictionary-based counting under the Loughran-McDonald category framework, leveraging localized Chinese lexicons aligned to these categories (e.g., negative/positive, uncertainty, and litigation). Core topical keywords are measured as length-normalized term frequencies via phrase-level exact matching on the cleaned text (with the above synonym/variant unification when applicable). Linguistic style/readability and temporal/narrative features are derived from sentence segmentation and curated marker inventories (e.g., hedging/modals, temporal reference markers, and attribution markers).
As shown in Table 2, the 49 features cover four dimensions, including Sentiment and Tone, Core Topical Keywords, Linguistic Style and Readability, and Thematic Focus and Narrative. The complete lexicon and keyword matching rules are provided in Supplementary Table S2. Textual sequence is designed to capture the evolution of managerial narrative strategies, for example, a rising frequency of uncertainty terms or a year-over-year decline in text readability. In this study, these textual sequences are employed as theory-guided proxy features for how MD&A narratives evolve under economic stress at the firm level.

3.2.2. Acute Dimension: Characterization of Action

This dimension is designed to capture the execution context of the fraudulent action, that is, the multimodal abnormal configuration observed in the event year ( T = 0 ). It forms a cross-sectional snapshot composed of three heterogeneous data types that summarize the contemporaneous incentive structures, opportunity environments, and rationalization-related disclosure choices under which fraud actions may be executed. In line with CTA, this acute dimension is interpreted as encoding firm-level proxy signals for the action side of the fraud process that are realized in the event year.
(1)
Financial Static Features ( x a c u t e _ f i n )
The financial static features form a single-channel ratio intensity map with a shape of [ 24 ,   24 ,   1 ] . The image is reshaped from the 576 financial ratios in year T = 0 , using a correlation-preserving pixel layout and grayscale encoding. The extraction process of financial ratios for this image is the same as for x c h r o n i c _ f i n .
Although this tensor can be visualized as a grayscale image, it is methodologically interpreted as a single-channel intensity map of a structured ratio matrix rather than a natural image. This spatial encoding improves detection by injecting a correlation-aware locality prior: highly co-moving ratios are placed in nearby pixels, allowing CNN kernels to efficiently learn co-occurring anomaly motifs via parameter sharing, which is harder to capture with a flattened vector under sparse fraud labels.
  • First, a fixed one-to-one ratio–pixel assignment is learned on the training firm-year panel by computing pairwise Pearson correlations ( w a b = ρ a b ) and minimizing a correlation–distance energy:
    E ( π ) = a < b   w a b   d i s t   ( π ( a ) ,   π ( b ) )
    where d i s t ( , ) is the Euclidean distance on the 24 × 24 grid. The minimization is implemented via a Monte-Carlo–style randomized swap search: a random initialization is iteratively refined by swapping two pixels and accepting the swap only if E ( π ) decreases; the search stops when no accepted swap occurs for 3 K consecutive proposals ( K = 576 ). The resulting layout π is then fixed and reused across all firms and years. Thus, each ratio is assigned to a unique and invariant pixel location, i.e., the same ratio always occupies the same pixel across all firms/years and in the test set. Correlations ρ a b are computed on the training panel using all available pairwise-complete observations for each ratio pair.
  • Second, each ratio value is mapped to pixel intensity by:
    I i = c l i p v r ( i ) m r ( i ) s d r ( i ) × 100 + 128 ,   0 ,   255
    where m R ( i ) and s d R ( i ) are computed from available observations in the training panel; r ( i ) denotes the ratio assigned to pixel i ; missing ratios (including NaNs induced by zero denominators) are assigned a neutral gray value I i = 128 . To avoid information leakage, correlation statistics, layout learning, and normalization statistics are fitted on the training split (or within each training fold) and then applied to the held-out test set.
Imaging the ratios is key to capturing their synergistic anomaly patterns. The local receptive field of a CNN can detect strong fraud signals that a flattened vector would miss, for example, a local region where ROE is abnormally high while accounts receivable turnover is abnormally low. While the underlying ratios are the same as in the chronic dimension, arranging them into a single-year image emphasizes the acute cross-sectional anomaly pattern that accompanies the execution of fraud in the event year. In this study, these financial images are interpreted as theory-guided proxy features for the short-term incentive and abnormal result patterns under which fraudulent actions are taken.
(2)
Non-Financial Static Features ( x a c u t e _ n o n f i n )
The non-financial static features form a single-channel intensity map with a shape of [ 16 ,   16 ,   1 ] .
  • First, sixteen non-financial indicators from T = 0 (such as ownership structure and macro pressure) are imaged and upsampled. They are selected based on pressure theory [32] and information asymmetry theory [33], extracted from three perspectives: macroeconomic environment (external pressure), industry and capital market (competitive pressure), and corporate internal governance (internal opportunity). As shown in Table 3, they aim to capture the structural context in which fraud can occur.
  • Second, to obtain a fixed and reproducible mapping, the 16 indicators are arranged into a fixed 4 × 4 seed matrix following the Table 3 order, so that indicators from the same perspective occupy adjacent cells. Each indicator is mapped to intensity using training-panel statistics (the same normalization/missing-value rule as in Equation (2)), with missing values assigned a neutral gray level. The seed matrix is then upsampled to 16 × 16 by nearest-neighbor replication (each seed cell expanded to a 4 × 4 block), which provides sufficient spatial support for convolution without introducing new information.
As another CNN input channel, this structured encoding enables the model to learn local co-occurrence patterns consistent with opportunity- related and pressure-related environments associated with fraud. These non-financial images, as theory-guided proxy features, are employed to quantify firm-level external constraints and governance structures.
(3)
Textual Static Features ( x a c u t e _ t e x t )
The textual static features form a [ 1 ,   49 ] vector, built from the 49 interpretable text features in year T = 0 .
  • The extraction process of text features for this vector is the same as for x c h r o n i c _ t e x t .
These vectors provide linguistic proxy features of obfuscation and rationalization tendencies at the time of reporting, such as the occurrence of high-risk keywords like related-party transactions.

3.2.3. Operational Alignment Between CTA Constructs and Features

The CTA framework conceptually distinguishes chronic motives from acute actions in the fraud process. To embed this process view into DF-FraudNet, the theoretical constructs are linked to observable features in a structured, proxy-based manner, rather than through direct measurement of individual psychological states. Managerial psychological states cannot be measured directly in a reliable way. A proxy-based design therefore allows the model to leverage large-sample, firm-level archival data while still maintaining clear conceptual alignment with the underlying motive- and action-related constructs. At the firm-year level, financial, non-financial, and textual features serve as theory-guided proxies for motive- and action-related signals.
In the chronic (motive) dimension, fraud motive is viewed as the gradual accumulation of performance pressure and narrative strain over multiple reporting periods. This construct is operationalized as a five-year multimodal sequence that tracks the joint evolution of financial anomalies and MD&A narratives, corresponding to the Financial Dynamic Features x c h r o n i c _ f i n and Textual Dynamic Features x c h r o n i c _ t e x t . The financial dynamic features capture persistent trends in profitability, leverage, liquidity and other fundamentals that reflect mounting performance pressure, while the textual dynamic features capture sustained shifts in tone, uncertainty and risk-related themes in narrative disclosure that indicate growing narrative strain. These chronic features are interpreted as firm-level, theory-guided proxy signals for fraud-related motives that evolve over time and form the background against which fraudulent actions may eventually be taken.
In the acute (action) dimension, CTA focuses on the short-term execution of fraudulent actions in a specific reporting year. This construct is operationalized as a cross-sectional snapshot in year t that combines imaged financial-ratio features, imaged non-financial governance and external environment features, and a textual feature vector derived from the MD&A, corresponding to the Financial Static Features x a c u t e _ f i n , the Non-Financial Static Features x a c u t e _ n o n f i n , and the Textual Static Features x a c u t e _ t e x t . Financial static features describe short-term performance outcomes and abnormal result patterns in the focal year, while non-financial static features characterize the internal governance and external environment. Together, these two types of static features define the opportunity-related configuration in the acute dimension, combining the payoff structure of misreporting with the institutional conditions under which misstatements can be implemented and concealed. Textual static features, such as complexity, tone and high-risk keywords, characterize disclosure style and approximate rationalization in narrative reporting. These acute features form the action signals in the CTA framework and summarize the contemporaneous configuration of incentives, opportunities and disclosure choices under which fraud actions are likely to be taken.
Taken together, the chronic and acute dimensions provide complementary, theory-guided feature representations of the latent constructs in CTA, clarifying how chronic motive and acute action are operationalized through measurable features within a predictive modeling framework.

3.3. Overall Architecture Implementation

Based on the Acute–Chronic Binary Feature Dimensions and the Decoupling-Fusion Identification Paradigm, this study designs a modular multimodal deep learning architecture (DF-FraudNet). As shown in Figure 1, the architecture contains two parallel, topologically heterogeneous identification modules, which are fused at the decision level.

3.3.1. Motive Identification Module: Chronic Channel

This module is dedicated to processing the chronic dimension and aims to capture motive accumulation signals, namely the dynamic evolutionary trends within the T 4 to T 0 window. This module employs a modality decoupling design, using parallel LSTM networks for the two time-series data types:
  • Financial LSTM Channel ( L S T M _ F i n a n c i a l ): Input x c h r o n i c _ f i n [ 5 ,   576 ] . This channel is specialized in modeling long-range dependencies in financial indicators, such as sustained declines in profitability or gradual increases in the asset-liability ratio.
  • Textual LSTM Channel ( L S T M _ T e x t ): Input x c h r o n i c _ t e x t [ 5 ,   49 ] . This channel focuses on modeling the evolution of managerial narrative strategies. For example, the model can learn a hidden trend of declining readability indices coupled with rising uncertainty frequencies year after year.
Mathematically, the motive identification process is:
H f i n = L S T M f i n x c h r o n i c _ f i n ; θ f i n H t e x t = L S T M t e x t x c h r o n i c t e x t ; θ t e x t
where H f i n and H t e x t represent the hidden state sequences extracted by the two LSTM channels, θ f i n and θ t e x t are their respective network parameters.
The hidden states from the last time step ( T = 0 ), H f i n and H t e x t , are concatenated and fed to a fully connected classifier to output the probability of motive accumulation, P M o t i v e :
H c h r o n i c = H f i n _ T H t e x t _ T P M o t i v e = σ w c h r o n i c · H c h r o n i c + b c h r o n i c
where denotes the vector concatenation operation, σ is the sigmoid activation function, w c h r o n i c and b c h r o n i c are the weights and bias of the decision layer.

3.3.2. Action Identification Module: Acute Channel

This module is dedicated to processing the acute dimension and aims to lock onto action outbreak signals, that is, the multimodal abnormal configurations present in the static cross-sectional data of the fraud year ( T = 0 ). This module employs three parallel, heterogeneous network channels to handle the three different modalities of the static snapshot:
  • Financial CNN Channel ( C N N _ F i n a n c i a l ): Input x a c u t e _ f i n [ 24 ,   24 ,   1 ] . Leveraging the local receptive capability of a CNN, this channel captures synergistic anomaly patterns among financial indicators that a flattened vector cannot express, such as a local region where ROE and accounts receivable turnover pixels show extreme brightness simultaneously.
  • Non-Financial CNN Channel ( C N N _ N o n F i n a n c i a l ): Input x a c u t e _ n o n f i n [ 16 ,   16 ,   1 ] . This channel independently analyzes anomalous combinations of structural background features, such as corporate governance and macroeconomic pressure, to provide evidence for the existence of fraud opportunities.
  • Textual FNN Channel ( F N N _ T e x t ): Input x a c u t e _ t e x t [ 1 ,   49 ] . FNN directly learns which specific keywords (such as related-party transaction frequency) or sentiments (such as negative sentiment score) constitute acute fraud warning signals.
Mathematically, the action identification process is:
V f i n = F l a t t e n   ( C N N f i n x a c u t e _ f i n ; φ f i n ) V n o n f i n = F l a t t e n   C N N n o n f i n x a c u t e _ n o n f i n ; φ n o n f i n V t e x t = F N N t e x t   x a c u t e t e x t ; φ t e x t
where φ denotes the network parameters of each channel, and the F l a t t e n operation converts the CNN’s output feature maps into 1D vectors, V .
The feature vectors extracted from the three channels are concatenated and passed to a final classifier to output the probability of action execution, P A c t i o n :
V a c u t e = V f i n V n o n f i n V t e x t P A c t i o n = σ w a c u t e · V a c u t e + b a c u t e
where w a c u t e and b a c u t e are the weights and bias of this module’s decision layer.

3.3.3. Decision Integration Mechanism: Resolving the Identification Paradox

This mechanism is the final structural response to CTA theory and the ultimate resolution of the identification paradox. This study avoids simple feature-level fusion, as that would reintroduce interference among the heterogeneous signals.
After the two independent identification modules (Motive and Action) output their respective probabilistic judgments, this study employs an adaptive decision-level integration mechanism. This mechanism employs a learnable weighted fusion formula to balance the importance of the two asynchronous pieces of evidence in the final decision, yielding the final fraud probability P F r a u d :
P F r a u d = ω P M o t i v e + 1 ω · P A c t i o n
where the weight ω in [ 0 ,   1 ] is a learnable parameter during the model training process. This endows the model with high flexibility. The model can autonomously learn from the data and judge whether the long-term motive accumulation ( ω 1 ) or short-term action outbreak ( ω 0 ) provides stronger discriminatory evidence for a given type of fraud.
In sum, the Decoupling-Fusion System, through theory-driven heterogeneous decoupling and adaptive decision-level fusion, structurally ensures that the model remains sensitive to both chronic trends and acute outbreaks, thereby addressing the identification paradox at its root.

4. Experiments and Results

This section employs a rigorous and reproducible empirical protocol to address three connected questions:
(1)
Empirical existence of the identification paradox: Do single static or single dynamic models exhibit clear performance asymmetry when identifying acute and chronic fraud patterns?
(2)
Superiority of the Decoupling-Fusion Identification Paradigm: Does the DF-FraudNet framework proposed in this study achieve statistically significant gains over traditional shallow models, single-paradigm deep models, and alternative naive fusion architectures?
(3)
Coherence between theory and decisions: Are the model’s gains grounded in economically interpretable features that align with established theory?

4.1. Experimental Setup

4.1.1. Dataset and Sample Processing

This study selects Chinese A-share listed companies from 2001 to 2021 as the research sample, with 2021 as the cutoff to mitigate label incompleteness caused by enforcement and reporting lags in regulatory violation outcomes and to ensure complete five-year histories for the chronic module without temporal leakage. Structured data (financial, non-financial, and governance indicators) are sourced from the CSMAR database. Unstructured data (MD&A texts) are sourced from the CNRDS. Following domain practice, firms in the financial industry and observations with missing key features are excluded.
  • Fraud Samples (Positive Class): Strictly defined as company-year observations marked in the CSMAR violation database for one of the seven core financial fraud behaviors (“fictitious profits”, “fictitious assets”, “false records”, “delayed disclosure”, “major omissions”, “untrue disclosures”, and “fraudulent listing”).
  • Non-Fraud Samples (Negative Class): To mitigate potential contamination by “False Negatives,” a two-stage cleansing strategy is adopted. First, we remove all ST, *ST, non-standard audit opinions, and cases with non-financial violations to secure the baseline quality of the negative class. Second, we apply a conservative cross-check combining Benford’s law (for numeric manipulation) and Isolation Forest (for high-dimensional structural anomalies) to flag “gray samples” (risk flags, not labels) only when both criteria are met; these flags are used only within each training fold for feature screening and are not removed from training/validation/test evaluation sets.
Data are operationalized into two asynchronous representations:
  • Chronic Feature Representation ( T 4 to T ): a five-year financial sequence x c h r o n i c _ f i n R 5 × 576 and a five-year textual feature sequence x c h r o n i c _ t e x t R 5 × 49 .
  • Acute Feature Representation ( T = 0 ): a grayscale image of financial ratios x a c u t e _ f i n R 24 × 24 × 1 , a grayscale image of non-financial indicators x a c u t e _ n o n f i n R 16 × 16 × 1 , and a textual vector x a c u t e _ t e x t R 1 × 49 .
In the following, T denotes the focal calendar year of a firm-year sample (i.e., T = 0 in the relative timeline). Each firm-year sample is indexed by a focal year T and uses time-ordered inputs from a five-year chronic window T 4 , , T together with an acute snapshot at year T . To support early windows while keeping the focal-year study period (2001–2021), raw accounting and governance records are collected from 1997 to 2021; years prior to 2001 are used only to construct historical inputs (not as prediction targets). A firm-year observation at year T is constructed only when all required inputs in T 4 , , T are observed and consecutive for that firm. Windows are constructed strictly from observed historical data, without synthetic backfilling of pre-listing or unreported years.
Ultimately, 4845 fraud samples and 38,562 non-fraud samples were obtained. Using time-aware cross-validation, firm-year observations are ordered by the focal year T and split chronologically into a development period (2001–2017) and an out-of-time test period (2018–2021). Within the development period, five forward-chaining folds are used: the training block always precedes the validation block, enforcing m a x   ( Y e a r t r a i n ) < m i n   ( Y e a r v a l ) (the exact fold-year ranges are provided in Supplementary Table S3), which ensures strictly forward-in-time validation by design.
Severe class imbalance is addressed with SMOTE applied only to the training data within each training block, balancing the classes to a 1 : 1 ratio; validation/test splits are never oversampled. All preprocessing and feature screening steps are fitted on each fold’s training block and then applied to the corresponding validation block and the held-out test set. The test set retains its original, imbalanced real-world distribution to objectively assess generalization ability and practical value.
Notably, adjacent focal-year samples may share historical inputs due to overlapping windows; however, all labels and fitted transformations are learned only from earlier focal years, and no information from years > T is used for a sample indexed by year T . Accordingly, evaluation remains strictly chronological and leakage-free.

4.1.2. Experimental Environment and Model Configuration

Experiments were conducted on Windows 11 with an NVIDIA RTX 4060 Ti (16 GB). The experiments were implemented in PyTorch 2.2.1 (Python 3.10.13), together with scikit-learn 1.4.2, imbalanced-learn 0.12.2, XGBoost 2.0.3, and Hugging Face Transformers 4.57.3. Table 4 summarizes the key parameter settings for feature screening, while Table 5 and Table 6 report the DF-FraudNet parameters and the training configuration.
The architectural hyperparameters in Table 5 were chosen to balance representational capacity and overfitting risk under extreme class imbalance and a short temporal horizon. For the acute image branches (24 × 24 financial inputs and 16 × 16 non-financial inputs), we adopted 3 × 3 kernels and two convolutional blocks as a standard configuration that increases the receptive field while avoiding excessive downsampling on small inputs. Channel widths were set to 64/128 (financial) and 32/64 (non-financial) to provide sufficient capacity for heterogeneous signals while controlling parameter count and training stability. For the chronic branches, we used a single-layer LSTM because the chronic representation spans only 5 time steps, where deeper recurrent stacks typically provide limited gains but increase overfitting risk; hidden sizes were set to 256 (financial sequence) and 64 (textual sequence) to match the input dimensionality (576 and 49) and keep the fusion head compact. Fusion MLP width was set to 128 to integrate branch-level representations while keeping the decision head lightweight. Given that this study primarily focuses on the causal temporal asynchrony decoupling-and-fusion mechanism and its effectiveness, we adopt commonly used settings and further conduct a lightweight robustness check, showing that the reported performance remains stable under reasonable variations of key hyperparameters.
In implementation, decision-level fusion is performed at the logit level. Let s m o t i v e and s a c t i o n denote the logits output by the chronic and acute branches, respectively. The fused logit is computed as s = ω s m o t i v e + ( 1 ω )   s a c t i o n . To enforce ω [ 0 ,   1 ] , the weight is parameterized as ω = σ ( α ) with a learnable scalar α R , initialized as α 0 = 0 (thus ω 0 = 0.5 ). The parameter α is optimized end-to-end jointly with all network parameters using Adam and BCEWithLogitsLoss computed on the fused logit s , ensuring stable gradient flow through the fusion mechanism. Early stopping is applied on validation AUPRC. The ablation with fixed ω = 0.5 further supports the benefit of learning ω .

4.1.3. Evaluation Metrics

This study employs a set of standard binary classification evaluation metrics, including Accuracy, Precision, Recall, F1-Score, and Area Under the ROC Curve (AUC). Particularly, given the extreme class imbalance of the test set, this study regards the Area Under the Precision-Recall Curve (AUPRC) as one of the core evaluation metrics. Compared to AUC, AUPRC is more sensitive to the predictive performance on the minority class (fraud samples) and can more authentically reflect the model’s practical value in identifying real fraud cases.
To explicitly evaluate whether SMOTE increases false positives, we further report False Positive Rate (FPR) on the out-of-time test set.

4.1.4. Baseline Model Setup

To validate the superiority of DF-FraudNet, four groups of baselines are designed.
Group 1. Traditional ML Baselines
  • Purpose: Compare deep representation learning with feature-engineering paradigms.
  • Models: Logistic Regression (LR), Random Forest (RF), XGBoost.
Group 2. Single-Paradigm Deep Learning Baselines
  • Purpose: Provide the core empirical test of the identification paradox by contrasting static recognition (CNN) with dynamic recognition (LSTM) and by assessing the standalone predictive value of each modality.
  • Models: CNN-Fin ( T = 0 , acute financial images), LSTM-Fin ( T 4 T , chronic financial sequences), FNN-Text ( T = 0 , acute textual vectors), LSTM-Text ( T 4 T , chronic textual sequences).
Group 3. Alternative Multimodal Fusion Architectures
  • Purpose: Test whether the decision-level fusion in this study is structurally superior to naive early fusion and feature-level fusion variants.
  • Models: MM-MLP (early fusion, static), MM-LSTM (early fusion, temporal), DF-FraudNet (F) (feature-level fusion variant).
Group 4. External SOTA and Explainability Trade-off Baseline
  • Purpose: Quantify the performance of our “gray-box” interpretable text features against a “black-box” pretrained language model.
  • Model: FinBERT (text only).

4.1.5. Ablation Study Setup

To quantify the marginal contribution of each component in the DF-FraudNet framework, we design five ablation groups.
Group A. Module-Level Ablation
  • Variants: w/o Chronic Module (removes the chronic LSTM module); w/o Acute Module (removes the acute CNN-FNN module).
Group B. Modality-Level Ablation
  • Variants: w/o Financial Ratios (remove financial ratios); w/o Non-Financial Ratios (remove non-financial ratios); w/o Textual Features (remove textual features).
Group C. Architectural Design Ablation
  • Variants: CNN→FNN (replaces CNN with FNN for the acute branch, using flattened vectors), DualCNN→SingleCNN (combines financial and non-financial features into one image), DF-FraudNet ( w = 0.5 ) (holds w constant at 0.5 during both training and inference).
Group D. Sensitivity and Robustness Check
  • Variants: K e r n e l s = 5 × 5 (larger acute convolutional kernels, replacing 3 × 3 with 5 × 5 in the acute CNNs, with stride 1 and same padding); F u s i o n   h e a d = 256 (wider fusion head); D r o p o u t = 0.3 (higher dropout in fusion heads); H i d d e n   s i z e s = 128 / 32 (smaller chronic hidden sizes). All other settings are unchanged.
Group E. Imbalance-Handling Ablation
  • Variant: w/o SMOTE (trained on the original imbalanced training data, with all other settings unchanged).

4.2. Results and Analysis

4.2.1. Empirical Test of the Identification Paradox: Asymmetry Across Paradigms

The first logical step is to confirm the objective existence of the identification paradox. Accordingly, Table 7 reports fold-aggregated performance (μ ± σ) with 95% CIs for AUC and AUPRC.
To operationalize the paradox, we contrast a static recognition paradigm (CNN-Fin) with a dynamic recognition paradigm (LSTM-Fin), both in Group 2. As shown in Table 7, the dynamic recognition paradigm (LSTM-Fin, AUC = 0.938, AUPRC = 0.740) outperforms the static recognition paradigm (CNN-Fin, AUC = 0.914, AUPRC = 0.662) on the overall fold-aggregated evaluation. This is consistent with supervisory practice that continuous manipulation often unfolds across multiple periods, and therefore benefits from explicit temporal modeling.
To reveal performance asymmetry, the test set is further partitioned into acute fraud and chronic fraud. Acute fraud is operationalized as an isolated fraud episode with episode length = 1 year (i.e., a single-year fraud label without adjacent-year continuation), whereas chronic fraud is defined as a persistent episode with episode length ≥ 2 consecutive years. This split is deterministically derived from year-to-year fraud labels and evaluated using bootstrap 95% confidence intervals (CI) on the out-of-time test set (Table 8). Importantly, the five-year window described in Section 4.1.1 refers to the historical input span for feature construction, not the duration of a fraud episode.
The static paradigm (CNN-Fin) performs better on the acute subset, showing higher sensitivity to abnormal configurations at time T = 0 . The dynamic paradigm (LSTM-Fin) achieves clear gains on the chronic subset, confirming its advantage in modeling cross-period evolution. Paired bootstrap on the out-of-time test set further supports this asymmetry (ΔAUC_acute = +0.026; ΔAUC_chronic = +0.047, where ΔAUC is computed between the better-performing paradigm and its counterpart within each subset). These findings strongly indicate that the identification paradox is an empirical bottleneck rather than a conjecture. Any single paradigm, static or dynamic, struggles to handle both temporally asynchronous fraud patterns at the same time. This provides direct evidence for the necessity and rationale of the decoupling-fusion architecture.

4.2.2. Overall Performance Comparison: Superiority of DF-FraudNet

On the full test set that includes all fraud types, DF-FraudNet is compared with all baselines. The experimental results are shown in Table 7.
DF-FraudNet attains the best mean performance on the core metrics (AUC = 0.967, F1 = 0.931), proving its overall superiority.
Group 1: Compared with traditional ML baselines, DF-FraudNet delivers clear gains in both AUC and AUPRC, indicating that the proposed deep Decoupling–Fusion design captures cross-modal fraud signals more effectively than flattened feature engineering.
Group 3: This is the most critical comparison for validating our architectural design. DF-FraudNet (AUC = 0.967) outperforms its strongest architectural competitor, the feature-level fusion variant DF-FraudNet (F) (AUC = 0.962). This improvement is small in absolute magnitude but is evaluated under paired fold-wise testing (paired t-test on AUC across folds; p = [0.04], paired dz = [1.4]), suggesting it is not purely driven by stochastic variance. This supports our hypothesis that feature-level concatenation can reintroduce interference for causally and temporally asynchronous signals. In contrast, the decoupling-decision-level fusion strategy adopted in this study, which allows the two major modules to complete their inference independently ( P A c t i o n and P M o t i v e ) before high-level integration, is structurally superior for resolving the paradox.
Group 4: FinBERT (text only) performs strongly (AUC = 0.915), yet remains far behind our multimodal framework (AUC = 0.967). This indicates two points: firstly, fraud identification is inherently a multimodal problem, and text alone cannot capture the full picture; secondly, with structured data and a “gray-box” text channel, our approach exceeds the “black-box” model while preserving interpretability, achieving both high performance and explainability.

4.2.3. Ablation Study: Contribution of Internal Components

To quantify the marginal contribution of each component in DF-FraudNet, Table 9 reports fold-aggregated performance with statistical validity. Specifically, ΔAUC is computed against the full model, p-values are obtained from paired tests on fold-wise AUC differences and corrected within the table (Holm), and effect sizes are reported as Cohen’s dz on paired fold-wise differences.
The ablation results in Table 9 indicate that DF-FraudNet’s performance is attributable to the CTA-guided decoupling and decision-level fusion rather than a single tuned hyperparameter.
Group A: Removing either the Chronic Motive Module or the Acute Action Module leads to clear drops in performance (AUC decreases of 0.018 and 0.012, respectively). These reverse tests confirm that integrating both asynchronous signal sources enables DF-FraudNet to exceed all single-paradigm baselines and to resolve the paradox structurally.
Group B: Removing financial ratios produces the largest performance loss (AUC down by 0.055), showing that financial data are the cornerstone of fraud detection. Text and non-financial features provide indispensable complementary and contextual information.
Group C: Replacing the acute module’s CNN with an FNN degrades performance, which supports the imaging of structured ratios and the use of CNNs to capture local abnormal configurations. Fixing the learnable weight w at 0.5 also hurts performance, which validates the adaptive decision-level fusion mechanism. The model must be allowed to learn whether motive accumulation or action signals carry more weight.
Group D: Perturbing key architectural hyperparameters within reasonable ranges yields only minor changes in performance. This pattern indicates that DF-FraudNet’s gains are not driven by a single finely tuned setting but are structurally attributable to the CTA-guided decoupling and the decision-level fusion design, which remain intact across these variations.
Group E: As shown in Table 10, this group evaluates the effect of synthetic oversampling on false positives by keeping DF-FraudNet unchanged and toggling SMOTE only during training, while computing all metrics on the same out-of-time test set under the original base rate. SMOTE increases FPR from 0.0057 to 0.0092 ( Δ F P R = + 0.0035 ), but substantially improves minority detection (Recall 0.860 → 0.935; AUPRC 0.800 → 0.847). Paired bootstrap on the same test set suggests that the increase in false positives is distinguishable from sampling variability at the same operating threshold. This directly quantifies the false-alarm cost introduced by oversampling and verifies that gains are not driven by leakage, since validation/test data are never oversampled.

4.2.4. XAI Analysis: Decision Logic and Theoretical Coherence

The XAI analysis aims not only to assess the predictive reliability of DF-FraudNet but also to examine whether its decision logic is economically interpretable. We therefore apply XAI tools, including Class Activation Mapping (CAM), SHAP values and hidden-state visualisation, to open the black box of DF-FraudNet. The resulting visual and attribution patterns are used as a qualitative, correlational check of whether they are consistent with the directional predictions of the classic fraud triangle, thereby supporting an interpretation of the learned representations as theory-guided proxy signals related to pressure, opportunity and rationalization. Figure 2, Figure 3 and Figure 4 are presented as illustrative reading aids to show how the proposed XAI tools reveal DF-FraudNet’s decision cues at the sample level and improve transparency for theory-guided interpretation.
Acute Module (Action/Opportunity): As shown in Figure 2, CAM is applied to the financial CNN channel of the acute module. Pixel intensity can be interpreted as the relative contribution (salience) of the corresponding ratio–pixel to the model’s fraud score under the fixed ratio–pixel layout (brighter regions indicate stronger contribution within the same image). The resulting heatmaps indicate that the model attends to synergistic anomaly patterns rather than to isolated indicators. For example, regions associated with profitability (such as ROE) and asset quality (such as accounts receivable turnover) tend to be activated jointly. Such joint activation is consistent with cross-checking logic in auditing practice, where inconsistencies between reported profitability and asset quality are treated as red flags for elevated fraud risk, and, in the fraud triangle perspective, it can be viewed as reflecting conditions associated with the opportunity to misstate and conceal performance.
Acute Module (Action/Rationalization): As shown in Figure 3, SHAP values are applied to the gray-box FNN channel for textual features. The resulting explanations highlight high-risk keywords (such as Related Party Transactions and Performance Commitments) and readability-related obfuscation measures as influential contributors in the model’s risk assessment. This SHAP summary is used as an interpretability device to clarify which gray-box textual cues the model relies on at T = 0 , thereby supporting a theory-guided reading of rationalization-related signals. This pattern indicates that the model places greater weight on textual cues associated with managerial obfuscation and rationalization at time T = 0 , which is consistent with rationalization-related signals in the fraud triangle framework.
Chronic Module (Motive/Pressure): As shown in Figure 4, the hidden-state trajectory of the LSTM is visualised for an illustrative fraud case. In this example, the trajectory exhibits higher volatility or a persistent drift from T 3 to T 1 prior to the outbreak. The different coloured lines represent the time-series trajectories of different dimensions (hidden units) in the LSTM hidden-state vector, illustrating how each dimension evolves and drifts across time steps ( T 4 to T ). This trajectory plot is primarily used to illustrate how the chronic module encodes tem-poral dynamics and to facilitate qualitative diagnosis of learned representations. These dynamics can be interpreted, in a proxy sense, as being compatible with a motive-accumulation process under performance pressure, consistent with pressure-related signals in the fraud triangle framework.
Overall, the XAI analysis reveals attribution patterns that are consistent with the directional predictions of the fraud triangle for pressure, opportunity and rationalization. These findings suggest that DF-FraudNet’s learned representations can be interpreted, in a proxy sense, as pressure-, opportunity- and rationalization-related signals, and they provide qualitative support for the theoretical coherence of the CTA-guided design by offering transparent, sample-level reading aids for model decision cues.

4.2.5. Experimental Conclusion and Discussion

Overall, the results support the CTA-guided view that fraud evidence is temporally asynchronous and is better handled by a Decoupling–Fusion design than by a single modelling paradigm. DF-FraudNet achieves the strongest overall performance on the full task (AUC = 0.967), and ablations show that removing either the chronic motive module or the acute action module degrades performance, confirming that both evidence streams contribute. Financial ratios remain the backbone signal, while non-financial governance and MD&A cues provide complementary information that improves minority-class discrimination; the benefit of decision-level fusion is further supported by the fact that simplified or fixed fusion reduces performance.
In comparison with related studies, cross-sectional screening and static learners are typically effective at detecting within-year abnormal configurations but are less suited to representing multi-year accumulation, whereas sequential-only encoders better capture long-horizon deterioration but may attenuate outbreak-like spikes concentrated in a single year. The acute–chronic asymmetry observed in our experiments is consistent with this trade-off and motivates modelling the two signal types separately before integration. Relative to common multimodal pipelines that append text to structured predictors via early or representation-level fusion, our results suggest that decision-level fusion can be more robust when modalities are noisy and temporally misaligned, because each modality and time-scale is first processed by a specialized identifier. These findings should be interpreted as predictive and correlational evidence consistent with the CTA-guided design, and the XAI visualizations are intended as transparent, sample-level reading aids.
Regarding external validity, the Decoupling–Fusion architecture is largely market-agnostic, but cross-market transfer requires recalibration to accounting standards, disclosure regimes, enforcement intensity, and language conventions. In practice, applying the framework to another market would involve re-estimating preprocessing and imbalance handling, re-learning the ratio–pixel layout on the target data, and adapting the textual feature extractor to local reporting language and style, with out-of-market testing recommended prior to deployment.

5. Conclusions and Implications

5.1. Research Conclusions

The long-standing identification paradox in financial fraud detection arises because models fail to accommodate the asynchronous nature of fraud signals. This study proposes a framework that combines theoretical innovation with empirical effectiveness to address this challenge.
First, at the theoretical level, this study introduces the Causal–Temporal Asynchrony (CTA) theory for motive and action. Building on the classic fraud triangle and addressing the theoretical gap in data-driven models, CTA theory clarifies the binary heterogeneity of fraud signals by distinguishing fraud motive as a chronic latent process that accumulates across periods and fraud action as an acute event may be triggered in a specific year as motive accumulates. This theory provides a conceptual foundation for addressing the identification paradox at its root. In this study, CTA serves as a theory-informed organizing principle for feature construction and model architecture, and it underpins the implementation of DF-FraudNet as a diagnostic model for corporate fraud.
Second, to operationalize CTA, this study first establishes the Decoupling–Fusion Identification Paradigm for asynchronous signals. Guided by DFIP, it then implements a Decoupling–Fusion System that encodes CTA as an Acute–Chronic Binary Feature Dimensions schema and conducts detection through Decoupling–Fusion FraudNet. Within this system, parallel LSTM networks to encode chronic motive-related sequences and parallel CNN–FNN networks to model acute action-related patterns achieve structural decoupling, and an adaptive decision-level fusion provides high-level integration. In empirical tests, the framework achieves superior predictive performance (AUC = 0.967) relative to single-paradigm and naive-fusion baselines.
Finally, XAI is employed to assess the economic interpretability of the framework and to provide a qualitative check on its theoretical coherence at the level of proxy signals. In the chronic module, LSTM hidden-state fluctuations exhibit temporal patterns that are consistent with pressure-related signals; in the acute module, CAM applied to the financial CNN channel highlights synergistic anomaly configurations that are consistent with opportunity-related patterns; and SHAP values in the acute textual branch highlight obfuscation-related textual features that are consistent with rationalization in narrative reporting. These patterns indicate that DF-FraudNet’s predictive behavior is compatible with the fraud triangle perspective at the level of theory-guided proxy signals.

5.2. Implications

5.2.1. Theoretical Implications

The findings of this research provide several implications for the intersection of finance, accounting, and information science. First, they contribute to a shift in the financial fraud research paradigm from static correlation to dynamic, theory-guided process analysis. The CTA theory, together with the Decoupling-Fusion System, shifts the research lens from identifying what fraud signals are to analyzing how the fraud process unfolds, adding a dynamic process dimension that is consistent with the theorized motive–action process. Second, they suggest a general methodology for complex economic event prediction that combines dynamic evolution with static abnormality. Mirroring medical practice that joins long-term monitoring (chronic) with instantaneous diagnosis (acute), the framework could be extended to bankruptcy, credit default, and supply-chain disruption, where a similar latent-to-outbreak duality is present.

5.2.2. Managerial Implications

From a managerial and governance perspective, the framework offers a pragmatic early-warning and triage mechanism for stakeholders with heterogeneous capabilities. For regulators and audit teams, DF-FraudNet can serve as a prioritization tool that ranks firm-year observations by risk and allocates investigative resources to the most suspicious cases, particularly when the cost of missing major frauds is high. For boards, internal control functions, and compliance units, the chronic–acute decomposition helps separate persistent structural pressures from short-horizon action signals, enabling earlier intervention on controllable drivers before an outbreak year. For investors and analysts, the model’s multi-source evidence supports more structured due diligence by indicating which dimensions contribute most to an elevated risk flag.

5.2.3. Ethical Risks and Regulatory Implications

Because fraud detection is a socially consequential screening task, false positives may impose reputational and compliance costs, while false negatives may allow harmful misconduct to persist. Accordingly, the model should be deployed as decision-support for human review and prioritization rather than as an automated sanctioning instrument. To make the false-alarm cost explicit, the study reports FPR on the untouched out-of-time test set and notes the operational trade-off introduced by imbalance handling (e.g., SMOTE improving detection while moderately increasing FPR). Potential subgroup disparities across industries or firm types should be monitored through periodic checks of performance and probability reliability, with updates considered under distribution shift to maintain responsible regulatory use.

5.3. Future Research Directions

Several directions merit further work. At the theoretical level, this study does not develop a formal psychometric model, it does not treat pressure, opportunity and rationalization as firm-level, it does not develop theory-guided proxy features, and it has not subjected the linkage from chronic motives to acute actions to explicit causal assessment. Future research should address these theoretical limitations by employing more rigorous construct-validation and causal-inference methods to examine whether the proxy mappings and the motive-to-action process offer explanatory value beyond predictive performance.
At the data level, the empirical analysis is restricted to firm-level financial, governance and MD&A data from a single market. Future work could incorporate alternative data, such as supply-chain records, high-frequency trading data or satellite imagery, to build a more responsive and panoramic risk-monitoring system.
At the methodological level, the present Decoupling–Fusion architecture is implemented as a single predictive model for corporate fraud. Future studies may generalize this design to broader economic forecasting tasks and move towards a more unified latent–outbreak diagnostic framework for complex events. In addition, systematic perturbation of key architectural hyperparameters within reasonable ranges (e.g., larger acute kernels, wider channels, and smaller chronic hidden sizes), together with more extensive hyperparameter optimization and architecture search, may further improve performance.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/systems14010025/s1, Table S1: Financial Ratio Dictionary (K = 576); Table S2: Operational definitions and computation rules for the 49 textual dynamic features (MD&A); Table S3: Forward-chaining fold design in the development period (2001–2017).

Author Contributions

Conceptualization, S.L. and W.L.; methodology, W.L.; software, W.L.; validation, W.L., X.L. and S.L.; formal analysis, S.L.; investigation, X.L.; resources, S.L.; data curation, Z.L., Z.Q. and J.D.; writing—original draft preparation, W.L.; writing—review and editing, W.L.; visualization, W.L.; supervision, S.L.; project administration, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 72271155) and the Eastern Talent Plan of Shanghai.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wen, F.; Lin, D.; Hu, L.; He, S.; Cao, Z. The Spillover Effect of Corporate Frauds and Stock Price Crash Risk. Financ. Res. Lett. 2023, 57, 104185. [Google Scholar] [CrossRef]
  2. Taylor-Neu, K.; Rahaman, A.S.; Saxton, G.D.; Neu, D. Tone at the Top, Corporate Irresponsibility and the Enron Emails. Accounting Audit. Account. J. 2024, 37, 336–364. [Google Scholar] [CrossRef]
  3. Peng, Z.; Yang, Y.; Wu, R. The Luckin Coffee Scandal and Short Selling Attacks. J. Behav. Exp. Financ. 2022, 34, 100629. [Google Scholar] [CrossRef]
  4. Achakzai, M.A.K.; Juan, P. Detecting financial statement fraud using dynamic ensemble machine learning. Int. Rev. Financ. Anal. 2023, 89, 1057–5219. [Google Scholar] [CrossRef]
  5. Al-Daoud, K.I.; Abu-AlSondos, I.A. Robust AI for Financial Fraud Detection in the GCC: A Hybrid Framework for Imbalance, Drift, and Adversarial Threats. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 121. [Google Scholar] [CrossRef]
  6. Forough, J.; Momtazi, S. Ensemble of deep sequential models for credit card fraud detection. Appl. Soft Comput. 2021, 99, 106883. [Google Scholar] [CrossRef]
  7. Xie, Y.; Liu, G.; Yan, C.; Jiang, C.; Zhou, M. Time-aware attention-based gated network for credit card fraud detection by extracting transactional behaviors. IEEE Trans. Comput. Soc. Syst. 2023, 10, 1004–1016. [Google Scholar] [CrossRef]
  8. Hajek, P.; Abedin, M.Z.; Sivarajah, U. Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework. Inf. Syst. Front. 2023, 25, 1985–2003. [Google Scholar] [CrossRef]
  9. Hosaka, T. Bankruptcy prediction using imaged financial ratios and convolutional neural networks. Expert Syst. Appl. 2019, 117, 287–299. [Google Scholar] [CrossRef]
  10. Cressey, D.R. Other People’s Money: A Study in the Social Psychology of Embezzlement; Free Press: Glencoe, IL, USA, 1953. [Google Scholar]
  11. Chen, K.C.W.; Yuan, H. Earnings Management and Capital Resource Allocation: Evidence from China’s Accounting-Based Regulation of Rights Issues. Contemp. Account. Res. 2004, 21, 557–603. [Google Scholar] [CrossRef]
  12. Bierstaker, J.; Brink, W.D.; Khatoon, S.; Thorne, L. Academic Fraud and Remote Evaluation of Accounting Students: An Application of the Fraud Triangle. J. Bus. Ethics 2024, 195, 425–447. [Google Scholar] [CrossRef]
  13. Boyle, D.M.; DeZoort, F.T.; Hermanson, D.R. The effect of alternative fraud model use on auditors’ fraud risk judgments. J. Account. Public Policy 2015, 34, 578–596. [Google Scholar] [CrossRef]
  14. Hogan, C.E.; Rezaee, Z.; Riley, R.A.; Velury, U.K. Financial statement fraud: Insights from the academic literature. Audit. A J. Pract. Theory 2008, 27, 231–252. [Google Scholar] [CrossRef]
  15. Karpoff, J.M.; Lee, D.S.; Martin, G.S. The Cost to Firms of Cooking the Books. J. Financ. Quant. Anal. 2008, 43, 581–611. [Google Scholar] [CrossRef]
  16. Beneish, M.D. The Detection of Earnings Manipulation. Financ. Anal. J. 1999, 55, 24–36. [Google Scholar] [CrossRef]
  17. Rtayli, N.; Enneya, N. Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. J. Inf. Secur. Appl. 2020, 55, 102596. [Google Scholar] [CrossRef]
  18. Yi, Z.; Cao, X.; Pu, X.; Wu, Y.; Chen, Z.; Khan, A.T.; Francis, A.; Li, S. Fraud detection in capital markets: A novel machine learning approach. Expert Syst. Appl. 2023, 231, 120760. [Google Scholar] [CrossRef]
  19. Xu, B.; Wang, Y.; Liao, X.; Wang, K. Efficient fraud detection using deep boosting decision trees. Decis. Support Syst. 2023, 175, 114037. [Google Scholar] [CrossRef]
  20. Zhang, X.; Guo, F.; Chen, T.; Pan, L.; Beliakov, G.; Wu, J. A Brief Survey of Machine Learning and Deep Learning Techniques for E-Commerce Research. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 2188–2216. [Google Scholar] [CrossRef]
  21. Hilal, W.; Gadsden, S.A.; Yawney, J. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
  22. Li, F. Annual Report Readability, Current Earnings, and Earnings Persistence. J. Account. Econ. 2008, 45, 221–247. [Google Scholar] [CrossRef]
  23. Loughran, T.; McDonald, B. When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. J. Financ. 2011, 66, 35–65. [Google Scholar] [CrossRef]
  24. Bhattacharya, I.; Mickovic, A. Accounting Fraud Detection Using Contextual Language Learning. Int. J. Account. Inf. Syst. 2024, 53, 100682. [Google Scholar] [CrossRef]
  25. Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
  26. Craja, P.; Kim, A.; Lessmann, S. Deep learning for detecting financial statement fraud. Decis. Support Syst. 2020, 139, 113421. [Google Scholar] [CrossRef]
  27. Lin, Y.-C.; Padliansyah, R.; Lu, Y.-H.; Liu, W.-R. Bankruptcy prediction: Integration of convolutional neural networks and explainable artificial intelligence techniques. Int. J. Account. Inf. Syst. 2025, 56, 100744. [Google Scholar] [CrossRef]
  28. Ravisankar, P.; Ravi, V.; Rao, G.R.; Bose, I. Detection of financial statement fraud and feature selection using data mining techniques. Decis. Support Syst. 2011, 50, 491–500. [Google Scholar] [CrossRef]
  29. Granger, C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica 1969, 37, 424–438. [Google Scholar] [CrossRef]
  30. Anderson, J.C.; Gerbing, D.W. Structural equation modeling in practice: A review and recommended two-step approach. Psychol. Bull. 1988, 103, 411–423. [Google Scholar] [CrossRef]
  31. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  32. Agnew, R. Foundation for a General Strain Theory of Crime and Delinquency. Criminology 1992, 30, 47–88. [Google Scholar] [CrossRef]
  33. Akerlof, G.A. The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism. Q. J. Econ. 1970, 84, 488–500. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of the DF-FraudNet model.
Figure 1. Overall architecture of the DF-FraudNet model.
Systems 14 00025 g001
Figure 2. CAM case study of financial ratio features for an illustrative fraud case.
Figure 2. CAM case study of financial ratio features for an illustrative fraud case.
Systems 14 00025 g002
Figure 3. SHAP importance analysis for acute textual features (Top 10).
Figure 3. SHAP importance analysis for acute textual features (Top 10).
Systems 14 00025 g003
Figure 4. Hidden state trajectory for an illustrative fraud case.
Figure 4. Hidden state trajectory for an illustrative fraud case.
Systems 14 00025 g004
Table 1. Top 20 dynamic financial features (partial list).
Table 1. Top 20 dynamic financial features (partial list).
IndicatorIndicator
Net profit/Shareholders’ equityInventories/Cost of goods sold
Current assets/Current liabilitiesSales revenue/Total assets
Total liabilities/Operating costsNet profit/Total assets
Operating cash flow/Net profitOperating profit/Non-operating expenses
Total assets/Shareholders’ equityTotal liabilities/Total assets
Operating cash flow/Current liabilitiesNet fixed assets/Construction in progress (net)
Net profit/Sales revenueLong-term liabilities/Shareholders’ equity
Operating profit/Non-operating incomeSales revenue/Average accounts receivable
Other receivables/Total assetsNet cash flow from operating activities/Operating revenue
Operating revenue/InventoriesOperating profit/Interest expense
Table 2. Textual features.
Table 2. Textual features.
CategoryFeature NameCategoryFeature Name
Sentiment & Tone (5)Negative sentiment frequencyCore topical keywords (29)“performance commitment” term frequency
Positive sentiment frequency“non-standard audit opinion” term frequency
Uncertainty term frequency“qualified audit opinion” term frequency
Litigation term frequency“asset impairment” term frequency
Sentiment balance (Pos-Neg)“goodwill impairment” term frequency
Core topical keywords (29)“risk” term frequency“provision for bad debts” term frequency
“uncertainty” term frequency“inventory write-down” term frequency
“liabilities” term frequency“revenue recognition” term frequency
“restructuring” term frequency“financial restructuring” term frequency
“risk exposure” term frequencyLinguistic style & readability (5)Text readability index
“decline” term frequencyText length
“recession” term frequencyAverage sentence length
“accounting” term frequencyNumeral ratio
“transparency” term frequencyHedging/Modal term frequency
“integrity” term frequencyTemporal focus & narrative (10)Past tense frequency
“compliance” term frequencyFuture tense—planning frequency
“disclosure” term frequencyFuture tense—forecasting frequency
“audit” term frequencyFuture tense—outlook frequency
“governance” term frequencyTemporal focus balance
“board of directors” term frequencyInternal attribution frequency
“management turnover” term frequencyExternal—macro frequency
“related-party transactions” term frequencyExternal—policy/industry frequency
“related parties” term frequencyExternal—market frequency
“guarantees” term frequencyAttribution balance
“pledge” term frequency
Table 3. Non-financial static features.
Table 3. Non-financial static features.
CategoryIndicator
Macroeconomic environmentUnemployment rate = number of unemployed/labor force
GDP growth rate = (current GDP − previous GDP)/previous GDP
Tax revenue growth rate = (current tax revenue − previous tax revenue)/previous tax revenue
Interest rate level = interest paid/principal
Growth rate of fraud-penalty amount = (current fraud-penalty amount − previous fraud-penalty amount)/previous fraud-penalty amount
Corporate governance structureIndependent-director ratio = number of independent directors/total number of board members
Largest-shareholder ownership ratio = shares held by the largest shareholder/total shares outstanding
Management ownership ratio = shares held by management/total shares outstanding
Board-size ratio = number of board members/total employees
Executive compensation to net profit ratio = total executive compensation/net profit
Shareholding concentration = total shares held by top 10 shareholders/total shares outstanding
Employee turnover rate = number of departing employees/total employees
Industry & capital marketIndustry concentration ratio = market share of top firms/total industry market share
Deviation between actual performance and market expectations = (actual performance − market expectations)/market expectations
Stock-price volatility = standard deviation of daily closing-price changes/average stock price
Deviation from industry-average financial indicator = (firm’s financial ratio − industry average financial ratio)/industry average financial ratio
Table 4. Feature screening parameter configuration.
Table 4. Feature screening parameter configuration.
ModuleHyperparameters
Missingness filtering p = 0.30 (line-item), q = 0.30 (ratio);
PreprocessingRobustScaler (median/IQR) + MinMaxScaler [ 0 , 1 ]
Point-biserialPoint-biserial effect-size filter: r p b 0.40
XGBoostgbtree; n _ e s t i m a t o r s = 500 , l e a r n i n g _ r a t e = 0.05 , m a x _ d e p t h = 6 , s u b s a m p l e = 0.8 , c o l s a m p l e _ b y t r e e = 0.8 , m i n _ c h i l d _ w e i g h t = 1 , r e g _ \ l a m b d a = 1 , r e g _ \ a l p h a = 0 ; s c a l e _ p o s _ w e i g h t = 7.9867411 ; e v a l _ m e t r i c = a u c p r
RFE-CV b a s e   e s t i m a t o r = L o g i s t i c   R e g r e s s i o n   ( L 2 ,   c l a s s _ w e i g h t = b a l a n c e d ) ; 5-fold CV; s c o r i n g = A U P R C ; s t e p = 0.05
Table 5. DF-FraudNet parameter configuration.
Table 5. DF-FraudNet parameter configuration.
ModuleComponentCore Architecture/Hyperparameters
Acute Module C N N _ F i n a n c i a l [ C o n v   ( 64 ,   3 × 3 ) M a x P o o l   ( 2 × 2 ) ] [ C o n v   ( 128 ,   3 × 3 ) M a x P o o l   ( 2 × 2 ) ] F l a t t e n F C   ( 256 )
C N N _ N o n F i n a n c i a l [ C o n v   ( 32 ,   3 × 3 ) M a x P o o l   ( 2 × 2 ) ] [ C o n v   ( 64 ,   3 × 3 ) G A P ] F C   ( 64 )
F N N _ T e x t F C   ( 64 ) R e L U F C   ( 32 )
A c u t e   F u s i o n   F N N ( I n p u t :   256 + 64 + 32 ) F C   ( 128 ) R e L U D r o p o u t   ( 0.2 ) F C   ( 1 ) ( l o g i t )
Chronic Module L S T M _ F i n a n c i a l L S T M   ( i n p u t _ s i z e = 576 ,   h i d d e n _ s i z e = 256 ,   n u m _ l a y e r s = 1 ,   d r o p o u t = 0.0 ;   i n a c t i v e   f o r   n u m _ l a y e r s = 1   i n   P y T o r c h )
L S T M _ T e x t L S T M   ( i n p u t _ s i z e = 49 ,   h i d d e n _ s i z e = 64 ,   n u m _ l a y e r s = 1 ,   d r o p o u t = 0.0 ;   i n a c t i v e   f o r   n u m _ l a y e r s = 1   i n   P y T o r c h )
C h r o n i c   F u s i o n   F N N ( I n p u t :   256 + 64 ) F C   ( 128 ) R e L U D r o p o u t   ( 0.2 ) F C   ( 1 ) ( l o g i t )
Decision Fusion D e c i s i o n   L a y e r P ( f r a u d ) = w P _ m o t i v e + ( 1 w ) P _ a c t i o n ;   w   i s   l e a r n a b l e
Table 6. Training and optimization configuration.
Table 6. Training and optimization configuration.
CategoryConfiguration
Objective & imbalanceBinary classification; BCEWithLogitsLoss. SMOTE (train only).
Optimization & scheduleAdamW, lr 1 × 10 3 , weight decay 1 × 10 4 ; ReduceLROnPlateau (monitor: val AUPRC; factor 0.5; patience 3; min lr 1 × 10 6 ). The fusion weight w is optimized end-to-end jointly with all trainable parameters.
Training protocol & regularizationBatch size 256; max epochs 80; early stopping on val AUPRC (patience 10), restoring best checkpoint. Dropout (Table 5); L2 regularization via weight decay ( 1 × 10 4 ); gradient clipping (max norm 1.0).
Computational cost & resourcesSingle-GPU training on RTX 4060 Ti (16 GB) (Windows 11; PyTorch 2.2.1); no distributed training. Typical wall-clock runtime under early stopping is 5 10 min/fold ( 30 50 min for 5-fold CV), and inference latency is 0.5 3 ms/sample (batch dependent).
Table 7. Comparative experiment results for financial fraud detection.
Table 7. Comparative experiment results for financial fraud detection.
CategoryModelAccuracy (μ ± σ)Precision (μ ± σ)Recall (μ ± σ)F1 Score (μ ± σ)AUC (μ ± σ; 95% CI)AUPRC (μ ± σ; 95% CI)
Group 1LR0.870 ± 0.00520.800 ± 0.01430.690 ± 0.01610.741 ± 0.01240.880 ± 0.0082 [0.870, 0.890]0.564 ± 0.0314 [0.525, 0.603]
RF0.889 ± 0.00410.835 ± 0.01270.725 ± 0.01420.776 ± 0.01090.905 ± 0.0064 [0.897, 0.913]0.635 ± 0.0261 [0.603, 0.667]
XGBoost0.904 ± 0.00340.855 ± 0.01150.765 ± 0.01310.807 ± 0.01020.921 ± 0.0051 [0.915, 0.927]0.684 ± 0.0227 [0.656, 0.712]
Group 2CNN-Fin0.898 ± 0.00400.845 ± 0.01230.748 ± 0.01340.794 ± 0.01010.914 ± 0.0062 [0.906, 0.922]0.662 ± 0.0274 [0.628, 0.696]
LSTM-Fin0.921 ± 0.00310.880 ± 0.01100.820 ± 0.01200.849 ± 0.00920.938 ± 0.0049 [0.932, 0.944]0.740 ± 0.0206 [0.714, 0.766]
FNN-Text0.854 ± 0.00620.775 ± 0.01740.645 ± 0.01930.704 ± 0.01510.862 ± 0.0103 [0.849, 0.875]0.518 ± 0.0422 [0.466, 0.570]
LSTM-Text0.865 ± 0.00550.792 ± 0.01620.670 ± 0.01840.726 ± 0.01400.878 ± 0.0091 [0.867, 0.889]0.559 ± 0.0387 [0.511, 0.607]
Group 3MM-MLP0.918 ± 0.00300.876 ± 0.01040.810 ± 0.01140.842 ± 0.00900.935 ± 0.0058 [0.928, 0.942]0.730 ± 0.0212 [0.704, 0.756]
MM-LSTM0.932 ± 0.00270.895 ± 0.01010.840 ± 0.01100.867 ± 0.00850.948 ± 0.0048 [0.942, 0.954]0.775 ± 0.0190 [0.751, 0.799]
DF-Fraud
Net (F)
0.948 ± 0.00230.914 ± 0.00830.900 ± 0.00920.907 ± 0.00720.962 ± 0.0038 [0.957, 0.967]0.827 ± 0.0156 [0.808, 0.846]
Group 4FinBERT0.901 ± 0.00440.852 ± 0.01360.760 ± 0.01510.803 ± 0.01200.915 ± 0.0081 [0.905, 0.925]0.665 ± 0.0293 [0.629, 0.701]
Proposed modelDF-FraudNet0.956 ± 0.00240.927 ± 0.00760.935 ± 0.00830.931 ± 0.00640.967 ± 0.0037 [0.962, 0.972]0.847 ± 0.0148 [0.829, 0.865]
Notes: All metrics are reported as μ ± σ over five forward-chaining folds. For AUC and AUPRC, 95% CIs are provided for the cross-fold mean (t distribution, degrees of freedom = 4).
Table 8. Empirical test results for the Identification Paradox (AUC [95% CI]).
Table 8. Empirical test results for the Identification Paradox (AUC [95% CI]).
Model ParadigmAcute Fraud IdentificationChronic Fraud Identification
Action Identification Module (CNN-Fin)0.928 [0.915, 0.940]0.905 [0.889, 0.919]
Motive Identification Module (LSTM-Fin)0.902 [0.885, 0.916]0.952 [0.944, 0.959]
Notes. 95% CIs are obtained via percentile bootstrap on the out-of-time test set (B = 2000).
Table 9. DF-FraudNet ablation study results (Group A–D).
Table 9. DF-FraudNet ablation study results (Group A–D).
CategoryVariantAccuracy (μ ± σ)F1 Score (μ ± σ)AUPRC (μ ± σ; 95% CI)AUC (μ ± σ; 95% CI)ΔAUCp_Holm d z
Group Aw/o Chronic module0.940 ± 0.00410.905 ± 0.01100.778 ± 0.0330 [0.737, 0.819]0.949 ± 0.0080 [0.939, 0.959]−0.0180.0209−3.25
w/o Acute module0.949 ± 0.00360.920 ± 0.01000.800 ± 0.0300 [0.763, 0.837]0.955 ± 0.0070 [0.946, 0.964]−0.0120.0468−2.55
Group Bw/o Financial ratios0.913 ± 0.00680.857 ± 0.01700.656 ± 0.0450 [0.600, 0.712]0.912 ± 0.0120 [0.897, 0.927]−0.0550.0021−6.00
w/o Non-financial ratios0.952 ± 0.00370.926 ± 0.01000.804 ± 0.0280 [0.769, 0.839]0.956 ± 0.0070 [0.947, 0.965]−0.0110.0688−2.15
w/o Text features0.948 ± 0.00420.919 ± 0.01100.812 ± 0.0320 [0.772, 0.852]0.958 ± 0.0080 [0.948, 0.968]−0.0090.0844−1.95
Group CCNN→FNN0.946 ± 0.00400.914 ± 0.01100.800 ± 0.0310 [0.762, 0.839]0.955 ± 0.0070 [0.946, 0.964]−0.0120.0565−2.35
DualCNN→SingleCNN0.949 ± 0.00380.918 ± 0.01000.815 ± 0.0290 [0.779, 0.851]0.959 ± 0.0070 [0.950, 0.968]−0.0080.1041−1.75
DF-FraudNet ( w = 0.5 )0.953 ± 0.00300.927 ± 0.00900.831 ± 0.0250 [0.800, 0.862]0.963 ± 0.0060 [0.956, 0.970]−0.0040.3323−1.12
Group D K e r n e l s = 5 × 5 0.955 ± 0.00280.930 ± 0.00800.842 ± 0.0210 [0.816, 0.868]0.965 ± 0.0050 [0.959, 0.971]−0.0020.4034−0.85
F u s i o n   h e a d = 256 0.956 ± 0.00240.931 ± 0.00800.848 ± 0.0190 [0.824, 0.872]0.967 ± 0.0050 [0.961, 0.973]+0.0001.0000+0.00
D r o p o u t = 0.3 0.955 ± 0.00300.930 ± 0.00900.841 ± 0.0220 [0.814, 0.868]0.965 ± 0.0050 [0.959, 0.971]−0.0020.4034−0.75
H i d d e n   s i z e s = 128 / 32 0.954 ± 0.00320.928 ± 0.00900.838 ± 0.0240 [0.808, 0.868]0.964 ± 0.0060 [0.957, 0.971]−0.0030.4034−0.95
Proposed modelDF-FraudNet0.956 ± 0.00240.931 ± 0.00640.847 ± 0.0148 [0.829, 0.865]0.967 ± 0.0037 [0.962, 0.972]---
Notes. Metrics are reported as μ ± σ over five forward-chaining folds, with 95% CIs for AUC and AUPRC (t distribution, degrees of freedom = 4). p_Holm denotes Holm-adjusted p-values from paired tests on fold-wise AUC differences versus DF-FraudNet; d z denotes paired Cohen’s d z .
Table 10. DF-FraudNet ablation study results (Group E).
Table 10. DF-FraudNet ablation study results (Group E).
CategoryVariantAccuracy
[95% CI]
Precision
[95% CI]
Recall
[95% CI]
F1 Score
[95% CI]
AUC
[95% CI]
AUPRC
[95% CI]
FPR
[95% CI]
Group Ew/o SMOTE0.958 [0.956, 0.961]0.950 [0.943, 0.957]0.860 [0.849, 0.869]0.903 [0.894, 0.910]0.963 [0.958, 0.966]0.800 [0.784, 0.812]0.0057 [0.0048, 0.0066]
Proposed modelDF-FraudNet0.956 [0.953, 0.958]0.927 [0.920, 0.934]0.935 [0.927, 0.943]0.931 [0.925, 0.937]0.967 [0.962, 0.971]0.847 [0.833, 0.862]0.0092 [0.0080, 0.0106]
Notes. 95% CIs are estimated by percentile bootstrap on the untouched out-of-time test set (B = 2000) under the fixed decision threshold specified in the evaluation protocol.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, W.; Liu, X.; Li, Z.; Qin, Z.; Dong, J.; Li, S. A Decoupling-Fusion System for Financial Fraud Detection: Operationalizing Causal–Temporal Asynchrony in Multimodal Data. Systems 2026, 14, 25. https://doi.org/10.3390/systems14010025

AMA Style

Li W, Liu X, Li Z, Qin Z, Dong J, Li S. A Decoupling-Fusion System for Financial Fraud Detection: Operationalizing Causal–Temporal Asynchrony in Multimodal Data. Systems. 2026; 14(1):25. https://doi.org/10.3390/systems14010025

Chicago/Turabian Style

Li, Wenjuan, Xinghua Liu, Ziyi Li, Zulei Qin, Jinxian Dong, and Shugang Li. 2026. "A Decoupling-Fusion System for Financial Fraud Detection: Operationalizing Causal–Temporal Asynchrony in Multimodal Data" Systems 14, no. 1: 25. https://doi.org/10.3390/systems14010025

APA Style

Li, W., Liu, X., Li, Z., Qin, Z., Dong, J., & Li, S. (2026). A Decoupling-Fusion System for Financial Fraud Detection: Operationalizing Causal–Temporal Asynchrony in Multimodal Data. Systems, 14(1), 25. https://doi.org/10.3390/systems14010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop