From Joint Distribution Alignment to Spatial Configuration Learning: A Multimodal Financial Governance Diagnostic Framework to Enhance Capital Market Sustainability

Li, Wenjuan; Liu, Xinghua; Li, Ziyi; Qin, Zulei; Dong, Jinxian; Li, Shugang

doi:10.3390/su172411236

Open AccessArticle

From Joint Distribution Alignment to Spatial Configuration Learning: A Multimodal Financial Governance Diagnostic Framework to Enhance Capital Market Sustainability

by

Wenjuan Li

¹,

Xinghua Liu

²,

Ziyi Li

¹,

Zulei Qin

¹

,

Jinxian Dong

¹ and

Shugang Li

^1,*

¹

School of Management, Shanghai University, Shanghai 200444, China

²

School of Management Science and Engineering, Shandong University of Finance and Economics, Jinan 250000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(24), 11236; https://doi.org/10.3390/su172411236

Submission received: 24 October 2025 / Revised: 4 December 2025 / Accepted: 12 December 2025 / Published: 15 December 2025

(This article belongs to the Section Economic and Business Aspects of Sustainability)

Download

Browse Figures

Versions Notes

Abstract

Financial fraud, as a salient manifestation of corporate governance failure, erodes investor confidence and threatens the long-term sustainability of capital markets. This study aims to develop and validate SFG-2DCNN, a multimodal deep learning framework that adopts a configurational perspective to diagnose financial fraud under class-imbalanced conditions and support sustainable corporate governance. Conventional diagnostic approaches struggle to capture the higher-order interactions within covert fraud patterns due to scarce fraud samples and complex multimodal signals. To overcome these limitations, SFG-2DCNN adopts a systematic two-stage mechanism. First, to ensure a logically consistent data foundation, the framework builds a domain-adaptive generative model (SMOTE-FraudGAN) that enforces joint distribution alignment to fundamentally resolve the issue of economic logic coherence in synthetic samples. Subsequently, the framework pioneers a feature topology mapping strategy that spatializes extracted multimodal covert signals, including non-traditional indicators (e.g., Total Liabilities/Operating Costs) and affective dissonance in managerial narratives, into an ordered two-dimensional matrix, enabling a two-dimensional Convolutional Neural Network (2D-CNN) to efficiently identify potential governance failure patterns through deep spatial fusion. Experiments on Chinese A-share listed firms demonstrate that SFG-2DCNN achieves an F1-score of 0.917 and an AUC of 0.942, significantly outperforming baseline models. By advancing the analytical paradigm from isolated variable assessment to holistic multimodal configurational analysis, this research provides a high-fidelity tool for strengthening sustainable corporate governance and market transparency.

Keywords:

sustainable corporate governance; financial fraud; multimodal deep learning; joint distribution alignment; spatial configuration learning

1. Introduction

In globalized capital markets, the sustainability of corporate governance has become the cornerstone of market resilience and stakeholder trust [1]. However, this cornerstone is being systematically eroded by increasingly sophisticated and covert financial fraud [2]. From Enron’s elaborately constructed off-balance-sheet entities to Luckin Coffee’s revenue inflation via complex transactional webs, modern fraud is no longer the manipulation of isolated indicators [3]. It has evolved into a toxic configuration woven from multiple pressures, governance deficiencies, and strategic narratives. This configurational risk pattern, deeply coupled with the extreme sparsity of fraud samples in the data space, poses a fundamental challenge to existing corporate governance diagnostic and risk-warning paradigms.

The first limitation of the current paradigm stems from its inherent atomistic, variable-based analytical approach. Whether exemplified by classical econometric tools such as M-Score [4] or by machine learning methods applied after feature concatenation such as eXtreme Gradient Boosting (XGBoost) [5], the essential operation remains the estimation of each risk variable’s marginal contribution. Such linear additivity cannot capture synergistic effects and non-linear emergence across modalities. For instance, these methods struggle to decode a typical governance-failure configuration in which sustained deterioration of objective financial conditions, structural defects in internal governance, and affective dissonance in managerial narratives co-resonate to amplify risk. This lack of holistic, systemic risk perception constitutes the core theoretical obstacle behind the performance bottleneck of conventional detectors.

The second limitation lies in the fragility of the data foundation. Fraud samples form a highly sparse and non-linear data manifold in the feature space [6]. This not only causes models to be systematically biased toward the majority class, creating a catastrophic risk of false negatives, but also imposes stringent domain-adaptability requirements on data augmentation techniques. Traditional Synthetic Minority Over-sampling Technique (SMOTE), due to its linear interpolation mechanism, cannot generate high-fidelity non-linear samples. Meanwhile, general-purpose, domain-agnostic Generative Adversarial Networks (GANs), lacking mathematical constraints on intrinsic economic logic, cannot guarantee joint distribution consistency across multimodal features [7]. They often produce samples that are statistically plausible but logically absurd, such as simultaneously generating high profitability and persistent negative operating cash flow, thereby introducing disguised noise rather than effective signals of governance failure into the model.

Accordingly, current intelligent diagnostics for financial fraud face two linked limitations: the restricted perspective of the analytic lens and the weak reliability of the data foundation. Against this background, this study aims to construct an intelligent multimodal diagnostic framework, SFG-2DCNN, that improves financial fraud detection and supports sustainable corporate governance by addressing these two limitations in an integrated way. Building on this objective, the study concentrates on two interrelated scientific challenges. The first is the Signal Logicalization challenge in generation, which concerns how to synthesize high-quality governance failure samples on a sparse manifold that are realistic, diverse, and economically coherent. The second is the Configuration Spatialization challenge in detection, which concerns how to move from shallow vector-based analysis to deep and precise recognition of multimodal risk configurations.

To systematically address the above challenges, the SFG-2DCNN framework is grounded in strain theory [8], information asymmetry theory [9], game theory [10], and configurational theory [11]. Building on these foundations, it operationalizes a multi-stage intelligent diagnostic workflow that progresses from joint distribution alignment to spatial configuration learning and is driven by two core mechanisms.

First, SFG-2DCNN addresses the Signal Logicalization challenge in the generation stage through joint distribution alignment. This mechanism is implemented by SMOTE-FraudGAN, a hybrid generative network. SMOTE-FraudGAN employs a multi-generator, divide-and-conquer strategy and uses the Joint Maximum Mean Discrepancy (JMMD) loss as a cross-modal regularizer. This joint distribution alignment enforces the statistical dependencies between heterogeneous features (finance, governance, and text) in their joint domain, ensuring the generated synthetic samples are simultaneously authentic, diverse, and logically self-consistent. For example, when the framework generates an objective financial signal such as high financial leverage, it simultaneously produces a corresponding subjective textual signal such as overly optimistic sentiment used for concealment. In this way, typical fraud patterns are reproduced and the quality of augmented samples is substantially improved.

Second, SFG-2DCNN addresses the Configuration Spatialization challenge in the detection stage through spatial configuration learning. This mechanism is implemented by a feature-topology mapping strategy together with a two-dimensional Convolutional Neural Network (2D-CNN). Guided by principles of economic theory and feature interaction efficiency, the feature-topology mapping strategy transforms high-information-density multimodal features, refined through a dual-screening mechanism, into an ordered two-dimensional feature matrix. This spatial configuration learning encodes abstract toxic configurations into local spatial patterns that can be learned by the 2D-CNN, thereby enabling precise capture of higher-order interactions through deep spatial fusion. For example, by placing the proportion of independent directors next to local text sentiment tones in the matrix, the convolutional kernels of the 2D-CNN can efficiently learn the spatial texture of governance deficiencies that resonate with emotional concealment.

The study’s contributions and implications are threefold. Theoretically, it explicitly defines and jointly addresses the coupled Signal Logicalization and Configuration Spatialization challenges in financial fraud detection and advances the paradigm from variable-based analysis to multimodal configurational analysis within a concrete multi-stage framework. Managerially, the SFG-2DCNN framework provides boards, auditors, and regulators with an intelligent diagnostic tool that prioritizes high-risk firms and reveals covert risk configurations to support more targeted oversight. Socially, the findings support sustainable corporate governance by reducing undetected fraud, strengthening investor protection, and reinforcing trust in capital markets.

The remainder of this paper is organized as follows. Section 2 reviews the related literature on financial fraud detection and multimodal learning. Section 3 presents the theoretical foundations and the proposed SFG-2DCNN diagnostic framework. Section 4 describes the experimental design and reports the main empirical results. Finally, Section 5 concludes with a discussion of implications and directions for future research.

2. Literature Review

2.1. Financial Pressure and Fraud Motivations

Financial fraud refers to the deliberate manipulation of corporate information to mislead stakeholders and gain capital advantages. Such behavior often arises when internal cash flows become insufficient and external financing becomes costly due to elevated perceived risk [12]. To maintain access to financial resources, managers may prioritize short-term survival and adopt impression management strategies, including earnings manipulation, to lower financing costs [13]. Strain theory interprets this misconduct as deviance under pressure [14], while information asymmetry theory explains how managers exploit insider knowledge to conceal deteriorating conditions from external parties [15].

Prior research has relied on conventional financial indicators such as abnormal accruals and leverage [16], as well as governance indicators including board independence, to capture early signs of risk [17]. In addition, natural language processing techniques have been applied to unstructured texts such as Management Discussion and Analysis (MD&A) disclosures [18] and have evolved from tone-based analysis [19] to contextual embeddings that detect obfuscation and sentiment inconsistency [20,21]. However, this line of work still focuses mainly on traditional indicators and single-modality signals. To address these limitations, this study expands the risk-signal system by incorporating non-traditional financial features such as the total liabilities-to-operating-costs ratio to capture hidden operational strain, macroeconomic and governance-related features to reflect external and internal contextual pressures, and text-based sentiment features to identify narrative inconsistency and managerial obfuscation. This multimodal feature design enables a more comprehensive representation of the pressure–opportunity–risk configuration that underlies fraudulent behavior.

Recent causal inference research in accounting and fraud has introduced tools to distinguish mere correlation from underlying causal mechanisms [22], but existing applications are built on low-dimensional indicator sets and are not well suited to high-dimensional, multimodal and severely class-imbalanced fraud prediction settings.

2.2. Fraud Detection Paradigms and Configurational Perspective

Early research relied on classical econometric models such as the Z-Score and M-Score [23], which adopted an atomistic view of risk and combined a small set of financial ratios in a linear way. These models have limited effectiveness in the strategic game against rational fraudsters [24,25]. Despite the introduction of non-linear models (e.g., XGBoost [26], MLP [27]) and governance features [28] in subsequent research, the analytical paradigm remained anchored in a variable-based mindset, focusing on the marginal contribution of individual risk indicators. More recent work uses graph and time-series models to capture relational and temporal structure, but still falls short of a configurational view of governance failure [29,30].

Configurational theory, rooted in early configurational approaches to organizational analysis, provides a holistic lens that emphasizes conjunctural causation and systemic patterns rather than single variables [31]. From this perspective, fraud risk arises from toxic configurations formed by higher-order interactions between financial conditions, governance structures and managerial behavior, which a variable-based mindset cannot adequately capture. Although some studies have attempted to fuse financial and textual signals, most methods still remain at the shallow stage of feature concatenation or decision-level fusion [32,33]. These approaches implicitly treat risk factors as unordered and equally weighted and therefore fail to capture contradictory signals between objective financial deterioration and subjective affective dissonance, which are key clues for fraud detection. Traditional MLP and 1D-CNN models lack spatial awareness and cannot transform high-dimensional heterogeneous features into structured neighborhood patterns, which restricts their ability to identify complex governance failure configurations [34,35]. Collectively, these limitations constitute the Configuration Spatialization challenge in detection and motivate the spatial configuration learning approach adopted in this study.

2.3. Generative Models and Imbalanced Fraud Data

Financial fraud detection is characterized by severe class imbalance. Fraud observations are distributed on a sparse and highly non-linear manifold, so models trained directly on raw data tend to favor the majority class and under-detect covert misconduct [36]. Conventional oversampling methods such as SMOTE rely on local linear interpolation and cannot reproduce the non-linear and multimodal structure of real fraud patterns [37]. Domain-agnostic generative models such as GAN and CTGAN increase the sample diversity, but the absence of economic and governance constraints often produces synthetic records that violate basic accounting logic [38,39]. For example, such models may generate records that simultaneously show high profitability and persistent negative operating cash flow, thereby introducing disguised noise rather than informative signals of governance failure.

Effective fraud detection therefore depends not only on how multimodal signals are fused but also on how synthetic samples are generated under severe class imbalance while preserving statistical and economic coherence, which this study terms the Signal Logicalization challenge in generation. This requirement highlights the relevance of game theory, which analyzes decision making under strategic interaction, in which each party anticipates the responses of others [40]. Drawing on this perspective, this study conceptualizes the generation of synthetic fraud samples as a strategic game between a generator that imitates fraudsters and a discriminator that represents detectors. This strategic view motivates a domain-adaptive generative mechanism for the fraud domain that preserves cross-modal dependencies between financial, governance and textual features.

2.4. Core Challenges and Contributions of This Study

Drawing on the above review, this study concentrates on two core scientific challenges in multimodal financial fraud detection. On the data side, the Signal Logicalization challenge in generation concerns how to construct synthetic fraud samples under severe class imbalance that remain statistically plausible, economically coherent and consistent across financial, governance and textual features. On the detection side, the Configuration Spatialization challenge in detection concerns how to move beyond variable-based analysis and make toxic risk configurations learnable as structured patterns rather than isolated indicators. In response to these coupled challenges, this study develops the SFG-2DCNN framework, which builds on an expanded multimodal risk-signal perspective and combines a generative mechanism based on joint distribution alignment with feature-topology mapping and 2D-CNN-based spatial configuration learning, thereby enhancing the detection of covert fraud and supporting governance-oriented risk oversight. The empirical implementation of this framework focuses on the Chinese A-share market, which prior literature describes as combining rigorous China Securities Regulatory Commission (CSRC) enforcement and detailed violation records with distinctive regulatory and ownership structures, making it a rich yet context-specific setting for examining financial fraud and governance risk [41,42].

3. Sustainable-Governance Diagnostic Framework: From Theoretical Mapping to Algorithmic Implementation

To systematically respond to the deeply coupled challenges of Signal Logicalization and Configuration Spatialization mentioned earlier, this study constructs an end-to-end intelligent diagnostic framework. The framework follows the core principle of theory-guided algorithm design, aiming to progressively map abstract theoretical insights about governance failures into executable and verifiable algorithmic modules, thereby achieving a rigorous transition from theory to practice.

3.1. Theoretical Background

As shown in Figure 1, this diagnostic framework is grounded in an integrated perspective that combines strain theory, information asymmetry theory, game theory, and configurational theory. This integrated perspective provides a solid theoretical basis for all subsequent modeling choices and ensures the internal logical consistency of the entire framework.

Strain theory reveals the motivational roots of why firms deviate from a sustainable development path. The structural pressures organizations face in achieving legitimate goals are a key driver for them to adopt short-sighted or even fraudulent behaviors to maintain an appearance of prosperity. This perspective guides the framework in operationalizing the “pressure” construct into a multi-dimensional feature set covering both internal financial distress and external macroeconomic fluctuations.

Information asymmetry theory clarifies the window of opportunity for governance failures to occur. Management can exploit its informational advantage for impression management, undermining the transparency and accountability that are the cornerstones of sustainable governance. This perspective directly guides the framework to analyze textual narratives such as MD&A disclosures in depth, to capture contradictory signals including “affective dissonance,” and to consider corporate governance structures as a key internal mechanism for regulating information asymmetry.

Game theory frames fraud as a strategic game that erodes market trust, providing crucial inspiration for algorithm design. It not only provides theoretical support for using GANs, an algorithmic simulation of a game process, but also guides feature engineering to systematically mine non-traditional, highly covert financial indicators. This design counters fraudsters’ strategic evasion of traditional monitoring targets, thereby enhancing the adversarial robustness of the diagnostic framework.

Configurational theory, as the overarching paradigm, encourages this study to transition from variable-based thinking to configurational thinking. It posits that fraud is not determined by the net effect of a single indicator but by a causal recipe composed of multiple conditions such as pressure, opportunity, and specific signals. This perspective operationalizes fraud as a toxic configuration, thus providing a direct methodological justification for using a 2D-CNN, an architecture naturally suited for identifying local spatial patterns.

3.2. Multi-Stage Diagnostic Process

Based on the theoretical framework described above, this study designs a 3-stage systematic diagnostic process: Decoding Governance Failure, Joint Distribution Alignment, and Spatial Configuration Learning. The aim is to operationalize abstract theories into algorithms to solve the core challenges of financial fraud detection.

3.2.1. Stage 1: Decoding Governance Failure—Constructing a Multimodal Signal Blueprint

The objective of this stage is to refine a multimodal signal configuration that can comprehensively describe the risk of governance failure. To counter fraudsters’ strategic evasion of traditional indicators, this study follows the theory of combined feature selection [43] and designs a dual-screening process that balances predictive performance with economic interpretability.

(1): Structured Features ( $X_{F i n a n c i a l}$ ; $X_{N o n - F i n a n c i a l}$ )

The structured dataset consists of financial and non-financial ratios. The financial part is based on the three major financial statements. After numerical processing, a large-scale generation of combinations was performed by dividing any two line items, creating a pool covering profitability, solvency, and operational efficiency. From this,

2356

initial candidate ratios with clear business meanings and adequate sample coverage were selected. The non-financial part was constructed based on corporate governance, macroeconomic conditions, and market environment, and was standardized into

230

initial candidate ratios covering factors such as shareholding structure, market ecosystem, internal controls, and regional risks.

To ensure the quality of the final feature set, this study employs a statistics–machine learning collaborative screening mechanism for structured features:

Statistical pre-screening (point-biserial correlation): Point-biserial correlation ( $r_{p b}$ ) is used to quickly eliminate features with a weak statistical correlation to the fraud label. We adopted the standard empirical threshold of $|r_{p b}| \geq 0.4$ , representing a moderate-to-strong linear association, to effectively filter out low-relevance noise features before non-linear refinement, thereby ensuring that the retained variables possess a fundamental connection to fraud mechanics before non-linear refinement.
Machine learning refinement (XGBoost algorithm): All structured features that pass the $r_{p b}$ pre-screening are fed into an XGBoost model for secondary refinement. By ranking features based on their information gain in XGBoost, a subset of features with the strongest non-linear discriminatory power and lowest redundancy is selected.

Through this process,

134

structured financial indicators (

X_{F i n a n c i a l}

) and

16

non-financial indicators (

X_{N o n - F i n a n c i a l}

) were ultimately selected. Together, they form an objective portrait and structured context for the occurrence of fraud. These features include not only classic financial indicators such as ROE and current ratio but also non-traditional, highly covert warning signals such as Total Liabilities/Operating Costs, as well as features related to the company’s internal and external environment, such as macroeconomic pressures and corporate governance deficiencies.

(2): Unstructured Features ( $X_{S e m a n t i c}; X_{T o n e}$ )

The original MD&A texts were subjected to standardized processing, including denoising, normalization, and sentence-level segmentation.

Higher-order semantic feature extraction ( $X_{S e m a n t i c}$ ): To decode the latent risks within the text, a FinBERT model pre-trained on financial domain data was used to encode the MD&A text into $768$ -dimensional contextual embedding vectors. This method can accurately capture risk signals in complex financial contexts (e.g., the true meaning of debt default), providing a higher-order, dense representation of the text’s core content.
Quantifying affective dissonance ( $X_{T o n e}$ ): To operationalize the impression management strategy, this study concurrently employs the Loughran–McDonald financial dictionary to quantify the sentiment and tone of the text, generating a $7$ -dimensional $X_{T o n e}$ feature (negative, positive, uncertainty, litigious, net sentiment score, and strong and weak modality). This feature aims to capture the contradiction between objective financial performance and management’s subjective narrative—the phenomenon of affective dissonance—providing crucial leading information for identifying deceptive intent.

The

768

-dimensional

X_{S e m a n t i c}

provides high-dimensional, context-sensitive semantic signals, while the

7

-dimensional

X_{T o n e}

offers a low-dimensional, highly interpretable supervisory signal that captures key features for judging deceptive behavior.

(3): Final Multimodal Feature Set ( $X_{F u s i o n}$ )

In summary, the multimodal feature vector

X_{F u s i o n}

in this study is formed by concatenating four features from three modalities:

X_{F u s i o n} = [X_{F i n a n c i a l}; X_{N o n - F i n a n c i a l}; X_{S e m a n t i c}; X_{T o n e}]

(1)

3.2.2. Stage 2: Joint Distribution Alignment—Reconstructing Logically Consistent Governance Failure Scenarios

This stage aims to address the Signal Logicalization challenge caused by data scarcity through the SMOTE-FraudGAN hybrid generative model. The core idea is to ensure that the generated governance failure scenarios (synthetic samples) are not only statistically realistic but also logically self-consistent from an economic perspective.

SMOTE-FraudGAN combines the distribution-smoothing capability of SMOTE with the non-linear generative power of Joint-FraudGAN:

Step 1: SMOTE Pre-smoothing and Manifold Regularization

Before directly applying the GAN, this stage first uses the SMOTE technique to perform an initial oversampling of the original scarce fraud samples

x_{i}

. By randomly selecting one of its

k

-nearest neighbors

x_{j}

for linear interpolation, a small number of synthetic samples

x_{n e w}

are generated as:

x_{n e w} = x_{i} + ϑ \cdot (x_{j} - x_{i}), ϑ \in [0,1]

(2)

This step provides an optimized initial distribution for the subsequent Joint-FraudGAN, effectively accelerating convergence and significantly reducing the risk of mode collapse during the initial training phase.

Step 2: Joint-FraudGAN Deep Refinement and Joint Distribution Alignment

This is the core stage for achieving joint distribution consistency across feature domains, where the samples initially augmented by SMOTE are deeply optimized and refined. The overall process is shown in Figure 2.

(1): Multi-Generator Module and Multi-Discriminator Module

As shown in Table 1, since a single generator struggles to learn the complex, heterogeneous feature distribution of financial fraud samples, this study introduces a divide-and-conquer strategy. It decomposes the multimodal feature generation task, based on economic priors, among seven parallel, domain-specialized generators. This structure significantly reduces the learning difficulty for each generator, allowing them to capture the highly cohesive distributions within their respective domains more stably and finely. All modules also receive random noise and the fraud label (as a conditional input, i.e., a CGAN architecture) to enhance the diversity and class adherence of the generated samples.

As shown in Table 2, the Joint-FraudGAN comprises five discriminator modules to enforce comprehensive constraints. Beyond the standard Base Discriminator for overall authenticity, this study introduces three Feature Domain Discriminators to ensure intra-domain consistency across subsets. Crucially, a Multi-task Discriminator goes beyond standard outputs by simultaneously verifying class correctness (

y = 1

) and cycle consistency. This dual-check mechanism prevents the generation of statistically plausible but logically absurd samples (i.e., disguised noise), ensuring the capture of specific toxic configurations of governance failure by the framework.

(2): Joint Constraint Loss Functions and Diversity Enhancement Loss Functions

To fundamentally resolve the logical conflict between cross-modal features, this study introduces

L_{J M M D}

to constrain the generator

G

. JMMD measures the difference between the joint probability distributions of generated and real samples in a Reproducing Kernel Hilbert Space, forcing the generator to learn and reproduce the true fraud patterns that arise from the combined effect of all feature modalities. For instance, it ensures that a generated sample with high debt (

X_{F i n a n c i a l}

) is accompanied by a low frequency of uncertainty words (

X_{T o n e}

, used to conceal risk), reflecting a logical association. Its core formula is as follows:

L_{J M M D} = {‖\frac{1}{n} \sum_{i = 1}^{n} \emptyset (x_{r e a l}^{i}) - \frac{1}{m} \sum_{j = 1}^{m} \emptyset (G (z_{j}))‖}_{H_{k}}^{2}

(3)

where

\emptyset (x)

represents the feature mapping,

x_{i}^{r e a l}

are real samples, and

G (z_{j})

are generated samples. Minimizing

L_{J M M D}

forces

G

to converge to the true distribution

P_{d a t a}

across the entire joint domain, thereby ensuring that the generated fraud samples are logically self-consistent within the domain.

To address the common issue of mode collapse in GAN, this study also introduces a diversity loss

L_{D i v}

as an additional constraint.

L_{D i v}

forces the generator

G

to map changes in the input noise

z

to significant changes in the generated sample

X_{F u s i o n}

by maximizing the mutual information

{I (X}_{F u s i o n}; c)

. Its core formula is as follows:

L_{D i v} = - {I (X}_{F u s i o n}; c) = - E_{X ~ G (\cdot| c)} [\log Q (c| X)]

(4)

L_{D i v}

ensures that small variations in the input noise lead to significant differences in the output samples, thereby effectively broadening the distribution of generated samples in the feature space and greatly enhancing the diversity and representativeness of the synthetic fraud samples.

(3): Multi-task Discriminator Loss Function

The task of the discriminator is to maximize its ability to distinguish between real and generated samples while ensuring that the generated samples have the correct fraud label. This study introduces a multi-task discriminator loss (

L_{M T}

) to drive the Multi-Task Discriminator to perform two key tasks: first, class discrimination (classification), ensuring that the generated sample’s label

y

is correct (i.e.,

y = 1

); and second, cycle-consistency verification, assessing whether the reconstructed representation

\hat{X}

remains consistent with the original input

X

. The core formula is as follows:

L_{M T} = L_{B C E} (D_{c l a s s} (X), y_{l a b e l}) + λ_{c y c} ‖X - \hat{X}‖

(5)

where

L_{B C E}

is the binary cross-entropy loss, and

λ_{c y c}

controls the weight of the cycle-consistency constraint.

(4): Basic Generative Adversarial Loss Function and Total Loss Function

For each generator

G_{i}

and discriminator

D_{i}

, the generator tries to create samples similar to the real data, while the discriminator tries to distinguish between generated and real samples. The basic GAN loss is formulated as:

L_{G A N} = E_{x ~ p_{d a t a} (x)} [\log D (x)] + E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))]

(6)

where

x ~ p_{d a t a} (x)

is a sample from the real data distribution, and

z ~ p_{z} (z)

is noise input. The generator

G

aims to maximize the probability of the discriminator making a mistake, which is equivalent to minimizing

L_{G A N}

.

During the training process, the total loss is formulated as:

\min_{θ_{G}} \max_{θ_{D}} \underset{A d v e r s a r i a l}{\underset{⏟}{L_{G A N} (θ_{G}, θ_{D})}} + \underset{J o i n t A l i g n m e n t}{\underset{⏟}{λ_{1} (τ) L_{J M M D} (θ_{G})}} - \underset{D i v e r s i t y}{\underset{⏟}{λ_{2} (τ) L_{D i v} (θ_{G})}} + \underset{M u l t i - t a s k}{\underset{⏟}{L_{M T} (θ_{D})}}

(7)

where

λ_{1}

and

λ_{2}

control the weights of the JMMD and diversity losses during training. By dynamically adjusting these weights, a balance between salient features and sample diversity can be achieved during the generation process, ensuring that the generated fraud samples are both representative and diverse.

3.2.3. Stage 3: Spatial Configuration Learning—Achieving Deep Diagnosis of Toxic Configurations

This stage aims to capture higher-order spatial configuration patterns among multimodal features through a feature-topology mapping strategy and a 2D-CNN detector, completing the transition from vector-based relationship analysis to spatial relationship diagnosis.

(1): Feature Ordering Principles and Ordered Matrix Transformation (Feature Image Mapping)

This step is not a simple reshaping of dimensions but a theory-driven process that explicitly encodes abstract signal configurations into spatial neighborhood patterns. Before reshaping the trimodal feature vector

X_{F u s i o n}

(total dimension

925

) into an

N \times N

two-dimensional feature matrix

M \in R^{31 \times 31}

, we innovatively follow the principle of feature interaction efficiency for logical group ordering:

Intra-modal cohesion: Within high-dimensional, homogeneous feature blocks (e.g., the $134 X_{F i n a n c i a l}$ features), we perform proximity sorting based on the correlation between features. This ensures that strongly related financial indicators are physically adjacent in the matrix, maximizing the feature extraction efficiency of the 2D-CNN’s local receptive field.
Inter-modal juxtaposition: Features from different modalities that have a strong theoretical interaction logic (e.g., financial ratios $X_{F i n a n c i a l}$ representing objective outcomes and sentiment tones $X_{T o n e}$ reflecting subjective intent) are placed closely together at the block boundaries of the matrix.
Spatial aggregation: For low-dimensional but high-value features (e.g., $X_{N o n - F i n a n c i a l}$ ), we place them centrally in the matrix to artificially enhance their signal density within the local receptive field, preventing them from being diluted by high-dimensional features.

The final ordering strategy is [

X_{F i n a n c i a l}

(134)] → [

X_{T o n e}

(7)] → [

X_{N o n - F i n a n c i a l}

(16)] → [

X_{S e m a n t i c}

(768)]. All feature values are normalized and then converted to a pixel intensity scale (

0 - 255

) to serve as a single-channel input for the 2D-CNN. The brightness for missing values or pixels with a zero denominator is set to

128

.

This ordered topological mapping successfully transforms an abstract, higher-order fraud signal configuration (e.g., the synergy between high financial leverage and overly optimistic management rhetoric) into a concrete spatial neighborhood texture that a 2D-CNN can directly recognize.

(2): 2D-CNN Detector and Deep Spatial Fusion

This stage is the core detection phase for deeply deciphering higher-order interaction patterns. The feature matrix

M

generated in the previous step is used as a single-channel input for the 2D-CNN. By leveraging the two-dimensional receptive field of the 2D-CNN’s convolutional kernels, which slide over

M

and simultaneously aggregate cross-modal feature information from adjacent horizontal and vertical positions. This convolution operation can be formally expressed as:

Z_{i, j}^{(l)} = f (\sum_{x = 0}^{H - 1} \sum_{y = 0}^{W - 1} K_{x, y}^{(l)} \cdot M_{i + x, j + y}^{(l - 1)} + b^{(l)})

(8)

where

Z_{i, j}^{(l)}

is the output feature map element at position

(i, j)

in layer

l

,

K^{(l)}

is the convolutional kernel,

M^{(l - 1)}

is the input feature map from the previous layer (which is

M

when

l = 1

),

H

and

W

are the dimensions of the kernel,

f (\cdot)

is the activation function, and

b^{(l)}

is the bias term.

Since the features have been ordered according to interaction efficiency, the 2D-CNN can efficiently capture the toxic configurations encoded as spatial neighborhood patterns (such as the spatial adjacency of optimistic distortion in MD&A text, low governance levels, and high financial leverage). This achieves a deep spatial fusion of heterogeneous multimodal features, ultimately leading to a high-confidence fraud classification.

In summary, SMOTE-FraudGAN (SFG) and 2D-CNN together form an end-to-end system framework (SFG-2DCNN) that addresses the two core pain points of financial fraud identification. It aims to capture the synergistic non-linear relationship between deteriorating financial conditions (Structural-Fin), internal and external environmental pressures (Structural-NonFin), and overly optimistic management/risk concealment (Semantic/Tone).

4. Experimental Design and Results Analysis

This section aims to conduct a comprehensive evaluation of the proposed multimodal intelligent diagnostic framework (SFG-2DCNN) through a rigorous and reproducible empirical design. The core objective of the experimental design is not only to demonstrate the framework’s performance advantage but also to achieve a holistic validation of the contribution of each innovative component and, ultimately, to verify its diagnostic robustness in the complex and dynamic capital market environment.

4.1. Experimental Setup

4.1.1. Data Sources, Sample Definition, and Preprocessing

This study draws on data for Chinese A-share listed companies from 2001 to 2021. The CSRC operates a stringent administrative enforcement regime that generates explicit, regulator-verified fraud labels for misconduct ranging from fictitious reporting to material omissions. Enforcement records produced under this regime form a large-scale, well-annotated fraud dataset that is well suited for training data-intensive deep learning models.

Structured data (financial, governance, economic environment, etc.) were sourced from the China Stock Market & Accounting Research (CSMAR) database, while unstructured text data (MD&A) were obtained from the China Research Data Service Platform (CNRDS). Following domain conventions, this study excluded companies in the financial industry and samples with missing key features.

Governance failure (fraud) sample definition: Strictly defined as company-year observations marked in the CSMAR violations table for core financial fraud behaviors (including “fictitious profits,” “inflated assets,” “false records,” “delayed disclosure,” “material omissions,” “misleading disclosures,” and “fraudulent IPOs”).

Control (non-fraud) sample definition: Samples with non-financial violations, abnormal audit opinions, or risk warnings (ST/*ST) were excluded to ensure a clean negative class.

Innovative handling of gray samples: To address the potential contamination of false negatives in the training data, this study introduces a dual cross-validation mechanism using Benford’s Law and Isolation Forest to screen for high-risk observations in the negative class, defining them as Gray Samples. Unlike traditional deletion strategies, this study adopts an innovative approach of marking and retaining the identified gray samples. This allows the subsequent generative model to perceive these ambiguous areas on the boundary between safety and violation, thereby enhancing the model’s ability to recognize the full spectrum of governance risks and improving its overall robustness.

After screening, a total of 5732 fraud samples and 45,780 non-fraud samples were obtained. A 5-fold cross-validation method was used to divide the dataset into training and testing sets. All sample balancing techniques were applied only to the training set, while the test set maintained its original class distribution for a fair and unbiased final evaluation of the downstream detection models.

4.1.2. Experimental Environment and Model Configuration

All experiments were conducted in an Ubuntu 22.04 LTS environment, using Python 3.10.13 and PyTorch 2.2.1 (CUDA 12.1, cuDNN 9) as the core deep learning framework, accelerated on a single NVIDIA GeForce RTX 4060 Ti 16 GB GPU. Auxiliary tools such as scikit-learn 1.4.2, imbalanced-learn 0.12.2, Transformers 4.41.2, and XGBoost 2.0.3 were used for the baseline model construction, feature engineering, and evaluation.

SMOTE-FraudGAN aims to provide a stable training environment for Joint-FraudGAN (JFGAN) through SMOTE pre-smoothing, ensuring that the generated samples are domain-logically self-consistent and diverse. Key parameter settings are shown in Table 3.

The 2D-CNN detector is designed to decipher higher-order fraud configurations by learning from the spatial topological structure. Key parameter settings are shown in Table 4.

4.1.3. Evaluation Metrics

Considering that in governance failure diagnostics, the cost of false negatives is extremely high and poses a serious threat to market sustainability, we adopt a multi-dimensional evaluation system. The evaluation focuses on two primary metrics, F1-score and the Area Under the ROC Curve (AUC). The F1-score is the harmonic mean of precision and recall and provides a comprehensive assessment of performance on the minority fraud class. AUC is a robust measure of a classifier’s overall discrimination ability across all possible thresholds and is insensitive to class imbalance.

In addition, three secondary metrics are reported to provide a more complete picture of classification performance. Precision measures the proportion of true positives among all observations predicted as fraud, and higher precision indicates a lower false-alarm rate. Recall measures the proportion of actual fraud cases that are correctly identified. Accuracy measures the overall correctness of classification across both fraud and non-fraud samples.

4.1.4. Baseline Models and Experimental Groups

To systematically validate the performance of our framework and precisely quantify the independent contributions of each innovative component, this study designs an experimental matrix consisting of four major groups.

Proposed full model: The complete SFG-2DCNN framework, which uses SMOTE-FraudGAN for sample balancing and a 2D-CNN based on ordered feature-topology mapping for deep spatial fusion detection.

Group 1. Traditional and single-modality baselines: This group aims to reproduce the performance limits of traditional audit and single-modality models, demonstrating the necessity of multimodal and complex models. A series of baseline models were trained using only structured features, including Z-Score (Altman), an authoritative domain baseline; Logistic Regression (LR-Structural), a linear statistical model baseline; Support Vector Machine (SVM-Structural), a strong non-linear baseline in traditional machine learning; XGBoost (XGB-Structural), a strong baseline for ensemble learning on structured data; and Multilayer Perceptron (MLP-Structural), a basic deep learning baseline.

Group 2. Imbalanced handling baselines: This group aims to validate the superiority of SMOTE-FraudGAN. While uniformly using a 2D-CNN detector, the effects of different sample balancing strategies were compared, including no balancing (Unbalanced-2DCNN), traditional SMOTE (SMOTE-2DCNN), and a basic GAN (BasicGAN-2DCNN).

Group 3. Detecting architecture baselines: This group aims to validate the effectiveness of the 2D-CNN’s deep spatial fusion. Using the high-quality data generated by SMOTE-FraudGAN, the performance of different detectors was compared, including SVM (SFG-SVM), XGBoost (SFG-XGB), MLP (SFG-MLP), and 1D-CNN (SFG-1DCNN).

Group 4. Ablation study variants: This group aims to precisely quantify the independent contributions of the key innovative components in the full model by systematically removing them. The specific settings are shown in Table 5.

4.2. Experimental Results and Analysis

4.2.1. Qualitative Analysis of Generated Sample Quality

This section aims to visually validate the effectiveness of SMOTE-FraudGAN in addressing the Signal Logicalization challenge, specifically in ensuring the cross-modal joint distribution consistency of the generated samples.

First, the model’s training process demonstrated high stability. The loss function curves (Figure 3) converge smoothly, indicating that the generator and discriminator successfully reached a Nash Equilibrium in their dynamic game, effectively avoiding common pitfalls in GAN training such as mode collapse. This provides a necessary foundation for generating high-quality samples that are both authentic and diverse.

More importantly, the superiority of the proposed framework is highlighted in the comparison of distribution fidelity (Figure 4). Unlike the significant distributional mismatch produced by a domain-agnostic general-purpose GAN, SMOTE-FraudGAN successfully achieves a precise fit to the true fraud data manifold by using the JMMD loss as a key cross-modal regularization constraint. This mechanism forces the generator to learn and reproduce the true statistical dependencies of heterogeneous features (such as high financial leverage and abnormally optimistic text) in the high-dimensional space. This precise joint distribution alignment ensures that the synthetic samples are highly self-consistent in terms of economic logic, fundamentally overcoming the Signal Logicalization challenge and providing a solid and reliable data foundation for the downstream detector.

4.2.2. Overall Performance Comparison: Validating the Model’s Superiority

This section comprehensively validates the performance superiority of the proposed framework by systematically comparing the Full-Model (SFG-2DCNN) with three major groups of baseline models (Groups 1, 2, and 3).

As shown in Table 6, the proposed SFG-2DCNN achieves the best performance across all key metrics, with an AUC of

0.942

and an F1-score of

0.917

. This result not only significantly surpasses all baseline models, breaking through the performance bottleneck of existing detection paradigms, but also provides initial confirmation of the framework’s success in systematically addressing the two deeply coupled challenges of Signal Logicalization and Configuration Spatialization.

Analysis 1: The Necessity of Multimodal Configurational Perspective (Comparison with Group 1)

This experiment aims to validate the limitations of traditional detection paradigms that rely solely on structured data. The results show that even the best-performing baseline in Group 1, XGB-Structural (

A U C = 0.811

), has a large performance gap compared to the full model. This significant difference strongly supports the multimodal configurational thinking advocated in this study, proving that relying only on objective, easily manipulated structured data cannot fully capture the entire picture of fraudulent behavior. The integration of the company’s internal and external environment (non-financial ratios) and subjective textual information (MD&A sentiment, higher-order semantics) is crucial for identifying covert toxic configurations.

Analysis 2: The Value of Joint Distribution Alignment (Comparison with Group 2)

This experiment aims to validate the superiority of the SMOTE-FraudGAN augmentation strategy while keeping the detector consistent. The traditional linear interpolation method, SMOTE-2DCNN (

A U C = 0.864

), is limited in performance because it cannot capture the highly non-linear characteristics of fraudulent behavior. The general-purpose, domain-agnostic BasicGAN-2DCNN (

A U C = 0.874

), although better than SMOTE, fails to achieve optimal performance due to a lack of cross-modal consistency constraints. The superiority of SFG-2DCNN is directly attributable to SMOTE-FraudGAN’s success in resolving the dilemma between diversity and domain-logical consistency in generated samples. Its core JMMD joint constraint forces the generated samples to reproduce the true cross-domain feature joint distribution, thereby providing high-quality, logically self-consistent training data for the downstream detector, which is key to overcoming the Signal Logicalization challenge.

Analysis 3: The Effectiveness of 2D-CNN Spatial Fusion (Comparison with Group 3)

This experiment aims to validate the superiority of the 2D-CNN architecture in capturing spatial configurations, while keeping the input data quality consistent. The performance of SFG-2DCNN is significantly better than that of the state-of-the-art model for tabular data, SFG-XGB (

A U C = 0.911

), and the sequential model, SFG-1DCNN (

A U C = 0.906

). This comparison directly validates the effectiveness of the proposed feature-topology mapping strategy. By recasting vectorwise feature relations into an ordered spatial layout, the 2D-CNN leverages its two-dimensional receptive field to efficiently capture toxic configurations expressed as local neighborhood patterns. This establishes the 2D-CNN as the preferred detector for deep spatial fusion of heterogeneous multimodal features and constitutes the key algorithmic realization of the Configuration Spatialization breakthrough.

4.2.3. Ablation Study: Validating the Model’s Innovations

This section aims to precisely quantify the independent contribution of each core innovative component of the SFG-2DCNN to its final performance by systematically removing them, thereby achieving a comprehensive, component-level validation of the proposed methodology.

Group A Ablation: Validation of the Domain-Adaptive Generative Network (SMOTE-FraudGAN)

This group of experiments validates the effectiveness of the key constraints and strategies in the SMOTE-FraudGAN architecture (such as

L_{J M M D}

,

L_{D i v}

, and multi-generator divide-and-conquer) in addressing the Signal Logicalization challenge.

The results of the Group A ablation study (Table 7) provide causal validation for the internal mechanisms of SMOTE-FraudGAN. Removing the JMMD loss, the core constraint (V-w/o

L_{J M M D}

), led to the most severe performance degradation (AUC drop of

6.05 %

), confirming that cross-modal feature joint distribution alignment is the fundamental driver for generating high-quality samples. Furthermore, using a single generator (V-Single-G,

- 4.46 %

) and removing the SMOTE pre-smoothing stage (V-w/o SMOTE,

- 3.93 %

) both resulted in significant performance drops, validating the key contributions of the divide-and-conquer strategy for stably learning heterogeneous feature distributions and the hybrid mechanism for providing an optimized initial distribution for stable GAN training on a sparse manifold.

Group B Ablation: Validation of Spatial Configuration Learning and Feature Ordering Principles

This group of experiments aims to validate the contribution of the feature ordering principle, based on economic theory, in the feature matrix transformation strategy for tackling the Configuration Spatialization challenge.

As shown in Table 8, randomly shuffling the feature order (V-Shuffled) led to a significant drop in model performance (AUC

- 4.99 %

), providing direct causal evidence for the criticality of feature spatial layout. This result proves that the superiority of the 2D-CNN is not coincidental but strictly dependent on the feature-topology mapping strategy. When this strategy is disrupted, adjacent pixels in the matrix lose their semantic correlation, and the spatial neighborhood texture is destroyed, thus confirming that this mapping is the core mechanism for encoding abstract toxic configurations into patterns recognizable by the 2D-CNN.

Group C Ablation: Analysis of Multimodal Feature Gains

This group of experiments quantifies the performance gain of multimodal fusion relative to single modalities and analyzes the independent contributions of different feature domains.

The results of the Group C ablation study (Table 9) confirm the indispensability of multimodal fusion. Relying on a single modality (V-Structural Only or V-Textual Only) led to substantial performance degradation (AUC drops of

7.22 %

and

9.98 %

, respectively), proving that the synergistic predictive power of fusing objective results with subjective intent is crucial. Within the textual domain, higher-order semantic features (V-w/o Semantic,

- 4.46 %

) and sentiment tone features (V-w/o Tone,

- 2.76 %

) also demonstrated unique complementary value: the former provided higher-information-density risk signals, while the latter served as an interpretable probe for affective dissonance, offering indispensable supplementary information.

4.2.4. Sensitivity and Robustness Analysis: Exploring the Credibility and Boundaries of Core Conclusions

(1): Sensitivity Analysis: Exploring the Optimal Boundaries of Core Hyperparameters

This section aims to validate the reasonableness and optimality of the model’s key hyperparameter choices.

Analysis of the number of structured features (K value): This experiment systematically tested the performance changes in the downstream 2D-CNN when different numbers of features $(K = 50, 100, 150, 200, 250)$ were retained after XGBoost gain ranking in the structured feature selection process. As shown in Figure 5, the model’s AUC value shows a trend of rapid increase followed by a plateau as K increases, reaching an optimal balance around $K = 150$ . When $K$ increases further, performance slightly decreases, indicating that redundant or noisy features have been introduced. This result demonstrates that the dual-screening through point-biserial correlation and XGBoost gain is effective.

Figure 5. Sensitivity Analysis of the Number of Structured Features.

Figure 5. Sensitivity Analysis of the Number of Structured Features.
Balance analysis of SMOTE-FraudGAN loss weights: This experiment used a grid search strategy to systematically adjust the combination of the joint consistency loss ( $L_{J M M D}$ , weight $λ_{1}$ ) and the diversity loss ( $L_{D i v}$ , weight $λ_{2}$ ) and evaluated the final performance of the downstream 2D-CNN. As shown in Figure 6, too low a $λ_{1}$ leads to domain-logically inconsistent generated samples, while too high a $λ_{2}$ causes the generated samples to deviate from the true data manifold, reducing their realism. The weight combination of $λ_{1} = 2.0$ and $λ_{2} = 0.3$ was confirmed to accurately calibrate the core dilemma of “realism vs. diversity,” ensuring the high quality of the generated samples.

Figure 6. Sensitivity Analysis of SMOTE-FraudGAN Loss Weights.

Figure 6. Sensitivity Analysis of SMOTE-FraudGAN Loss Weights.
Analysis of Convolutional Kernel Size: As shown in Table 10, this experiment examined the impact of different convolutional kernel sizes $(K e r n e l S i z e = 2 \times 2, 3 \times 3, 5 \times 5, 7 \times 7)$ on the ability to capture spatial configurations, finding that the model performed best with a $3 \times 3$ kernel. Kernels that are too small cannot cover sufficient cross-modal boundary information, leading to inadequate fusion, while oversized kernels are prone to capturing irrelevant long-distance features, introducing noise.

Table 10. Sensitivity Analysis of Convolutional Kernel Size.

Table 10. Sensitivity Analysis of Convolutional Kernel Size.

Kernel Size AUC F1
$2 \times 2$ 0.880 0.875
$3 \times 3$ 0.942 0.917
$5 \times 5$ 0.930 0.905
$7 \times 7$ 0.915 0.895

(2): Robustness Analysis: Evaluating Key Design Choices and Generalization Ability

This section evaluates the robustness of the core findings by varying key assumptions and data-partitioning schemes.

Gray sample handling strategy: This experiment added two different sample handling methods, Robust-Naive (not identifying gray samples) and Robust-DeleteGray (completely removing gray samples), to validate the effectiveness of the proposed strategy of marking gray samples. The experimental results, as shown in Table 11, indicate that Robust-Naive performed the worst, suggesting that false negative samples severely contaminate the negative class data. Although Robust-DeleteGray performed better than the former, it was significantly inferior to the Full-Model. This result confirms that this innovative mark and retain gray samples strategy allows SMOTE-FraudGAN to perceive high-risk samples at the edge of abnormality, thereby improving the model’s ability to recognize complex decision boundaries.

Table 11. Robustness Analysis of Gray Sample Handling Strategies.

Table 11. Robustness Analysis of Gray Sample Handling Strategies.

Handling Strategy AUC F1
Robust-Naive 0.865 0.850
Robust-DeleteGray 0.895 0.875
Full Model 0.942 0.917

Robustness Test of Temporal Stability: This experiment used a rolling time window approach to verify the temporal stability of our model. As shown in Table 12, compared to XGBoost, the average performance of SFG-2DCNN consistently remained at a high level with an extremely low standard deviation. This strongly demonstrates that the SFG-2DCNN model has learned the deep, essential configurational patterns of fraudulent behavior, rather than superficial correlations of specific historical periods, and possesses high temporal stability and generalization ability.

Table 12. Robustness Analysis of Temporal Stability.

Training Window	Test Window	SFG-2DCNN (AUC)	XGB-Structural (AUC)
2001–2011	2012–2013	0.885	0.745
2001–2013	2014–2015	0.879	0.690
2001–2015	2016–2017	0.881	0.730
2001–2017	2018–2019	0.868	0.720
2001–2019	2020–2021	0.862	0.685
Mean		0.875	0.714
Std. Dev.		±0.010	±0.026

Robustness test of data partitioning method (K-fold cross-validation): This experiment validated the stability of our model through unified experiments under different data partitioning strategies, including 5-fold cross-validation, 10-fold cross-validation, 20-fold cross-validation, and a traditional single random split ( $80 % / 20 %$ ). As shown in Table 13, the proposed framework’s average performance was highly consistent across all cross-validation settings ( $A U C \approx 0.938 - 0.942$ ) with an extremely low standard deviation (≤0.005), strongly demonstrating the model’s high robustness to data partitioning and validating the stability and credibility of our experimental conclusions.

4.2.5. Visual Interpretation of Toxic Configurations

To visually validate toxic configurations, we applied Gradient-weighted Class Activation Mapping (Grad-CAM) on a randomly selected fraud sample. As shown in Figure 7, the heatmap reveals a hierarchical and distributed attention structure: the primary hotspot (Red Zone) concentrates at the interface of financial pressure and textual sentiment, confirming the detection of spatial synergies between financial distress and affective dissonance. Simultaneously, a secondary cluster (Orange) and scattered points capture MD&A anomalies and isolated red flags. This indicates that SFG-2DCNN integrates holistic configurations with granular signals for robust diagnosis.

4.3. Experimental Conclusion and Discussion

Overall, the empirical results in Section 4 provide convergent evidence that the proposed SFG-2DCNN framework effectively addresses the coupled challenges of Signal Logicalization and Configuration Spatialization.

First, the comparative experiments establish the external performance advantage of SFG-2DCNN. Its best results (

A U C = 0.942, F 1 = 0.917

) clearly exceed all baseline models, indicating that relative to traditional sample-balancing methods (e.g., SMOTE, GAN) and vector-based detectors (e.g., XGBoost, 1D-CNN), the integrated design of sample generation and multimodal spatial fusion is essential for achieving a substantial improvement in detection accuracy. This finding is consistent with a configurational view of fraud and suggests that capturing higher-order toxic configurations through spatial fusion is more effective than assessing marginal variable contributions.

Second, the ablation studies provide strong support for the internal mechanisms of the framework. Removing the JMMD-based joint constraint produces the largest performance decline (AUC drop of

6.05 %

), highlighting the central role of domain-adaptive sample generation in addressing the Signal Logicalization challenge. Disrupting the spatial feature layout also leads to a marked degradation in performance (AUC drop of

4.99 %

), which confirms the importance of spatial configuration learning for tackling the Configuration Spatialization challenge. In contrast to general-purpose generative models, these results underline that enforcing intrinsic economic logic through joint distribution alignment is critical for synthesizing valid financial data.

Finally, the sensitivity and robustness analyses show that the framework is relatively insensitive to key hyperparameters and remains stable under temporal and cross-validation tests. This stability indicates that spatial configuration learning captures robust structural patterns rather than transient correlations and offers an advantage over shallow fusion strategies that are prone to overfitting.

In summary, this framework, through the two pillars of joint distribution alignment and spatial configuration learning, successfully overcomes the core challenges of data scarcity and signal covertness, and its conclusions are highly credible.

5. Conclusions and Implications

5.1. Conclusions

In global capital markets, financial fraud undermines investor trust and increases the cost of capital, which threatens sustainable corporate governance and capital market resilience. To address this, this study develops SFG-2DCNN, a multimodal diagnostic framework that unifies financial, governance and textual signals to detect covert fraud configurations under severe fraud-data sparsity.

Within this framework, Joint Distribution Alignment and Spatial Configuration Learning are combined to address the twin challenges of Signal Logicalization and Configuration Spatialization. On the data side, the SMOTE-FraudGAN module generates synthetic fraud samples that remain statistically plausible and economically coherent across financial, governance and textual features, thereby improving the quality of augmented data under severe class imbalance, in line with the need for economically coherent sample generation highlighted in prior work on imbalanced fraud data and generative models [36,37,38,39]. On the detection side, feature-topology mapping and a 2D-CNN transform heterogeneous fraud features into structured spatial patterns, operationalizing the configurational view of governance failure emphasized in recent studies [29,30,31] and enabling the model to learn toxic configurations such as the joint presence of financing pressure and affective dissonance in managerial narrative.

Empirical evidence from Chinese A-share listed companies shows that SFG-2DCNN achieves an F1-score of 0.917 and an AUC of 0.942, outperforming both single-modality models and multimodal baselines. Compared with traditional variable-based detectors such as Z-Score, M-Score and subsequent machine learning models that focus on marginal effects of individual indicators [16,17,23,26,27,28], the results support a configurational view of fraud risk, in which higher-order interactions between financial conditions, governance structures and narrative management provide superior explanatory power. They also show that a joint-domain generative adversarial strategy with distributional alignment is effective for sample augmentation under severe class imbalance. For corporate financial management, this suggests that capital structure, cost of debt, earnings management and board oversight should be evaluated as integrated risk configurations rather than isolated factors, consistent with evidence that financing pressure, governance deficiencies and disclosure behavior jointly shape fraud incentives [12,13,14,15,16,17,20,21]. Future diagnostic systems can therefore combine joint-distribution-consistent augmentation with configuration-aware detectors to build more credible fraud screening tools for governance-oriented risk oversight.

5.2. Theoretical and Managerial Implications

The contributions of this study extend beyond technical optimization and provide a configurational lens for conceptualizing financial fraud risk. At the theoretical level, the SFG-2DCNN framework advances fraud detection from a variable-based view to a multimodal configurational perspective that explicitly incorporates hidden, non-traditional indicators. Signals such as non-traditional financial ratios, governance and macro-environmental features, and MD&A-based sentiment or semantic dissonance are shown to be weakly informative when viewed in isolation but become informative within toxic configurations that jointly reflect financing pressure, governance deficiencies and narrative manipulation [12,13,14,15,16,17,19,20,21,28,32,33]. Spatial configuration learning operationalizes this perspective by mapping heterogeneous features into an ordered topology so that these higher-order configurations can be learned as local spatial patterns, while the SMOTE-FraudGAN module introduces a joint-distribution-based data augmentation strategy that aligns the joint distribution of financial, governance and textual features and generates synthetic fraud samples that are both diverse and economically coherent [36,37,38,39]. This joint-distribution-consistent augmentation principle can be extended to sample expansion in other multimodal financial management datasets that require realistic synthetic data.

At the managerial level, the findings suggest that governance-oriented early warning systems should focus on configuration-level risk patterns rather than on a narrow set of headline ratios. Existing practices that rely mainly on traditional financial indicators and basic governance metrics [16,17,23,28] capture only part of the risk structure. Chief financial officers and treasury departments can use the proposed multimodal risk-signal system to monitor combinations of indicators such as total-liabilities-to-operating-costs ratios, cash flow quality, board independence and MD&A sentiment instead of tracking each metric separately, thereby identifying firms that simultaneously exhibit rising financing pressure, weakened governance and increasingly optimistic narratives, where incentives for earnings management and cosmetic disclosure are particularly strong. Audit committees, independent directors, external auditors and regulators can draw on the configurational risk patterns learned by SFG-2DCNN as a basis for risk-based supervision, prioritizing high-risk configurations in internal control testing and inspection. The joint distribution alignment mechanism ensures that augmented data respect accounting identities and cross-modal relationships, supporting more reliable anomaly detection in continuous auditing and digital regulation and contributing to more efficient allocation of supervisory resources and improved detection of covert fraud schemes.

5.3. Research Limitations and Future Directions

While this study proposes a robust diagnostic framework, several limitations should be acknowledged to contextualize the findings and guide future research. First, external validity is constrained by the exclusive focus on the Chinese A-share market; although this setting provides rich data due to rigorous enforcement, its specific regulatory and ownership structures may limit direct generalizability to markets with different governance regimes. Second, the deep learning architecture still behaves largely as a black box and lacks the granular interpretability required to trace predictions back to explicit causal pathways, which constrains its use in legal and forensic contexts. Third, the framework currently relies on static analysis, capturing risk at discrete reporting points without fully modeling the dynamic evolution of fraud strategies or the contagion of risk through transaction and governance networks.

Building on these limitations, future research can proceed in three directions. First, applying the SFG-2DCNN framework to datasets from the United States, Europe and other markets would help assess its cross-market robustness and adapt the feature-topology to different governance environments. Second, incorporating causal inference methods, such as causal forests, could help identify the main drivers of fraud, distinguish correlation from causation and enhance model transparency for regulators and auditors. Third, integrating adversarial machine learning and graph neural networks could improve robustness against evasion attacks and allow dynamic transaction networks to be modeled, thereby capturing risk contagion and enabling earlier, system-oriented warning signals.

Author Contributions

Conceptualization, S.L.; Methodology, W.L.; Software, W.L.; Validation, W.L.; Formal analysis, X.L.; Investigation, X.L.; Data curation, Z.L., Z.Q. and J.D.; Writing—original draft, W.L.; Supervision, S.L.; Project administration, S.L.; Funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 72271155) and the Eastern Talent Plan of Shanghai.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ding, W.; Levine, R.; Lin, C.; Xie, W. Corporate immunity to the COVID-19 pandemic. J. Financ. Econ. 2021, 141, 802–830. [Google Scholar] [CrossRef]
Dyck, A.; Morse, A.; Zingales, L. How pervasive is corporate fraud? Rev. Account. Stud. 2024, 29, 736–769. [Google Scholar] [CrossRef]
Lu, P.; Wang, Y.; Li, B. Short selling and corporate financial fraud: Empirical evidence from China. Int. Rev. Econ. Financ. 2024, 89, 1569–1582. [Google Scholar] [CrossRef]
Beneish, M.D. The Detection of Earnings Manipulation. Financ. Anal. J. 1999, 55, 24–36. [Google Scholar] [CrossRef]
Cai, S.; Xie, Z. Explainable fraud detection of financial statement data driven by two-layer knowledge graph. Expert Syst. Appl. 2024, 246, 123126. [Google Scholar] [CrossRef]
Hilal, W.; Gadsden, S.A.; Yawney, J. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
Aftabi, S.Z.; Ahmadi, A.; Farzi, S. Fraud Detection in Financial Statements Using Data Mining and GAN Models. Expert Syst. Appl. 2023, 227, 120144. [Google Scholar] [CrossRef]
Agnew, R. Foundation for a General Strain Theory of Crime and Delinquency. Criminology 1992, 30, 47–87. [Google Scholar] [CrossRef]
Akerlof, G.A. The Market for “Lemons”: Quality Uncertainty and the Market Mechanism. Q. J. Econ. 1970, 84, 488–500. [Google Scholar] [CrossRef]
Myerson, R.B. Game Theory: Analysis of Conflict; Harvard University Press: Cambridge, MA, USA, 1991. [Google Scholar]
Meyer, A.D.; Tsui, A.S.; Hinings, C.R. Configurational Approaches to Organizational Analysis. Acad. Manag. J. 1993, 36, 1175–1195. [Google Scholar] [CrossRef]
Rajan, R.G.; Zingales, L. Financial dependence and growth. Am. Econ. Rev. 1998, 88, 559–586. [Google Scholar]
Jensen, M.C.; Meckling, W.H. Theory of the firm: Managerial behavior, agency costs and ownership structure. J. Financ. Econ. 1976, 3, 305–360. [Google Scholar] [CrossRef]
Agnew, R. Pressured into Crime: An Overview of General Strain Theory; Oxford University Press: New York, NY, USA, 2007. [Google Scholar]
Dechow, P.M.; Sloan, R.G.; Sweeney, A.P. Causes and consequences of earnings manipulation: An analysis of firms subject to enforcement actions by the SEC. Contemp. Account. Res. 1996, 13, 1–36. [Google Scholar] [CrossRef]
Healy, P.M.; Wahlen, J.M. A Review of the Earnings Management Literature and Its Implications for Standard Setting. Account. Horiz. 1999, 13, 365–383. [Google Scholar] [CrossRef]
Myers, S.C.; Majluf, N.S. Corporate financing and investment decisions when firms have information that investors do not have. J. Financ. Econ. 1984, 13, 187–221. [Google Scholar] [CrossRef]
Zhou, W.; Li, Y.; Wang, D.; Du, X.; Ke, Y. Management’s tone in MD&A disclosure and investment efficiency: Evidence from China. Financ. Res. Lett. 2024, 59, 104767. [Google Scholar]
Loughran, T.; McDonald, B. Textual Analysis in Finance. Annu. Rev. Financ. Econ. 2020, 12, 357–375. [Google Scholar] [CrossRef]
Huang, A.H.; Wang, H.; Yang, Y. FinBERT: A Large Language Model for Extracting Information from Financial Text. Contemp. Account. Res. 2023, 40, 806–841. [Google Scholar] [CrossRef]
Bhattacharya, I.; Mickovic, A. Accounting fraud detection using contextual language learning. Int. J. Account. Inf. Syst. 2024, 53, 100682. [Google Scholar] [CrossRef]
Berger, P.G.; Lee, H. Did the Dodd–Frank Whistleblower Provision Deter Accounting Fraud? J. Account. Res. 2022, 60, 1337–1378. [Google Scholar] [CrossRef]
Altman, E.I. Predicting Financial Distress of Companies: Revisiting the Z-Score and ZETA Models. In Handbook of Research Methods and Applications in Empirical Finance; Edward Elgar: Cheltenham, UK, 2013; pp. 428–456. [Google Scholar]
Dasilas, A.; Rigani, A. Machine learning techniques in bankruptcy prediction: A systematic literature review. Expert Syst. Appl. 2024, 255, 124761. [Google Scholar] [CrossRef]
Zhao, J.; Ouenniche, J.; De Smedt, J. A complex network analysis approach to bankruptcy prediction using company relational information-based drivers. Knowl. Based Syst. 2024, 300, 112234. [Google Scholar] [CrossRef]
Hajek, P.; Abedin, M.Z.; Sivarajah, U. Fraud detection in mobile payment systems using an XGBoost-based framework. Inf. Syst. Front. 2023, 25, 1985–2003. [Google Scholar] [CrossRef]
Forough, J.; Momtazi, S. Ensemble of deep sequential models for credit card fraud detection. Appl. Soft Comput. 2021, 99, 106883. [Google Scholar] [CrossRef]
Beasley, M.S. An Empirical Analysis of the Relation between the Board of Director Composition and Financial Statement Fraud. Account. Rev. 1996, 71, 443–465. [Google Scholar]
Motie, S.; Raahemi, B. Financial fraud detection using graph neural networks: A systematic review. Expert Syst. Appl. 2024, 240, 122156. [Google Scholar] [CrossRef]
Benchaji, I.; Douzi, S.; El Ouahidi, B.; Jaafari, J. Enhanced credit card fraud detection based on attention mechanism and LSTM deep model. J. Big Data 2021, 8, 151. [Google Scholar] [CrossRef]
Misangyi, V.F.; Greckhamer, T.; Furnari, S.; Fiss, P.C.; Crilly, D.; Aguilera, R. Embracing causal complexity: The emergence of a neo-configurational perspective. J. Manag. 2017, 43, 255–282. [Google Scholar] [CrossRef]
Kim, A.G.; Nikolaev, V.V. Context-Based Interpretation of Financial Information. J. Account. Res. 2024. [Google Scholar] [CrossRef]
Li, J.; Li, N.; Xia, T.; Guo, J. Textual analysis and detection of financial fraud: Evidence from Chinese manufacturing firms. Econ. Model. 2023, 126, 106428. [Google Scholar] [CrossRef]
Achakzai, M.A.K.; Peng, J. Detecting financial statement fraud using dynamic ensemble machine learning. Int. Rev. Financ. Anal. 2023, 89, 102827. [Google Scholar] [CrossRef]
Zhou, Y.; Zhi, X.; Gao, R.; Wang, C. Using data-driven methods to detect financial statement fraud in the real scenario. Int. J. Account. Inf. Syst. 2024, 65, 100793. [Google Scholar] [CrossRef]
Shahana, T.; Lavanya, V.; Bhat, A.R. State of the Art in Financial Statement Fraud Detection: A Systematic Review. Technol. Forecast. Soc. Change. 2023, 192, 122527. [Google Scholar] [CrossRef]
Liang, X.W.; Jiang, A.P.; Li, T.; Xue, Y.; Wang, G. LR-SMOTE—An improved unbalanced data set oversampling based on K-means and SVM. Knowl. Based Syst. 2020, 196, 105845. [Google Scholar] [CrossRef]
Engelmann, J.; Lessmann, S. Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning. Expert Syst. Appl. 2021, 174, 114582. [Google Scholar] [CrossRef]
Habibi, O.; Chemmakha, M.; Lazaar, M. Imbalanced tabular data modelization using CTGAN and machine learning to improve IoT Botnet attacks detection. Eng. Appl. Artif. Intell. 2023, 118, 105669. [Google Scholar] [CrossRef]
Osborne, M.J.; Rubinstein, A. A Course in Game Theory; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
Allen, F.; Qian, J.; Qian, M. Law, finance, and economic growth in China. J. Financ. Econ. 2005, 77, 57–116. [Google Scholar] [CrossRef]
Jiang, F.; Kim, K.A. Corporate governance in China: A survey. Rev. Financ. 2020, 24, 733–772. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An Introduction to Feature Extraction. In Feature Extraction: Foundations and Applications; Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–25. [Google Scholar]

Figure 1. Theoretical foundations of the SFG-2DCNN diagnostic framework.

Figure 2. Schematic diagram of SMOTE-FraudGAN.

Figure 3. SMOTE-FraudGAN Loss Function.

Figure 4. (Left) General GAN Generated Samples vs. Real Samples; (Right) SMOTE-FraudGAN Generated Samples vs. Real Samples.

Figure 7. Grad-CAM Visualization of Toxic Configurations.

Table 1. Architecture of the Joint-FraudGAN Generators.

Generator Type	Quantity	Function
$Financial Feature Generator (X_{F i n a n c i a l})$	4	Generates high-dimensional financial ratio features to capture objective economic pressure. This is achieved through fine-grained generation across distinct economic categories (Profitability, Liquidity, Efficiency, and Cash Flow), which reduces the complexity of handling heterogeneous features with a single generator.
Non-financial Feature Generator $(X_{N o n - F i n a n c i a l})$	1	Generates non-financial ratios related to corporate governance and the macroeconomic environment to capture structural defects.
Semantic Feature Generator $(X_{S e m a n t i c})$	1	Generates higher-order semantic embeddings from FinBERT to replicate context-sensitive risk signals.
$Sentiment Feature Generator (X_{T o n e})$	1	Generates sentiment features to model subjective biases in managerial intent.
$Total (X_{F u s i o n})$	7	These generators collectively produce the final multimodal feature vector.

Table 2. Architecture of the Joint-FraudGAN Discriminators.

Discriminator Type	Quantity	Function
Base Discriminator	1	$Assesses the overall authenticity of the fused feature vector (X_{F u s i o n})$ based on the standard adversarial loss.
Feature Domain Discriminator	3	Independently assesses the distributional realism of the financial, non-financial, and textual feature subsets to ensure intra-domain consistency.
Multi-task Discriminator	1	Enforces two constraints to enhance sample discriminability: alignment with the correct fraud labels and cycle consistency.
Total	5	These discriminators operate synergistically to impose comprehensive, multi-dimensional constraints on the generated samples.

Table 3. Key Hyperparameter Configurations for SMOTE-FraudGAN.

Parameter	Specification
Input	Dim: $925 \to$ Zero-padding $\to Dim : 961$ $\to Reshape : 31 \times 31$
Architecture	Two-stage: SMOTE pre-smoothing $(k = 7)$ → JFGAN optimization ( $100$ epochs)
Generator	Type: Multi-generator (7 domain-specific MLP branches) Structure: $3$ hidden layers, LeakyReLU activation $Normalization (G)$ : Batch normalization $Normalization (D)$ : Spectral normalization $Output layer (G$ ): Tanh
Batch Config	Batch size: $128$ Gradient accumulation: $2$ steps
Loss Functions	JMMD loss $(λ_{1})$ : $2.0$ , linear ramp-up (first $30 %$ iter.) Diversity loss $(λ_{2})$ : $0.3$ linear ramp-up (final $30 %$ iter.)
Optimizer	Type: AdamW $Betas (β_{1}$ $, β_{2}$ $) : (0.5, 0.999)$ $Weight decay : {1 \times 10}^{- 5}$
Learning Rates	$Generator (l r G$ $) : {1.5 \times 10}^{- 4}$ $Discriminator (l r D$ $) : {2 \times 10}^{- 4}$

Table 4. Hyperparameter Configurations for the 2D-CNN Detector.

Parameter	Specification
Input	Single channel, $31 \times 31$ topologically ordered matrix
Architecture	Conv stages: $3$ Layers per stage: $2$ Kernel size: $3 \times 3$ , $stride : 1$ Dilation (stage $2$ , $layer 2$ $) : 2$ Pooling: None (intermediate) $Channels: {64, 128, 192}$ Classification head: GAP → dropout $(p = 0.3)$ → dense $(1$ $, S i g m o i d$ )
Loss Function	Type: Focal Binary Cross-Entropy (Focal-BCE) $Gamma (γ$ $) : 1.5 - 2.0$ $Alpha (α$ $) : 0.55$
Optimizer	$Type : A d a m W$ Learning rate: ${1 \times 10}^{- 3}$ Weight decay: ${3 \times 10}^{- 4}$
Training	Epochs: $80$ Batch size: $128$ Early stopping patience: $12$ epochs

Table 5. Definitions of Ablation Study Variants (Group 4).

Category	Variant	Description of Modification from the Full Model
A. Generative Model	V-w/o JMMD	Removes the JMMD loss.
	V-w/o LDiv	Removes the diversity loss.
	V-w/o SMOTE	Removes the SMOTE pre-smoothing stage.
	V-Single-G	Employs a single generator instead of the multi-generator architecture.
B. Fusion Strategy	V-Shuffled	Randomly shuffles the feature order before reshaping into the matrix.
C. Feature Modalities	V-Structural Only	Trained using only structured features.
	V-Textual Only	Trained using only unstructured features.
	V-w/o Semantic	Removes the FinBERT-based semantic features.
	V-w/o Tone	Removes the Loughran–McDonald sentiment tone features.

Table 6. Comparative Performance Analysis of All Models.

Group	Model	AUC	F1	Recall	Precision	Accuracy
G1: Traditional and Single-Modality Baselines	Z-Score	0.650	0.595	0.610	0.580	0.650
	LR-Structural	0.710	0.685	0.690	0.680	0.710
	SVM-Structural	0.730	0.712	0.715	0.710	0.730
	XGB-Structural	0.811	0.783	0.790	0.775	0.802
	MLP-Structural	0.780	0.752	0.759	0.745	0.772
G2: Imbalance Handling Baselines	Unbalanced-2DCNN	0.812	0.754	0.667	0.867	0.823
	SMOTE-2DCNN	0.864	0.834	0.841	0.826	0.854
	BasicGAN-2DCNN	0.874	0.846	0.851	0.842	0.864
G3: Detector Architecture Baselines	SFG-SVM	0.885	0.857	0.862	0.852	0.874
	SFG-XGB	0.911	0.883	0.887	0.877	0.900
	SFG-MLP	0.895	0.867	0.872	0.862	0.885
	SFG-1DCNN	0.906	0.877	0.882	0.872	0.895
Full Model	SFG-2DCNN	0.942	0.917	0.921	0.913	0.931

Table 7. Ablation Study Results for the Generative Model (Group A).

Variant	AUC	AUC Drop (%)	F1-Score
$V - w / o L_{J M M D}$	0.885	6.05%	0.865
$V - w / o L_{D i v}$	0.911	3.29%	0.890
V-w/o SMOTE	0.905	3.93%	0.885
V-Single-G	0.900	4.46%	0.880
Full Model (Baseline)	0.942	0.00%	0.917

Table 8. Ablation Study Results for the Fusion Strategy (Group B).

Variant	AUC	AUC Drop (%)	F1-Score
V-Shuffled	0.895	4.99%	0.875
Full Model (Baseline)	0.942	0.00%	0.917

Table 9. Ablation Study Results for Feature Modalities (Group C).

Variant	AUC	AUC Drop (%)	F1-Score
V-Structural Only	0.874	7.22%	0.855
V-Textual Only	0.848	9.98%	0.829
V-w/o Semantic	0.900	4.46%	0.880
V-w/o Tone	0.916	2.76%	0.896
Full Model (Baseline)	0.942	0.00%	0.917

Table 13. Robustness Analysis of Data Partitioning Strategies.

Partitioning Strategy	Mean AUC	Std. Dev.
5-Fold CV	0.942	±0.003
10-Fold CV	0.940	±0.004
20-Fold CV	0.939	±0.005
Single 80/20 Split	0.938	N/A

Note: N/A indicates that the standard deviation is not applicable for the single 80/20 split.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Liu, X.; Li, Z.; Qin, Z.; Dong, J.; Li, S. From Joint Distribution Alignment to Spatial Configuration Learning: A Multimodal Financial Governance Diagnostic Framework to Enhance Capital Market Sustainability. Sustainability 2025, 17, 11236. https://doi.org/10.3390/su172411236

AMA Style

Li W, Liu X, Li Z, Qin Z, Dong J, Li S. From Joint Distribution Alignment to Spatial Configuration Learning: A Multimodal Financial Governance Diagnostic Framework to Enhance Capital Market Sustainability. Sustainability. 2025; 17(24):11236. https://doi.org/10.3390/su172411236

Chicago/Turabian Style

Li, Wenjuan, Xinghua Liu, Ziyi Li, Zulei Qin, Jinxian Dong, and Shugang Li. 2025. "From Joint Distribution Alignment to Spatial Configuration Learning: A Multimodal Financial Governance Diagnostic Framework to Enhance Capital Market Sustainability" Sustainability 17, no. 24: 11236. https://doi.org/10.3390/su172411236

APA Style

Li, W., Liu, X., Li, Z., Qin, Z., Dong, J., & Li, S. (2025). From Joint Distribution Alignment to Spatial Configuration Learning: A Multimodal Financial Governance Diagnostic Framework to Enhance Capital Market Sustainability. Sustainability, 17(24), 11236. https://doi.org/10.3390/su172411236

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Kernel Size	AUC	F1
$2 \times 2$	0.880	0.875
$3 \times 3$	0.942	0.917
$5 \times 5$	0.930	0.905
$7 \times 7$	0.915	0.895

Handling Strategy	AUC	F1
Robust-Naive	0.865	0.850
Robust-DeleteGray	0.895	0.875
Full Model	0.942	0.917

Article Menu

From Joint Distribution Alignment to Spatial Configuration Learning: A Multimodal Financial Governance Diagnostic Framework to Enhance Capital Market Sustainability

Abstract

1. Introduction

2. Literature Review

2.1. Financial Pressure and Fraud Motivations

2.2. Fraud Detection Paradigms and Configurational Perspective

2.3. Generative Models and Imbalanced Fraud Data

2.4. Core Challenges and Contributions of This Study

3. Sustainable-Governance Diagnostic Framework: From Theoretical Mapping to Algorithmic Implementation

3.1. Theoretical Background

3.2. Multi-Stage Diagnostic Process

3.2.1. Stage 1: Decoding Governance Failure—Constructing a Multimodal Signal Blueprint

3.2.2. Stage 2: Joint Distribution Alignment—Reconstructing Logically Consistent Governance Failure Scenarios

3.2.3. Stage 3: Spatial Configuration Learning—Achieving Deep Diagnosis of Toxic Configurations

4. Experimental Design and Results Analysis

4.1. Experimental Setup

4.1.1. Data Sources, Sample Definition, and Preprocessing

4.1.2. Experimental Environment and Model Configuration

4.1.3. Evaluation Metrics

4.1.4. Baseline Models and Experimental Groups

4.2. Experimental Results and Analysis

4.2.1. Qualitative Analysis of Generated Sample Quality

4.2.2. Overall Performance Comparison: Validating the Model’s Superiority

4.2.3. Ablation Study: Validating the Model’s Innovations

4.2.4. Sensitivity and Robustness Analysis: Exploring the Credibility and Boundaries of Core Conclusions

4.2.5. Visual Interpretation of Toxic Configurations

4.3. Experimental Conclusion and Discussion

5. Conclusions and Implications

5.1. Conclusions

5.2. Theoretical and Managerial Implications

5.3. Research Limitations and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI