This section outlines the experimental framework, including the setup and analysis of results. The experiments are designed to investigate the following research questions:
5.2. Baseline Models
To ensure fair and reproducible comparisons, all baseline models were implemented with a consistent input dimensionality of 6 features per transaction (after preprocessing) and with comparable network capacities. Each model receives an input feature vector of dimension 6 after preprocessing, ensuring that the quantum and classical models are compared on the same input basis. The selected hyperparameters strike a balance between providing sufficient model capacity to capture complex patterns and preventing overfitting, consistent with prior studies on fraud detection and quantum–classical hybrid architectures.
Convolutional Neural Network (CNN): A single one-dimensional convolutional layer with 32 filters (kernel size = 3), followed by a pooling layer and a fully connected output layer. This architecture effectively captures local temporal patterns in transaction sequences, making it suitable for detecting short-term anomalies, but it lacks the ability to model long-range sequential dependencies.
Recurrent Neural Network (RNN): A single recurrent layer with 64 hidden units and tanh activation, followed by a dense output layer. RNNs can model sequential dependencies and temporal evolution in transaction data, offering stronger context modeling than CNNs, though they are susceptible to vanishing gradients and have limited long-range memory.
Quantum Neural Network (QNN): A two-layer variational quantum circuit (VQC), where each layer consists of parameterized rotation gates (Ry and Rz) and entanglement operations implemented as a ring topology of nearest-neighbor CNOTs to ensure scalable inter-qubit correlations. The number of qubits N matches the input feature dimension (). A classical dense layer follows after quantum expectation values are measured. While effective for nonlinear transformation in low-dimensional feature spaces, QNNs lack temporal modeling capacity.
Quantum Recurrent Neural Network (QRNN): A two-layer VQC (6 qubits, ring-topology nearest-neighbor CNOT entanglement) for quantum encoding, followed by a classical RNN layer with 64 hidden units for sequential modeling. This hybrid architecture captures both high-dimensional quantum features and dynamic behavioral patterns over time, offering stronger temporal modeling than QNN alone.
Hybrid Quantum Recurrent Neural Network (HQRNN): A baseline hybrid model connecting a fixed-topology VQC (6 qubits) to a shallow single-layer RNN with 64 hidden units. Due to structural decoupling—where the RNN does not adapt to the quantum encoding—and limited depth, it has reduced capability in detecting subtle or long-range fraud patterns.
HQRNN-FD (Proposed Model): An enhanced hybrid quantum–classical architecture that addresses HQRNN’s limitations by aligning VQC depth with the recurrent architecture. The VQC uses 6 qubits to match the input feature dimension, while the RNN component comprises multiple recurrent blocks adaptively configured to the quantum topology. A self-attention module is applied to focus on salient transaction segments, improving discrimination of minority-class fraudulent behavior.
For all RNN-, QRNN-, HQRNN-, CNN-, and QNN-based models, Binary Cross-Entropy loss was used with a learning rate of 0.001 and 64 training epochs. For RNN-family models (RNN, QRNN, HQRNN), the hidden unit size was fixed at 64, with a single recurrent layer,
tanh activation, and a linear prediction output layer. Both QRNN and HQRNN employ a two-layer VQC with 6 qubits and ring-topology entanglement (nearest-neighbor CNOTs) before the recurrent layer, consistent with their architectural descriptions in
Section 5.2. For non-recurrent baselines, two widely adopted neural architectures were used: CNN and QNN. The CNN consists of a single one-dimensional convolutional layer with 32 filters (kernel size = 3), followed by pooling and a fully connected output layer. The QNN follows the structure in [
34], using 6 input qubits to match the data dimensionality. These unified hyperparameter and training settings ensure strict fairness in the comparison across all baseline models.
5.3. Experimental Results
Comprehensive Performance Comparison: Financial fraud detection involves a pronounced class imbalance, with legitimate transactions making up for the vast majority and fraudulent transactions compromising less than 5% of the data. The key challenge is to improve the model’s capacity to accurately detect the minority class of high-risk fraudulent instances while minimizing false positives (FPs) among legitimate transactions, thereby ensuring dependable and efficient detection performance.
5.3.1. Confusion Matrix Analysis
Variation in classification error distributions between normal and fraudulent transactions is clearly observed across different models (
Table 5). For instance, the CNN model recorded 19,711 FP, markedly higher than those produced by the RNN (13,597) and HQRNN-FD (8110). Elevated FP rates can result in excessive rejection of legitimate transactions, leading to increased user dissatisfaction and higher manual verification costs.
In contrast, HQRNN-FD achieves a lower FP count while maintaining a manageable number of false negatives (FN) for fraudulent cases at 1891. Although the original HQRNN demonstrates strong recall (FN = 1777), its higher FP count indicates reduced precision in detecting normal transactions.
While QNN and QRNN show moderate effectiveness in minimizing FN (1800 and 2446, respectively), they will exhibit relatively high FP values (15,450 for QNN and 11,399 for QRNN). In particular, QRNN prioritizes recall at the expense of precision.
Overall, HQRNN-FD consistently achieves the lowest FP and FN rates, effectively minimizing false alerts while preserving high detection sensitivity (
Figure 10).
5.3.2. Model Metrics Analysis
In terms of classification metrics, HQRNN-FD attained the highest F1-score of 0.9009 for the fraud class (Class 1), surpassing all baseline models. Compared with the original HQRNN, it showed a notable improvement in precision, from 0.7943 to 0.8486, and an approximate 3% increase in F1-score, highlighting the effectiveness of integrating quantum encoding with attention-based mechanisms.
Moreover, HQRNN-FD achieved superior performance in both macro and weighted average F1-scores, recording values of 0.9421 and 0.9722, respectively, which reflects strong performance across all classes. In comparison, conventional models, such as CNN and DNN, showed reasonable results on normal transactions but performed poorly on fraudulent cases, with F1-scores of 0.8037 and 0.8294, respectively, indicating reduced accuracy and a higher misclassification risk.
While QNN and QRNN leverage quantum-based components, their effectiveness is limited by shallow circuit architectures, resulting in overall F1-scores of 0.8408 and 0.8665, both of which remain lower than that of HQRNN-FD.
5.3.3. Noise Resistance Analysis
Quantum circuits are inherently susceptible to various noise types, such as depolarizing, bit-flip, and phase-flip noise, particularly when deployed on current NISQ devices. This section investigates the robustness of the HQRNN-FD model relative to other quantum baselines (HQRNN, QNN, and QRNN) under three representative noise models. Each model was evaluated at noise intensities of 0 (ideal), 0.01, 0.05, and 0.10, with the results summarized in
Table 6.
Depolarizing Noise Impact: Depolarizing noise introduces random qubit decoherence over the Bloch sphere. The Kraus operator for depolarizing noise is expressed as:
where
p denotes the depolarization probability, and
X,
Y, and
Z represent the Pauli operators. This type of noise randomly applies one of the Pauli operators to a qubit with equal probability. Under low noise levels (0.01, 0.05), HQRNN-FD shows minimal performance degradation, with its F1-score slightly decreasing from 0.9009 to 0.8914 and 0.8906, respectively, while maintaining an accuracy above 96.8%. Even at higher noise intensity (0.10), the model preserves a competitive F1-score of 0.8455 and an accuracy of 95.28%, outperforming most baseline methods and confirming robustness to quantum decoherence.
By contrast, the QNN model exhibits relatively stable performance but maintains lower precision, with a modest decline in F1-score from 0.8408 to 0.8356. However, the QRNN model demonstrates higher sensitivity to depolarizing noise, as evidenced by a substantial drop in F1-score from 0.9768 to 0.7635, suggesting a significant reduction in its generalization capability under noisy conditions.
Impact of Bit-Flip Noise: Bit-flip noise results in the inversion of quantum states between
and
, which can significantly impair classification performance. It is represented by the following Kraus operator formulation:
where
p represents the bit-flip probability and
X corresponds to the Pauli-X operator. Under high noise conditions (
), the F1-score of HQRNN-FD drops to 0.8214, yet it still surpasses both HQRNN (0.8213) and QNN (0.5757). At a moderate noise level (
), HQRNN-FD maintains a consistent F1-score of 0.8590, demonstrating enhanced robustness to noise.
By contrast, QRNN exhibits pronounced sensitivity to bit-flip noise, with its F1-score falling sharply to 0.5754 and accuracy decreasing to 83.97% at a noise level of 0.10, indicating limited robustness against such perturbations.
Impact of Phase-Flip Noise: Phase-flip noise alters the phase of qubits and has a particularly adverse effect on models utilizing angle-based encoding. It is characterized by the following Kraus operator expression:
where
p denotes the phase-flip probability and
Z is the Pauli-Z operator. HQRNN-FD exhibits high resilience to phase perturbations, maintaining F1-scores between 0.8943 and 0.8913 across noise levels ranging from 0.01 to 0.10. Its classification accuracy remains consistently above 96.8%, surpassing the performance of HQRNN, QNN, and QRNN.
Notably, both QNN and QRNN experience considerable performance degradation in the presence of phase-flip noise. Specifically, QNN’s F1-score declines to 0.5919 under high noise intensity (0.10), indicating limited robustness to phase disturbances.
5.3.4. Scalability of the HQRNN-FD Model
In hybrid QNN architectures, the number of qubits plays a critical role in determining the representational power and encoding depth of quantum components. However, increasing qubit count also leads to higher computational complexity and poses additional challenges for parameter optimization.
To assess scalability, the HQRNN-FD model is used as a representative case, with classification performance evaluated under 2-, 4-, and 6-qubit configurations. Accuracy is measured over six monthly intervals from January to June 2023. The corresponding results are presented in
Table 7 and illustrated in
Figure 11.
The overall results suggest a consistent improvement in classification accuracy as the number of qubits increases. Notably, the 6-qubit configuration enabled HQRNN-FD to achieve the highest accuracy in 5 out of 6 monthly evaluations, with an average accuracy of 0.9721, surpassing the 4-qubit and 2-qubit configurations, which attained 0.9702 and 0.9631, respectively. This performance advantage is particularly evident in February and May, months marked by greater data imbalance, demonstrating the enhanced stability and adaptability of the 6-qubit configuration under challenging conditions.
As shown in
Figure 11, the monthly accuracy comparison across different qubit configurations highlights notable performance differences. The 2-qubit model consistently yields lower accuracy, particularly in February (0.9560) and May (0.9519), suggesting a representational limitation in capturing complex transaction dynamics. Conversely, the 4-qubit configuration strikes a practical balance between accuracy and quantum resource usage. Its performance in April and June closely approximates that of the 6-qubit model, indicating its stability for scenarios where computational resources are limited.
Regarding average accuracy, increasing the number of qubits positively influences model performance, though not in a strictly linear fashion. The 4-qubit configuration demonstrates reliable modeling capability, whereas the 6-qubit setup yields further improvements in accuracy, making it more suitable for high-precision fraud detection applications within financial risk management contexts.
5.3.5. Ablation Study Analysis
To quantitatively evaluate the contributions of individual components within the HQRNN-FD framework and examine the efficacy of preprocessing, a systematic ablation study was conducted. This involved selectively disabling key modules—namely, variational quantum circuits (VQCs), self-attention mechanisms, and recurrent neural networks (RNNs)—to create controlled model variants. Each configuration was assessed using an identical test dataset, with results detailed in
Table 8.
A direct comparison between HQRNN and HQRNN-FD demonstrates that incorporating the attention mechanism enhances the Class 1 F1-score from 0.8703 to 0.9009, indicating improved identification of salient behavioral features within transactional sequences and strengthening anomaly detection performance. Precision also improved from 0.7943 to 0.8486, reflecting an enhanced reduction in false positives (FPs). These outcomes emphasize the synergy between quantum feature extraction and attention mechanisms in sequential fraud pattern recognition.
By contrast, the HQRNN-FD variant without RNNs retained quantum feature encoding but showed limited temporal modeling capabilities due to the absence of recurrent processing. This resulted in a decrease in F1-score from 0.9009 to 0.8665, indicating that quantum encoding alone is insufficient to capture complex temporal dependencies and fraud-related patterns. The RNN component is essential for processing sequences of transaction features across time, capturing long-range correlations and temporal evolution of user behavior. In particular, the RNN with attention enables the model to focus on informative segments of transaction histories (e.g., irregular intervals of high-risk activity), thus providing context-awareness that purely quantum or static models, such as QNNs, lack.
Meanwhile, the HQRNN-FD configuration without VQCs achieved consistently high recall values (all exceeding 0.96) but suffered from reduced precision and F1-scores, reflecting a tendency to misclassify normal transactions as fraudulent (i.e., over-detection). This further highlights the complementary role of the quantum encoder; by enhancing the representational richness of each transaction’s feature vector, it provides more informative input to the RNN, improving discrimination between subtle classes.
Taken together, these findings validate that the integration of quantum encoding with temporal modeling substantially enhances the robustness and effectiveness of the framework, particularly in scenarios characterized by class imbalance. The superior performance of HQRNN-FD arises from this hybrid synergy—quantum circuits extract expressive nonlinear features at each time step, while RNNs learn the dynamics of transaction evolution. Ablation analysis confirms that removing either component undermines performance, and that the RNN is indispensable for effective sequence modeling.
Preprocessing Impact Assessment: The removal of the preprocessing pipeline—particularly SMOTE oversampling—consistently led to performance degradation across all model configurations, most notably in precision and F1-score. For example, HQRNN-FD’s precision dropped from 0.8486 to 0.7797, and its F1-score declined from 0.9009 to 0.8694, accompanied by a modest accuracy decrease of 0.52% (from 97.15% to 96.66%). These results emphasize the essential role of preprocessing in strengthening model robustness and enhancing the detection of minority-class fraudulent instances.
The adverse impact was more severe in classical variants. HQRNN-FD without attention and VQCs, as well as the version without VQCs alone, experienced significant F1-score reductions to 0.8020 and 0.8320, respectively—underscoring the vulnerability of deep learning models to class imbalance. Notably, the RNN-removed variant (HQRNN-FD w/o RNNs) showed an increase in recall (0.9656) in the absence of preprocessing; however, this was offset by a substantial drop in F1-score, suggesting that higher recall came at the cost of more false positives, ultimately compromising the model’s reliability and interpretability. This performance drop was consistent across precision, recall, and accuracy metrics, reinforcing the robustness of the RNN’s contribution across evaluation criteria.