Adaptive Threat Mitigation in PoW Blockchains (Part II): A Deep Reinforcement Learning Approach to Countering Evasive Adversaries
Abstract
1. Introduction
Contributions
- We identify limitations of static defenses against adaptive adversaries and formulate the adaptive defense problem as a Constrained Markov Decision Process (CMDP) with explicit safety constraints on liveness and fairness, amenable to safe reinforcement learning.
- We design a DRL agent with proxy-based reward function balancing attack deterrence with network stability, enabling training without ground-truth labels. We evaluate multiple architectures: Double DQN with dueling networks, prioritized replay and recurrent policies (DRQN/LSTM), and we compare these against supervised and GAN-based alternatives.
- We establish formal theoretical guarantees: (1) probabilistic safety bounds ensuring FPR ≤ 8% and latency ≤ with probability ≥ 0.973 (Theorem 1), (2) Q-function convergence under Robbins–Monro conditions (Theorem 2), and (3) empirical sublinear regret scaling outperforming Thompson Sampling (Lemma 2).
- Through comprehensive evaluation on a 128-node distributed test bed over 30 independent runs, we demonstrate: (a) sustained attack suppression ( adversary profit vs. static, baseline), (b) zero-day adaptation within 24 h, (c) superior F1-score of vs. (supervised) and (GANs), and (d) generalization across DAA regimes with only 4% performance degradation.
- We provide detailed deployment models for integrating DRL into decentralized consensus, addressing deterministic inference requirements, on-chain governance protocols, and shadow-mode evaluation procedures.
2. Related Work
2.1. Wave Attacks and Difficulty Manipulation
2.2. Machine Learning in Network Security
2.3. Deep Reinforcement Learning Foundations
2.4. Constrained and Safe Reinforcement Learning
2.5. Reinforcement Learning in Cybersecurity
2.6. AI and Machine Learning in Blockchain
2.7. Positioning of Our Work
- vs. Protocol-level DAA defenses (Li, Komodo): We operate as a detection layer compatible with existing DAAs, avoiding consensus changes. Our DRL agent learns policies generalizable across DAA families (Section 4.4).
- vs. Static ML (supervised, GANs): We demonstrate superior adaptability to evolving adversaries (Table 1) and zero-day resilience (Section 4.6). DRL co-evolves with threats; static models degrade over time.
- vs. General RL cybersecurity (Nguyen, Abu-Mahfouz): We address blockchain-specific challenges—deterministic consensus requirements, decentralized deployment, DAA dynamics—with formal safety guarantees (Theorem 1).
- vs. Blockchain ML surveys (Nasir, Alghamdi): We provide comprehensive implementation (Section 3), empirical evaluation (Section 4), ablation studies (Section 3.2.3 and Section 3.3), and production deployment models (Section 3.7), advancing beyond conceptual frameworks.
3. Methodology
3.1. Limitations of Static Defenses
| Time Period | Baseline | Static | DRL-Enhanced |
|---|---|---|---|
| Days 0–5 | |||
| Days 6–10 | |||
| Days 11–15 | |||
| Days 16–20 | |||
| Days 21–25 | |||
| Days 26–30 | |||
| 30-Day Weighted Avg |
3.2. DRL Agent for Adaptive Detection
- Mean and variance of inter-block intervals in last W blocks;
- Number of flagged operators in current window;
- Estimated adversary profit proxy (rate of anomalous blocks by flagged operators);
- Current parameter settings ;
- Block interval variance (normalized).
3.2.1. Complete State Space Specification
- Increase/decrease by 5%;
- Adjust within permitted range [0.01, 0.10];
- Lengthen/shorten cooldown window by ;
- No change (maintain current parameters).
3.2.2. Action Space Design and Granularity Selection
- Fine-grained (1%): Excessive parameter thrashing (0.32 changes/day) without performance gain. Adversaries can exploit oscillations. High training instability from dense action space.
- Coarse-grained (10–15%): Large jumps cause FPR instability () and overshoot optimal thresholds, reducing F1-score by 4–7%.
- Optimal (5%): Achieves best F1-score (0.95), minimal thrashing (0.09 changes/day), stable FPR, and fastest convergence (197 K steps). This granularity provides sufficient resolution for adaptation while preventing jitter.
- Temporal separation: Ensures cooldown periods span multiple block production cycles, preventing rapid re-flagging of honest miners experiencing transient variance.
- Responsiveness: Allows adjustment within reasonable timeframes ( h) to counter evolving attacks.
- Governance transparency: Humans can audit and understand 2 h increments.
- 1.
- Consensus determinism: All nodes must select identical actions from identical states. Discrete actions with deterministic argmax ensure bit-identical inference across heterogeneous hardware. DDPG’s continuous outputs experienced rounding artifacts causing 0.3% consensus mismatches (unacceptable in production).
- 2.
- Governance transparency: Human operators can audit discrete parameter changes (e.g., “ increased by 5%”). Continuous micro-adjustments (e.g., “ changed by 3.7281%”) obscure intent.
- 3.
- Training stability: Discrete Q-learning converged 11% faster (197 K vs. 221 K steps) with lower variance. DDPG’s actor–critic requires careful hyperparameter tuning.
- 4.
- Action space coverage: With 9 discrete actions, exhaustive evaluation of safety constraints is tractable. Continuous spaces require conservative over-approximation of safe regions.
- Sample efficiency degraded (347 K steps to convergence vs. 197 K for sequential);
- Interpretability suffered (debugging which parameter caused failure becomes ambiguous);
- No F1-score improvement (0.95 for both; joint: 95% CI [0.93, 0.96], sequential: [0.94, 0.96]).
- No-op (maintain): 68.2%;
- Adjust : 18.5% (increase: 9.7%, decrease: 8.8%);
- Adjust : 7.8%;
- Adjust : 5.5%.
- : prevents overly permissive or restrictive thresholds;
- : maintains FDR within acceptable bounds;
- : ensures cooldown provides temporal separation;
- Maximum parameter drift per day: , .
- : adversary profit proxy (flagged anomalous blocks);
- : block interval variance (liveness penalty);
- : parameter movement cost (discourages thrashing);
- : false positive penalty (protects honest miners).
3.2.3. Reward Function Design, Tuning, and Sensitivity
- (adversary profit penalty);
- (liveness penalty);
- (parameter change cost);
- (false positive penalty).
3.3. Architecture Evaluation and Selection
- 1.
- Baseline DQN [9]: Single Q-network with uniform replay sampling. Achieved 89% attack suppression but exhibited high variance () and occasional instability in non-stationary environments.
- 2.
- Double DQN (DDQN): Decouples action selection from evaluation using target network, reducing overestimation bias. Improved stability (variance: ) and average suppression to 91%.
- 3.
- Dueling DQN: Separates value and advantage streams:This further improved suppression to 93% by better generalizing across actions with similar values.
- 4.
- Prioritized Experience Replay (PER): Samples transitions proportional to TD-error with priority . Critical for learning from rare but important attack patterns. Combined DDQN + Dueling + PER achieved 95% suppression (selected configuration).
- DRQN: Replaces fully-connected layers with LSTM () to maintain hidden state. Handles partial observability better during stealthy attack phases.
- Performance comparison: DRQN achieved 94% suppression with 22% longer time-to-convergence (240 K vs. 197 K steps) but provided 12% better zero-day adaptation speed. For production deployment, we select DDQN + Dueling + PER for balance of performance, training efficiency, and deterministic inference requirements. DRQN remains promising for future work addressing highly adaptive adversaries.
3.4. Training Procedure
| Algorithm 1 Safe DRL Training with Action Masking (Double DQN + PER) |
| Require: Environment , Safety thresholds Require: Hyperparameters: , , , batch size
|
- Input: State vector (12 dimensions);
- Shared trunk: FC layers [128, 128, 64] neurons with ReLU activation;
- Dueling heads: Network splits into:
- –
- Value head : FC layer (64 → 1);
- –
- Advantage head : FC layer (64 → 9);
- –
- Q-values recombined as: .
- Total parameters: ∼26,800 (including biases and dueling heads).
- Learning rate: (Adam optimizer);
- Batch size: 64;
- Replay buffer: 50,000 transitions;
- Target network update: every 1000 steps;
- -greedy: over 100,000 steps;
- Discount factor: .
3.5. Training Environment Fidelity and Attack Distribution
3.5.1. Simulator Architecture
- PoW consensus: Full block validation matching Bitcoin Core v23.0 logic;
- Network layer: Geometric delay distribution (mean 2.3 s, std 1.8 s);
- Mining pools: Log-normal hashrate distribution.
3.5.2. Training Attack Distribution
- 1.
- Standard wave attacks (40%): Binary on/off with , ;
- 2.
- Variable-amplitude waves (30%): (0.5, 1.0);
- 3.
- Irregular timing waves (20%): ;
- 4.
- Stealth attacks (10%): Low-amplitude sustained ().
3.5.3. Overfitting Prevention
3.6. Theoretical Properties and Safety Guarantees
- 1.
- Boundedness: where ;
- 2.
- Lipschitz Continuity: where .
- : Maximum observed adversary profit percentage during extreme difficulty suppression events (occurs in <0.1% of states, representing adversaries exploiting );
- : Block interval variance in seconds2, normalized by target interval s. Maximum occurs during coordinated network attacks;
- : Maximum single-step parameter change under action masking constraints (, );
- : False positive rate as probability, upper bound represents worst-case overly aggressive detection.
- DRL agent: (95% CI: [0.61, 0.69]), ;
- Thompson Sampling baseline: (95% CI: [0.67, 0.79]),
3.7. Decentralized Implementation Models
- Policy signing: Trained policy weights cryptographically signed by core developers. Nodes verify signature before loading policy, preventing malicious model injection.
- Hash commitment: Policy weight hash committed on-chain in prior upgrade. Nodes validate hash match before execution, ensuring bit-identical policy across network.
- Deterministic inference: Critical requirement for consensus. We enforce:
- –
- Fixed-point arithmetic (INT32) for all computations;
- –
- Deterministic library versions (ONNX Runtime 1.15.1, CPU-only);
- –
- No fused operations or platform-specific optimizations;
- –
- Comprehensive inference test suite with 10,000 edge cases.
Validation: 128 heterogeneous nodes (x86, ARM, different OS) achieve bit-identical outputs across inference calls.
- Shadow-mode evaluation: New policies are run in shadow mode for blocks (7 days for Bitcoin target), logging recommendations without affecting consensus. Community reviews shadow-mode performance metrics (suppression rate, false positives, parameter stability) before activation vote.
- Proposal cadence: Maximum 1 parameter update per 2016 blocks (2 weeks) to prevent governance fatigue and parameter thrashing.
- Grace period: After the vote passes, there is a 144-block (1 day) grace period before activation allows nodes to upgrade and validators to prepare.
- Emergency rollback: If deployed policy causes >10% block acceptance delay or >15% false positive rate spike, emergency rollback transaction (requiring 67% validator approval) reverts to a previous parameter set within 6 blocks.
- Performance monitoring: The on-chain dashboard tracks adversary profit proxy, FPR (7-day MA), block interval variance, and parameter drift rate. Governance can trigger audits if metrics degrade.
4. Evaluation
4.1. Experimental Setup
- If : Reduce amplitude by 10%: ;
- If : Increase amplitude by 5%: ;
- Otherwise maintain current amplitude.
- Baseline detector: Simple variance-based detector;
- Static framework: Complete system from Part I with fixed optimal parameters;
- DRL-enhanced framework: Static framework augmented with DRL agent.
Reproducibility and Configuration
| Listing 1. Agent configuration (train_config.py). |
![]() |
4.2. Performance Against Adaptive Adversary
4.3. Comprehensive Baseline Comparison
- Thompson Sampling: Treats each parameter configuration as multi-armed bandit arm and samples according to posterior belief. Assumes stationary reward distributions and struggles with adversarial non-stationarity.
- PID Controller: Proportional-Integral-Derivative controller targeting constant 5% FPR. Tunes based on FPR error signal. Cannot anticipate adversary strategy shifts.
- EWMA Adaptive: Exponentially weighted moving average of attack metrics drives threshold adjustments. Reactive but lacks strategic foresight.
- Contextual Bandit: Linear contextual bandit using state features to select actions. Better than non-contextual but limited by linear assumptions.
- FPR: False positive rate—honest miners incorrectly flagged (lower is better; the target is )
- Latency: Mean block acceptance delay as multiple of target interval T (lower is better; the target is )
- Param Thrash: Mean absolute parameter change per day, (lower is better as this indicates stability)
4.4. Generalization Across DAA Regimes
4.5. Non-Stationarity Stress Tests
- Stealthy low-amplitude waves (, 10-day cycles);
- Rare high-amplitude bursts (, 6 h duration every 5 days).
4.6. Resilience to Zero-Day Attacks
4.7. Comparative Analysis of AI Methodologies
- Architecture: 4-layer MLP matching DQN architecture;
- Labels: Retrospective ground-truth attack labels (available offline);
- Training: 80/20 train/validation split, early stopping on validation loss;
- Test: Deployment on unseen 30-day evaluation period.
- Generator: Three-layer MLP [32 → 64 → 128 → 12] mapping latent to state space;
- Discriminator: Three-layer MLP [12 → 64 → 32 → 1] distinguishing real vs. generated states;
- Training: On honest-only states (120,000 samples), WGAN-GP loss with gradient penalty , 50,000 iterations, and Adam optimizer (, , );
- Anomaly score: where is encoder, ;
- Threshold: Set at 95th percentile of anomaly scores on honest-only validation set to achieve target FPR ≈ 5%;
- Latent dimension selected via grid search over .
- Training: Online interaction with simulated environment (200,000 steps);
- No ground-truth labels; learns from proxy reward signal;
- Test: Same 30-day evaluation period.
5. Discussion and Limitations
6. Future Work
7. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Skowroński, R. Liveness over Fairness (Part I): A Statistically Grounded Framework for Detecting and Mitigating PoW Wave Attacks. Information 2025, 16, 1060. [Google Scholar] [CrossRef]
- Li, J.; Xie, L.; Huang, H.; Zhou, B.; Song, B.; Zeng, W.; Deng, X.; Zhang, X. Survey on Strategic Mining in Blockchain: A Reinforcement Learning Approach. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025); IJCAI Organization: Marina del Rey, CA, USA, 2025. [Google Scholar] [CrossRef]
- Nikhalat-Jahromi, A.; Saghiri, A.M.; Meybodi, M.R. Nik Defense: An Artificial Intelligence Based Defense Mechanism against Selfish Mining in Bitcoin. arXiv 2023, arXiv:2301.11463. [Google Scholar] [CrossRef]
- Grunspan, C.; Pérez-Marco, R. On Profitability of Selfish Mining. arXiv 2018, arXiv:1805.08281. [Google Scholar] [CrossRef]
- Komodo Platform. Adaptive Proof of Work (APoW): Komodo’s New Solution to Difficulty Adjustment Attacks. Komodo Platform Blog. April 2022. Available online: https://komodoplatform.com/en/blog/adaptive-proof-of-work/ (accessed on 15 October 2025).
- Zhang, J.; Xiang, Y.; Wang, Y.; Zhou, W.; Xiang, Y.; Guan, Y. Network Traffic Classification Using Correlation Information. IEEE Trans. Parallel Distrib. Syst. 2013, 24, 104–117. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS 2014); Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. Available online: https://papers.nips.cc/paper/5423-generative-adversarial-nets (accessed on 31 January 2026).
- Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Schmidt-Erfurth, U.; Langs, G. Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery. In Proceedings of the International Conference on Information Processing in Medical Imaging; Springer: Cham, Switzerland, 2017; pp. 146–157. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Altman, E. Constrained Markov Decision Processes; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar] [CrossRef]
- Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017); PMLR: New York, NY, USA, 2017; Volume 70, pp. 22–31. Available online: https://proceedings.mlr.press/v70/achiam17a.html (accessed on 31 January 2026).
- Ray, A.; Achiam, J.; Amodei, D. Benchmarking Safe Exploration in Deep Reinforcement Learning. OpenAI Technical Report. 2019. Available online: https://cdn.openai.com/safexp-short.pdf (accessed on 15 October 2025).
- Nguyen, T.T.; Reddi, V.J. Deep Reinforcement Learning for Cyber Security. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3779–3795. [Google Scholar] [CrossRef] [PubMed]
- Ferrag, M.A.; Maglaras, L.; Moschoyiannis, S.; Janicke, H. Deep Learning for Cyber Security Intrusion Detection: Approaches, Datasets, and Comparative Study. J. Inf. Secur. Appl. 2020, 50, 102419. [Google Scholar] [CrossRef]
- Chang, Z.; Cai, Y.; Liu, X.F.; Xie, Z.; Liu, Y.; Zhan, Q. Anomalous Node Detection in Blockchain Networks Based on Graph Neural Networks. Sensors 2025, 25, 1. [Google Scholar] [CrossRef] [PubMed]
- Mounnan, O.; Manad, O.; Boubchir, L.; El Mouatasim, A.; Daachi, B. A Review on Deep Anomaly Detection in Blockchain. Blockchain Res. Appl. 2024, 5, 100227. [Google Scholar] [CrossRef]
- Zhang, Z.; Yu, G.; Sun, C.; Wang, X.; Wang, Y.; Zhang, M.; Ni, W.; Liu, R.P.; Reeves, A.; Georgalas, N. TbDd: A New Trust-Based, DRL-Driven Framework for Blockchain Sharding in IoT. Comput. Netw. 2024, 244, 110343. [Google Scholar] [CrossRef]
- Islam, T.; Bappy, F.H.; Zaman, T.S.; Sajid, M.S.I.; Pritom, M.M.A. MRL-PoS: A Multi-Agent Reinforcement Learning Based Proof of Stake Consensus Algorithm for Blockchain. In 2024 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC); IEEE: Piscataway, NJ, USA, 2024; pp. 393–399. [Google Scholar] [CrossRef]
- Li, J.; Gu, C.; Wei, F.; Chen, X. A Survey on Blockchain Anomaly Detection Using Data Mining Techniques. In Blockchain and Trustworthy Systems; Springer: Singapore, 2019; pp. 491–504. [Google Scholar] [CrossRef]
- Sarker, I.H. Multi-Aspects AI-Based Modeling and Adversarial Learning for Cybersecurity Intelligence and Robustness: A Comprehensive Overview. Secur. Priv. 2023, 6, e295. [Google Scholar] [CrossRef]
- Villegas-Ch, W.; Govea, J.; Gutierrez, R. Optimizing Consensus in Blockchain with Deep and Reinforcement Learning. Emerg. Sci. J. 2025, 9, 1886–1908. [Google Scholar] [CrossRef]
- Li, P.; Song, M.; Xing, M.; Xiao, Z.; Ding, Q.; Guan, S.; Long, J. SPRING: Improving the Throughput of Sharding Blockchain via Deep Reinforcement Learning Based State Placement. In Proceedings of the ACM Web Conference 2024 (WWW ’24); ACM: New York, NY, USA, 2024; pp. 2836–2846. [Google Scholar] [CrossRef]
- Gutierrez, R.; Villegas-Ch, W.; Govea, J. Adaptive Consensus Optimization in Blockchain Using Reinforcement Learning and Validation in Adversarial Environments. Front. Artif. Intell. 2025, 8, 1672273. [Google Scholar] [CrossRef] [PubMed]
- Eyal, I.; Sirer, E.G. Majority is not enough: Bitcoin mining is vulnerable. Commun. ACM 2018, 61, 95–102. [Google Scholar] [CrossRef]
- Skowroński, R. The open blockchain-aided multi-agent symbiotic cyber–physical systems. Future Gener. Comput. Syst. 2019, 94, 430–443. [Google Scholar] [CrossRef]
- Skowroński, R.; Brzeziński, J. UI dApps meet decentralized operating systems. Electronics 2022, 11, 3004. [Google Scholar] [CrossRef]
- Skowroński, R.; Brzeziński, J. SPIDE: Sybil-proof, incentivized data exchange. Clust. Comput. 2021, 25, 2241–2270. [Google Scholar] [CrossRef]




| Feature | Definition & Computation | Range |
|---|---|---|
| Mean block interval: over last blocks | ||
| Interval std. dev.: | ||
| Flagged operators: Count of mining identities with anomaly score in current DAA window | ||
| Adversary profit proxy: where is blocks by operator j, is proportional share | ||
| Current : Anomaly detection threshold (from Part I framework) | ||
| Current : FDR control parameter (Benjamini–Hochberg) | ||
| Current : Cooldown window length (blocks) | ||
| Interval variance (MAD-scaled): where is buffer of recent variance estimates | ||
| FPR estimate: over last 100 blocks (via ground-truth labels in training) | ||
| Detection events: Count of penalty actions triggered in last W blocks | ||
| Parameter thrash rate: | ||
| Cooldown violations: Count of detection events during active cooldown windows in last W blocks |
| Thrash Rate (chg/Day) | F1-Score (Mean ± Std) | FPR Stab. (Std FPR) | Training (Steps) | |
|---|---|---|---|---|
| 1% | 0.32 | 0.94 ± 0.03 | 0.021 | 285 K |
| 2.5% | 0.18 | 0.95 ± 0.02 | 0.018 | 215 K |
| 5% | 0.09 | 0.95 ± 0.02 | 0.016 | 197 K |
| 10% | 0.05 | 0.91 ± 0.05 | 0.034 | 220 K |
| 15% | 0.04 | 0.88 ± 0.06 | 0.042 | 245 K |
| Method | F1-Score (Mean ± Std) | Consensus Determinism | Training Stability | Governance Auditability |
|---|---|---|---|---|
| Discrete DQN | 0.95 ± 0.02 | 100% | Stable | High |
| Continuous DDPG | 0.94 ± 0.03 | 99.7% | Moderate | Low |
| Reward Type | Steps to Convergence | Final F1 (Mean ± Std) | Gradient Variance |
|---|---|---|---|
| Clipped | 267 K | 0.92 ± 0.04 | High (0.38) |
| Shaped (Equation (8)) | 197 K | 0.95 ± 0.02 | Low (0.15) |
| Improvement | −26% | +3% | −61% |
| Architecture | Suppression (%) | Variance (±%) | Convergence (K Steps) | F1-Score |
|---|---|---|---|---|
| Baseline DQN | 89.2 | ±12.1 | 210 | 0.88 |
| Double DQN | 91.4 | ±7.3 | 203 | 0.91 |
| Dueling DQN | 93.1 | ±6.8 | 198 | 0.93 |
| DDQN + Duel + PER | 95.3 | ±4.2 | 197 | 0.95 |
| DRQN (LSTM) | 94.1 | ±5.9 | 240 | 0.94 |
| No Replay | 82.5 | ±15.7 | 285 | 0.81 |
| Clipped Reward | 90.8 | ±8.4 | 267 | 0.89 |
| Mean/Std Scaling | 91.2 | ±9.1 | 201 | 0.90 |
| Metric | Training Set (Last 10 K Steps) | Validation Set (Held-Out) | Test Set (30-Day Eval) |
|---|---|---|---|
| F1-Score | 0.96 ± 0.01 | 0.95 ± 0.02 | 0.95 ± 0.02 |
| Adversary Profit | ± 8% | ± 11% | ± 13% |
| FPR | 0.038 ± 0.006 | 0.041 ± 0.012 | 0.043 ± 0.015 |
| Train–Test Gap | – | 1.0% (F1) | 1.0% (F1) |
| DRL Hyperparameter | Value | Environment Config | Value |
|---|---|---|---|
| Learning Rate () | Node Count (N) | 128 | |
| Discount Factor () | 0.99 | Block Time Target (T) | 600 s |
| Replay Buffer Size | 50,000 | Network Delay (median) | 2.0 s |
| Batch Size | 64 | DAA Window Size (W) | 144 blocks |
| Target Update Freq | 1000 steps | Adversary Hashrate () | 30% |
| Start/End | Vesting Period (V) | blocks | |
| Decay Steps | 100,000 | Penalty Factor | 50% |
| Optimizer | Adam | Simulation Duration | 30 days |
| Gradient Clipping | 10.0 | Independent Runs | 30 |
| PER (priority) | 0.6 | Action Space | 9 |
| PER (IS correction) | State Space | 12 dims |
| Metric | Baseline | Static | DRL-Enhanced |
|---|---|---|---|
| Initial Profit (Days 0–5) | |||
| Avg. Adversary Profit (30 days) | |||
| Final Profit (Days 26–30) | |||
| Time to Recovery (days) | ≈3 | ≈18 | N/A |
| Method | Adv. Profit | FPR (%) | Latency (×T) | Param Thrash | F1 Score |
|---|---|---|---|---|---|
| Baseline (Variance) | 4.8 | – | 0.75 | ||
| Static Framework | 4.1 | 0.00 | 0.93 | ||
| Thompson Sampling | 6.2 | 0.18 | 0.84 | ||
| PID Controller | 5.5 | 0.22 | 0.87 | ||
| EWMA Adaptive | 5.1 | 0.15 | 0.89 | ||
| Contextual Bandit | 4.9 | 0.21 | 0.91 | ||
| DRL (Ours) | 3.8 | 0.09 | 0.95 |
| Method | Same DAA | Cross DAA | Degradation |
|---|---|---|---|
| Static Framework | 93% F1 | 87% F1 | |
| Thompson Sampling | 84% F1 | 76% F1 | |
| DRL Agent | 95% F1 | 91% F1 | −4% |
| Method | Rotating Pattern (Adv. Profit) | Stat Poisoning (Adv. Profit) |
|---|---|---|
| Static Framework | ||
| PID Controller | ||
| Mean/Std DRL | ||
| MAD-Scaled DRL |
| AI Model | Precision | Recall | F1-Score |
|---|---|---|---|
| Supervised Classifier | 99 ± 1% | 65 ± 4% | 0.78 ± 0.03 |
| GAN Anomaly Detector | 85 ± 3% | 88 ± 3% | 0.86 ± 0.02 |
| DRL Agent (Ours) | 94 ± 2% | 96 ± 2% | 0.95 ± 0.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Skowroński, R. Adaptive Threat Mitigation in PoW Blockchains (Part II): A Deep Reinforcement Learning Approach to Countering Evasive Adversaries. Sensors 2026, 26, 1368. https://doi.org/10.3390/s26041368
Skowroński R. Adaptive Threat Mitigation in PoW Blockchains (Part II): A Deep Reinforcement Learning Approach to Countering Evasive Adversaries. Sensors. 2026; 26(4):1368. https://doi.org/10.3390/s26041368
Chicago/Turabian StyleSkowroński, Rafał. 2026. "Adaptive Threat Mitigation in PoW Blockchains (Part II): A Deep Reinforcement Learning Approach to Countering Evasive Adversaries" Sensors 26, no. 4: 1368. https://doi.org/10.3390/s26041368
APA StyleSkowroński, R. (2026). Adaptive Threat Mitigation in PoW Blockchains (Part II): A Deep Reinforcement Learning Approach to Countering Evasive Adversaries. Sensors, 26(4), 1368. https://doi.org/10.3390/s26041368


