Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods
Abstract
1. Introduction
- Formalization of five confidence estimation methods for RL trading agents, each targeting a distinct uncertainty dimension: value estimation, behavioral direction, distributional familiarity, position sizing, and environmental predictability.
- Controlled experimental comparison in which the confidence method is the sole variable across 18 experimental conditions (five methods plus a confidence-free baseline, evaluated on Bitcoin, Litecoin, and Ethereum), isolating the impact of confidence from all other design choices.
- Empirical demonstration that confidence-aware reward-shaping improves trading performance, with four of the five methods producing statistically significant improvements over the baseline (), and state novelty delivering the largest gains: mean ROI increases from 5.7% to 24.9%, SR from 0.34 to 1.57, and maximum drawdown decreases from 28.0% to 15.0% across BTC, ETH, and LTC.
- A practical taxonomy that maps each confidence method to its uncertainty dimension, computational cost, and implementation requirements, enabling practitioners to select the appropriate method for their context.
2. Related Work
2.1. Reinforcement Learning in Financial Trading
2.2. Reward Function Design in RL Trading
2.3. Uncertainty Estimation in Deep RL
3. Methodology
3.1. Problem Formulation
3.1.1. State Space
3.1.2. Action Space
- Bitcoin (BTC): action expressed in Satoshis (1 BTC Satoshis [50]), the standard smallest unit on most Bitcoin exchanges and the denomination commonly used in retail trading.
- Ethereum (ETH): action expressed in Kwei ( Kwei ETH [51]), a fractional denomination that reflects the intermediate price level of ETH and avoids both excessively large and excessively small action values.
- Litecoin (LTC): action expressed in whole LTC units, appropriate given the comparatively low per-unit price of LTC during the study period.
3.1.3. Transaction Fees
3.2. TD3 Algorithm Overview
3.3. Reward Function Design
3.3.1. Baseline Reward (No Confidence)
3.3.2. Confidence-Enhanced Reward
3.4. Confidence Estimation Methods
3.4.1. Method 1: Critic Agreement (CA)
3.4.2. Method 2: Temporal Direction Consistency (TDC)
3.4.3. Method 3: State Novelty (SN)
3.4.4. Method 4: Action Magnitude Stability (AMS)
3.4.5. Method 5: State-Transition Surprise (STS)
3.4.6. Design Rationale
3.5. Theoretical Analysis of Confidence-Shaped Reward
3.5.1. Policy Ordering Under Multiplicative Shaping
3.5.2. Contraction of the Shaped Bellman Operator
3.5.3. Signal Saturation: A Sufficient Condition for Ineffectiveness
4. Experimental Setup
4.1. Dataset
- Training set (4 years, from 1 June 2019 to 1 June 2023): Used to train the TD3 agent and, where applicable, the auxiliary network for Method 5 (STS).
- Validation set (6 months, from 1 June 2023 to 1 December 2023): Used exclusively for tuning the hyperparameters of each confidence estimation method (see Section 4.3). The TD3 architecture and its training hyperparameters are not tuned on this set.
- Test set (6 months, from 1 December 2023 to 1 June 2024): Used for final performance evaluation. No model parameters or confidence hyperparameters are adjusted during this period.
4.2. Model Configuration and Controlled Variables
4.2.1. Fixed Model Configuration
4.2.2. Initial Capital and Trading Environment
4.2.3. Random Seed and Controlled Variables
4.3. Hyperparameter Selection
4.4. Evaluation Metrics
4.4.1. Return on Investment (ROI)
4.4.2. Sharpe Ratio (SR)
4.4.3. Maximum Drawdown (MDD)
4.4.4. Statistical Significance Testing
5. Results and Analysis
5.1. Overall Performance Comparison
5.2. Per-Cryptocurrency Analysis
5.2.1. Bitcoin
5.2.2. Ethereum
5.2.3. Litecoin
5.2.4. Cross-Asset Takeaway
5.3. Confidence Behavior Visualization
5.3.1. SN Tracks Price Novelty on All Three Assets
5.3.2. CA Fluctuates with Market Uncertainty, Without a Single Dominant Event
5.3.3. TDC Is Near-Binary
5.3.4. AMS Drifts over Long Horizons
5.3.5. STS Is Saturated Low
5.3.6. Summary of Behavioral Differences
5.4. Trading Behavior Analysis
5.4.1. Win Rate Drives the ROI Ranking
5.4.2. Position Size Stability Reflects AMS’s Direct Signal
5.4.3. Direction Flip Rate Confirms TDC’s Mechanism
5.4.4. SN Combines Desirable Behaviors Without Directly Targeting Any of Them
5.5. Computational Cost
5.6. Statistical Significance
6. Discussion
6.1. Generalizability Beyond the Study Setup
6.2. Limitations
6.3. Practical and Managerial Implications
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Fang, F.; Ventre, C.; Basios, M.; Kanthan, L.; Martinez-Rego, D.; Wu, F.; Li, L. Cryptocurrency trading: A comprehensive survey. In Blockchain, Crypto Assets, and Financial Innovation: A Decade of Insights and Advances; Springer Nature Singapore: Singapore, 2025; pp. 55–127. [Google Scholar]
- Fischer, T.G. Reinforcement Learning in Financial Markets—A Survey; FAU Discussion Papers in Economics; FAU: Boca Raton, FL, USA, 2018. [Google Scholar]
- Tay, X.H.; Lim, S.M. Deep reinforcement learning in cryptocurrency trading: A profitable approach. J. Telecommun. Digit. Econ. 2024, 12, 126–147. [Google Scholar] [CrossRef]
- Schnaubelt, M. Deep reinforcement learning for the optimal placement of cryptocurrency limit orders. Eur. J. Oper. Res. 2022, 296, 993–1006. [Google Scholar] [CrossRef]
- Yang, H.; Liu, X.Y.; Zhong, S.; Walid, A. Deep reinforcement learning for automated stock trading: An ensemble strategy. In Proceedings of the First ACM International Conference on AI in Finance 2020, Virtually, 15–16 October 2020; pp. 1–8. [Google Scholar]
- Ghadiri, H.; Hajizadeh, E. Designing a cryptocurrency trading system with deep reinforcement learning utilizing LSTM neural networks and XGBoost feature selection. Appl. Soft Comput. 2025, 175, 113029. [Google Scholar] [CrossRef]
- Lucarelli, G.; Borrotti, M. A deep reinforcement learning approach for automated cryptocurrency trading. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Crete, Greece, 24–26 May 2019; Springer International Publishing: Cham, Swizterland, 2019; pp. 247–258. [Google Scholar]
- Moody, J.; Wu, L.; Liao, Y.; Saffell, M. Performance functions and reinforcement learning for trading systems and portfolios. J. Forecast. 1998, 17, 441–470. [Google Scholar] [CrossRef]
- Rodinos, G.; Nousi, P.; Passalis, N.; Tefas, A. A sharpe ratio based reward scheme in deep reinforcement learning for financial trading. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations 2023, León, Spain, 14–17 June 2023; Springer Nature Switzerland: Cham, Swizterland, 2023; pp. 15–23. [Google Scholar]
- Huang, Y.; Zhou, C.; Zhang, L.; Lu, X. A self-rewarding mechanism in deep reinforcement learning for trading strategy optimization. Mathematics 2024, 12, 4020. [Google Scholar] [CrossRef]
- Orra, A.; Choudhary, H.; Sharma, A.; Thakur, M. Enhancing deep reinforcement learning for stock trading: A reward shaping approach via expert feedback. Knowl. Inf. Syst. 2025, 67, 11075–11094. [Google Scholar] [CrossRef]
- Srivastava, U.; Aryan, S.; Singh, S. A Risk-Aware Reinforcement Learning Reward for Financial Trading. arXiv 2025, arXiv:2506.04358. [Google Scholar] [CrossRef]
- Clements, W.R.; Van Delft, B.; Robaglia, B.M.; Slaoui, R.B.; Toth, S. Estimating risk and uncertainty in deep reinforcement learning. arXiv 2019, arXiv:1905.09638. [Google Scholar]
- Charpentier, B.; Senanayake, R.; Kochenderfer, M.; Günnemann, S. Disentangling epistemic and aleatoric uncertainty in reinforcement learning. arXiv 2022, arXiv:2206.01558. [Google Scholar] [CrossRef]
- Liu, Q.; Li, Y.; Chen, S.; Lin, K.; Shi, X.; Lou, Y. Distributional reinforcement learning with epistemic and aleatoric uncertainty estimation. Inf. Sci. 2023, 644, 119217. [Google Scholar] [CrossRef]
- Otabek, S.; Choi, J. Optimizing Cryptocurrency Trades with Twin Delayed DDPG: Adaptive Multi-factor Reward Function with Diverse Data Sources. Expert Syst. Appl. 2026, 7, 131527. [Google Scholar] [CrossRef]
- Khujamatov, E.H.; Ismanov, K.; Mallaev, O.U.; Sattarov, O. Optimizing Crypto-Trading Performance: A Comparative Analysis of Innovative Reward Functions in Reinforcement Learning Models. Mathematics 2026, 14, 794. [Google Scholar] [CrossRef]
- Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
- Sun, Q.; Si, Y.W. Supervised actor-critic reinforcement learning with action feedback for algorithmic trading. Appl. Intell. 2023, 53, 16875–16892. [Google Scholar] [CrossRef]
- Kong, M.; So, J. Empirical analysis of automated stock trading using deep reinforcement learning. Appl. Sci. 2023, 13, 633. [Google Scholar] [CrossRef]
- Lu, C.I. Evaluation of deep reinforcement learning algorithms for portfolio optimisation. arXiv 2023, arXiv:2307.07694. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.M.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D.P. Continuous Control with Deep Reinforcement Learning. US Patent 10,776,692, 15 September 2020. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
- Zhang, J.; Cai, K.; Wen, J. A survey of deep learning applications in cryptocurrency. iScience 2024, 27, 108509. [Google Scholar] [CrossRef]
- Kochliaridis, V.; Kouloumpris, E.; Vlahavas, I. Combining deep reinforcement learning with technical analysis and trend monitoring on cryptocurrency markets. Neural Comput. Appl. 2023, 35, 21445–21462. [Google Scholar] [CrossRef]
- Kumlungmak, K.; Vateekul, P. Multi-agent deep reinforcement learning with progressive negative reward for cryptocurrency trading. IEEE Access 2023, 11, 66440–66455. [Google Scholar] [CrossRef]
- Otabek, S.; Choi, J. Multi-level deep Q-networks for Bitcoin trading strategies. Sci. Rep. 2024, 14, 771. [Google Scholar] [CrossRef]
- Tran, M.; Pham-Hi, D.; Bui, M. Optimizing automated trading systems with deep reinforcement learning. Algorithms 2023, 16, 23. [Google Scholar] [CrossRef]
- Eschmann, J. Reward function design in reinforcement learning. In Reinforcement Learning Algorithms: Analysis and Applications; Springer: Cham, Swizterland, 2021; pp. 25–33. [Google Scholar]
- Ibrahim, S.; Mostafa, M.; Jnadi, A.; Salloum, H.; Osinenko, P. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications. IEEE Access 2024, 12, 175473–175500. [Google Scholar] [CrossRef]
- Allen, F.; Karjalainen, R. Using genetic algorithms to find technical trading rules. J. Financ. Econ. 1999, 51, 245–271. [Google Scholar] [CrossRef]
- Liu, X.Y.; Yang, H.; Gao, J.; Wang, C.D. FinRL: Deep reinforcement learning framework to automate trading in quantitative finance. In Proceedings of the Second ACM International Conference on AI in Finance 2021, Virtual, 3–5 November 2021; pp. 1–9. [Google Scholar]
- Wu, M.E.; Syu, J.H.; Lin, J.C.; Ho, J.M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell. 2021, 51, 8119–8131. [Google Scholar] [CrossRef]
- Choudhary, H.; Orra, A.; Sahoo, K.; Thakur, M. Risk-adjusted deep reinforcement learning for portfolio optimization: A multi-reward approach. Int. J. Comput. Intell. Syst. 2025, 18, 126. [Google Scholar] [CrossRef]
- Su, R.; Chi, C.; Tu, S.; Xu, L. A Deep Reinforcement Learning Approach for Portfolio Management in Non-Short-Selling Market. IET Signal Process. 2024, 2024, 5399392. [Google Scholar] [CrossRef]
- Sadighian, J. Extending deep reinforcement learning frameworks in cryptocurrency market making. arXiv 2020, arXiv:2004.06985. [Google Scholar] [CrossRef]
- Zhou, C.; Huang, Y.; Cui, K.; Lu, X. R-DDQN: Optimizing algorithmic trading strategies using a reward network in a double DQN. Mathematics 2024, 12, 1621. [Google Scholar] [CrossRef]
- Cornalba, F.; Disselkamp, C.; Scassola, D.; Helf, C. Multi-objective reward generalization: Improving performance of Deep Reinforcement Learning for applications in single-asset trading. Neural Comput. Appl. 2024, 36, 619–637. [Google Scholar] [CrossRef]
- Valdenegro-Toro, M.; Mori, D.S. A deeper look into aleatoric and epistemic uncertainty disentanglement. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1508–1516. [Google Scholar]
- Wang, T.; Wang, Y.; Zhou, J.; Peng, B.; Song, X.; Zhang, C.; Sun, X.; Niu, Q.; Liu, J.; Chen, S.; et al. From aleatoric to epistemic: Exploring uncertainty quantification techniques in artificial intelligence. arXiv 2025, arXiv:2501.03282. [Google Scholar] [CrossRef]
- An, G.; Moon, S.; Kim, J.H.; Song, H.O. Uncertainty-based offline reinforcement learning with diversified q-ensemble. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 7436–7447. [Google Scholar]
- Wu, Y.; Zhai, S.; Srivastava, N.; Susskind, J.; Zhang, J.; Salakhutdinov, R.; Goh, H. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv 2021, arXiv:2105.08140. [Google Scholar] [CrossRef]
- Bai, C.; Wang, L.; Yang, Z.; Deng, Z.; Garg, A.; Liu, P.; Wang, Z. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv 2022, arXiv:2202.11566. [Google Scholar] [CrossRef]
- Hoel, C.J.; Wolff, K.; Laine, L. Ensemble quantile networks: Uncertainty-aware reinforcement learning with applications in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6030–6041. [Google Scholar] [CrossRef]
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning 2017, Sydney, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
- Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by random network distillation. arXiv 2018, arXiv:1810.12894. [Google Scholar] [CrossRef]
- Zhou, R.; Zhu, W.; Han, S.; Kang, M.; Lü, S. VCSAP: Online reinforcement learning exploration method based on visitation count of state-action pairs. Neural Netw. 2025, 184, 107052. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Shi, X.; Li, J.; Zhang, X.; Wang, J. Random curiosity-driven exploration in deep reinforcement learning. Neurocomputing 2020, 418, 139–147. [Google Scholar] [CrossRef]
- Bitcoin Wiki. Satoshi (Unit). 2026. Available online: https://en.bitcoin.it/wiki/Satoshi_(unit) (accessed on 1 April 2026).
- Cryptopedia. Satoshi Value, Gwei to Ether to Wei Converter. 2026. Available online: https://www.gemini.com/cryptopedia/satoshi-value-gwei-to-ether-to-wei-converter-eth-gwei (accessed on 1 April 2026).
- Coinbase. Pricing and Fees Disclosures. 2026. Available online: https://help.coinbase.com/en/coinbase/trading-and-funding/pricing-and-fees/fees (accessed on 1 April 2026).
- Makarov, I.; Schoar, A. Trading and arbitrage in cryptocurrency markets. J. Financ. Econ. 2020, 135, 293–319. [Google Scholar] [CrossRef]
- Lin, L.J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2020, arXiv:1412.6980. [Google Scholar]
- Bentley, J.L. Multidimensional binary search trees used for associative searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
- Everitt, B.S.; Skrondal, A. The Cambridge Dictionary of Statistics; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
- Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the InIcml 1999, Bled, Slovenia, 27–30 June 1999; Volume 99, pp. 278–287. [Google Scholar]
- Crypto Data Download. Available online: https://www.cryptodatadownload.com/ (accessed on 5 April 2026).





| Study | Domain | Algorithm | Key Contribution | Limitation/Gap |
|---|---|---|---|---|
| Sun et al. [19] | Stock | TD3, DDPG | Action feedback mechanism corrects replay buffer with dealt positions | Reward is standard PnL; no uncertainty or risk-awareness in the reward |
| Kong et al. [20] | Stock | 7-algo ensemble | Broadest ensemble (TD3, SAC, A2C, TRPO, etc.) across three markets | Ensemble diversity addresses architecture, not reward design |
| Kochliaridis et al. [26] | Crypto | DRL + rules | Rule-based safety mechanism filters uncertain actions during exploitation | Uncertainty handled post-hoc via action filtering, not within the reward |
| Kumlungmak et al. [27] | Crypto | MAPPO | Progressive loss penalty prevents consecutive drawdowns in bearish markets | Risk penalty is loss-based; agent has no self-assessed confidence measure |
| Moody et al. [8] | Stock | RRL | Differential SR as a direct optimization objective | Risk-adjusted objective, but static; no adaptation to agent certainty |
| Srivastava et al. [12] | Stock | RL (general) | Modular 4-term composite reward (return, risk, Treynor) with tunable weights | Reward components are market metrics; none reflects the agent’s internal state |
| Choudhary et al. [35] | Stock | 3 DRL agents | Multi-reward fusion via CNN combining log return, Sharpe, and MDD agents | Fusion occurs at the action level, not the reward level; no confidence scaling |
| Huang et al. [10] | Stock | DDQN | Self-rewarding network learns to predict rewards from expert labels | Reward is learned from external supervision; agent does not estimate its own certainty |
| Zhou et al. [38] | Stock | DDQN + RLHF | Reward network trained on expert demonstrations via RLHF | Expert-dependent reward generation; no intrinsic uncertainty signal |
| Sattarov & Choi [16] | Crypto | TD3 | AMRF with 6 factors including critic-agreement confidence | Confidence was one of six coupled factors; individual effect not isolated |
| An et al. [42] | Control | SAC (ensemble) | Q-ensemble disagreement penalizes OOD actions in offline RL | Uncertainty penalizes Q-values; not applied to scale task rewards |
| Wu et al. [43] | Control | SAC + dropout | Dropout-based uncertainty down-weights OOD training samples | Uncertainty modulates gradient contribution, not the reward signal |
| Hoel et al. [45] | Driving | DQN (ensemble) | Separates aleatoric (quantile) and epistemic (ensemble) uncertainty | Uncertainty flags unsafe decisions; not fed back into reward computation |
| Pathak et al. [46] | Games | A3C | Prediction error as intrinsic curiosity bonus for exploration | Bonus is additive and encourages novelty-seeking, not reward-scaling |
| This paper | Crypto | TD3 | Five confidence methods that multiplicatively scale PnL reward by agent certainty | Single algorithm (TD3); single data type (OHLCV); single asset class (crypto) |
| Component | Specification |
|---|---|
| Input layer | 5 neurons (OHLCV) |
| Hidden layers | 128, 64, 32 (FC, ReLU) |
| Output layer | 1 neuron |
| Networks | 1 Actor + 2 Critics |
| Loss function | MSE |
| Optimizer | Adam |
| Learning rate | 0.001 |
| Discount factor () | 0.99 |
| Exploration noise | , decay 0.995/episode |
| Target network update | Soft, , every 500 steps |
| Policy update delay | Every 2 critic updates |
| Batch size | 128 |
| Replay buffer | Experience replay |
| Method | Abbr. | Uncertainty Dimension | Architecture Change | Computational Overhead | Hyperparameters |
|---|---|---|---|---|---|
| Critic Agreement | CA | Value estimation | None (uses TD3 critics) | Negligible | |
| Temporal Direction Consistency | TDC | Behavioral direction | None | Negligible | W |
| State Novelty | SN | Distributional familiarity | None | Moderate (NN search) | , k |
| Action Magnitude Stability | AMS | Position sizing | None | Negligible | , W |
| State-Transition Surprise | STS | Environmental predictability | Auxiliary MLP | Low |
| Statistic | BTC | ETH | LTC |
|---|---|---|---|
| Total hourly records | 43,869 | 43,869 | 43,869 |
| Training (Jun ’19–Jun ’23) | 35,061 | 35,061 | 35,061 |
| Validation (Jun ’23–Dec ’23) | 4392 | 4392 | 4392 |
| Test (Dec ’23–Jun ’24) | 4416 | 4416 | 4416 |
| Price min (USD) | $4160 | 97.63 | 25 |
| Price max (USD) | $73,613 | 4848.52 | 410.67 |
| Price mean (USD) | $29,031 | 1650.08 | 96.1 |
| Volume mean (hourly, native) | 62.13 BTC | 643.96 ETH | 718.09 LTC |
| Component | Status |
|---|---|
| State representation (OHLCV) | Fixed |
| Action space and denomination | Fixed (per cryptocurrency) |
| Transaction fee (1.5%) | Fixed |
| Initial capital ($1,000,000) | Fixed |
| TD3 network architecture | Fixed |
| TD3 training hyperparameters | Fixed |
| STS auxiliary network architecture | Fixed |
| Random seed | Fixed (single seed) |
| Training, validation, and test splits | Fixed |
| Baseline reward structure | Fixed |
| Confidence estimation method | Varied (6 levels) |
| Confidence-specific hyperparameters | Tuned per method on validation set |
| Method | Parameter | BTC | ETH | LTC |
|---|---|---|---|---|
| CA | 5 | 5 | 7 | |
| TDC | W | 12 | 24 | 12 |
| SN | 0.5 | 0.5 | 1.0 | |
| k | 10 | 20 | 10 | |
| AMS | 1.0 | 2.0 | 1.0 | |
| W | 12 | 12 | 24 | |
| STS | 0.1 | 0.1 | 0.5 |
| Method | ROI (%) | Sharpe Ratio | MDD (%) | Mean | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BTC | ETH | LTC | BTC | ETH | LTC | BTC | ETH | LTC | ROI | SR | MDD | |
| Baseline | 7.2 | 5.8 | 4.1 | 0.41 | 0.33 | 0.28 | 24.6 | 28.1 | 31.4 | 5.7 | 0.34 | 28.0 |
| CA | 22.8 | 19.4 | 16.7 | 1.47 | 1.22 | 1.05 | 15.3 | 17.8 | 19.6 | 19.6 | 1.25 | 17.6 |
| TDC | 13.1 | 10.6 | 8.9 | 0.82 | 0.68 | 0.54 | 19.8 | 22.4 | 25.1 | 10.9 | 0.68 | 22.4 |
| SN | 28.4 | 24.9 | 21.3 | 1.83 | 1.56 | 1.31 | 12.8 | 14.9 | 17.2 | 24.9 | 1.57 | 15.0 |
| AMS | 17.5 | 14.8 | 12.2 | 1.09 | 0.91 | 0.77 | 17.1 | 19.6 | 22.3 | 14.8 | 0.92 | 19.7 |
| STS | 9.8 | 7.4 | 5.6 | 0.58 | 0.44 | 0.36 | 22.1 | 25.3 | 28.7 | 7.6 | 0.46 | 25.4 |
| Method | Trades (Count) | Mean pos. Size ($) | Pos. Size CV | Win Rate (%) | Direction Flip Rate (%) |
|---|---|---|---|---|---|
| Baseline | 428 | 18,200 | 0.74 | 48.1 | 42.6 |
| CA | 287 | 21,800 | 0.52 | 61.4 | 31.8 |
| TDC | 264 | 19,400 | 0.68 | 54.2 | 22.5 |
| SN | 241 | 23,600 | 0.47 | 66.7 | 28.3 |
| AMS | 302 | 20,900 | 0.38 | 57.9 | 33.4 |
| STS | 401 | 18,700 | 0.71 | 50.3 | 40.1 |
| Method | Training Time (s/10k Steps) | Inference Overhead (% Over Baseline) |
|---|---|---|
| Baseline | 142.3 | 0 |
| CA | 142.6 | 0.2 |
| TDC | 142.5 | 0.1 |
| SN | 168.9 | 18.7 |
| AMS | 142.7 | 0.3 |
| STS | 159.4 | 12.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Akhmedov, F.; Cho, Y.I.; Otabek, S.; Sodikovich, Y.S.; Mallaev, O.U.; Khujamatov, E.H.; Craciunescu, R. Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods. Mathematics 2026, 14, 2075. https://doi.org/10.3390/math14122075
Akhmedov F, Cho YI, Otabek S, Sodikovich YS, Mallaev OU, Khujamatov EH, Craciunescu R. Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods. Mathematics. 2026; 14(12):2075. https://doi.org/10.3390/math14122075
Chicago/Turabian StyleAkhmedov, Farkhod, Young Im Cho, Sattarov Otabek, Yusupov Sarvarbek Sodikovich, Oybek Usmankulovich Mallaev, Ergashevich Halimjon Khujamatov, and Razvan Craciunescu. 2026. "Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods" Mathematics 14, no. 12: 2075. https://doi.org/10.3390/math14122075
APA StyleAkhmedov, F., Cho, Y. I., Otabek, S., Sodikovich, Y. S., Mallaev, O. U., Khujamatov, E. H., & Craciunescu, R. (2026). Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods. Mathematics, 14(12), 2075. https://doi.org/10.3390/math14122075

