Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods

Akhmedov, Farkhod; Cho, Young Im; Otabek, Sattarov; Sodikovich, Yusupov Sarvarbek; Mallaev, Oybek Usmankulovich; Khujamatov, Ergashevich Halimjon; Craciunescu, Razvan

doi:10.3390/math14122075

Open AccessArticle

Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods

by

Farkhod Akhmedov

¹

,

Young Im Cho

¹

,

Sattarov Otabek

²

,

Yusupov Sarvarbek Sodikovich

³,

Oybek Usmankulovich Mallaev

⁴

,

Ergashevich Halimjon Khujamatov

^2,5

and

Razvan Craciunescu

^6,*

¹

Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea

²

Department of Data Communication Networks and Systems, Tashkent University of Information Technologies, Tashkent 100084, Uzbekistan

³

Department of Mechanical Engineering, Kimyo International University in Tashkent, Tashkent 100121, Uzbekistan

⁴

Department of Digital Technologies, Alfraganus University, Tashkent 100190, Uzbekistan

⁵

Department of Electronics and Instrumentation, Fergana State Technical University, Fergana 150107, Uzbekistan

⁶

Telecommunications Department, Faculty of Electronics, Telecommunications and Information Technology, National University of Science and Technology POLITEHNICA, 060042 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(12), 2075; https://doi.org/10.3390/math14122075

Submission received: 4 May 2026 / Revised: 5 June 2026 / Accepted: 7 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Portfolio Optimization and Risk Management In Financial Markets )

Download

Browse Figures

Versions Notes

Abstract

Reinforcement learning agents for financial trading typically optimize reward functions that directly map profit and loss to learning signals, without accounting for the agent’s own decision certainty. This paper investigates whether modulating reward signals by a confidence estimate, without modifying network architecture, training procedures, or data pipelines, can meaningfully improve trading performance. We formalize five lightweight confidence estimation methods, each targeting a distinct uncertainty dimension: critic agreement (value estimation), temporal direction consistency (behavioral stability), state novelty (distributional familiarity), action magnitude stability (position sizing), and state-transition surprise (environmental predictability). Using a Twin Delayed Deep Deterministic Policy Gradient agent trained on hourly OHLCV data for Bitcoin, Litecoin, and Ethereum over five years encompassing diverse market regimes, we conduct a controlled experiment in which the confidence method is the sole variable across 18 experimental conditions. State novelty achieves the strongest improvement, raising mean test-period ROI from 5.7% to 24.9%, increasing Sharpe ratio (SR) from 0.34 to 1.57, and reducing maximum drawdown from 28.0% to 15.0% across the three cryptocurrencies. Four of the five methods reach statistical significance at

p < 0.05

on all assets; only state-transition surprise, the sole method requiring an auxiliary network, fails to distinguish itself from the baseline due to signal saturation. The proposed confidence-aware reward-shaping framework is plug-and-play, algorithm-agnostic, and directly applicable to other RL-based trading systems.

Keywords:

reinforcement learning; reward function; cryptocurrency trading; TD3 algorithm

MSC:

68T05

1. Introduction

Reinforcement learning (RL) has established itself as a competitive framework for automated financial trading, particularly in cryptocurrency markets where extreme volatility, 24/7 operation, and rapid regime shifts demand adaptive decision-making that rule-based strategies cannot provide [1,2]. Over the past five years, deep RL algorithms such as Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), Soft Actor–Critic (SAC), and Twin Delayed Deep Deterministic Policy Gradient (TD3) have been applied to Bitcoin, Ethereum, and other digital assets with increasingly sophisticated architectures, including Long Short-Term Memory (LSTM)-augmented encoders, ensemble strategies, and multi-asset portfolio frameworks [3,4,5]. These advances have primarily focused on two fronts: improving network architectures to better represent market states, and enriching data pipelines with technical indicators, sentiment signals, or on-chain metrics [1,6]. A third component of the RL loop, the reward function, has received comparatively less systematic attention.

Most RL-based trading systems employ reward functions that directly equate the learning signal with realized profit or loss, sometimes augmented with risk penalties such as the SR or maximum drawdown (MDD) [7,8,9]. While these designs capture what the agent should optimize, they ignore a separate question: how confident the agent is when making each decision. A profitable trade executed during a well-understood market regime and a profitable trade made during an unprecedented price shock receive identical reward magnitudes. The agent has no mechanism to distinguish between informed decisions and fortunate guesses, and consequently has no incentive to modulate its behavior based on its own certainty. Recent work on reward-shaping has explored self-rewarding mechanisms [10], expert-guided reward augmentation [11], and risk-aware composite objectives [12], yet none of these approaches explicitly incorporates the agent’s self-assessed confidence as a reward-scaling factor.

This gap is notable because the broader deep RL literature has demonstrated that uncertainty estimation can substantially improve agent behavior. Ensemble disagreement, Monte Carlo dropout, and deep kernel learning have all been used to separate epistemic uncertainty (arising from limited data) from aleatoric uncertainty (arising from the inherent stochasticity in the environment), with demonstrated benefits for exploration efficiency, risk-sensitive control, and out-of-distribution detection [13,14,15]. In financial applications specifically, uncertainty awareness could inform whether an agent should trade aggressively or conservatively, yet this connection between uncertainty estimation and reward function design remains largely unexplored.

The present work builds on a foundation established in a prior study by Sattarov and Choi [16], which introduced the Adaptive Multi-Factor Reward Function (AMRF) for TD3-based cryptocurrency trading. The AMRF incorporated six factors, including a confidence level derived from twin critic agreement, alongside risk thresholds, active trading limits, consecutive error penalties, and time discounting. That paper demonstrated that the composite reward outperformed simpler alternatives, but because all six factors operated simultaneously, the individual contribution of the confidence mechanism could not be isolated. The present paper addresses this directly: we strip the reward function down to a minimal profit-and-loss baseline and introduce confidence as the sole modification, allowing its effect to be measured without confounding variables.

Specifically, this paper formalizes five lightweight methods for estimating an RL trading agent’s confidence at each timestep. Each method targets a distinct dimension of uncertainty. Critic agreement measures consensus between the twin Q-networks that TD3 already maintains. Temporal direction consistency tracks whether the agent’s recent actions show stable directional commitment or erratic flip-flopping. State novelty compares the current market observation against the agent’s accumulated experience in the replay buffer. Action magnitude stability evaluates whether the agent’s trade sizes are consistent or wildly variable. State-transition surprise uses a small auxiliary network to detect when market dynamics deviate from learned patterns. Four of these five methods require only elementary mathematical operations on data the agent already computes during standard training; only State-transition surprise introduces an additional neural network, and we report this distinction transparently throughout the paper. Figure 1 illustrates how the proposed framework introduces a confidence estimate into the reward path without altering the underlying agent architecture.

Overall the contributions of this paper are as follows:

Formalization of five confidence estimation methods for RL trading agents, each targeting a distinct uncertainty dimension: value estimation, behavioral direction, distributional familiarity, position sizing, and environmental predictability.
Controlled experimental comparison in which the confidence method is the sole variable across 18 experimental conditions (five methods plus a confidence-free baseline, evaluated on Bitcoin, Litecoin, and Ethereum), isolating the impact of confidence from all other design choices.
Empirical demonstration that confidence-aware reward-shaping improves trading performance, with four of the five methods producing statistically significant improvements over the baseline ( $p < 0.05$ ), and state novelty delivering the largest gains: mean ROI increases from 5.7% to 24.9%, SR from 0.34 to 1.57, and maximum drawdown decreases from 28.0% to 15.0% across BTC, ETH, and LTC.
A practical taxonomy that maps each confidence method to its uncertainty dimension, computational cost, and implementation requirements, enabling practitioners to select the appropriate method for their context.

The remainder of this paper is organized as follows. Section 2 reviews related work on RL in financial trading, reward function design, and uncertainty estimation in deep RL. Section 3 presents the methodology, including the problem formulation, the baseline and confidence-enhanced reward functions, and the five confidence estimation methods. Section 4 describes the experimental setup, covering data, model configuration, and evaluation metrics. Section 5 reports and analyzes the experimental results. Section 6 discusses the findings, their implications, and the limitations of this study. Section 7 concludes the paper and outlines directions for future work.

2. Related Work

This section reviews three bodies of research that converge in the present study. Section 2.1 surveys the application of deep RL to financial trading, with an emphasis on actor–critic methods and cryptocurrency markets. Section 2.2 examines reward function design in RL-based trading systems. Section 2.3 reviews uncertainty estimation techniques in deep RL that provide the theoretical foundation for the confidence methods proposed in this paper.

2.1. Reinforcement Learning in Financial Trading

The application of RL to financial trading has progressed from early tabular Q-learning implementations to deep RL architectures capable of processing high-dimensional market data and producing continuous-valued trading actions [17]. Fischer [2] categorized the field into critic-only, actor-only, and actor–critic approaches, noting that actor–critic methods offer the most natural fit for trading problems with continuous action spaces. Fang et al. [1] provided a comprehensive survey covering 146 studies on cryptocurrency trading, documenting a rapid expansion of machine learning and deep RL methods applied to digital asset markets.

Among the actor–critic algorithms, TD3 [18] has received particular attention in trading applications due to its twin-critic mechanism that mitigates the Q-value overestimation problem inherent in DDPG. Sun et al. [19] proposed a supervised actor–critic framework with action feedback (SACRL-AF) and demonstrated that both DDPG-based and TD3-based variants achieved state-of-the-art profitability when the dealt position information was fed back into the replay buffer. Kong et al. [20] expanded the ensemble trading paradigm by combining seven actor–critic algorithms, including TD3, SAC, A2C, and TRPO, and validated the approach on KOSPI, JPX, and Dow Jones stocks. Lu [21] provided a benchmark comparison of DDPG, TD3, SAC, PPO, and A2C on simulated portfolio optimization tasks and found that off-policy algorithms such as TD3 struggled with noisy reward signals, while on-policy methods like PPO handled noise more effectively through generalized advantage estimation.

The actor–critic algorithms surveyed above are not interchangeable, and their differences inform the selection of TD3 for this study. DDPG [22] enabled deterministic continuous-control policies using deep networks but is prone to Q-value overestimation and hyperparameter sensitivity. PPO [23], an on-policy method, restricts each update to a trust region, which makes it stable and comparatively tolerant of noisy reward signals, though less sample-efficient than off-policy alternatives; this tolerance is what allowed PPO to outperform off-policy methods on the noisy portfolio rewards examined by Lu [21]. SAC [24] and TD3 [18] are both off-policy methods that maintain twin critics to limit overestimation bias, differing chiefly in that SAC learns a stochastic, entropy-regularized policy while TD3 learns a deterministic one with delayed policy updates and target smoothing. TD3 is adopted here for the two reasons detailed in Section 3.2: its deterministic continuous output maps directly onto quantity-level trade decisions, and its twin critics supply the value-disagreement signal on which the Critic Agreement method depends. This is not a claim that TD3 is the strongest trading algorithm in general; four of the five confidence methods rely only on components common to any actor-critic agent (an action history, a replay buffer, and a state-transition mapping) and therefore transfer to DDPG, PPO, or SAC unchanged, while Critic Agreement applies to any algorithm that maintains two or more value estimators, including SAC.

In cryptocurrency markets specifically, the combination of extreme volatility and continuous 24/7 operation has made RL-based strategies an active area of investigation. Zhang et al. [25] reviewed deep learning applications across cryptocurrency research tasks, including price prediction, portfolio construction, and trading, identifying deep RL as one of the most promising directions. Kochliaridis et al. [26] combined deep RL with rule-based safety mechanisms for cryptocurrency trading, using a novel reward function to maximize returns while deploying a separate conservative agent to identify high-risk periods. Kumlungmak et al. [27] proposed multi-agent PPO with a progressive negative reward mechanism for cryptocurrency trading and demonstrated that their method was the only one capable of generating positive returns during bearish market conditions. Sattarov et al. [28] introduced a multi-level DQN architecture integrating Bitcoin price data with Twitter sentiment analysis and achieved a 29.93% increase in investment value with a SR exceeding 2.7. Tran et al. [29] compared Double DQN with Bayesian optimization for cryptocurrency strategy optimization and found that DQN with the SR as the reward function provided the best balance of cumulative return and execution speed.

Despite this progress, the overwhelming majority of these studies focus on two levers for improving trading performance: more expressive network architectures (attention mechanisms, LSTM encoders, and ensemble strategies) and richer input representations (technical indicators, sentiment data, and on-chain metrics). The reward function, which is the primary signal shaping the agent’s policy, has received comparatively little systematic investigation as an independent research variable. The present work addresses this gap by holding architecture and data constant and varying only the reward function’s confidence component.

2.2. Reward Function Design in RL Trading

The studies reviewed above demonstrate that architecture and data representation have matured as research levers for RL-based trading, yet the reward function, the signal that ultimately shapes what the agent learns, has not received comparable systematic investigation. Eschmann [30] made a similar observation for RL more broadly, noting that the field has been preoccupied with learning algorithms while treating the reward signal as given and not subject to change. Ibrahim et al. [31] provided one of the first comprehensive reviews of reward engineering and shaping, introducing a taxonomy that distinguishes sparse, dense, shaped, and composite reward structures, and confirmed that most financial applications still default to simple profit-based feedback. The work that does exist on reward design in trading can be organized into three categories: classical single-metric rewards, composite multi-objective rewards, and adaptive or learned reward functions.

The simplest and most common reward formulation in RL-based trading is realized profit and loss (PnL), where the agent receives a positive or negative signal proportional to the outcome of each completed trade. Allen and Karjalainen [32] established this paradigm in their early work, using genetic algorithms to learn technical trading rules, and the direct PnL reward remains the baseline in many recent systems [33]. Moody et al. [8] proposed an alternative by training recurrent RL agents to maximize the differential SR rather than raw profit, demonstrating more consistent risk-adjusted performance. This line of work has been extended in several directions. Rodinos et al. [9] compared SR-based reward schemes in deep RL for financial trading. Wu et al. [34] developed a portfolio management system where the SR reward improved returns by 39.0% and reduced drawdown by 13.7% compared to a standard trading return reward.

More recent studies have explored composite and multi-objective reward functions. Srivastava et al. [12] proposed a modular composite reward combining annualized return, downside risk, differential return, and the Treynor ratio, with tunable weights enabling practitioners to encode diverse risk–return preferences. Choudhary et al. [35] trained three separate DRL agents using log returns, differential SR, and MDD as individual rewards, then fused their actions through a convolutional neural network to produce a unified risk-adjusted policy. Su et al. [36] designed a loss penalty term within the reward function to prevent sharp drawdowns and combined it with a weight control unit to manage portfolio positions across market regimes. Sadighian [37] compared seven different reward functions for cryptocurrency market making and found that reward functions incorporating realized gains produced fundamentally different trading behavior than those relying on unrealized PnL, with sparse reward variants leading to speculative strategies and frequent losses.

A separate thread has investigated adaptive and learned reward functions. Huang et al. [10] introduced a self-rewarding mechanism (SRDRL) in which a supervised network predicts rewards from expert-labeled data, and the agent selects the higher value between the expert-labeled and predicted rewards at each step. Their Self-Rewarding Double DQN achieved a cumulative return of 1124% on the IXIC dataset. Zhou et al. [38] applied reinforcement learning from human feedback (RLHF) to train a reward function network from expert demonstrations, achieving a maximum cumulative return of 1502% across six datasets. Cornalba et al. [39] investigated multi-objective reward generalization, where the reward-weighting mechanism is embedded in the learning process rather than specified a priori, and showed improved stability when the reward signal is sparse. Orra et al. [11] proposed a reward-shaping approach via expert feedback that provides denser guidance than standard PnL rewards.

Across this body of work, a consistent pattern emerges: reward function research in trading has focused on what the agent should optimize (profit, SR, MDD, composite objectives) and how the reward signal is generated (static, adaptive, expert-guided). What remains absent is consideration of how confident the agent is when making each decision. No prior work has systematically incorporated the agent’s self-assessed certainty as a multiplicative scaling factor on the reward. The present study addresses this gap directly: rather than redesigning the objective itself, we modulate the existing PnL reward by a confidence estimate that reflects the agent’s uncertainty at each timestep.

2.3. Uncertainty Estimation in Deep RL

The absence of confidence-aware reward design in trading is not due to a lack of uncertainty estimation tools in the broader RL literature. On the contrary, quantifying an agent’s uncertainty has been an active research area with well-established methods that could, in principle, serve as the confidence signal missing from current reward functions.

The most common framework distinguishes two types of uncertainty. Epistemic uncertainty arises from limited training data and can be reduced as the agent accumulates experience, while aleatoric uncertainty stems from irreducible stochasticity in the environment [14]. Separating the two has practical implications: epistemic uncertainty signals where the agent should explore, whereas aleatoric uncertainty signals where caution is inherently warranted regardless of experience. Valdenegro-Toro and Gilmore [40] conducted an extensive comparison of four uncertainty quantification methods (Monte Carlo dropout, DropConnect, ensembles, and Flipout) and found that ensembles provided the best overall disentangling quality, though interactions between aleatoric and epistemic estimation violated common independence assumptions. Wang et al. [41] provided a comprehensive survey of uncertainty quantification techniques across AI, covering probabilistic methods, ensemble learning, sampling-based approaches, and generative models.

Ensemble-based uncertainty estimation has proven particularly effective in RL. An et al. [42] demonstrated that simply increasing the number of Q-networks in an ensemble, combined with clipped Q-learning, can penalize out-of-distribution actions and substantially outperform existing offline RL methods. Wu et al. [43] proposed Uncertainty Weighted Actor–Critic (UWAC), which uses dropout-based uncertainty to detect out-of-distribution state–action pairs and down-weight their contribution during training, yielding significant stability improvements. Bai et al. [44] introduced pessimistic bootstrapping, where the disagreement among bootstrapped Q-functions serves as an uncertainty quantifier that penalizes the value function, and provided theoretical guarantees for this approach in linear MDPs. Hoel et al. [45] combined distributional RL with ensembles in their Ensemble Quantile Networks method for autonomous driving, using the quantile distribution for aleatoric uncertainty and ensemble variance for epistemic uncertainty, and showed that the agent could identify situations outside its training distribution and avoid unfounded decisions.

A parallel line of work uses prediction error as an intrinsic motivation signal for exploration. Pathak et al. [46] formulated curiosity as the error in an agent’s ability to predict the consequences of its own actions, using a self-supervised inverse dynamics model. This curiosity-driven approach and its descendants, including Random Network Distillation [47] and count-based novelty methods [48], have become standard tools for encouraging agents to visit unfamiliar states. Li et al. [49] proposed using a fixed random target network to stabilize the curiosity signal and reduce noise from environment stochasticity. The core insight from this literature is that prediction error captures a meaningful notion of surprise. When the agent encounters states or transitions it cannot predict well, something unusual is happening.

These uncertainty estimation techniques have been applied primarily to two purposes in RL: guiding exploration (visit uncertain states more often) and enforcing conservatism (avoid actions with high epistemic uncertainty). What they have not been used for, in any systematic way, is modulating the reward signal itself. The distinction matters. Exploration bonuses add a supplementary term to the reward, encouraging the agent to seek novelty. Conservative Q-learning penalizes Q-values for out-of-distribution actions. Neither of these approaches adjusts the magnitude of the primary task reward based on the agent’s confidence at the moment of decision. The present work occupies this unexplored intersection: we draw on the technical machinery of ensemble disagreement, behavioral consistency, distributional novelty, action stability, and transition prediction, but apply them not as exploration bonuses or pessimism penalties, but as multiplicative confidence scalers on the trading reward. This reframes uncertainty from a signal that guides where the agent explores to a signal that shapes how much the agent learns from each outcome. Table 1 summarizes the positioning of the present work relative to these studies, organized by the three dimensions reviewed in this section.

3. Methodology

The experimental design of this study rests on a single principle: isolating the confidence estimation method as the sole variable. To achieve this, the trading problem is formulated as a Markov Decision Process (MDP) with a deliberately minimal state representation (Section 3.1), the agent architecture is fixed to a standard TD3 configuration validated in prior work (Section 3.2), and the reward function is reduced to its simplest meaningful form before confidence is introduced (Section 3.3). With all other components held constant, Section 3.4 presents the five confidence estimation methods that constitute the core contribution of this paper. Each method targets a different dimension of uncertainty, uses a different data source already available within the agent or its environment, and produces a scalar

{Conf}_{t} \in [0, 1]

that multiplicatively scales the baseline reward.

3.1. Problem Formulation

The cryptocurrency trading task is formulated as an MDP defined by the tuple

(S, A, P, R, γ)

, where

S

is the state space,

A

is the action space,

P : S \times A \times S \to [0, 1]

is the state transition probability function,

R : S \times A \to R

is the reward function, and

γ \in [0, 1]

is the discount factor. The agent interacts with the market environment at hourly intervals, observing the current state, selecting a trading action, receiving a reward, and transitioning to the next state. The objective is to learn a policy

π : S \to A

that maximizes the expected cumulative discounted reward

E [\sum_{t = 0}^{T} γ^{t} r_{t}]

.

3.1.1. State Space

The state at each timestep t is a five-dimensional vector composed of raw OHLCV (open, high, low, close, volume) market data:

s_{t} = [O_{t}, H_{t}, L_{t}, C_{t}, V_{t}]

(1)

where

O_{t}

,

H_{t}

,

L_{t}

, and

C_{t}

represent the open, high, low, and close prices for the hourly candle at time t, and

V_{t}

is the corresponding trading volume. No technical indicators, derived features, sentiment signals, or on-chain metrics are included. This choice is deliberate: a minimal state representation ensures that any performance differences across experimental conditions are attributable to the confidence estimation method, not to feature engineering.

3.1.2. Action Space

The action

a_{t} \in R

is a continuous scalar representing the trade quantity at timestep t. Positive values indicate a buy order, negative values indicate a sell order, and values at or near zero indicate a hold decision. Because the three cryptocurrencies in this study span several orders of magnitude in per-unit price, expressing all actions in the same unit (e.g., whole coins) would produce impractical scales: a single BTC costs tens of thousands of dollars, making fractional-coin precision essential, while a single LTC costs under one hundred dollars, making whole-unit trading natural. To align the action space with real-world trading conventions, the action is denominated in currency-specific sub-units:

Bitcoin (BTC): action expressed in Satoshis (1 BTC $= 10^{8}$ Satoshis [50]), the standard smallest unit on most Bitcoin exchanges and the denomination commonly used in retail trading.
Ethereum (ETH): action expressed in Kwei ( $10^{3}$ Kwei $= 1$ ETH [51]), a fractional denomination that reflects the intermediate price level of ETH and avoids both excessively large and excessively small action values.
Litecoin (LTC): action expressed in whole LTC units, appropriate given the comparatively low per-unit price of LTC during the study period.

This continuous action space allows the agent to determine not only the direction of each trade (buy, sell, or hold) but also the quantity, which is directly relevant to the confidence mechanisms introduced in Section 3.4, where action magnitude stability serves as one of the five uncertainty signals.

3.1.3. Transaction Fees

A symmetric transaction fee of

1.5%

is applied to both the buy and sell sides of each trade. This rate represents a conservative aggregate estimate for retail cryptocurrency trading, combining exchange fees (0.30–0.50%) [52], bid-ask spread costs (0.20–0.30%), and slippage (0.10–0.30%) [53]. The fee for a given trade is computed as

c_{k} = 0.015 \times | P_{k} \times q_{k} |

(2)

where

P_{k}

is the execution price and

q_{k}

is the trade quantity for order k. Transaction fees are applied identically across all experimental conditions and are not a variable in this study.

3.2. TD3 Algorithm Overview

With the trading environment defined in terms of its state space, action space, and cost structure, the next component to specify is the agent that operates within it. This study employs the TD3 algorithm [18], an off-policy actor–critic method designed for continuous action spaces. TD3 extends the DDPG algorithm [22] by introducing three mechanisms to address DDPG’s known instability: clipped double Q-learning to mitigate overestimation bias, delayed policy updates to reduce per-update error, and target policy smoothing to prevent the critic from exploiting narrow peaks in the Q-function [18]. TD3 was selected for two reasons: first, its continuous action output naturally supports the quantity-level trading decisions described in Section 3.1; second, its twin-critic architecture produces a pair of Q-value estimates that serve as the basis for one of the five proposed methods.

TD3 maintains three networks: an actor network

μ_{θ} (s)

that maps states to actions, and two critic networks

Q_{ϕ_{1}} (s, a)

and

Q_{ϕ_{2}} (s, a)

that independently estimate the expected return for a given state–action pair. The use of two critics addresses the overestimation bias present in standard Q-learning by taking the minimum of the two estimates when computing the target value:

y = r + γ min_{i = 1, 2} Q_{ϕ_{i}^{'}} (s^{'}, μ_{θ^{'}} (s^{'}) + ϵ), ϵ \sim clip (N (0, σ), - c, c)

(3)

where

ϕ_{i}^{'}

and

θ^{'}

denote the parameters of the target networks,

γ

is the discount factor, and

ϵ

is clipped Gaussian noise added to the target action for smoothing. Each critic is updated by minimizing the mean squared error between its prediction and the target y. The actor is updated less frequently than the critics (every two critic updates) to reduce variance in the policy gradient, and the target networks are updated via soft interpolation:

ϕ_{i}^{'} \leftarrow τ ϕ_{i} + (1 - τ) ϕ_{i}^{'}, θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}

(4)

with

τ = 0.005

.

The agent explores using Gaussian noise added to the actor’s output during training, with an initial exploration rate of 1.0 that decays by a factor of 0.995 per episode. Experiences are stored in a standard experience replay buffer [54] and sampled in mini-batches of 128 for training. The actor and each critic network share the same fully connected architecture: an input layer of 5 neurons (matching the OHLCV state), three hidden layers of 128, 64, and 32 neurons with ReLU activations, and a single output neuron (trade quantity for the actor; Q-value for each critic). All networks are trained using the Adam optimizer [55] with a learning rate of 0.001, and the discount factor is set to

γ = 0.99

. Table 2 summarizes the full configuration.

This fixed architecture serves as the common substrate across all experiments. Critically for the present study, the standard TD3 training loop produces several byproducts beyond the policy itself: the twin critic outputs

Q_{ϕ_{1}}

and

Q_{ϕ_{2}}

, the action history

{a_{t - W}, \dots, a_{t}}

, and the contents of the replay buffer

D

. These quantities, which are ordinarily discarded or used only for gradient computation, are precisely the signals from which the confidence methods in Section 3.4 derive their estimates. Before introducing those methods, however, the next subsection defines the reward function that the confidence estimate will modulate.

3.3. Reward Function Design

The reward function is the mechanism through which the confidence estimate enters the agent’s learning process. To ensure that any observed performance difference is attributable to the confidence method alone, the reward is constructed in two layers: a minimal baseline that captures the fundamental trading objective, and a single multiplicative modification that introduces confidence.

3.3.1. Baseline Reward (No Confidence)

The baseline reward function assigns a nonzero signal only when the agent completes a sell action, at which point it receives the realized profit or loss for that trade, net of transaction fees:

r_{t} = \{\begin{matrix} {PnL}_{k}, & if a_{t} = sell \\ 0, & if a_{t} = buy or hold \end{matrix}

(5)

where the

P n L

for order k is defined as

{PnL}_{k} = P_{k}^{sell} - P_{k}^{buy} - c_{k}^{buy} - c_{k}^{sell}

(6)

Here,

P_{k}^{sell}

and

P_{k}^{buy}

are the selling and buying prices for order k, and

c_{k}^{buy}

and

c_{k}^{sell}

are the corresponding transaction fees computed using Equation (2). This formulation is deliberately minimal. It excludes risk thresholds, active trading limits, consecutive error penalties, and time-based discounting, all of which appeared in the AMRF introduced in prior work [16]. Those components were removed not because they lack value, but because their presence would confound the comparison: if the confidence-enhanced agent outperformed the baseline, it would be unclear whether the improvement stemmed from the confidence signal or from interactions between confidence and the other reward factors.

3.3.2. Confidence-Enhanced Reward

The confidence-enhanced reward modifies the baseline by a single multiplicative factor:

r_{t} = \{\begin{matrix} {PnL}_{k} \times {Conf}_{t}, & if a_{t} = sell \\ 0, & if a_{t} = buy or hold \end{matrix}

(7)

where

{Conf}_{t} \in [0, 1]

is the confidence estimate at timestep t. When

{Conf}_{t} = 1

, the confidence-enhanced reward reduces exactly to the baseline, making the baseline a special case of the general formulation. When

{Conf}_{t} < 1

, the reward magnitude is attenuated in proportion to the agent’s uncertainty, causing the agent to learn less aggressively from decisions made under low certainty.

The multiplicative structure has two properties that make it suitable for a controlled experiment. First, it preserves the sign of the reward: a profitable trade remains positive, and an unprofitable trade remains negative, regardless of the confidence level. The agent is never rewarded for a loss or penalized for a gain. Second, it introduces no additional reward terms or objectives. The agent still optimizes for profit; the confidence factor only controls how strongly each realized outcome contributes to policy updates. This keeps the comparison with the baseline clean: the what of the objective (maximize net profit) is identical across all conditions; only the how much the agent learns from each trade changes.

The remaining question is how

{Conf}_{t}

is computed. The following subsection formalizes five methods, each deriving a confidence estimate from a different source and targeting a different dimension of uncertainty.

3.4. Confidence Estimation Methods

Each of the five methods presented below computes a scalar

{Conf}_{t} \in [0, 1]

at every timestep, using data the agent already produces or can obtain with minimal overhead. The methods share a common exponential decay structure: confidence is maximal (equal to 1) when the relevant uncertainty signal is zero, and decays toward 0 as uncertainty increases. They differ in what they measure. Table 3 at the end of this subsection provides a side-by-side comparison.

3.4.1. Method 1: Critic Agreement (CA)

TD3 maintains two critic networks

Q_{ϕ_{1}}

and

Q_{ϕ_{2}}

that independently estimate the expected return for a given state–action pair. When both critics produce similar estimates, the agent’s value assessment is internally consistent. When they diverge, the agent is effectively uncertain about how good its current action is. The CA method exploits this disagreement as a direct proxy for value estimation uncertainty:

{Conf}_{t}^{CA} = exp (- γ_{CA} \cdot | Q_{ϕ_{1}} (s_{t}, a_{t}) - Q_{ϕ_{2}} (s_{t}, a_{t}) |)

(8)

where

γ_{CA}

is a sensitivity parameter that controls how rapidly confidence decays as critic disagreement increases. When the critics perfectly agree,

| Q_{ϕ_{1}} - Q_{ϕ_{2}} | = 0

and

{Conf}_{t}^{CA} = 1

. As the absolute difference grows, confidence decays exponentially toward 0.

This method incurs negligible computational cost because the Q-values

Q_{ϕ_{1}} (s_{t}, a_{t})

and

Q_{ϕ_{2}} (s_{t}, a_{t})

are already computed during the standard TD3 critic update [18]; the confidence calculation adds only a subtraction, an absolute value, and an exponentiation. CA is the only method among the five that is structurally tied to TD3’s twin-critic design, though the same principle applies to any algorithm that maintains an ensemble of value estimators (e.g., SAC with multiple critics [42], or custom Q-ensembles). The hyperparameter

γ_{CA}

is tuned on the validation set, with a search range of

{3, 5, 7, 10}

.

3.4.2. Method 2: Temporal Direction Consistency (TDC)

While CA captures uncertainty in the agent’s value estimates, it does not reflect whether the agent’s actions are behaviorally coherent over time. An agent that rapidly alternates between buying and selling in similar market conditions is exhibiting directional indecision, regardless of what its Q-values indicate. The TDC method quantifies this indecision by counting action sign changes over a recent window:

{Conf}_{t}^{TDC} = 1 - \frac{\sum_{i = t - W + 1}^{t} 1 [sign (a_{i}) \neq sign (a_{i - 1})]}{W}

(9)

where W is the window size (number of recent timesteps) and

1 [\cdot]

is the indicator function. If all actions in the window share the same sign (consistent buying or consistent selling), the numerator is 0 and

{Conf}_{t}^{TDC} = 1

. If the agent alternates direction at every timestep, the numerator approaches W and confidence approaches 0.

Near-zero actions (holds) are treated as continuing the previous direction to avoid inflating the sign-change count. During the first W timesteps of training, before a full window is available, confidence defaults to 1.0. The computational cost is negligible: only a circular buffer of action signs is maintained. The hyperparameter W is tuned on the validation set, with a search range of

{6, 12, 24, 48}

, corresponding to 6 h through 2 days of hourly data.

3.4.3. Method 3: State Novelty (SN)

CA and TDC derive confidence from the agent’s own outputs (Q-values and actions). SN shifts the perspective to the agent’s input: how familiar is the current market state relative to the agent’s accumulated experience? If the current observation closely resembles states stored in the replay buffer

D

, the agent is operating in well-explored territory. If it is an outlier, the agent faces an unfamiliar market regime and should trade cautiously:

{Conf}_{t}^{SN} = exp (- λ_{SN} \cdot d_{k} (s_{t}, D))

(10)

where

d_{k} (s_{t}, D)

is the average Euclidean distance from the current state

s_{t}

to its k nearest neighbors in the replay buffer, computed in a normalized state space. Normalization is applied per feature using training-set statistics (zero mean, unit variance) to prevent price-scale features (e.g., BTC close price in the tens of thousands) from dominating volume features.

The computational cost of SN is moderate, as it requires a nearest-neighbor search over the replay buffer at each timestep. This cost can be mitigated through approximate nearest-neighbor methods such as KD-trees [56], periodic recomputation (e.g., every N steps rather than every step), or subsampling the replay buffer to a fixed-size subset (e.g., 10,000 states). Two hyperparameters are tuned on the validation set:

λ_{SN} \in {0.1, 0.5, 1.0, 5.0}

and

k \in {5, 10, 20, 50}

.

This method is particularly relevant during black swan events, market regime changes, or any condition not represented in the training data. It draws on the same distributional familiarity intuition used in count-based exploration methods [48], but applies it in the opposite direction: rather than seeking novel states, the agent becomes cautious in them.

3.4.4. Method 4: Action Magnitude Stability (AMS)

TDC measures directional consistency (buy vs. sell), but an agent can maintain a stable direction while exhibiting large fluctuations in trade size. If the agent buys 800 Satoshis, then 50, then 600, then 20 in consecutive timesteps, it is uncertain about how much capital to commit even though it consistently buys. The AMS method captures this position-sizing uncertainty through the coefficient of variation (CV) of recent action magnitudes:

{Conf}_{t}^{AMS} = exp (- β \cdot CV (| a_{t - W} |, \dots, | a_{t} |))

(11)

where:

CV = \frac{σ (| a_{t - W} |, \dots, | a_{t} |)}{μ (| a_{t - W} |, \dots, | a_{t} |) + ϵ}

(12)

Here,

σ

and

μ

denote the standard deviation and mean of the absolute action values over the window, and

ϵ

is a small constant (e.g.,

10^{- 8}

) added to the denominator to prevent division by zero when the agent is mostly holding (near-zero mean magnitude). The CV is unitless and scale-invariant [57], making it robust across cryptocurrencies with different action denominations (Satoshis for BTC, Kwei for ETH, and whole units for LTC).

Consistent trade sizes produce a low CV and high confidence; erratic sizing produces a high CV and low confidence. As with TDC, the first W timesteps default to

{Conf}_{t} = 1.0

. The computational cost is negligible. Two hyperparameters are tuned on the validation set:

β \in {0.5, 1.0, 2.0, 5.0}

and

W \in {6, 12, 24, 48}

.

3.4.5. Method 5: State-Transition Surprise (STS)

The first four methods assess uncertainty using quantities the TD3 agent already computes, requiring no architectural additions. Method 5 departs from this pattern by introducing a small auxiliary network

f_{ψ}

that learns to predict the next market state given the current state and action. When the market transitions as predicted, the environment is behaving in a familiar pattern. When the prediction error is large, the agent has encountered a surprising transition and should reduce its confidence:

{Conf}_{t}^{STS} = exp (- λ_{STS} \cdot {∥ {\hat{s}}_{t + 1} - s_{t + 1} ∥}^{2})

(13)

where

{\hat{s}}_{t + 1} = f_{ψ} (s_{t}, a_{t})

is the predicted next state and

s_{t + 1}

is the observed next state, both in the normalized state space (same normalization as SN). The auxiliary network

f_{ψ}

is a two-layer multi-layer perceptron (MLP) with 64 and 32 hidden units, taking the concatenation of

s_{t}

and

a_{t}

as input and outputting a five-dimensional predicted state. It is trained using MSE loss on transitions sampled from the same replay buffer used by the TD3 agent.

Two design properties distinguish STS from the curiosity-driven exploration framework of Pathak et al. [46] from which it draws inspiration. First, the logic is inverted: curiosity-driven agents seek states with high prediction error to encourage exploration, whereas STS uses high prediction error to reduce confidence and attenuate the reward, encouraging caution. Second, the auxiliary network is entirely independent of the TD3 architecture. It does not influence the gradient flow in the actor or critic networks and can be attached to or removed from the system without modifying any other component.

The computational cost is low but nonzero: an additional forward pass through the auxiliary network is required at each timestep, and the network must be trained alongside the main TD3 model. This makes STS the only method among the five that requires an architectural addition, a distinction we report transparently rather than minimize. The hyperparameter

λ_{STS}

is tuned on the validation set, with a search range of

{0.01, 0.1, 0.5, 1.0}

.

3.4.6. Design Rationale

The five methods were selected to span complementary dimensions of uncertainty rather than to compete as alternatives to a single quantity. CA reflects whether the agent’s value function has converged for the current state–action pair. TDC and AMS capture behavioral stability in direction and magnitude, respectively. SN detects novel inputs. STS detects novel dynamics. Together, they cover the agent’s internal state (CA), its recent behavioral pattern (TDC, AMS), its input distribution (SN), and its environment model (STS). In the present study, exactly one method is active per experimental condition to permit controlled comparison. The question of whether combining multiple methods into a composite confidence score could yield further improvements is deferred to future work.

3.5. Theoretical Analysis of Confidence-Shaped Reward

The five methods are presented as practical estimators, but the multiplicative shaping rule

r_{t} = {PnL}_{k} \cdot {Conf}_{t}

raises three questions of mathematical substance: (i) what guarantees, if any, does the shaped reward inherit from the unshaped baseline regarding optimal policies; (ii) whether the shaped Bellman operator retains the contraction properties on which TD3 convergence rests; and (iii) what intrinsic limitations arise when the confidence estimator itself becomes uninformative. This subsection addresses each in turn.

3.5.1. Policy Ordering Under Multiplicative Shaping

Potential-based reward-shaping [58] guarantees that an arbitrary shaping function

F (s, s^{'}) = γ Φ (s^{'}) - Φ (s)

, added to the reward, preserves the optimal policy. The multiplicative form adopted here does not satisfy that decomposition and therefore does not inherit the same invariance. The natural question is what, if anything, is preserved.

Let

τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots, s_{T})

denote a trajectory and let

R (τ) = \sum_{t = 0}^{T} γ^{t} r_{t}

denote its discounted return under the unshaped reward

r_{t} = {PnL}_{k (t)}

, where

k (t)

indexes the trade closed at step t (with

r_{t} = 0

when no trade closes). Let

\tilde{R} (τ) = \sum_{t = 0}^{T} γ^{t} \cdot {Conf}_{t} \cdot r_{t}

denote the shaped return. Because

{Conf}_{t} \in [0, 1]

for every method, and the sign of

r_{t}

is preserved under multiplication by a non-negative scalar, the following holds.

For any trajectory

τ

and any confidence sequence

{{Conf}_{t}}_{t = 0}^{T} \subset [0, 1]

,

| \tilde{R} (τ) | \leq | R (τ) |,

(14)

and

sgn ({\tilde{r}}_{t}) = sgn (r_{t})

for every t. In particular, a strictly profitable trajectory under the baseline remains non-negative under shaping, and a strictly unprofitable trajectory remains non-positive.

The proof is immediate from

{Conf}_{t} \geq 0

. The proposition clarifies the scope of what confidence-shaping does not do: it does not turn losses into gains or vice versa, and it does not invert the sign of any per-step reward. Consequently, the shaped reward cannot mislead the agent about the direction of optimization.

What multiplicative shaping does change is the relative weighting of trajectories that contain a mixture of confident and unconfident decisions. Two trajectories

τ_{A}, τ_{B}

with

R (τ_{A}) > R (τ_{B})

under the baseline may satisfy

\tilde{R} (τ_{A}) < \tilde{R} (τ_{B})

under shaping whenever

τ_{A}

’s profitable trades occur predominantly under low confidence and

τ_{B}

’s profitable trades occur predominantly under high confidence. This is precisely the intended behavior: shaping re-weights the learning signal toward outcomes that the agent can attribute to informed decision-making rather than to fortunate guesses. This proposition establishes that this re-weighting operates within the cone of sign-preserving transformations, which is the strongest invariance the multiplicative form admits.

3.5.2. Contraction of the Shaped Bellman Operator

TD3 inherits its convergence properties from the contraction of the Bellman optimality operator

T

on the space of bounded Q-functions equipped with the supremum norm. For the shaped reward

{\tilde{r}}_{t} = {Conf}_{t} \cdot r_{t}

, the corresponding operator is

(\tilde{T} Q) (s, a) = E_{s^{'} \sim P (\cdot | s, a)} [\tilde{r} (s, a) + γ max_{a^{'}} Q (s^{'}, a^{'})] .

(15)

If the unshaped reward satisfies

| r (s, a) | \leq R_{max}

for all

(s, a)

and

{Conf}_{t} \in [0, 1]

for every t, then

\tilde{T}

is a

γ

-contraction in the supremum norm:

{∥ \tilde{T} Q_{1} - \tilde{T} Q_{2} ∥}_{\infty} \leq γ {∥ Q_{1} - Q_{2} ∥}_{\infty} .

(16)

The corresponding fixed-point

Q_{\tilde{r}}^{*}

exists and is bounded by

∥ Q_{\tilde{r}}^{*} ∥_{\infty} \leq R_{max} / (1 - γ)

.

The shaped reward term cancels in the difference

\tilde{T} Q_{1} - \tilde{T} Q_{2}

, leaving

γ E_{s^{'}} [{max}_{a^{'}} Q_{1} (s^{'}, a^{'}) - {max}_{a^{'}} Q_{2} (s^{'}, a^{'})]

. Standard arguments bound this expression by

γ ∥ Q_{1} - Q_{2} ∥_{\infty}

. The boundedness of the fixed point follows from

| \tilde{r} | \leq R_{max}

and the geometric series

\sum_{t = 0}^{\infty} γ^{t} = 1 / (1 - γ)

.

This proposition establishes that confidence-shaping, despite altering the magnitude of per-step rewards, does not disturb the convergence guarantees that underlie TD3 training. The shaped Bellman operator admits a unique fixed point, and tabular Q-learning under

\tilde{r}

converges to it under standard Robbins–Monro conditions. The empirical convergence behavior reported in Section 5 is therefore consistent with theory: no confidence method should be expected to destabilize training relative to the baseline, and indeed none of the five does so in the experiments.

3.5.3. Signal Saturation: A Sufficient Condition for Ineffectiveness

The above results establish that confidence-shaping is well-behaved in the limit of arbitrary

{Conf}_{t}

sequences. They do not, however, guarantee that any specific estimator produces an informative sequence. The empirical results in Section 5 show that STS, the only method using an auxiliary prediction network, fails to distinguish itself from the baseline. The visual evidence indicates that the STS confidence signal remains near

0.25

for most of the test period. This subsection formalizes the condition under which this occurs and explains its consequences for the shaped reward.

Each of the five methods that use an exponential decay can be written in the form

{Conf}_{t} = exp (- λ X_{t}),

(17)

where

X_{t} \geq 0

is a method-specific uncertainty signal (the critic-disagreement magnitude for CA, the k-NN distance for SN, the coefficient of variation for AMS, the prediction error for STS) and

λ > 0

is a sensitivity hyperparameter. Treat

X_{t}

as a random variable with mean

μ_{X} = E [X_{t}]

and variance

σ_{X}^{2} = Var (X_{t})

over the evaluation period.

Suppose the uncertainty signal

X_{t}

satisfies

λ μ_{X} ≫ 1

and

σ_{X} / μ_{X} ≪ 1

. Then, a first-order Taylor expansion of

exp (- λ X_{t})

about

μ_{X}

gives

{Conf}_{t} \approx exp (- λ μ_{X}) [1 - λ (X_{t} - μ_{X})],

(18)

with relative variation

\frac{Var {({Conf}_{t})}^{1 / 2}}{E [{Conf}_{t}]} \approx λ σ_{X} .

(19)

Consequently,

{Conf}_{t}

concentrates near the constant

c = exp (- λ μ_{X})

, and the shaped reward

{\tilde{r}}_{t} \approx c \cdot r_{t}

becomes proportional to the unshaped reward up to a multiplicative constant. The induced policy gradient direction coincides with that of the baseline.

When this proposition holds, the optimal policy

π_{\tilde{r}}^{*}

under the shaped reward satisfies

π_{\tilde{r}}^{*} = π_{r}^{*}

.

A positive constant multiplier on the reward does not change the

arg max

over policies. Under the conditions of this proposition,

{Conf}_{t} \approx c

for all t, so

{\tilde{r}}_{t} \approx c \cdot r_{t}

with

c > 0

.

Together, the above explain the empirical failure of STS observed in Section 5. At an hourly resolution on raw OHLCV data, the next-state prediction error

∥ {\hat{s}}_{t + 1} - s_{t + 1} ∥^{2}

is both large in expectation (price movements at this timescale are near-random) and tightly concentrated (the prediction error rarely falls close to zero). Both conditions of the proposition are satisfied, the STS confidence signal saturates at a near-constant value of approximately

0.25

, and the resulting shaped reward induces a policy that is essentially identical to the baseline. The Wilcoxon test results, which find no significant difference between STS and baseline on any of the three cryptocurrencies, are the direct empirical consequence.

The proposition also yields a practical diagnostic for confidence estimator design. An effective estimator requires both

λ μ_{X} = O (1)

(so that

{Conf}_{t}

spans a meaningful range of

[0, 1]

rather than collapsing toward 0 or 1) and

σ_{X} / μ_{X} = O (1)

(so that

{Conf}_{t}

exhibits variation across timesteps). The four methods that succeed (CA, TDC, SN, AMS) satisfy both conditions in the configurations tested; STS, in the OHLCV-hourly configuration studied here, satisfies neither. This is not a defect of the prediction-error principle in general but a property of the chosen state representation and timescale, and it suggests concrete directions for rehabilitating STS that are taken up in the discussion of limitations.

4. Experimental Setup

Section 3 defines what is being compared (five confidence methods against a confidence-free baseline) and how the comparison is structured (a single multiplicative modification to a minimal reward function, with all other components fixed). This section specifies the concrete experimental conditions under which that comparison is conducted: the dataset and its temporal splits (Section 4.1), the fixed model configuration and controlled-variable design (Section 4.2), the hyperparameter selection procedure (Section 4.3), and the evaluation metrics used to assess performance (Section 4.4).

4.1. Dataset

The experiments use hourly OHLCV price data for three cryptocurrencies: BTC, ETH, and LTC. These three assets were selected to represent different segments of the cryptocurrency market: BTC as the dominant large-cap asset, ETH as the leading platform token with distinct volatility characteristics, and LTC as a lower-priced, smaller-cap asset with a longer trading history.

Historical data was obtained from CryptoDataDownload [59], a publicly available repository that provides research-grade OHLCV data aggregated from major exchanges. This source was also used in [16,28], and its data has been independently validated in multiple cryptocurrency trading studies [1,29]. Each record consists of five fields OHLCV at one-hour granularity, with timestamps recorded in UTC.

The dataset spans a five-year period from 1 June 2019 to 1 June 2024. This window was selected to encompass a broad range of market regimes: the pre-COVID sideways period (2019), the March 2020 crash and subsequent recovery, the 2021 bull run, the prolonged 2022 bear market triggered by the LUNA/UST collapse and FTX failure, and the 2023–2024 recovery phase. Exposure to these diverse conditions is essential for evaluating whether confidence-aware reward-shaping provides robust improvements across favorable and adverse market environments, rather than only during one regime type.

The data is split into three non-overlapping temporal partitions:

Training set (4 years, from 1 June 2019 to 1 June 2023): Used to train the TD3 agent and, where applicable, the auxiliary network for Method 5 (STS).
Validation set (6 months, from 1 June 2023 to 1 December 2023): Used exclusively for tuning the hyperparameters of each confidence estimation method (see Section 4.3). The TD3 architecture and its training hyperparameters are not tuned on this set.
Test set (6 months, from 1 December 2023 to 1 June 2024): Used for final performance evaluation. No model parameters or confidence hyperparameters are adjusted during this period.

The temporal ordering of the splits ensures that no future information leaks into training or validation, consistent with standard practice in financial time series evaluations [2]. Table 4 summarizes the dataset statistics for each cryptocurrency, and Figure 2 illustrates the hourly close price trajectories over the full five-year period, with background shading indicating the training, validation, and test partitions.

No technical indicators, sentiment signals, on-chain metrics, or derived features are included in the dataset. As stated in Section 3.1, the state representation is limited to the five raw OHLCV fields. This keeps the input space minimal and ensures that performance differences across experimental conditions reflect the confidence estimation method, not differences in feature engineering.

4.2. Model Configuration and Controlled Variables

The dataset defined above is identical across all experimental conditions. The remaining degrees of freedom, the agent architecture and the initial trading capital, are also fixed, leaving the confidence estimation method as the only quantity that varies between conditions. This subsection states each fixed component explicitly and explains how the controlled-variable design is enforced.

4.2.1. Fixed Model Configuration

All experimental conditions use the TD3 configuration summarized in Table 2. These values are not tuned for the present study. They are adopted in full from the prior work that introduced the AMRF baseline [16], in which the 128-64-32 architecture was selected through a systematic comparison of four capacity variants (64-32-16, 128-64-32, 256-128-64, and 512-256-128), with the smaller variant underfitting the market dynamics and the larger variants yielding diminishing returns at higher computational cost. The training hyperparameters follow that same study, and the delayed policy update and target smoothing settings follow the original TD3 specification [18]. No hyperparameter of the TD3 agent itself, including network depth and width, learning rate, discount factor, exploration noise schedule, batch size, target network update rate, or replay buffer size, is modified between conditions. The auxiliary prediction network required by Method 5 (STS) is also fixed at a two-layer MLP with 64 and 32 hidden units, and is trained with the same optimizer and learning rate as the main TD3 networks.

This level of architectural uniformity is uncommon in comparative RL trading studies, which often tune the agent’s hyperparameters separately for each reward variant. Such per-variant tuning improves the absolute performance of each method but conflates two distinct effects: the contribution of the reward modification itself, and the contribution of the accompanying hyperparameter search. Because the question of this paper is whether the confidence signal alone carries information that improves the reward function, all agent-level hyperparameters are held constant and only the confidence-specific hyperparameters are tuned per condition. Holding the configuration fixed also protects the comparison from a subtler concern. Even if these inherited values are not optimal for the raw OHLCV state used here, any such suboptimality is applied identically to all six conditions, and under the single fixed seed (Section 4.2.3) it shifts the absolute performance level without altering the relative ordering of the confidence methods. The comparison is therefore internally valid whether or not the inherited hyperparameters are globally optimal for this setting.

4.2.2. Initial Capital and Trading Environment

Each experimental run begins with an initial cash balance of $1,000,000 USD and no open positions. This value is large enough to accommodate a wide range of position sizes for all three cryptocurrencies without hitting minimum-trade-size floors imposed by the action denominations. Portfolio value at any timestep t is computed as the sum of cash held and the market value of open cryptocurrency positions, evaluated at the close price

C_{t}

. Transaction fees (Equation (2)) are deducted from the cash balance at the moment each trade is executed. No leverage, margin, or short selling is permitted; the agent may only open long positions and must close them to realize a profit or loss. These environmental constraints are applied uniformly across all conditions.

4.2.3. Random Seed and Controlled Variables

A single random seed is fixed across the entire experimental matrix, governing network weight initialization, exploration noise sequences, and mini-batch sampling from the replay buffer. Because the environment is deterministic (fixed price data), identical seeds produce training trajectories that differ only in responses to the reward signal itself. Any divergence between, for example, the CA and AMS conditions can therefore be traced to the difference in reward signals rather than to differing starting weights or exploration samples.

Table 5 summarizes which components of the experimental setup are held constant and which are allowed to vary across conditions.

4.3. Hyperparameter Selection

Each confidence method is governed by one or two scalar hyperparameters that shape the decay of its confidence signal. These hyperparameters are tuned on the 6-month validation set (June 2023 through November 2023) independently for each cryptocurrency, using validation-set ROI as the selection criterion. The training and test sets play no role in selection. The TD3 agent’s own hyperparameters (Table 2) are not tuned; retuning them per condition would conflate the effect of the confidence signal with that of the accompanying hyperparameter search and defeat the purpose of the controlled comparison.

Table 6 reports the selected values. The search ranges were CA:

γ_{CA} \in {3, 5, 7, 10}

; TDC:

W \in {6, 12, 24, 48}

; SN:

λ_{SN} \in {0.1, 0.5, 1.0, 5.0}

,

k \in {5, 10, 20, 50}

; AMS:

β \in {0.5, 1.0, 2.0, 5.0}

,

W \in {6, 12, 24, 48}

; STS:

λ_{STS} \in {0.01, 0.1, 0.5, 1.0}

.

4.4. Evaluation Metrics

Three standard metrics assess each trained policy on the 6-month test set, covering absolute profitability, risk-adjusted performance, and downside exposure. Let

V_{t}

denote portfolio value at hour t, with

V_{0}

the initial capital of $1,000,000 and

V_{T}

the final portfolio value at the end of the test period. Let

{{PnL}_{k}}_{k = 1}^{K}

denote the sequence of K realized trade profits and losses over the test period.

4.4.1. Return on Investment (ROI)

ROI measures the cumulative net return relative to the initial capital:

ROI = \frac{V_{T} - V_{0}}{V_{0}} \times 100%

(20)

A positive ROI indicates net profit; a negative ROI indicates a net loss.

4.4.2. Sharpe Ratio (SR)

SR measures risk-adjusted return by normalizing the mean per-trade profit by its standard deviation:

SR = \frac{\bar{r} - r_{f}}{σ_{r}}

(21)

where

\bar{r}

and

σ_{r}

are the mean and standard deviation of per-trade profits

{{PnL}_{k}}

, and

r_{f}

is a reference benchmark return. Consistent with the formulation used in the foundational TD3 trading study [16],

r_{f}

is set to the absolute price change over the evaluation period,

r_{f} = | P_{0} - P_{T} |

, which represents the passive gain or loss from a static buy-and-hold position. This makes SR a direct measure of the agent’s active-trading skill above the passive benchmark. Higher SR values indicate greater profit per unit of profit volatility; values above 1 are generally considered favorable, above 2 strong, and above 3 excellent.

4.4.3. Maximum Drawdown (MDD)

MDD captures worst-case capital loss by measuring the largest peak-to-trough decline in portfolio value over the test period:

MDD = max_{t \in [0, T]} \frac{V_{t}^{peak} - V_{t}}{V_{t}^{peak}} \times 100%

(22)

where

V_{t}^{peak} = {max}_{τ \leq t} V_{τ}

is the running maximum of portfolio value up to time t. MDD is non-negative, and lower values are preferable: an MDD of 20% means that at some point during the test period, the portfolio had lost 20% of its value from its prior peak.

4.4.4. Statistical Significance Testing

To determine whether performance differences between a confidence-enhanced condition and the corresponding baseline are statistically significant rather than artifacts of noise, a Wilcoxon signed-rank test is applied to the paired per-trade PnL distributions. The test is non-parametric, makes no assumption of normality, and is appropriate for the heavy-tailed profit distributions typical of cryptocurrency trading. Pairs are formed by matching trades that occur in the same market conditions under the baseline and confidence-enhanced conditions, both of which share identical initialization and exploration sequences under the fixed random seed. A result is considered statistically significant at

p < 0.05

.

5. Results and Analysis

This section reports the test-period performance of the 18 experimental conditions. The presentation is organized around the three evaluation metrics introduced in the Section 4.4 and proceeds from the aggregate comparison to progressively finer-grained analyses. Section 5.1 presents the headline performance table and establishes the overall ranking of the confidence methods. Section 5.2 examines whether this ranking is stable across BTC, ETH, and LTC or varies with market characteristics. Section 5.3 visualizes how each confidence signal evolves during the test period and how those dynamics relate to market events. Section 5.4 analyzes the downstream effect of confidence on trading behavior itself, including trade frequency, average position size, and win rate. Section 5.5 reports the wall-clock computational cost of each method, reinforcing the lightweight message. Section 5.6 closes with the statistical significance of the observed differences between each confidence method and the baseline.

5.1. Overall Performance Comparison

Table 7 reports the test-set ROI, SR, and MDD for each of the 18 conditions. The baseline condition serves as the reference point within each cryptocurrency; the five confidence methods are directly compared against it. To summarize across assets, the rightmost columns report the mean of each metric over the three cryptocurrencies.

Three patterns emerge. First, every confidence method improves on the baseline across all three metrics and all three cryptocurrencies. The magnitude of improvement ranges from modest (STS) to substantial (SN), but no confidence method produces worse performance than the confidence-free reference on any (asset, metric) combination. This establishes the basic claim: modulating the reward by a scalar confidence factor, without any other change to the agent, consistently improves trading outcomes.

Second, the ranking of methods is highly stable across the three cryptocurrencies. SN achieves the highest ROI and SR and the lowest MDD on every asset, followed by CA, AMS, TDC, and STS in the same order across all three markets. The mean-column ordering, therefore, reflects a genuinely asset-invariant ranking rather than an artifact of averaging over inconsistent per-asset rankings. This has direct implications for the practical recommendation: a practitioner does not need to re-tune the choice of method to each market.

Third, the spread between the best and worst confidence methods (SN and STS) is larger than the spread between the worst confidence method (STS) and the baseline. In mean ROI, SN exceeds STS by 17.3 percentage points, while STS exceeds the baseline by 1.9 percentage points. This asymmetry matters: it indicates that which confidence signal is used carries more weight than the mere presence of any confidence signal. Choosing the right dimension of uncertainty is the actionable design decision, not the binary decision to include confidence at all.

The ordering of the five methods admits a coherent interpretation in terms of what each measures. SN, which operates on the input distribution, performs best; it responds to genuinely novel market conditions by attenuating the reward when the agent has the least basis for its decision. CA, which measures internal value disagreement, comes next; its signal is more readily available (the Q-values already exist) but less directly tied to out-of-distribution events. AMS and TDC measure behavioral consistency, which is a weaker proxy for decision quality: an agent can be behaviorally consistent while being consistently wrong. STS, despite adding an auxiliary network, produces the smallest improvement; the likely reason, examined in Section 5.3, is that next-state prediction on raw OHLCV at hourly resolution is intrinsically difficult, yielding prediction errors that are high almost everywhere and therefore uninformative as a confidence signal. The per-cryptocurrency analysis in the next subsection examines whether these interpretations hold up when the results are disaggregated by market.

5.2. Per-Cryptocurrency Analysis

The aggregate ranking reported above obscures differences in how much each confidence method gains on each asset. These differences are informative: because the three cryptocurrencies have distinct liquidity, volatility, and price-regime characteristics during the test period (December 2023 through June 2024), they provide a natural cross-check on the hypothesized mechanism behind each method’s effect.

5.2.1. Bitcoin

BTC yields the largest absolute ROI improvement across every confidence method and the smallest MDD reductions proportionally. The test period coincided with the January 2024 approval of spot Bitcoin ETFs and the subsequent rally that pushed BTC above $70,000 by March 2024. This is precisely the type of regime shift that SN is designed to detect: sustained price movement into territory that was not well represented in the training data (June 2019 through June 2023, with BTC peaking near $69,000 in late 2021 but spending most of the training period below $50,000). SN’s lead over CA on BTC (5.6 percentage points of ROI) is the largest such gap across the three assets, consistent with the explanation that input-novelty signaling is most valuable when the test distribution shifts away from training.

5.2.2. Ethereum

ETH produces smaller absolute ROI values than BTC but a similar relative ranking. The more compressed spread between methods on ETH (mean ROI spread of 17.5 percentage points versus 18.6 on BTC) reflects ETH’s comparatively range-bound behavior during the test period: it lacked a single dominant directional move equivalent to BTC’s ETF-driven rally. In range-bound conditions, the marginal value of attenuating learning during novel states is smaller because fewer genuinely novel states occur. Notably, TDC’s disadvantage relative to AMS narrows on ETH (4.2 percentage points of ROI versus 4.4 on BTC and 3.3 on LTC), suggesting that directional-consistency signals are less punished in markets where the correct action itself fluctuates less.

5.2.3. Litecoin

LTC shows the weakest absolute performance for every method, including the baseline. Its thinner liquidity and smaller market capitalization mean that individual trades of a size appropriate for a $1,000,000 portfolio have a more pronounced effect on realized execution prices, and the assumed 1.5% transaction fee absorbs a larger fraction of small directional edges. Despite this, the ranking of methods on LTC matches the other two assets, and the proportional improvements (SN over baseline: 5.2× ROI, 4.7× Sharpe) are in line with BTC and ETH. The persistence of the ranking on the most challenging of the three markets is the strongest available evidence that the effect is not an artifact of any single asset’s behavior.

5.2.4. Cross-Asset Takeaway

The quantitative differences across assets align with the intuitions behind each method. SN’s advantage is largest where the test period contains the most distributional shift (BTC), narrowest where the test period is most range-bound (ETH), and preserved even under the tightest liquidity constraints (LTC). The ranking is stable, but the gap between methods is not, and the gap size is predictable from market characteristics. This is a stronger result for the paper than a uniformly identical ranking would be, because it indicates that each method’s effect is tied to a recognizable market condition rather than to a generic performance offset.

5.3. Confidence Behavior Visualization

Aggregate performance metrics summarize how each confidence method affects trading outcomes, but they obscure how the confidence signals themselves behave over time. Understanding what each method actually measures, minute by minute, is essential to interpreting why the methods rank as they do. This subsection presents the evolution of each of the five confidence signals across the test period for all three cryptocurrencies, plotted alongside the corresponding price trajectory.

Figure 3, Figure 4 and Figure 5 show the test-period confidence trajectories for BTC, ETH, and LTC. In each figure, the top panel is the hourly close price and the five panels below plot the smoothed confidence signal produced by each method. All confidence values lie in

[0, 1]

by construction, with 1 indicating maximum confidence.

5.3.1. SN Tracks Price Novelty on All Three Assets

The SN trajectory on BTC (Figure 3) descends from approximately 0.85 in December to a minimum near 0.35 at the March all-time high, and recovers only partially to around 0.55 by June. The descent begins in early February, synchronous with the break above $50,000, and the lowest values occur precisely when BTC trades in the $65,000–$73,000 range, which exceeds the price band the agent encountered during most of its training. On ETH (Figure 4), SN drops from around 0.95 to 0.50 as the price crosses $3000 in late January, remains depressed through the rally, and fluctuates between 0.45 and 0.85 thereafter. On LTC (Figure 5), where absolute prices remain well within the training range, SN still responds to the March–April price spike, dropping from above 0.90 to about 0.50 as the price rises from $75 to above $100 and recovering once the price returns to its prior range.

In each case the SN signal is price-driven: it falls when the current market state diverges from the agent’s accumulated experience. The depth and duration of that divergence explain why SN produces the largest ROI improvement on BTC (where the divergence is most prolonged) and the smallest on LTC (where the divergence is brief and local). This visual evidence is consistent with the mechanism proposed in Section 5.2 and distinguishes SN from the behavioral methods, which respond to the agent’s internal state rather than the market.

5.3.2. CA Fluctuates with Market Uncertainty, Without a Single Dominant Event

The CA trajectory on BTC shows a visible trough near 0.45 in early February, coinciding with the start of the February rally when the critics face a more difficult value-estimation problem, followed by a peak near 0.75 as the rally matures and recedes again toward June. On ETH and LTC, CA oscillates more uniformly between 0.55 and 0.75, with no single dominant feature. This is consistent with the interpretation that critic disagreement reflects localized difficulties in value estimation rather than broad distributional shifts; the signal is informative but less directly tied to recognizable market events than SN.

5.3.3. TDC Is Near-Binary

Across all three assets, TDC hovers at approximately 0.85 for the majority of the test period, punctuated by sharp narrow dips to 0.50–0.60. Each dip corresponds to a period during which the agent reversed its buy/sell direction multiple times within the rolling window. The dips occur irregularly and without clear clustering around market events, reflecting that direction-flipping behavior depends on the agent’s own policy trajectory rather than on external conditions. Consistent with the aggregate ranking in Section 5.1, the TDC signal provides relatively little targeted information compared to SN: the agent either is flipping or is not, with no graded response to changing market conditions.

5.3.4. AMS Drifts over Long Horizons

AMS shows smoother, lower-frequency behavior than the other methods. On BTC, AMS declines gradually from roughly 0.80 in December to approximately 0.20 in early April, then recovers to 0.85 by June; on ETH and LTC, similar multi-week drifts occur. These trajectories reflect sustained periods in which the agent’s trade sizes become more or less consistent, which tends to evolve over longer horizons than direction flips or critic disagreements. The resulting confidence signal is informative about policy stability but, like TDC, does not respond directly to market events.

5.3.5. STS Is Saturated Low

The STS signal is the most visually distinctive: across all three assets, it remains near 0.25 for the majority of the test period, with only brief spikes to 0.45–0.50. This pattern reflects the practical difficulty of next-state prediction in OHLCV data at hourly resolution. Price movements at this timescale approach a random walk, so the auxiliary network’s prediction error remains consistently high, leaving the STS confidence signal compressed near the bottom of its range for most timesteps. The spikes correspond to short intervals of briefly predictable market behavior, but these occur too rarely to provide a useful reward-shaping signal over most of the test period.

This explains STS’s position at the bottom of the method ranking. The issue is not that prediction-error is in principle uninformative, but that the underlying signal is saturated by the intrinsic unpredictability of the market at the chosen resolution. On a slower timescale (e.g., daily bars) or with a richer state representation than OHLCV alone, STS may behave differently; this possibility is discussed further in Section 6.

5.3.6. Summary of Behavioral Differences

The five methods divide into two categories on the basis of their visual behavior. SN and CA produce signals that respond to the market environment: SN to distributional shifts in price, CA more locally to value-estimation difficulties. TDC and AMS produce signals that reflect the agent’s internal state: TDC the short-term direction stability, AMS the longer-term magnitude stability. STS attempts to capture environmental structure but is saturated low due to the inherent unpredictability of hourly OHLCV. The environmental methods outperform the behavioral methods in the aggregate ranking, and within each category, the signal with the clearer and more event-aligned response (SN over CA; AMS over TDC) performs better. This pattern is visible in the figures and consistent with the performance results in Section 5.1.

5.4. Trading Behavior Analysis

Section 5.1 established that confidence-aware reward-shaping improves outcome metrics, and Section 5.3 showed how each confidence signal behaves over time. What remains is the connecting link: how does each confidence signal, once multiplied into the reward, change the agent’s actual trading behavior? This subsection reports five behavioral metrics computed over the test period, averaged across the three cryptocurrencies, that characterize how the agent differs under each reward configuration.

Table 8 reports: the number of trades (buy–sell round-trips) executed over the six-month test period, the mean position size per trade in USD, the CV of position size across trades, the win rate (fraction of completed trades with positive realized PnL), and the direction flip rate (fraction of adjacent trade pairs with opposite signs).

Three patterns in Table 8 illuminate why the confidence methods rank as they do in terms of outcome metrics.

5.4.1. Win Rate Drives the ROI Ranking

Win rate is the metric that most closely tracks the ROI ranking from Section 5.1. The baseline wins on approximately 48% of trades, compared to 67% for SN and 61% for CA. The improvement is not due to the agent trading more aggressively; the opposite holds. Under the baseline, the agent executes roughly 428 trades over six months, while under SN and CA the count drops to 241 and 287 respectively. The reward attenuation in novel or uncertain states causes the agent to delay or skip weaker opportunities, leaving a smaller but more selective set of completed trades. The consequence is a higher fraction of winners, which compounds over the test period into the ROI differences reported in Table 7.

STS, whose confidence signal remains saturated near 0.25 for most of the test period (Figure 3), exhibits behavior nearly indistinguishable from the baseline: trade count 401; win rate 50.3%. This confirms the mechanistic interpretation from Section 5.3: when the confidence signal is saturated at a near-constant low value, the reward multiplier is approximately constant and the agent learns a policy similar to the one it would learn from the unmodified baseline reward.

5.4.2. Position Size Stability Reflects AMS’s Direct Signal

The CV of position size is, by construction, the quantity that AMS attempts to stabilize. AMS achieves the lowest CV (0.38), well below the baseline’s 0.74 and below SN’s 0.47. This is a mechanistic check rather than an outcome: AMS successfully reduces what it is designed to reduce. The question V-A already answered is whether that reduction translates to better trading outcomes, and the answer is mixed: AMS does improve over the baseline, but not as much as SN or CA. Consistent sizing is necessary for disciplined trading but is not sufficient on its own; an agent can size consistently around a poor directional signal.

5.4.3. Direction Flip Rate Confirms TDC’s Mechanism

TDC produces the lowest direction flip rate (22.5%), roughly half of the baseline’s 42.6%. As with AMS, this is a mechanistic confirmation: TDC suppresses learning during direction-inconsistent behavior, and the trained agent flips direction less often. Also, as with AMS, the behavioral effect does not fully translate to outcome improvements. Reducing direction flips helps on average but penalizes the correct response to genuine regime changes, in which reversing direction is the optimal action. This is the likely explanation for TDC’s middle-ranked outcome performance despite its strong effect on the metric it directly targets.

5.4.4. SN Combines Desirable Behaviors Without Directly Targeting Any of Them

A notable feature of Table 8 is that SN, which does not explicitly target trade count, position size CV, or direction flip rate, nonetheless produces improvements in all of them (trade count 241 versus 428 baseline; position CV 0.47 versus 0.74; direction flip 28.3% versus 42.6%). The mechanism is different from AMS and TDC: SN does not penalize inconsistent behavior in itself, but rather attenuates reward during states the agent cannot reliably assess. The downstream effect is that the agent becomes quieter, more selective, and more consistent, without any of those behaviors being explicit targets of the reward modification. This is a characteristic advantage of input-based uncertainty signaling over output- or behavior-based signaling: the agent self-regulates across multiple behavioral dimensions in response to a single, well-targeted input signal.

5.5. Computational Cost

The confidence methods are proposed as lightweight enhancements, a claim that can be evaluated concretely by measuring the additional wall-clock time each one imposes during training and inference. Table 9 reports the mean training time per 10,000 training steps and the mean per-timestep inference overhead for each method, measured on a single NVIDIA RTX 4090 GPU and averaged across the three cryptocurrencies.

Four of the five methods (CA, TDC, AMS, and STS) impose under 20% overhead, confirming the lightweight characterization. CA, TDC, and AMS add essentially zero cost, as their confidence calculations are reducible to a handful of arithmetic operations on quantities the agent already computes. SN is the most expensive method because of the nearest-neighbor search across the replay buffer; the 18.7% inference overhead reflects the KD-tree queries performed at each timestep. STS’s overhead comes from forward passes through the auxiliary prediction network and is the second-highest at 12.0%. Even these higher-overhead methods add less than 20 s per 10,000 training steps, which is modest relative to the cost of a full training run and leaves all five methods practically deployable.

5.6. Statistical Significance

To determine whether the performance improvements reported in Section 5.1 are statistically distinguishable from the baseline, a Wilcoxon signed-rank test is applied to the paired per-trade PnL distributions. Pairs are formed by matching trades that occur at equivalent market conditions under the baseline and each confidence-enhanced condition, which is possible because the fixed random seed ensures identical initialization and exploration across conditions. The null hypothesis is that the median difference between paired PnL values is zero.

SN, CA, and AMS produce statistically significant improvements (

p < 0.05

) on all three cryptocurrencies, with SN and CA reaching

p < 0.001

on all markets. TDC crosses the 0.05 threshold on each asset but only marginally on LTC (0.047), indicating that its improvement over the baseline is real but small enough to approach the limits of detectability at this sample size. STS fails to reach significance on any asset, with p-values above 0.14 uniformly. This is consistent with the behavioral analysis in Section 5.4: STS produces a trading policy that differs little from the baseline because its saturated confidence signal provides almost no reward modulation, leaving the per-trade PnL distributions under STS and the baseline statistically indistinguishable.

The test confirms that the ROI differences in Table 7 are unlikely to reflect chance variation for four of the five methods. Combined with the mechanistic evidence in Section 5.3 and Section 5.4, the statistical results support the claim that confidence-aware reward-shaping produces genuine, measurable improvements in trading outcomes, with SN providing the strongest and most consistent effect and STS functioning effectively as a null condition due to signal saturation.

6. Discussion

6.1. Generalizability Beyond the Study Setup

The experimental setup of this paper is deliberately narrow: one RL algorithm (TD3), one data type (hourly OHLCV), three cryptocurrencies, and a single reward structure reduced to

P n L

. This narrowness serves the controlled comparison but raises the question of whether the observed effects would transfer to other settings. Four of the five confidence methods are largely independent of the experimental choices. TDC, AMS, SN, and STS depend only on generic properties of the RL loop: an action history, a replay buffer, and a mapping from states to next states. Any off-policy or on-policy actor–critic method with these components (SAC, PPO, DDPG, A3C) can implement all four without modification. CA is the only method that requires an ensemble of value estimators; it transfers directly to SAC and to any architecture with two or more critics, and the underlying principle (ensemble disagreement as confidence) generalizes to Monte Carlo dropout or bootstrapped Q-networks when a twin-critic architecture is not available.

Transfer to other asset classes is a separate question. The mechanisms identified in Section 5.3, particularly the input-based novelty signal underlying SN, depend only on whether the test-period state distribution shifts relative to the training distribution. This condition is not specific to cryptocurrency: equity markets exhibit regime changes around earnings seasons and macroeconomic releases, foreign exchange markets shift around central bank announcements, and commodity markets respond to supply shocks. In each case, an SN-style signal would attenuate learning during the least familiar states, with effects analogous to those reported here. The quantitative magnitude of the improvement is not guaranteed to transfer, but the direction of the effect should, provided the training and test periods span different regimes.

The reward-structure simplification is more consequential. Real trading systems often incorporate risk penalties, transaction cost adjustments, or drawdown limits, any of which may interact with a multiplicative confidence factor in ways this paper does not examine. A confidence factor applied to a reward that already contains risk terms could double-count or cancel uncertainty-aware behavior, and the resulting agent would not necessarily match the behavior studied here. This interaction is a productive direction for future work and a reason to interpret the present results as a lower bound on what confidence-aware reward-shaping can achieve, rather than a complete characterization.

6.2. Limitations

Several scope constraints of this study should be stated explicitly.

The single random seed used across the experimental matrix isolates the confidence method as the sole source of variance between conditions but measures performance under one particular initialization rather than across the distribution of initializations. A multi-seed study would provide marginal performance estimates and confidence intervals around each condition’s outcome metrics; the present design does not.

Only one RL algorithm (TD3) is evaluated. The generalizability arguments above suggest that four of the five methods should apply to other actor–critic algorithms, but this remains an assumption until tested directly. The interaction between confidence-aware reward-shaping and algorithm-specific components (e.g., PPO’s trust region, SAC’s entropy regularization) may alter the magnitude or direction of the effect.

The state representation is limited to five raw OHLCV fields. This simplification is intentional, but it has two consequences for STS in particular: next-state prediction on this minimal state at hourly resolution is intrinsically difficult, producing the saturated confidence signal observed in Section 5.3. A richer state (technical indicators, order-book depth, multi-asset context) or a slower timescale (daily bars) may produce more informative STS behavior. The current results indicate that STS is ineffective in the configuration studied, not that the prediction-error principle is inherently weak.

The test period is six months, sufficient to include a meaningful regime transition (the January 2024 spot ETF approval and the subsequent rally to the all-time high) but not long enough to establish stability across multiple bull-bear cycles. A multi-year test evaluation would be more conclusive, though also more costly to conduct.

Finally, the confidence methods are studied in isolation rather than in combination. A composite signal, for example

{Conf}_{t} = {Conf}_{t}^{SN} \cdot {Conf}_{t}^{CA}

, or a learned weighting of multiple methods, may outperform any single method. The present study does not examine this possibility; doing so would require a second layer of hyperparameter search (method weights) that exceeds the scope of a controlled single-variable comparison.

6.3. Practical and Managerial Implications

The results translate into concrete guidance for the people who build, run, and oversee RL-based trading systems, with different implications for each role.

For trading-system developers, two questions matter: which method to adopt, and what it costs to integrate. The choice of method follows a simple flow. If the algorithm already maintains multiple value estimators, Critic Agreement is the cheapest method to try, requiring no computation beyond the standard critic update and a single decay hyperparameter while producing statistically significant improvements on every asset tested; it is the natural default for TD3, SAC, or any value-ensemble architecture. If the system is expected to operate under conditions that differ from its training data, for example after a prolonged period of market evolution, State Novelty is the stronger choice despite its nearest-neighbor cost, because its advantage over Critic Agreement widens precisely as the test distribution departs from training (Section 5.1). State-Transition Surprise is not recommended in the hourly OHLCV setting, where its signal saturates; a practitioner considering it should first confirm that the state representation and timescale support meaningful next-state prediction. On integration cost, four of the five methods require no change to the network, training loop, or data pipeline, and three of them add under one percent of inference overhead (Section 5.5), so a confidence estimator can be retrofitted onto an existing system, and removed again, without disturbing the rest of the stack.

For portfolio managers, the gains are concentrated in risk-adjusted terms rather than raw return. The strongest method reduced mean maximum drawdown from 28.0% to 15.0% and raised the Sharpe ratio more than fourfold, while executing roughly half as many trades as the baseline (Section 5.1 and Section 5.4). This mechanism works by making the agent more selective rather than more active, which aligns more than combined multiple methods, can be extended to other RL algorithms (PPO, SAC), provides richer state representations, and shows applications to non-cryptocurrency markets. All of these, the present controlled design deliberately set aside to consistently provide interpretable measures of policy stability in sizing and direction. Additionally, signals such as State Novelty and Critic Agreement fall when the agent operates outside its training distribution (Section 5.3). Any of these can be logged and monitored as a real-time indicator of model reliability, flagging the periods in which an automated strategy is least trustworthy and human oversight or reduced exposure may be warranted, regardless of which method drives the reward.

These implications follow from backtests on historical data under a fixed transaction-cost assumption and a single market-regime window. They describe how the framework can inform system design and oversight; they are not a guarantee of live-trading profitability, which would require forward testing under real execution conditions.

7. Conclusions

This paper examined whether the reward signal in RL-based cryptocurrency trading can be improved by incorporating the agent’s own decision certainty, without modifying network architecture, data pipelines, or training procedures. Five lightweight confidence estimation methods were formalized, each targeting a distinct dimension of uncertainty: CA, TDC, SN, AMS, and STS. A controlled experiment on BTC, ETH, and LTC over a six-month test period found that four of the five methods produced statistically significant improvements in return, risk-adjusted return, and MDD over a confidence-free baseline. State Novelty produced the largest gains, followed by CA, AMS, and TDC. STS, the only method requiring an auxiliary network, did not reach statistical significance, a result attributable to the difficulty of next-state prediction on raw OHLCV at hourly resolution rather than to a weakness of the prediction-error principle itself.

The central finding is that confidence-aware reward-shaping is a general, low-cost mechanism for improving RL trading agents. The best-performing method requires no architectural changes, operates directly on quantities the agent already computes, and transfers, in principle, to other actor–critic algorithms and asset classes with minimal adaptation. Future work should examine composite confidence signals combining multiple methods, extension to other RL algorithms (PPO, SAC), richer state representations, and application to non-cryptocurrency markets, all of which the present controlled design deliberately set aside in order to isolate the effect of the confidence mechanism itself.

Author Contributions

Conceptualization, F.A. and S.O.; methodology, F.A.; software, Y.I.C. and E.H.K.; validation, S.O. and O.U.M.; formal analysis, Y.I.C. and Y.S.S.; investigation, F.A.; data curation, S.O.; writing—original draft preparation, F.A.; writing—review and editing, R.C.; visualization, O.U.M.; funding acquisition, R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by PubArt program of POLITEHNICA Bucharest.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fang, F.; Ventre, C.; Basios, M.; Kanthan, L.; Martinez-Rego, D.; Wu, F.; Li, L. Cryptocurrency trading: A comprehensive survey. In Blockchain, Crypto Assets, and Financial Innovation: A Decade of Insights and Advances; Springer Nature Singapore: Singapore, 2025; pp. 55–127. [Google Scholar]
Fischer, T.G. Reinforcement Learning in Financial Markets—A Survey; FAU Discussion Papers in Economics; FAU: Boca Raton, FL, USA, 2018. [Google Scholar]
Tay, X.H.; Lim, S.M. Deep reinforcement learning in cryptocurrency trading: A profitable approach. J. Telecommun. Digit. Econ. 2024, 12, 126–147. [Google Scholar] [CrossRef]
Schnaubelt, M. Deep reinforcement learning for the optimal placement of cryptocurrency limit orders. Eur. J. Oper. Res. 2022, 296, 993–1006. [Google Scholar] [CrossRef]
Yang, H.; Liu, X.Y.; Zhong, S.; Walid, A. Deep reinforcement learning for automated stock trading: An ensemble strategy. In Proceedings of the First ACM International Conference on AI in Finance 2020, Virtually, 15–16 October 2020; pp. 1–8. [Google Scholar]
Ghadiri, H.; Hajizadeh, E. Designing a cryptocurrency trading system with deep reinforcement learning utilizing LSTM neural networks and XGBoost feature selection. Appl. Soft Comput. 2025, 175, 113029. [Google Scholar] [CrossRef]
Lucarelli, G.; Borrotti, M. A deep reinforcement learning approach for automated cryptocurrency trading. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Crete, Greece, 24–26 May 2019; Springer International Publishing: Cham, Swizterland, 2019; pp. 247–258. [Google Scholar]
Moody, J.; Wu, L.; Liao, Y.; Saffell, M. Performance functions and reinforcement learning for trading systems and portfolios. J. Forecast. 1998, 17, 441–470. [Google Scholar] [CrossRef]
Rodinos, G.; Nousi, P.; Passalis, N.; Tefas, A. A sharpe ratio based reward scheme in deep reinforcement learning for financial trading. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations 2023, León, Spain, 14–17 June 2023; Springer Nature Switzerland: Cham, Swizterland, 2023; pp. 15–23. [Google Scholar]
Huang, Y.; Zhou, C.; Zhang, L.; Lu, X. A self-rewarding mechanism in deep reinforcement learning for trading strategy optimization. Mathematics 2024, 12, 4020. [Google Scholar] [CrossRef]
Orra, A.; Choudhary, H.; Sharma, A.; Thakur, M. Enhancing deep reinforcement learning for stock trading: A reward shaping approach via expert feedback. Knowl. Inf. Syst. 2025, 67, 11075–11094. [Google Scholar] [CrossRef]
Srivastava, U.; Aryan, S.; Singh, S. A Risk-Aware Reinforcement Learning Reward for Financial Trading. arXiv 2025, arXiv:2506.04358. [Google Scholar] [CrossRef]
Clements, W.R.; Van Delft, B.; Robaglia, B.M.; Slaoui, R.B.; Toth, S. Estimating risk and uncertainty in deep reinforcement learning. arXiv 2019, arXiv:1905.09638. [Google Scholar]
Charpentier, B.; Senanayake, R.; Kochenderfer, M.; Günnemann, S. Disentangling epistemic and aleatoric uncertainty in reinforcement learning. arXiv 2022, arXiv:2206.01558. [Google Scholar] [CrossRef]
Liu, Q.; Li, Y.; Chen, S.; Lin, K.; Shi, X.; Lou, Y. Distributional reinforcement learning with epistemic and aleatoric uncertainty estimation. Inf. Sci. 2023, 644, 119217. [Google Scholar] [CrossRef]
Otabek, S.; Choi, J. Optimizing Cryptocurrency Trades with Twin Delayed DDPG: Adaptive Multi-factor Reward Function with Diverse Data Sources. Expert Syst. Appl. 2026, 7, 131527. [Google Scholar] [CrossRef]
Khujamatov, E.H.; Ismanov, K.; Mallaev, O.U.; Sattarov, O. Optimizing Crypto-Trading Performance: A Comparative Analysis of Innovative Reward Functions in Reinforcement Learning Models. Mathematics 2026, 14, 794. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Sun, Q.; Si, Y.W. Supervised actor-critic reinforcement learning with action feedback for algorithmic trading. Appl. Intell. 2023, 53, 16875–16892. [Google Scholar] [CrossRef]
Kong, M.; So, J. Empirical analysis of automated stock trading using deep reinforcement learning. Appl. Sci. 2023, 13, 633. [Google Scholar] [CrossRef]
Lu, C.I. Evaluation of deep reinforcement learning algorithms for portfolio optimisation. arXiv 2023, arXiv:2307.07694. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.M.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D.P. Continuous Control with Deep Reinforcement Learning. US Patent 10,776,692, 15 September 2020. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Zhang, J.; Cai, K.; Wen, J. A survey of deep learning applications in cryptocurrency. iScience 2024, 27, 108509. [Google Scholar] [CrossRef]
Kochliaridis, V.; Kouloumpris, E.; Vlahavas, I. Combining deep reinforcement learning with technical analysis and trend monitoring on cryptocurrency markets. Neural Comput. Appl. 2023, 35, 21445–21462. [Google Scholar] [CrossRef]
Kumlungmak, K.; Vateekul, P. Multi-agent deep reinforcement learning with progressive negative reward for cryptocurrency trading. IEEE Access 2023, 11, 66440–66455. [Google Scholar] [CrossRef]
Otabek, S.; Choi, J. Multi-level deep Q-networks for Bitcoin trading strategies. Sci. Rep. 2024, 14, 771. [Google Scholar] [CrossRef]
Tran, M.; Pham-Hi, D.; Bui, M. Optimizing automated trading systems with deep reinforcement learning. Algorithms 2023, 16, 23. [Google Scholar] [CrossRef]
Eschmann, J. Reward function design in reinforcement learning. In Reinforcement Learning Algorithms: Analysis and Applications; Springer: Cham, Swizterland, 2021; pp. 25–33. [Google Scholar]
Ibrahim, S.; Mostafa, M.; Jnadi, A.; Salloum, H.; Osinenko, P. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications. IEEE Access 2024, 12, 175473–175500. [Google Scholar] [CrossRef]
Allen, F.; Karjalainen, R. Using genetic algorithms to find technical trading rules. J. Financ. Econ. 1999, 51, 245–271. [Google Scholar] [CrossRef]
Liu, X.Y.; Yang, H.; Gao, J.; Wang, C.D. FinRL: Deep reinforcement learning framework to automate trading in quantitative finance. In Proceedings of the Second ACM International Conference on AI in Finance 2021, Virtual, 3–5 November 2021; pp. 1–9. [Google Scholar]
Wu, M.E.; Syu, J.H.; Lin, J.C.; Ho, J.M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell. 2021, 51, 8119–8131. [Google Scholar] [CrossRef]
Choudhary, H.; Orra, A.; Sahoo, K.; Thakur, M. Risk-adjusted deep reinforcement learning for portfolio optimization: A multi-reward approach. Int. J. Comput. Intell. Syst. 2025, 18, 126. [Google Scholar] [CrossRef]
Su, R.; Chi, C.; Tu, S.; Xu, L. A Deep Reinforcement Learning Approach for Portfolio Management in Non-Short-Selling Market. IET Signal Process. 2024, 2024, 5399392. [Google Scholar] [CrossRef]
Sadighian, J. Extending deep reinforcement learning frameworks in cryptocurrency market making. arXiv 2020, arXiv:2004.06985. [Google Scholar] [CrossRef]
Zhou, C.; Huang, Y.; Cui, K.; Lu, X. R-DDQN: Optimizing algorithmic trading strategies using a reward network in a double DQN. Mathematics 2024, 12, 1621. [Google Scholar] [CrossRef]
Cornalba, F.; Disselkamp, C.; Scassola, D.; Helf, C. Multi-objective reward generalization: Improving performance of Deep Reinforcement Learning for applications in single-asset trading. Neural Comput. Appl. 2024, 36, 619–637. [Google Scholar] [CrossRef]
Valdenegro-Toro, M.; Mori, D.S. A deeper look into aleatoric and epistemic uncertainty disentanglement. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1508–1516. [Google Scholar]
Wang, T.; Wang, Y.; Zhou, J.; Peng, B.; Song, X.; Zhang, C.; Sun, X.; Niu, Q.; Liu, J.; Chen, S.; et al. From aleatoric to epistemic: Exploring uncertainty quantification techniques in artificial intelligence. arXiv 2025, arXiv:2501.03282. [Google Scholar] [CrossRef]
An, G.; Moon, S.; Kim, J.H.; Song, H.O. Uncertainty-based offline reinforcement learning with diversified q-ensemble. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 7436–7447. [Google Scholar]
Wu, Y.; Zhai, S.; Srivastava, N.; Susskind, J.; Zhang, J.; Salakhutdinov, R.; Goh, H. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv 2021, arXiv:2105.08140. [Google Scholar] [CrossRef]
Bai, C.; Wang, L.; Yang, Z.; Deng, Z.; Garg, A.; Liu, P.; Wang, Z. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv 2022, arXiv:2202.11566. [Google Scholar] [CrossRef]
Hoel, C.J.; Wolff, K.; Laine, L. Ensemble quantile networks: Uncertainty-aware reinforcement learning with applications in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6030–6041. [Google Scholar] [CrossRef]
Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning 2017, Sydney, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by random network distillation. arXiv 2018, arXiv:1810.12894. [Google Scholar] [CrossRef]
Zhou, R.; Zhu, W.; Han, S.; Kang, M.; Lü, S. VCSAP: Online reinforcement learning exploration method based on visitation count of state-action pairs. Neural Netw. 2025, 184, 107052. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Shi, X.; Li, J.; Zhang, X.; Wang, J. Random curiosity-driven exploration in deep reinforcement learning. Neurocomputing 2020, 418, 139–147. [Google Scholar] [CrossRef]
Bitcoin Wiki. Satoshi (Unit). 2026. Available online: https://en.bitcoin.it/wiki/Satoshi_(unit) (accessed on 1 April 2026).
Cryptopedia. Satoshi Value, Gwei to Ether to Wei Converter. 2026. Available online: https://www.gemini.com/cryptopedia/satoshi-value-gwei-to-ether-to-wei-converter-eth-gwei (accessed on 1 April 2026).
Coinbase. Pricing and Fees Disclosures. 2026. Available online: https://help.coinbase.com/en/coinbase/trading-and-funding/pricing-and-fees/fees (accessed on 1 April 2026).
Makarov, I.; Schoar, A. Trading and arbitrage in cryptocurrency markets. J. Financ. Econ. 2020, 135, 293–319. [Google Scholar] [CrossRef]
Lin, L.J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2020, arXiv:1412.6980. [Google Scholar]
Bentley, J.L. Multidimensional binary search trees used for associative searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
Everitt, B.S.; Skrondal, A. The Cambridge Dictionary of Statistics; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the InIcml 1999, Bled, Slovenia, 27–30 June 1999; Volume 99, pp. 278–287. [Google Scholar]
Crypto Data Download. Available online: https://www.cryptodatadownload.com/ (accessed on 5 April 2026).

Figure 1. Conceptual overview of confidence-aware reward-shaping. (Top): the baseline architecture, where the TD3 agent receives the raw profit-and-loss signal

{PnL}_{k}

directly as its reward (

{Conf}_{t} = 1

). (Bottom): the proposed architecture, where

{PnL}_{k}

is scaled by a confidence estimate

{Conf}_{t} \in [0, 1]

before entering the reward function, yielding

r_{t} = {PnL}_{k} \times {Conf}_{t}

.

Figure 1. Conceptual overview of confidence-aware reward-shaping. (Top): the baseline architecture, where the TD3 agent receives the raw profit-and-loss signal

{PnL}_{k}

directly as its reward (

{Conf}_{t} = 1

). (Bottom): the proposed architecture, where

{PnL}_{k}

is scaled by a confidence estimate

{Conf}_{t} \in [0, 1]

before entering the reward function, yielding

r_{t} = {PnL}_{k} \times {Conf}_{t}

.

Figure 2. Hourly close prices for BTC, ETH, and LTC over the five-year study period (June 2019–June 2024).

Figure 3. BTC test-period confidence trajectories (December 2023–June 2024).

Figure 4. ETH test-period confidence trajectories (December 2023–June 2024).

Figure 5. LTC test-period confidence trajectories (December 2023–June 2024).

Table 1. Positioning of the present work relative to selected studies.

Study	Domain	Algorithm	Key Contribution	Limitation/Gap
Sun et al. [19]	Stock	TD3, DDPG	Action feedback mechanism corrects replay buffer with dealt positions	Reward is standard PnL; no uncertainty or risk-awareness in the reward
Kong et al. [20]	Stock	7-algo ensemble	Broadest ensemble (TD3, SAC, A2C, TRPO, etc.) across three markets	Ensemble diversity addresses architecture, not reward design
Kochliaridis et al. [26]	Crypto	DRL + rules	Rule-based safety mechanism filters uncertain actions during exploitation	Uncertainty handled post-hoc via action filtering, not within the reward
Kumlungmak et al. [27]	Crypto	MAPPO	Progressive loss penalty prevents consecutive drawdowns in bearish markets	Risk penalty is loss-based; agent has no self-assessed confidence measure
Moody et al. [8]	Stock	RRL	Differential SR as a direct optimization objective	Risk-adjusted objective, but static; no adaptation to agent certainty
Srivastava et al. [12]	Stock	RL (general)	Modular 4-term composite reward (return, risk, Treynor) with tunable weights	Reward components are market metrics; none reflects the agent’s internal state
Choudhary et al. [35]	Stock	3 DRL agents	Multi-reward fusion via CNN combining log return, Sharpe, and MDD agents	Fusion occurs at the action level, not the reward level; no confidence scaling
Huang et al. [10]	Stock	DDQN	Self-rewarding network learns to predict rewards from expert labels	Reward is learned from external supervision; agent does not estimate its own certainty
Zhou et al. [38]	Stock	DDQN + RLHF	Reward network trained on expert demonstrations via RLHF	Expert-dependent reward generation; no intrinsic uncertainty signal
Sattarov & Choi [16]	Crypto	TD3	AMRF with 6 factors including critic-agreement confidence	Confidence was one of six coupled factors; individual effect not isolated
An et al. [42]	Control	SAC (ensemble)	Q-ensemble disagreement penalizes OOD actions in offline RL	Uncertainty penalizes Q-values; not applied to scale task rewards
Wu et al. [43]	Control	SAC + dropout	Dropout-based uncertainty down-weights OOD training samples	Uncertainty modulates gradient contribution, not the reward signal
Hoel et al. [45]	Driving	DQN (ensemble)	Separates aleatoric (quantile) and epistemic (ensemble) uncertainty	Uncertainty flags unsafe decisions; not fed back into reward computation
Pathak et al. [46]	Games	A3C	Prediction error as intrinsic curiosity bonus for exploration	Bonus is additive and encourages novelty-seeking, not reward-scaling
This paper	Crypto	TD3	Five confidence methods that multiplicatively scale PnL reward by agent certainty	Single algorithm (TD3); single data type (OHLCV); single asset class (crypto)

Table 2. TD3 model configuration. All parameters are fixed across every experimental condition.

Component	Specification
Input layer	5 neurons (OHLCV)
Hidden layers	128, 64, 32 (FC, ReLU)
Output layer	1 neuron
Networks	1 Actor + 2 Critics
Loss function	MSE
Optimizer	Adam
Learning rate	0.001
Discount factor ( $γ$ )	0.99
Exploration noise	$N (0, 1.0)$ , decay 0.995/episode
Target network update	Soft, $τ = 0.005$ , every 500 steps
Policy update delay	Every 2 critic updates
Batch size	128
Replay buffer	Experience replay

Table 3. Summary of the five confidence estimation methods. All methods output

{Conf}_{t} \in [0, 1]

, where 1 indicates maximum confidence, and 0 indicates maximum uncertainty. Methods 1–4 require no architectural modifications; Method 5 introduces a lightweight auxiliary network.

Table 3. Summary of the five confidence estimation methods. All methods output

{Conf}_{t} \in [0, 1]

, where 1 indicates maximum confidence, and 0 indicates maximum uncertainty. Methods 1–4 require no architectural modifications; Method 5 introduces a lightweight auxiliary network.

Method	Abbr.	Uncertainty Dimension	Architecture Change	Computational Overhead	Hyperparameters
Critic Agreement	CA	Value estimation	None (uses TD3 critics)	Negligible	$γ_{CA}$
Temporal Direction Consistency	TDC	Behavioral direction	None	Negligible	W
State Novelty	SN	Distributional familiarity	None	Moderate (NN search)	$λ_{SN}$ , k
Action Magnitude Stability	AMS	Position sizing	None	Negligible	$β$ , W
State-Transition Surprise	STS	Environmental predictability	Auxiliary MLP	Low	$λ_{STS}$

Table 4. Dataset statistics for each cryptocurrency. All values are computed over the full five-year period (1 June 2019–1 June 2024). Hourly OHLCV data was obtained from CryptoDataDownload [59].

Statistic	BTC	ETH	LTC
Total hourly records	43,869	43,869	43,869
Training (Jun ’19–Jun ’23)	35,061	35,061	35,061
Validation (Jun ’23–Dec ’23)	4392	4392	4392
Test (Dec ’23–Jun ’24)	4416	4416	4416
Price min (USD)	$4160	97.63	25
Price max (USD)	$73,613	4848.52	410.67
Price mean (USD)	$29,031	1650.08	96.1
Volume mean (hourly, native)	62.13 BTC	643.96 ETH	718.09 LTC

Table 5. Controlled and varying components of the experimental design. The confidence estimation method is the sole variable across conditions within each cryptocurrency.

Component	Status
State representation (OHLCV)	Fixed
Action space and denomination	Fixed (per cryptocurrency)
Transaction fee (1.5%)	Fixed
Initial capital ($1,000,000)	Fixed
TD3 network architecture	Fixed
TD3 training hyperparameters	Fixed
STS auxiliary network architecture	Fixed
Random seed	Fixed (single seed)
Training, validation, and test splits	Fixed
Baseline reward structure	Fixed
Confidence estimation method	Varied (6 levels)
Confidence-specific hyperparameters	Tuned per method on validation set

Table 6. Confidence-method hyperparameters selected on the validation set. Values were chosen to maximize validation-set ROI for each cryptocurrency independently.

Method	Parameter	BTC	ETH	LTC
CA	$γ_{CA}$	5	5	7
TDC	W	12	24	12
SN	$λ_{SN}$	0.5	0.5	1.0
	k	10	20	10
AMS	$β$	1.0	2.0	1.0
	W	12	12	24
STS	$λ_{STS}$	0.1	0.1	0.5

Table 7. Test-set performance across all 18 conditions. ROI and MDD are in percent; SR is unitless. Best value in each column is shown in bold. Mean columns average across the three cryptocurrencies.

Method	ROI (%)			Sharpe Ratio			MDD (%)			Mean
Method	BTC	ETH	LTC	BTC	ETH	LTC	BTC	ETH	LTC	ROI	SR	MDD
Baseline	7.2	5.8	4.1	0.41	0.33	0.28	24.6	28.1	31.4	5.7	0.34	28.0
CA	22.8	19.4	16.7	1.47	1.22	1.05	15.3	17.8	19.6	19.6	1.25	17.6
TDC	13.1	10.6	8.9	0.82	0.68	0.54	19.8	22.4	25.1	10.9	0.68	22.4
SN	28.4	24.9	21.3	1.83	1.56	1.31	12.8	14.9	17.2	24.9	1.57	15.0
AMS	17.5	14.8	12.2	1.09	0.91	0.77	17.1	19.6	22.3	14.8	0.92	19.7
STS	9.8	7.4	5.6	0.58	0.44	0.36	22.1	25.3	28.7	7.6	0.46	25.4

Table 8. Test-period trading behavior metrics, averaged across BTC, ETH, and LTC. Position size is reported as the USD value of each trade’s quantity at execution price. Values describe how each confidence method alters the agent’s downstream behavior compared to the confidence-free baseline.

Method	Trades (Count)	Mean pos. Size ($)	Pos. Size CV	Win Rate (%)	Direction Flip Rate (%)
Baseline	428	18,200	0.74	48.1	42.6
CA	287	21,800	0.52	61.4	31.8
TDC	264	19,400	0.68	54.2	22.5
SN	241	23,600	0.47	66.7	28.3
AMS	302	20,900	0.38	57.9	33.4
STS	401	18,700	0.71	50.3	40.1

Table 9. Computational cost of each confidence method. Training time is reported per 10,000 training steps; inference overhead is the additional per-timestep cost of computing the confidence value during training, relative to the baseline reward computation.

Method	Training Time (s/10k Steps)	Inference Overhead (% Over Baseline)
Baseline	142.3	0
CA	142.6	0.2
TDC	142.5	0.1
SN	168.9	18.7
AMS	142.7	0.3
STS	159.4	12.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Akhmedov, F.; Cho, Y.I.; Otabek, S.; Sodikovich, Y.S.; Mallaev, O.U.; Khujamatov, E.H.; Craciunescu, R. Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods. Mathematics 2026, 14, 2075. https://doi.org/10.3390/math14122075

AMA Style

Akhmedov F, Cho YI, Otabek S, Sodikovich YS, Mallaev OU, Khujamatov EH, Craciunescu R. Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods. Mathematics. 2026; 14(12):2075. https://doi.org/10.3390/math14122075

Chicago/Turabian Style

Akhmedov, Farkhod, Young Im Cho, Sattarov Otabek, Yusupov Sarvarbek Sodikovich, Oybek Usmankulovich Mallaev, Ergashevich Halimjon Khujamatov, and Razvan Craciunescu. 2026. "Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods" Mathematics 14, no. 12: 2075. https://doi.org/10.3390/math14122075

APA Style

Akhmedov, F., Cho, Y. I., Otabek, S., Sodikovich, Y. S., Mallaev, O. U., Khujamatov, E. H., & Craciunescu, R. (2026). Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods. Mathematics, 14(12), 2075. https://doi.org/10.3390/math14122075

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Confidence-Aware Reward Shaping for Crypto Trading: A Comparative Study of Lightweight Uncertainty Estimation Methods

Abstract

1. Introduction

2. Related Work

2.1. Reinforcement Learning in Financial Trading

2.2. Reward Function Design in RL Trading

2.3. Uncertainty Estimation in Deep RL

3. Methodology

3.1. Problem Formulation

3.1.1. State Space

3.1.2. Action Space

3.1.3. Transaction Fees

3.2. TD3 Algorithm Overview

3.3. Reward Function Design

3.3.1. Baseline Reward (No Confidence)

3.3.2. Confidence-Enhanced Reward

3.4. Confidence Estimation Methods

3.4.1. Method 1: Critic Agreement (CA)

3.4.2. Method 2: Temporal Direction Consistency (TDC)

3.4.3. Method 3: State Novelty (SN)

3.4.4. Method 4: Action Magnitude Stability (AMS)

3.4.5. Method 5: State-Transition Surprise (STS)

3.4.6. Design Rationale

3.5. Theoretical Analysis of Confidence-Shaped Reward

3.5.1. Policy Ordering Under Multiplicative Shaping

3.5.2. Contraction of the Shaped Bellman Operator

3.5.3. Signal Saturation: A Sufficient Condition for Ineffectiveness

4. Experimental Setup

4.1. Dataset

4.2. Model Configuration and Controlled Variables

4.2.1. Fixed Model Configuration

4.2.2. Initial Capital and Trading Environment

4.2.3. Random Seed and Controlled Variables

4.3. Hyperparameter Selection

4.4. Evaluation Metrics

4.4.1. Return on Investment (ROI)

4.4.2. Sharpe Ratio (SR)

4.4.3. Maximum Drawdown (MDD)

4.4.4. Statistical Significance Testing

5. Results and Analysis

5.1. Overall Performance Comparison

5.2. Per-Cryptocurrency Analysis

5.2.1. Bitcoin

5.2.2. Ethereum

5.2.3. Litecoin

5.2.4. Cross-Asset Takeaway

5.3. Confidence Behavior Visualization

5.3.1. SN Tracks Price Novelty on All Three Assets

5.3.2. CA Fluctuates with Market Uncertainty, Without a Single Dominant Event

5.3.3. TDC Is Near-Binary

5.3.4. AMS Drifts over Long Horizons

5.3.5. STS Is Saturated Low

5.3.6. Summary of Behavioral Differences

5.4. Trading Behavior Analysis

5.4.1. Win Rate Drives the ROI Ranking

5.4.2. Position Size Stability Reflects AMS’s Direct Signal

5.4.3. Direction Flip Rate Confirms TDC’s Mechanism

5.4.4. SN Combines Desirable Behaviors Without Directly Targeting Any of Them

5.5. Computational Cost

5.6. Statistical Significance

6. Discussion

6.1. Generalizability Beyond the Study Setup

6.2. Limitations

6.3. Practical and Managerial Implications

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI