Diffusion-Q Synergy (DQS): A Generative Approach to Policy Optimization via Denoised Action Spaces

Li, Ao; Zhu, Xinghui; Que, Haoyi

doi:10.3390/app151810141

Open AccessArticle

Diffusion-Q Synergy (DQS): A Generative Approach to Policy Optimization via Denoised Action Spaces

by

Ao Li

¹

,

Xinghui Zhu

^1,* and

Haoyi Que

^2,*

¹

School of Information Science and Technology, Hunan Agricultural University, No. 1 Nongda Road, Furong District, Changsha 410128, China

²

School of Artificial Intelligence, Shenzhen Polytechnic University, 7098 Liuxian Boulevard, Nanshan District, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10141; https://doi.org/10.3390/app151810141

Submission received: 8 July 2025 / Revised: 10 September 2025 / Accepted: 12 September 2025 / Published: 17 September 2025

Download

Browse Figures

Versions Notes

Abstract

In this paper, we propose a novel algorithm that integrates diffusion models with reinforcement learning, called Diffusion-Q Synergy (DQS). The methodology formalizes an equivalence relationship between the iterative denoising process in diffusion models and the policy improvement mechanism in Markov Decision Processes. Central to this framework is a dual-learning mechanism: (1) a parametric Q-function is trained to evaluate noise prediction trajectories through temporal difference learning, effectively serving as a differentiable critic for action quality assessment; and (2) this learned Q-scoring function is then structurally integrated into the training objective of a conditional diffusion model, formulating a constrained optimization problem that simultaneously maximizes expected returns while minimizing policy deviation from behavioral priors. The algorithmic superiority of DQS stems from its hybrid architecture combining the (i) diffusion policy cloning for stable behavior regularization and (ii) adaptive noise rectification through Q-value-guided key denoising step correction, which is particularly effective for refining suboptimal action sequences, thereby guiding the entire diffusion trajectory toward policy optimality. Rigorous ablation studies across benchmark environments demonstrate statistically significant performance improvements (

p < 0.01

) over baseline methods in both computational efficiency and asymptotic policy quality. The implementation has been open-sourced at AOLIGOOD/Diffusion_Q_Synergy, to facilitate reproducibility.

Keywords:

reinforcement learning; diffusion models; diffusion policy; policy optimization

1. Introduction

Contemporary industrial systems have experienced unprecedented expansion and development, with control objects evolving toward higher-dimensional, increasingly nonlinear, and more complex operational objectives. This paradigm shift has imposed substantially elevated demands on control system design. Reinforcement learning (RL) has emerged as a particularly promising solution, as it obviates the need for complete a priori knowledge of environmental dynamics. This capability enables robotic systems, automated equipment, and other intelligent agents to autonomously adapt and optimize their behavior in dynamic, uncertain environments, thereby enhancing operational efficiency, minimizing resource expenditure, and facilitating more sophisticated task execution. Notably, in domains such as Large Language Models (LLMs), embodied intelligence, and autonomous driving, RL has proven instrumental in developing more flexible and responsive solutions that drive technological innovation.

The conceptual foundations of RL can be traced back to the mid-20th century, when Minsky [1] first formalized the notions of “reinforcement” and “reinforcement learning”. His seminal work emphasized the critical role of reward–punishment mechanisms in learning processes and identified trial-and-error as the fundamental paradigm of RL.

From a methodological perspective, dynamic programming provided essential mathematical tools for RL development. Bellman’s 1960 formulation of dynamic programming [2,3] not only established the theoretical framework for solving Markov Decision Processes (MDPs) but also introduced the Bellman equation, which remains the cornerstone of modern RL algorithms. Subsequent work on policy iteration methods further refined the MDP solution framework. However, these early developments remained largely theoretical until the breakthrough introduction of Q-learning [4]. This algorithm enabled practical applications by discovering optimal action policies without requiring explicit reward functions or state transition models. As a model-free approach that provably converges to optimal policies in deterministic MDP, Q-learning represented a watershed moment in RL research.

The 21st century witnessed transformative advances with the integration of deep learning. In 2013, DeepMind’s pioneering work [5] combined deep neural networks with Q-learning to create Deep Q-Networks (DQNs), effectively addressing the critical challenge of high-dimensional state space representation. This innovation spawned numerous advanced algorithms, including DDQN [6], TRPO [7], PPO [8], DDPG [9], A3C [10], SAC [11], TD3 [12], and [13,14,15], collectively propelling RL into mainstream artificial intelligence research and applications.

While deep neural networks have significantly enhanced the capability of reinforcement learning agents to handle complex decision-making tasks, conventional RL methodologies necessitate continuous environmental interaction for experience collection during agent training. This exploration process often incurs substantial costs in real-world applications. Specifically, agents typically require millions of interactions to develop effective policies, which proves particularly problematic in domains such as robotic control and autonomous driving—scenarios where such extensive interaction is not only cost-prohibitive but also poses significant safety risks. Furthermore, the inherent randomness in environmental interactions coupled with frequent policy updates during exploration may lead to training gradient instability.

Offline reinforcement learning (Figure 1), also known as batch reinforcement learning, has emerged as a promising research direction to address these challenges. This paradigm enables policy learning exclusively from pre-collected datasets, eliminating the need for online environmental interaction. The principal advantages of this approach are threefold: First, the use of fixed datasets substantially improves training stability; second, the elimination of real-time interaction allows for the optimal utilization of historical data, particularly valuable in high-cost or safety-critical applications (e.g., healthcare, autonomous vehicles, and robotics); finally, offline RL facilitates policy optimization using large-scale offline datasets, thereby mitigating traditional RL’s sample inefficiency while enhancing data utilization.

However, offline RL introduces unique challenges. Since it relies solely on pre-collected data, standard RL methods encounter distributional shift during policy optimization, generating out-of-distribution actions. Q-learning exacerbates this issue through extrapolation error, overestimating the value of OOD (out-of-distribution actions). This overestimation triggers a destructive feedback loop: the policy increasingly selects overestimated OOD actions, and the Q-function amplifies these errors, ultimately resulting in significant policy performance degradation.

Prior research in offline RL has typically addressed this problem through one of four approaches: 1. Policy Constraints [12,16,17,18,19,20]: Restricting the policy from selecting actions that deviate substantially from the offline data distribution. While effective in preventing OOD action selection and subsequent Q-value overestimation, this method renders policy learning critically dependent on dataset quality. 2. Uncertainty Estimation [21,22]: Adjusting policies based on estimated uncertainty (e.g., in policy or value functions) to avoid decisions in high-uncertainty regions. Practical implementations often underestimate uncertainty, potentially increasing OOD action selection. 3. Regularization [19,23,24]: Incorporating regularization terms without explicit policy constraints. Generally less conservative than policy constraints, this approach typically achieves superior performance in practice. 4. Trajectory Optimization [25,26]: Framing RL as a sequence generation task with multiple state–action anchor points throughout trajectories. While theoretically eliminating distributional shift, this method demands substantial computational resources and exhibits slow prediction times.

In the majority of prior work, the policy is typically parameterized as a Gaussian distribution, with its mean and (often diagonal) covariance determined by the output of a neural network. However, in the context of offline reinforcement learning, datasets are frequently collected from a mixture of diverse behavior policies. Consequently, the true underlying behavior policy may exhibit complex characteristics such as strong multimodality, significant skewness, or intricate dependencies between different action dimensions. A standard diagonal Gaussian policy inherently struggles to adequately model these complex properties [27]. This inherent limitation in policy representation capacity can severely constrain the expressiveness of the learned policy. As a result, offline reinforcement learning methods that rely on this restrictive parameterization, particularly those incorporating policy regularization, often tend to converge to suboptimal policies and may exhibit performance that is slightly inferior compared to alternative approaches.

Owing to the superior performance of diffusion models, recent studies have increasingly adopted them as policy networks, as exemplified by Diffuser [28] and Diffusion Policy [29]. However, these methods rely solely on behavioral cloning for policy training without directly optimizing the policy, leading to performance that critically depends on the neural network’s scale and often suffers from suboptimal real-time efficiency. In contrast, Diffusion-QL aligns with our approach by leveraging the Q-function for policy improvement to learn the optimal policy. Nevertheless, its training process necessitates frequent sampling, which substantially increases computational overhead. Notably, if we model the action distribution of the dataset as a perturbed optimal policy distribution, based on prior studies [30], the later stages of the denoising process have the most significant impact on sample quality, where the noise prediction near

t = 0

can be viewed as estimating the perturbed noise of the optimal policy distribution. This insight enables highly efficient sampling during training with only a minimal number of diffusion steps.

Building upon these observations, we propose DQS, a novel reinforcement learning algorithm that formalizes the dataset’s action distribution as a noise-injected optimal policy distribution and employs a diffusion (or score-based) model for policy regularization. Specifically, DQS utilizes a multilayer perceptron (MLP)-based Denoising Diffusion Probabilistic Model (DDPM) as its policy. The training objective comprises two key components: 1. a behavioral cloning term, ensuring alignment with the empirical action distribution of the training data, and 2. a policy optimization term, where the Q-function guides the diffusion model in noise prediction during the later stages of denoising. Crucially, this framework enables highly efficient training by requiring only minimal denoising steps during sampling.

In summary, this paper contributes DQS, an improved offline RL algorithm that leverages diffusion models’ exceptional data distribution learning capability for precise policy regularization while successfully employing Q-learning to guide noise prediction in suboptimal actions toward optimal action discovery. Comprehensive evaluation on the D4RL offline RL benchmark demonstrates that DQS achieves performance comparable to or superior to recent offline RL baselines.

2. Preliminaries

This section systematically delineates the formal definition of Markov Decision Processes, the application of weighted regression to policy improvement, and the mathematical underpinnings of diffusion models. This exposition establishes the theoretical foundation for the subsequent development of methodologies.

2.1. Offline Reinforcement Learning

A Markov Decision Process (MDP) is formally defined as a tuple

M = (S, A, P, R, γ)

.

S

represents the state space, consisting of the set of all possible states.

A

represents the action space, consisting of the set of all possible actions.

P (s^{'} | s, a)

is the state transition probability function, which specifies the probability of transitioning to state

s^{'}

from state s when taking action a. This function satisfies the property

\sum_{s^{'}} P (s^{'} | s, a) = 1

for all

s \in S, a \in A

.

R (s, a)

is the reward function, which defines the immediate reward received upon taking action a in state s.

γ \in [0, 1)

is the discount factor, which weights the importance of immediate versus future rewards.

Markov Decision Processes constitute the theoretical foundation for reinforcement learning, providing a mathematical framework to model the interaction dynamics between an intelligent agent and its environment. The primary objective in reinforcement learning is to learn an optimal policy

π_{θ}

by maximizing the expected cumulative discounted return, denoted by

J (π_{θ})

:

\begin{matrix} J (π_{θ}) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t})] \end{matrix}

(1)

Here,

τ = (s_{0}, a_{0}, s_{1}, a_{1}, \dots)

denotes a trajectory (or sequence of state–action pairs) generated by following policy

π_{θ}

.

Specifically, the action–value function

Q^{π_{θ}} (s, a)

is defined as the expected cumulative discounted return obtained by taking action a in state s and thereafter following policy

π_{θ}

:

\begin{matrix} Q^{π_{θ}} (s_{t}, a_{t}) = E_{a \sim π_{θ}} [G_{t} | s_{t} = s, a_{t} = a] \end{matrix}

(2)

where

G_{t} = \sum_{k = t}^{\infty} γ^{k - t} R (s_{k}, a_{k})

is the cumulative discounted return from timestep t.

The action–value function

Q^{π_{θ}}

satisfies the Bellman equation, which offers a recursive relationship between the value of a state–action pair and the expected value of the subsequent state–action pair:

\begin{matrix} Q^{π_{θ}} (s_{t}, a_{t}) = R (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim P, a_{t + 1} \sim π_{θ}} [Q^{π_{θ}} (s_{t + 1}, a_{t + 1}) | s_{t}, a_{t}] \end{matrix}

(3)

Maximizing the reinforcement learning objective function is equivalent to maximizing the expected values of the action–value function.

In contrast to online reinforcement learning, offline reinforcement learning solely utilizes a static dataset

D = {(s_{i}, a_{i}, r_{i}, s_{i}^{'})}

of previously collected transitions for policy optimization, without explicit interaction with the environment. This dependency introduces significant challenges, including distribution shift (or covariate shift) and extrapolation errors, as the learned policy might select actions outside the distribution of the dataset’s behavior policy

μ

. To address these challenges, a common approach in offline RL is to constrain the learned policy

π

to adhere closely to the behavior policy

μ

that generated the dataset

D

, while concurrently optimizing for high Q-values.

This objective is formally expressed as maximizing the expected Q-value under the behavior policy’s state distribution

ρ_{μ} (s)

, penalized by the KL divergence between the learned policy

π (\cdot | s)

and the behavior policy

μ (\cdot | s)

:

\begin{matrix} arg max_{π} & \int_{S} ρ_{μ} (s) \int_{A} π (a | s) Q_{ϕ} (s, a) d a d s \\ - \frac{1}{α} \int_{S} ρ_{μ} (s) D_{KL} (π (\cdot | s) ∥ μ (\cdot | s)) d s . \end{matrix}

(4)

Here,

Q_{ϕ} (s, a)

denotes the learned action–value function parameterized by

ϕ

,

ρ_{μ} (s)

is the stationary state distribution induced by the behavior policy

μ

, and

D_{KL}

represents the Kullback-Leibler divergence. The term

- \frac{1}{α} D_{KL} (π (\cdot | s) ∥ μ (\cdot | s))

acts as a regularization term, scaled by

\frac{1}{α}

(where

α

is a hyperparameter) to penalize deviations of the learned policy

π

from the behavior policy

μ

.

2.2. Policy Improvement via Weighted Regression

The optimal policy

π^{*}

for the objective in Equation (4) is derived using Lagrange multipliers, yielding [17]

π^{*} (a | s) = \frac{1}{Z (s)} μ (a | s) exp (α Q_{ϕ} (s, a)),

(5)

where

Z (s)

is the partition function. This expression for

π^{*}

represents a policy improvement step.

Directly sampling from

π^{*}

requires explicit modeling of the behavior policy

μ

. This poses a considerable challenge, particularly in continuous action spaces, as

μ

can be highly complex or multimodal. Previous methods [17] address this challenge through projecting the optimal policy

π^{*}

onto a parameterized policy class

π_{θ}

. This projection is typically accomplished by minimizing the KL divergence between

π^{*}

and

π_{θ}

:

\begin{matrix} arg min_{θ} E_{s \sim D^{μ}} [D_{KL} (π^{*} (\cdot | s) ∥ π_{θ} (\cdot | s))] \\ = & arg max_{θ} E_{(s, a) \sim D^{μ}} [\frac{1}{Z (s)} log π_{θ} (a | s) exp (α Q_{ϕ} (s, a))] . \end{matrix}

(6)

This approach is typically referred to as weighted regression or advantage-weighted regression (specifically when Q is replaced by the advantage

A = Q - V

), where

exp (α Q_{ϕ} (s, a))

serves as the regression weight.

2.3. Diffusion Models

Diffusion models [31,32] represent a prominent class of generative models in deep learning, which employ thermodynamic diffusion principles to iteratively reconstruct data from noise. These models have exhibited notable performance in various generation tasks, including image, audio, and video synthesis. The framework encompasses two fundamental phases: the forward process and the reverse process.

Forward Process: The forward process, also known as the diffusion phase, systematically transforms the original data sample

x_{0}

into pure Gaussian noise through T discrete timesteps. This transformation is mathematically represented as

\begin{matrix} x_{t} = α_{t} x_{t - 1} + β_{t} ϵ, ϵ \sim N (0, I) \end{matrix}

(7)

Here,

β_{t}

denotes the noise intensity at step t,

ϵ \sim N (0, I)

represents standard normal distribution noise, and

α_{t}

corresponds to the scaling factor at timestep t.

Reverse Process: The reverse process constitutes the central innovation of diffusion models, which aims to learn the gradual reconstruction of data

x_{0}

from noisy observations

x_{t}

. This parameterized Markov chain is realized through neural network learning, with the transition given by

\begin{matrix} x_{t - 1} = \frac{x_{t} - β_{t} ϵ_{θ} (s_{t}, x_{t}, t)}{α_{t}} + σ_{t} z, z \sim N (0, I) \end{matrix}

(8)

In this formulation,

ϵ_{θ} (s_{t}, x_{t}, t)

denotes the parameterized noise prediction network, while

σ_{t}

represents the noise variance.

The neural network training objective is expressed as

\begin{matrix} L_{D D P M} (θ) = E [| | ϵ (x_{0}, t) - ϵ_{θ} (x_{t}, c, t) {| |}^{2}] \end{matrix}

(9)

3. Diffusion-Q Synergy

As the control requirements of intelligent systems exhibit increasing complexity (e.g., dynamic gait regulation in quadrupedal robots, multimodal decision-making in autonomous vehicles), the limitations inherent in conventional control methodologies when applied to high-dimensional, nonlinear, and time-varying systems have become more pronounced. To effectively address these challenges, a novel offline reinforcement learning algorithm, predicated upon the diffusion model and designated as DQS, is proposed.

This methodology capitalizes on the progressive distribution-fitting capabilities characteristic of diffusion models to execute behavioral cloning based on the policy present within the dataset. Furthermore, through rigorous theoretical analysis, the mathematical equivalence between the reverse denoising process of the diffusion model and the policy learning process is demonstrably established. This equivalence ensures that the learned policy, denoted as

π_{θ}

, maintains alignment with the dataset’s behavioral policy while simultaneously enabling the Q-learning-guided noise prediction network,

ϵ_{θ}

, to generate optimized denoised actions.

3.1. Diffusion Policy Cloning

For the purpose of clearly distinguishing between two distinct temporal processes, the following notation is adopted: Specifically, the subscript

t \in {1 \dots T}

is utilized to denote discrete timesteps associated with the diffusion process. The superscript

n \in {1 \dots N}

is employed to represent timesteps within the reinforcement learning (RL) trajectory.

In the field of reinforcement learning (RL), a policy

π

is formally defined as a conditional probability function that characterizes the probability distribution over actions given a specific state. Concurrently, diffusion models, as a class of generative models, can also be conceptualized as a conditional probability function. This function describes the probabilistic generation of a final output

x_{0}

(e.g., an action a) from an initial random noise

x_{T}

, contingent upon a given state s, through an iterative reverse denoising process. Given the demonstrated efficacy of diffusion models in accurately modeling and generating complex data distributions, it is posited that these models can be directly employed for the parameterization of policies within RL frameworks.

\begin{matrix} π_{θ} (a | s) = P_{θ} (x_{0} | s, x_{T}) \end{matrix}

(10)

In this context,

π_{θ} (a | s)

represents the learned policy, and

P_{θ} (x_{0} | s, x_{T})

represents the reverse process of the diffusion model, indicating the likelihood of the diffusion model generating a clean sample

x_{0}

given the state s and the final noisy sample

x_{t}

.

A rigorous mathematical equivalence is established between the conditional generation process of the diffusion model and the process of RL policy optimization. Consequently, the standard objective function utilized in reinforcement learning may be reformulated as follows:

\begin{matrix} J (θ) & = E_{s \sim D, a \sim π_{θ} (\cdot | s)} [\sum_{t = 0}^{\infty} γ^{t} R (s, a)] \\ = E [D_{K L} (π^{⋆} (a | s) ‖ P_{θ} (x_{0} | s, x_{T}, T))] \\ = E [D_{K L} (π^{⋆} (a | s) ‖ π_{θ} (a | s))] \end{matrix}

(11)

Here,

θ

represents the set of learnable parameters, and

π^{⋆} (a | s)

denotes the optimal policy.

Remark 1.

π^{⋆}

is defined as the policy that maximizes

J (θ)

:

\begin{matrix} π^{⋆} = arg max J (π) \end{matrix}

(12)

If

π^{⋆}

matches

π_{θ}

, i.e.,

π^{⋆} = π_{θ}

, then

J (θ)

must reach its maximum value. Therefore, minimizing the discrepancy between

π_{θ}

and

π^{⋆}

can be regarded as a surrogate objective for maximizing

J (θ)

, as expressed in Equation (12).

Remark 2.

Within the framework presented in Equation (10), the policy improvement process characteristic of reinforcement learning is modeled as the reverse denoising trajectory of the diffusion model, wherein a clean sample

x_{0}

is generated from noise ϵ. This specific formulation necessitates two conditions: 1. The diffusion model must possess the requisite capacity to approximate arbitrary policy distributions, a capability previously demonstrated in ref. [33]. 2. Knowledge or reliable estimability of the optimal policy

π^{⋆}

is required. It is pertinent to note that in the context of offline RL,

π^{⋆}

is typically unknown and must therefore be inferred from a dataset that has been generated by a suboptimal behavioral policy, denoted by μ.

To apply

D_{K L}

definition to the objective function,

\begin{matrix} J (θ) = E_{s \sim ρ_{μ} (s)} [E_{a \sim π^{*} (\cdot | s)} [log π^{*} (a | s) - log P_{θ} (x_{0} | s, x_{T})]] \end{matrix}

(13)

Following the methodology established in [31], the diffusion model’s training objective (Equation (9)) relates to the noise prediction error:

\begin{matrix} - log P_{θ} (x_{0} | s, x_{T}) \propto \sum_{t = 1}^{T} E_{ϵ \sim N (0, I)} [| | ϵ - ϵ_{θ} (s, x_{t}, t) {| |}^{2}] . \end{matrix}

(14)

Thus, minimizing

J (θ)

is equivalent to minimizing the diffusion model’s noise prediction error.

3.2. Q-Learning-Guided Noise Prediction

Remark 3.

Q-Learning-guided can be reformulated through the application of KL divergence to quantify the dissimilarity between the optimal policy

π^{⋆}

and the denoised action distribution

P_{θ} (x_{0} | a_{D}, s)

for a given state, thereby facilitating the guidance of noise prediction for suboptimal actions. However, analogous to the challenge encountered in Equation (12), the optimal policy remains unknown, necessitating the determination of its relationship with other components. As delineated in the preliminaries (Equation (5)), connections have been established between the behavioral policy μ and the optimal policy

π^{⋆}

, as well as between the parameterized policy and

π^{⋆}

. Correspondingly, a relationship between the denoising model

ϵ_{θ}

and the optimal policy can be derived.

Building upon the weighted regression formulation presented in Equation (6), it is further derived that

\begin{matrix} π^{⋆} (\cdot | s) \propto P_{θ} (x_{0} | a_{D}, s) \cdot exp (α Q (s, x_{0})) \end{matrix}

(15)

In the context of offline reinforcement learning, weighted regression imposes a constraint on the policy to maintain proximity to

μ

while employing the Q-value as regression weights to direct policy optimization. Similarly, Q-Learning-guided term within the objective function adjusts the direction of noise prediction while preserving the overall noise prediction trajectory (a characteristic ensured by the behavior cloning term), thereby optimizing the policy to better approximate the optimal policy. Consequently,

J (θ)

is simplified by adapting the weighted regression methodology:

\begin{matrix} J (θ) = E_{s \sim ρ_{μ} (s)} [E_{a \sim π^{*} (\cdot | s)} [log (μ (\cdot | s) + α E_{a \sim μ (\cdot | s)} [Q (s, a)] - log P_{θ} (x_{0} | s, x_{T})]] . \end{matrix}

(16)

where

α

functions as a normalization factor for the Q-function.

Substituting Equation (14) into the

J (θ)

objective gives

\begin{matrix} J (θ) = E_{s \sim ρ_{μ} (s)} [\sum_{t = 1}^{T} E_{ϵ \sim N (0, I)} [| | ϵ - ϵ_{θ} (s, x_{t}, t) {| |}^{2}] + α E_{a \sim μ (\cdot | s)} [Q (s, a)]] . \end{matrix}

(17)

As indicated in Equation (17), the objective

J (θ)

is ultimately reduced to the evaluation of generated actions

x_{0}

through the Q-network

Q_{ϕ}

. This process facilitates the adjustment of the diffusion model’s parameters

θ

to enable

ϵ_{θ}

to yield higher expected Q-values. Given that the objective

J (θ)

is defined by leveraging the Q-function, the accuracy of Q-value estimation is rendered critical. To mitigate potential overestimation issues, double Q-learning is employed for Q-function estimation. Specifically, two Q-networks,

Q_{ϕ_{1}}

and

Q_{ϕ_{2}}

, are utilized to independently estimate action values, and their parameters are updated by minimizing the discrepancy between their estimates. This approach serves to reduce Q-value variance and enhance the stability of policy optimization. Furthermore, target networks are implemented to stabilize the training procedure. These networks, updated through Exponential Moving Average (EMA), contribute to relative stability during training by serving as delayed-update replicas for the computation of Q-value targets. The objective function for Q-learning is formulated as

\begin{matrix} L_{Q} (ϕ) = E_{(s, a) \sim D} [{(r + γ min_{i = 1, 2} Q_{ϕ_{i}^{'}} (s^{'}, a^{'}) - Q_{ϕ_{i}} (s, a))}^{2}] \end{matrix}

(18)

Reducing the function to its simplest form, the complete optimization objective is expressed as

\begin{matrix} J (θ) = {(ϵ - ϵ_{θ} (s, x_{t}, t))}^{2} + α E_{(s, a_{D}) \sim D, x_{0} \sim P_{θ} (\cdot | s, a_{D})} [Q_{ϕ} (s, x_{0})] \end{matrix}

(19)

This section presents the derivation of the policy optimization term as a formulation that maximizes the expected Q-value of denoised actions. This derivation is predicated upon weighted regression under the premise of an unknown optimal policy. Through the application of double Q-learning and target networks, the stability and accuracy of Q-value estimation are enhanced. The final optimization objective integrates both the noise prediction error and Q-function-guided policy optimization, thereby ensuring behavior cloning while simultaneously adjusting the direction of noise prediction to approximate the optimal policy.

3.3. Algorithm Details

To help readers better understand our algorithm process, we will introduce the main steps of our algorithm below [34]:

1.

Initialization Phase: Initially, the noise prediction network

ϵ_{θ}

and dual Q-learning networks

Q_{ϕ_{1}}, Q_{ϕ_{2}}

are constructed and initialized, along with their corresponding target networks

Q_{ϕ_{1}^{'}}, Q_{ϕ_{2}^{'}}

and

ϵ_{θ^{'}}

. Subsequently, the offline dataset

D_{μ}

is loaded to acquire state-action-next state-reward tuples (

s, a, s^{'}, r

) for use as training samples.

2.

Critic Training Phase: During this phase, potential actions

a^{'}

for subsequent states

s^{'}

are generated. The target Q-values are computed using the Bellman equation and the target networks. The loss function (Equation (18)) is computed by minimizing the mean squared error between the current state–action Q-values predicted by

Q_{ϕ_{1}}

and

Q_{ϕ_{2}}

and their target values. The backpropagation process is then employed to update the parameters (

ϕ_{1}

and

ϕ_{2}

) of the dual Q-network, thereby enhancing value estimation accuracy.

3.

Policy Training Phase: The policy training procedure comprises two principal components:

(a): Behavior Cloning Term: Starting from actions within the dataset $a_{D}$ , intermediate noisy states ( $x_{t}$ ) are generated through random timestep noising ( $ϵ \sim N (0, I)$ ). These states, along with the corresponding timestep (t), are input into the noise prediction network $ϵ_{θ}$ to yield predicted noise ( $ϵ_{θ} (s, x_{t}, t)$ ). The loss function is computed by minimizing the mean squared error between the actual noise ( $ϵ$ ) and the predicted noise ( $ϵ_{θ} (s, x_{t}, t)$ ) for the current state (s) and timestep (t).
(b): Policy Improvement Term: Optimized actions ( $x_{0}$ ) are generated through the denoising process, starting from noisy samples derived from dataset actions $a_{D}$ . Policy gradient signals are constructed by evaluating the Q-function values ( $Q_{ϕ}$ ) for state s and the optimized action $x_{0}$ .

The compound loss function (Equation (19)) integrates these two components, with gradient descent utilized to update the parameters (

θ

) of the noise prediction network.

4.

Target Network Update: An Exponential Moving Average (EMA) mechanism is employed to execute soft updates on the target network parameters (

ϕ_{1}^{'}

,

ϕ_{2}^{'}

, and

θ^{'}

). This gradual update strategy effectively stabilizes the training process by mitigating instability induced by high target value variance, thereby promoting steady improvement in model performance.

During both policy training and critic training, policy gradient clipping is applied to ensure training stability. The pseudocode and specific algorithm are shown in the following Algorithm 1.

Algorithm 1 DQS

Require: Initialize the score-based model

ϵ_{θ}

, training parameters

ϕ_{1}

and

ϕ_{2}

of the action
evaluation model

Q_{ϕ}

, training parameters

ϕ_{1}^{'}

and

ϕ_{2}^{'}

of the target action evaluation
model

Q_{ϕ^{'}}

,Loading the Dataset

D_{μ}

.

1:: for each gradient step do
2:: $(s, a_{D}, r, s^{'}) \sim D_{μ}$ , $i \sim Uniform (1, N)$ , $ϵ \sim N (0, I)$
3:: // Training Critic
4:: Sample $a^{*}$ by $a^{*} = \frac{x_{t} - β ϵ_{θ} (s, x_{t}, t)}{α}$
5:: Update $ϕ_{1}$ and $ϕ_{2}$ by (18)
6:: // Training Policy
7:: Adding Noise to $a_{D}$ to get $a_{n}$
8:: Recovering $a^{*}$ by $a^{*} = \frac{a_{n} - β ϵ_{θ} (s, x_{t}, t)}{α} for t > n$
9:: Update $θ$ by (19)
10:: // Update Target Network
11:: Update $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ by EMA
12:: Update $θ^{'}$ by EMA
13:: end for

4. Experimental Evaluation

The efficacy of the proposed methodology was systematically evaluated using the D4RL benchmark dataset [35], a widely recognized standard in offline reinforcement learning. To comprehensively validate the algorithm’s performance within the domain of intelligent control systems, the standard Gym test environments were selected as the evaluation platform. This test suite comprises three representative locomotion tasks: Hopper-v2, Walker2d-v2, and HalfCheetah-v2. Additionally, the Kitchen-v2 environment was included to assess the algorithm’s capability in complex manipulation tasks. Each task presents distinct challenges in biomechanical control, simulating various robotic locomotion and manipulation paradigms.

The Hopper-v2 environment models the control dynamics of a monopedal robotic system, analogous to the locomotion observed in kangaroos. This task requires the precise regulation of torque at the agent’s articulated joints to achieve stable forward hopping while concurrently maintaining dynamic equilibrium. The primary control objective is the joint optimization of forward velocity and postural stability to prevent instability or falls.

In the Walker2d-v2 environment, the control of a bipedal robotic system is simulated, emulating the mechanical principles of human ambulation. Effective control in this environment necessitates the coordinated application of torque across multiple leg joints to generate and sustain stable gait patterns. The central control challenge lies in the simultaneous maximization of forward velocity and preservation of upright balance.

The HalfCheetah-v2 environment simulates the control requirements of a quadrupedal robotic platform, inspired by the locomotion of cheetahs. Successful performance in this task demands sophisticated inter-limb coordination to achieve high-speed running gaits. The paramount control objective centers on maximizing horizontal velocity through the optimal distribution of torque across all actuated joints.

The Kitchen environment introduces a new dimension of evaluation by simulating a robotic manipulation scenario with multiple articulated objects. This environment tests the agent’s ability to perform sequential manipulation tasks, such as opening microwave doors, moving kettles, and turning knobs, requiring precise motor control and long-term task planning. The key challenge lies in coordinating high-dimensional continuous control while maintaining object interactions and avoiding collisions in a constrained workspace.

4.1. Comparison with Other Methods

Baselines

To establish a comprehensive performance benchmark, the proposed approach was systematically evaluated against several representative baselines from distinct methodological categories.

Within the domain of regularization-based offline reinforcement learning, baselines include Conservative Q-Learning (CQL) [23], TD3+BC [12], DQL [36], AWR [17], SAC [11], BEAR [37], BCQ [16], and Implicit Q-Learning (IQL) [19]. Within the model-based offline RL paradigm, Model-Based Offline Planning (MBOP) [38] was considered. Additionally, a comparative analysis was performed against trajectory prediction approaches, specifically Decision Transformer (DT) [25], Trajectory Transformer (TT) [26], and Diffuser [28].

The reported performance metrics for baseline methods are derived from either (1) the optimal results published in their respective original papers, or (2) the standardized evaluations conducted by ref. [35]. Our experimental results (see Table 1) were evaluated across five random seeds and 150 episodes, with performance variance quantified using standard deviation to ensure statistical reliability. This ensures a fair and consistent comparison framework. We employed permutation testing to assess statistical significance, with the detailed implementation and hyperparameters of our method provided in Appendix A, Table A1. Furthermore, the training details of our method are comprehensively described in Figure A1, Figure A2, Figure A3 and Figure A4.

Our method’s suboptimal performance in Gym tasks compared to DQL [36] might stem from the Q-function primarily guiding the denoising process only during its later stages. This late-stage guidance could be detrimental to effective motor control. Conversely, our method’s superior performance in Kitchen tasks could be attributed to the Q-function’s guidance during the earlier stages of the denoising process, which appears to be either less critical or even counterproductive for this specific task.

Table 2 presents a comparison of the computational efficiency of an algorithm named DQS against several baseline algorithms on an NVIDIA 4090 GPU. The table shows two key metrics: training time and inference time. Training time is defined as the duration required to complete one million training steps. Inference time is measured after 150 inference episodes.

4.2. Impact of Hyperparameters

The number of denoising steps, denoted as t, within diffusion models is a critical hyperparameter that significantly influences both the quality of generated samples and the efficiency of the inference process. Generally, increasing the number of denoising steps facilitates the progressive refinement of the output through more meticulous noise removal, thereby yielding samples characterized by enhanced fidelity and realism. Each successive step applies subtle adjustments to the generated data, with the cumulative effect leading to a sharper final result. However, each additional denoising step necessitates a distinct forward pass through the model, consequently increasing the total generation time. Thus, selecting the number of denoising steps involves a fundamental trade-off between generation quality and computational efficiency. In applications such as control systems where real-time performance is paramount, minimizing the value of t becomes essential. As depicted in Figure 2, empirical analysis conducted in this study indicates that setting

t = 5

achieves an optimal balance between sample quality and real-time performance requirements.

The alternative action parameter, denoted as n, governs the generation of N candidate actions for a given state s. These candidates are subsequently evaluated utilizing the learned critic

Q_{ϕ}

prior to resampling the optimal action. This parameter similarly impacts both generation quality and computational efficiency. In principle, exploring multiple action candidates enables the model to select superior subsequent actions, potentially mitigating suboptimal local convergence and enhancing overall output quality. This process is analogous to conducting a broader search within the action space. While generating N candidate actions inherently demands greater computational resources compared to single-action generation, the parallelizable nature of this process renders the impact of N on generation speed less pronounced than that of t. Consequently, the parameter selection space for n typically affords greater flexibility. As demonstrated in Figure 3, experimental results indicate that setting

n = 50

optimally reconciles generation quality with real-time constraints.

4.3. Ablation Study

This section presents a comprehensive analysis validating the superior quantitative performance of Diffusion-Q Synergy (DQS) compared to other policy-constrained methods on D4RL benchmark tasks. Ablation studies were conducted to examine the two principal components of DQS: (1) the utilization of diffusion models as expressive policies and (2) Q-learning-guided denoising of suboptimal actions. Regarding the policy component, behavioral cloning comparisons were performed between the diffusion model and prevalent Transformer architectures. With respect to the Q-learning component, policy improvement was evaluated against BCQ methods.

As evidenced in Table 3, diffusion-based behavioral cloning (Diffusion-BC) demonstrates performance superior to that of baseline approaches. This outcome validates the diffusion policy’s greater expressiveness and enhanced capability in fitting data distributions within control domains. The performance advantage of DQS over Diffusion-BC substantiates that the Q-learning-guided denoising of suboptimal actions provides additional policy enhancement. Table 3 further reveals that the Q-learning component of DQS yields more significant policy improvement than baseline methods, such as BCQ. Collectively, these results demonstrate the synergistic interaction between DQS’s dual components, which culminates in superior overall performance.

We conducted systematic evaluations of the actions generated at each temporal step under a fixed sampling horizon (T = 5) within the experimental environment. As depicted in Figure 4, these actions are referred to as “Sampling samples”. Our analysis reveals that the action outputs of the diffusion policy exhibit predominant dependence on the terminal sampling step. Importantly, the implementation of guidance mechanisms at this final sampling stage yields substantial improvements in the quality of generated actions.

5. Conclusions

The critical challenges of distributional shift and complex policy distributions in offline RL are investigated in this paper. Through rigorous theoretical analysis, an equivalence is established between policy optimization and the reverse process of diffusion models. This enables the interpretation of dataset actions as actions from the optimal policy that have been contaminated by noise. Building upon this insight, DQS is proposed; this novel algorithmic framework addresses these fundamental limitations. Experimental evaluations conducted on the D4RL benchmark demonstrate that the proposed method achieves superior performance compared to state-of-the-art approaches across most tasks, while maintaining computational efficiency (diffusion steps = 5, candidate actions = 50). However, the empirical results reveal certain limitations, particularly high variance in D4RL scores, which indicates potential instability in policy optimization. To mitigate this issue, future research may consider incorporating consistency models and event-triggered sampling to balance sampling efficiency and performance. In summary, this work contributes a new theoretical perspective and represents a methodological advancement within the field of offline reinforcement learning.

Author Contributions

Conceptualization, A.L.; Methodology, A.L.; Formal analysis, A.L.; Investigation, A.L.; Writing—original draft, A.L.; Writing—review & editing, X.Z. and H.Q.; Supervision, X.Z. and H.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Acknowledgments

I would like to express my deepest gratitude to all those who contributed to the completion of this paper. I also extend my sincere thanks to Man Liu and Jile Li for their technical assistance and constructive discussions. Lastly, I owe my heartfelt thanks to my family and friends for their unwavering support and understanding during this challenging yet rewarding journey.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
DQS	Diffusion-Q Synergy
MDP	Markov Decision Process
OOD	Out-of-Distribution
MLP	Multilayer Perceptron

Appendix A. Hyperparameter

The training hyperparameters of DQS in different tasks, including the learning rate

η

, noise scheduling strategy, weight

α

of the Q-guided loss function, and GradNorm are provided in Table A1.

Table A1. The hyperparameter settings for all evaluation tasks.

Task	$η$	$α$	Grad Norm	Noise Scheduling
Medium-Expert-HalfCheetah	$3 \times 10^{- 4}$	1.0	7.0	Cosine
Medium-Expert-Hopper	$3 \times 10^{- 4}$	1.0	5.0	Cosine
Medium-Expert-Walker2d	$3 \times 10^{- 4}$	1.0	5.0	Cosine
Medium-HalfCheetah	$3 \times 10^{- 4}$	1.0	9.0	Cosine
Medium-Hopper	$3 \times 10^{- 4}$	1.0	9.0	Cosine
Medium-Walker2d	$3 \times 10^{- 4}$	1.0	1.0	Cosine
Medium-Replay-HalfCheetah	$3 \times 10^{- 4}$	1.0	2.0	Cosine
Medium-Replay-Hopper	$3 \times 10^{- 4}$	1.0	4.0	Cosine
Medium-Replay-Walker2d	$3 \times 10^{- 4}$	1.0	4.0	Cosine
Kitchen-Complete	$3 \times 10^{- 4}$	0.005	9.0	Cosine
Kitchen-Partial	$3 \times 10^{- 4}$	0.005	10.0	Cosine
Kitchen-Mixed	$3 \times 10^{- 4}$	0.005	10.0	Cosine

The hyperparameters for DQS in this table were adopted directly from the DQL baseline without additional tuning.

Appendix B. Training Detail

The algorithm was trained on a single 4090 GPU. Figure A1, Figure A2 and Figure A3 provide detailed insights into the training process, including Q loss, policy loss, mean of Q-values, GPU memory utilization, GPU utilization, and GPU memory allocation (bytes). It is important to note that the curves presented in these figures are not smoothed. Instead, the data points for these curves were sampled directly during training, with a sampling frequency of every 1000 training steps. This provides an unfiltered view of the training dynamics. Figure A4 presents the average performance and standard deviation band of DQS across various tasks, averaged over five different random seeds after training.

Figure A1. Detailed training and resource utilization metrics for the Kitchen-partial-v0 environment. These plots provide a comprehensive overview of model convergence and hardware efficiency during training.

Figure A2. Detailed training and resource utilization metrics for the Kitchen-mixed-v0 environment. These plots provide a comprehensive overview of model convergence and hardware efficiency during training.

Figure A3. Detailed training and resource utilization metrics for the Kitchen-complete-v0 environment. These plots provide a comprehensive overview of model convergence and hardware efficiency during training.

Figure A4. Performance of our method (DQS) in various evaluation tasks. The graph shows the mean normalized score obtained by DQS across four different task sets. The error bars represent the standard deviation, visualizing the robustness of our method.

References

Minsky, M. Steps toward Artificial Intelligence. Proc. IRE 1961, 49, 8–30. [Google Scholar] [CrossRef]
Weiss, G. Dynamic Programming and Markov Processes. Ronald A. Howard. Technology Press and Wiley, New York, 1960. Viii + 136 Pp. Illus. $5.75. Science 1960, 132, 667. [Google Scholar] [CrossRef]
Bellman, R. Dynamic Programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef] [PubMed]
Watkins, C.J.C.H.; Dayan, P. Q-Learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2019, arXiv:1509.02971. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1861–1870. [Google Scholar]
Fujimoto, S.; Gu, S.S. A minimalist approach to offline reinforcement learning. Adv. Neural Inf. Process. Syst. 2021, 34, 20132–20145. [Google Scholar]
Garcia, G.; Eskandarian, A.; Fabregas, E.; Vargas, H.; Farias, G. Cooperative Formation Control of a Multi-Agent Khepera IV Mobile Robots System Using Deep Reinforcement Learning. Appl. Sci. 2025, 15, 17777. [Google Scholar] [CrossRef]
Bogyrbayeva, A.; Dauletbayev, B.; Meraliyev, M. Reinforcement Learning for Efficient Drone-Assisted Vehicle Routing. Appl. Sci. 2025, 15, 2007. [Google Scholar] [CrossRef]
Cho, H.; Kim, H. Entropy-Guided Distributional Reinforcement Learning with Controlling Uncertainty in Robotic Tasks. Appl. Sci. 2025, 15, 2773. [Google Scholar] [CrossRef]
Fujimoto, S.; Meger, D.; Precup, D. Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2052–2062. [Google Scholar]
Peng, X.B.; Kumar, A.; Zhang, G.; Levine, S. Advantage Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning. arXiv 2020, arXiv:1910.00177. [Google Scholar]
Nair, A.; Dalal, M.; Gupta, A.; Levine, S. {AWAC}: Accelerating Online Reinforcement Learning with Offline Datasets. arXiv 2021, arXiv:2006.09359. [Google Scholar] [CrossRef]
Kostrikov, I.; Nair, A.; Levine, S. Offline Reinforcement Learning with Implicit Q-Learning. In Proceedings of the International Conference on Learning Representations, Virtually, 25–29 April 2022. [Google Scholar]
Kumar, A.; Fu, J.; Soh, M.; Tucker, G.; Levine, S. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Yu, T.; Thomas, G.; Yu, L.; Ermon, S.; Zou, J.Y.; Levine, S.; Finn, C.; Ma, T. Mopo: Model-based offline policy optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 14129–14142. [Google Scholar]
Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; Joachims, T. Morel: Model-based offline reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21810–21823. [Google Scholar]
Kumar, A.; Zhou, A.; Tucker, G.; Levine, S. Conservative q-learning for offline reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 1179–1191. [Google Scholar]
Yu, T.; Kumar, A.; Rafailov, R.; Rajeswaran, A.; Levine, S.; Finn, C. Combo: Conservative offline model-based policy optimization. Adv. Neural Inf. Process. Syst. 2021, 34, 28954–28967. [Google Scholar]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 15084–15097. [Google Scholar]
Janner, M.; Li, Q.; Levine, S. Offline Reinforcement Learning as One Big Sequence Modeling Problem. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 1273–1286. [Google Scholar]
Shafiullah, N.M.; Cui, Z.; Altanzaya, A.A.; Pinto, L. Behavior transformers: Cloning k modes with one stone. Adv. Neural Inf. Process. Syst. 2022, 35, 22955–22968. [Google Scholar]
Janner, M.; Du, Y.; Tenenbaum, J.B.; Levine, S. Planning with Diffusion for Flexible Behavior Synthesis. arXiv 2022, arXiv:2205.09991. [Google Scholar] [CrossRef]
Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; Song, S. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. Int. J. Robot. Res. 2024, 02783649241273668. [Google Scholar] [CrossRef]
Fernández, D.G.; Matişan, R.A.; Muñoz, A.M.; Vasilcoiu, A.M.; Partyka, J.; Veljković, T.H.; Jazbec, M. DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach. arXiv 2024, arXiv:2410.09633. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2256–2265. [Google Scholar]
Chen, H.; Lu, C.; Ying, C.; Su, H.; Zhu, J. Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Dong, Z.; Yuan, Y.; HAO, J.; Ni, F.; Ma, Y.; Li, P.; ZHENG, Y. CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making. In Proceedings of the Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; Levine, S. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv 2021, arXiv:2004.07219. [Google Scholar] [CrossRef]
Wang, Z.; Hunt, J.J.; Zhou, M. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhang, C.; Shi, Y.; Chen, Y. BEAR: Physics-Principled Building Environment for Control and Reinforcement Learning. In Proceedings of the 14th ACM International Conference on Future Energy Systems (e-Energy ’23), Orlando FL USA, 20–23 June 2023; pp. 66–71. [Google Scholar] [CrossRef]
Argenson, A.; Dulac-Arnold, G. Model-Based Offline Planning. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]

Figure 1. Offline reinforcement learning.

Figure 2. Analysis of the impact of the denoising steps on performance. The graph shows the relationship between normalized reward and training steps for different numbers of denoising steps (T = 3, T = 5, T = 15, and T = 25). Each subplot represents a different environment.

Figure 3. Analysis of the impact of the number of candidate actions on performance. The graph shows the relationship between normalized reward and training steps for different numbers of candidate actions (N = 1, N = 20, N = 50, and N = 100). Each subplot represents a different environment.

Figure 4. Analysis of reward changes and dynamics during the diffusion reverse process. The left subplot shows the reward change (

Δ

R) at different sampling steps, while the right subplot shows the reward dynamics throughout the entire sampling process. Both figures indicate that the significant reward increase primarily occurs in the final step of the reverse process (step 5). This highlights the crucial role of the final denoising step in generating high-reward samples.

Figure 4. Analysis of reward changes and dynamics during the diffusion reverse process. The left subplot shows the reward change (

Δ

R) at different sampling steps, while the right subplot shows the reward dynamics throughout the entire sampling process. Both figures indicate that the significant reward increase primarily occurs in the final step of the reverse process (step 5). This highlights the crucial role of the final denoising step in generating high-reward samples.

Table 1. Comparing performance of DQS with other baseline algorithms.

Gym Tasks	BC	CQL	IQL	DT	TT	TD3+BC	Diffuser	MBOP	DQL	DQS (Ours)
Medium-Expert-HalfCheetah	55.2	91.6	86.7	86.8	95.0	90.7	88.9	105.9	96.8	94.0 ± 0.94
Medium-Expert-Hopper	52.5	105.4	91.5	107.6	110.0	98.0	103.3	55.1	111.1	97.0 ± 1.01
Medium-Expert-Walker2d	107.5	108.8	109.6	108.1	101.9	110.1	106.9	70.2	110.1	109.3 ± 0.29
Medium-HalfCheetah	42.6	44.0	47.4	42.6	46.9	48.3	42.8	44.6	51.1	47.0 ± 0.70
Medium-Hopper	52.9	58.5	66.3	67.6	61.1	59.3	74.3	48.8	90.5	84.6 ± 1.30
Medium-Walker2d	75.3	72.5	78.3	74.0	79.0	83.7	79.6	41.0	87.0	85.9 ± 1.70
Medium-Replay-HalfCheetah	36.6	45.5	44.2	36.6	41.9	44.6	37.7	42.3	47.8	44.4 ± 0.61
Medium-Replay-Hopper	18.1	95.0	94.7	82.7	91.5	60.9	93.6	12.4	101.3	96.1 ± 0.01
Medium-Replay-Walker2d	26.0	77.2	73.9	66.6	82.6	81.8	70.6	9.7	95.5	94.8 ± 1.32
Average	51.9	77.6	77.0	74.7	78.9	75.3	77.5	47.8	88.0	83.7
Kitchen Tasks	BC	CQL	IQL	BEAR	AWR	SAC	BCQ	BEAR-p	DQL	DQS (Ours)
Kitchen-Complete	33.8	43.8	62.5	0.0	0.0	15.0	8.1	0.0	84.0	72.1 ± 13.2
Kitchen-Partial	33.8	49.8	46.3	13.1	15.4	0.0	18.9	0.0	60.5	73.5 ± 6.6
Kitchen-Mixed	47.5	51.0	51.0	47.2	10.6	2.5	8.1	0.0	62.6	65.1 ± 13.5
Average	38.4	48.2	53.3	20.1	8.7	5.8	11.7	0.0	69.0	69.7

Only the absolute highest value in each row is highlighted in boldface. We report scores with 5 diffusion steps.

Table 2. Comparing computational efficiency of DQS with other baseline algorithms.

Time	CQL	BC	DT	TD3 + BC	DQL	DQS
Training Time	140 m	11 m	400 m	38 m	160 m	120 m
Inference Time	-	-	-	-	10.6 m	11.1 m

Training time is defined as the duration needed to accomplish one million training steps. Inference time is after 150 episodes of inference.

Table 3. Comparison of our diffusion model with CVAE models, and our Q-learning method with BCQ method.

Dataset	Diffusion vs. CVAE		Q-Synergy vs. BCQ
Dataset	CVAE-BC	Diffusion-BC	Diffusion-BCQ	DQS (Ours)
Medium-Replay-HalfCheetah	34.4	39.7	46.1	45.4
Medium-Replay-Hopper	84.7	92.8	94.7	96.1
Medium-Replay-Walker2d	67.3	71.4	90.9	95.8
Mixed-Kitchen	49	51.2	58.5	68.0
Average	58.9	63.8	72.6	76.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, A.; Zhu, X.; Que, H. Diffusion-Q Synergy (DQS): A Generative Approach to Policy Optimization via Denoised Action Spaces. Appl. Sci. 2025, 15, 10141. https://doi.org/10.3390/app151810141

AMA Style

Li A, Zhu X, Que H. Diffusion-Q Synergy (DQS): A Generative Approach to Policy Optimization via Denoised Action Spaces. Applied Sciences. 2025; 15(18):10141. https://doi.org/10.3390/app151810141

Chicago/Turabian Style

Li, Ao, Xinghui Zhu, and Haoyi Que. 2025. "Diffusion-Q Synergy (DQS): A Generative Approach to Policy Optimization via Denoised Action Spaces" Applied Sciences 15, no. 18: 10141. https://doi.org/10.3390/app151810141

APA Style

Li, A., Zhu, X., & Que, H. (2025). Diffusion-Q Synergy (DQS): A Generative Approach to Policy Optimization via Denoised Action Spaces. Applied Sciences, 15(18), 10141. https://doi.org/10.3390/app151810141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diffusion-Q Synergy (DQS): A Generative Approach to Policy Optimization via Denoised Action Spaces

Abstract

1. Introduction

2. Preliminaries

2.1. Offline Reinforcement Learning

2.2. Policy Improvement via Weighted Regression

2.3. Diffusion Models

3. Diffusion-Q Synergy

3.1. Diffusion Policy Cloning

3.2. Q-Learning-Guided Noise Prediction

3.3. Algorithm Details

4. Experimental Evaluation

4.1. Comparison with Other Methods

4.2. Impact of Hyperparameters

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Hyperparameter

Appendix B. Training Detail

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI