Segmenting Action-Value Functions Over Time Scales in SARSA via TD(Δ)

Mahammad Humayoo; Gengzhong Zheng; Xiaoqing Dong; Wei Huang; Liming Miao; Shuwei Qiu; Zexun Zhou; Peitao Wang; Zakir Ullah; Naveed Ur Rehman Junejo; Xueqi Cheng

doi:10.3390/a18110729

,

and

¹

School of Computer Information Engineering, Hanshan Normal University , Chaozhou 521041, China

²

CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

³

University of Chinese Academy of Sciences, Beijing 101408, China

⁴

School of Computer Science, Beijing Institute of Technology, Beijing 100081, China

Algorithms2025, 18(11), 729;https://doi.org/10.3390/a18110729

Version Notes

Order Reprints

Abstract

In numerous episodic reinforcement learning (RL) environments, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Traditional SARSA algorithms face challenges in achieving an optimal balance between bias and variation, primarily due to their dependence on a single, constant discount factor (

η

). This study enhances the temporal difference decomposition method, TD(

Δ

), by applying it to the SARSA algorithm, wherein the action-value function is segmented into several components based on the differences between action-value functions linked to specific discount factors. Each component, referred to as a delta estimator (D), is linked to a specific discount factor and learned independently. This modified technique is referred to as SARSA(

Δ

). SARSA is a widely used on-policy RL method that enhances action-value functions via temporal difference updates. This decomposition, namely SARSA(

Δ

), facilitates learning across a range of time scales. This analysis makes learning more effective and guarantees consistency, especially in situations where long-horizon improvement is needed. The results of this research show that the proposed technique works to lower bias in SARSA’s updates and speed up convergence in both deterministic and stochastic settings, even in dense-reward Atari environments. Experimental results from a variety of benchmark settings show that the proposed SARSA(

Δ

) outperforms existing TD learning techniques in both tabular and deep RL environments.

Keywords:

SARSA; temporal difference (TD(Δ)); action-value functions; on-policy; long-horizon; time-scale decomposition

Segmenting Action-Value Functions Over Time Scales in SARSA via TD(Δ)

Abstract

Article Metrics

Citations

Article Access Statistics