We now proceed by presenting some of the main ideas of distributional reinforcement learning in a tabular setting. We will first look at the evaluation problem, where we are trying to find the stateaction value of a fixed policy $\pi $. Second, we consider the control problem, where we try to find the optimal stateaction value. Third, we consider the distributional approximation procedure CDRL used by agents in this paper.
2.2.1. Evaluation
We consider a distributional variant of (
2), the distributional Bellman operator given by
${T}^{\pi}:\mathcal{D}\to \mathcal{D}$,
Here,
${T}^{\pi}$ is, for all
$n\ge 1$, a
$\gamma $contraction in
${\mathcal{D}}_{n}$ with a unique fixed point when
${\mathcal{D}}_{n}$ is endowed with the supremum
${n}^{\mathrm{th}}$Wasserstein metric ([
5], Lemma 3) (see [
15] for more details on Wasserstein distances). Moreover by Proposition 2 of [
9],
${T}^{\pi}$ is expectation preserving when we have an initial coupling with the
${\mathcal{T}}^{\pi}$iteration given in (
2); that is, given an initial
${\eta}_{0}\in \mathcal{D}$ and a function
g, such that
$g={Q}_{{\eta}_{0}}$. Then,
${\left({\mathcal{T}}^{\pi}\right)}^{n}g={Q}_{{\left({T}^{\pi}\right)}^{n}{\eta}_{0}}$ holds for all
$n\ge 0$.
Thus, if we let
${\eta}_{\pi}\in \mathcal{D}$ be the function of distributions of
${Z}_{\pi}$ in (
1), then
${\eta}_{\pi}$ is the unique fixed point satisfying the distributional Bellman equation:
It follows that iterating ${T}^{\pi}$ on any starting collection ${\eta}_{0}$ with bounded moments eventually solves the evaluation task of $\pi $ to an arbitrary degree.
2.2.3. Categorical Evaluation and Control
In most real applications, the updates of (
4) and (
5) are either computationally infeasible or impossible to fully compute due to
p being unknown. It follows that approximations are key to defining practical distributional algorithms. This could involve parametrization over some selected set of distributions along with projections onto these distributional subspaces. It could also involve stochastic approximations with sampled transitions and gradient updates with function approximation.
A structure for algorithms making use of such approximations is Categorical Distributional Reinforcement Learning (CDRL). In what follows is a short summary of the CDRL procedure fundamental to single agent implementations in this paper.
Let
$\mathbf{z}=\left\{\phantom{\rule{0.166667em}{0ex}}{z}_{1},{z}_{2},\dots ,{z}_{K}\phantom{\rule{0.166667em}{0ex}}\right\}$ be an ordered fixed set of equallyspaced real numbers such that
${z}_{1}<{z}_{2}<\dots <{z}_{K}$ with
$\Delta z{z}_{i+1}{z}_{i}$. Let:
be the subset of categorical distributions in
$\mathcal{P}\left(\mathbb{R}\right)$ supported on
$\mathbf{z}$. We consider parameterized distributions by using
$\widehat{\mathcal{D}}={\mathcal{P}}^{\mathcal{A}\times \mathcal{X}}$ as the collection of possible inputs and outputs of an algorithm. Moreover, for each
$\eta \in \widehat{\mathcal{D}}$, we have:
as its Qvalue function.
Given a subsequent treatment of our extension of CDRL, we first reproduce the steps of the general procedure in Algorithm 1 (see [
10], Algorithm 1).
Algorithm 1: Categorical Distributional Reinforcement Learning (CDRL) 
At each iteration step t and input ${\eta}_{t}\in \widehat{\mathcal{D}}$, sample a transition $({x}_{t},{a}_{t},{r}_{t},{x}_{t}^{\prime})$. Select ${a}^{\ast}$ to be either sampled from $\pi \left({x}_{t}\right)$ in the evaluation setting or taken as ${a}^{\ast}={arg\; max}_{a}{Q}_{{\eta}_{t}}({x}_{t}^{\prime},a)$ in the control setting. Recall the Cramér projection ${\mathsf{\Pi}}_{\mathbf{z}}$ given in Definition 2, and put:
Take the next iterated function as some update ${\eta}_{t+1}$ such that:
where:
denotes the Kullback–Leibler divergence.

Consider first a finite MDP and a tabular setting. Define
${\widehat{\eta}}_{t}^{(x,a)}{\eta}_{t}^{(x,a)}$ whenever
$(x,a)\ne ({x}_{t},{a}_{t})$. Then, by the convexity of
$log\left(z\right)$, it is readily verified that updates of the form:
satisfy Step 4. In fact, if there exists a unique policy
${\pi}^{\ast}$ associated with the convergence of (
3), then this update yields an almost sure convergence, with respect to the supremumCramér metric, to a distribution in
$\widehat{\mathcal{D}}$ with
${\pi}^{\ast}$ as the greedy policy (with some additional assumptions on the stepsizes
${\alpha}_{t}$ and sufficient support (see [
10], Theorem 2, for details).
In practice, we are often forced to use function approximation of the form:
where
$\varphi $ is parameterized by some set of weights
$\mathbf{\theta}$. Gradient updates with respect to
$\mathbf{\theta}$ can then be made to minimize the loss:
where
${\widehat{\eta}}_{t}^{({x}_{t},{a}_{t})}={\mathsf{\Pi}}_{\mathbf{z}}{\left({f}_{{r}_{t}}\right)}_{\#}\varphi ({x}_{t}^{\prime},{a}^{\ast};{\mathbf{\theta}}_{\mathrm{fixed}})$ is the computed learning target of the transition
$({x}_{t},{a}_{t},{r}_{t},{x}_{t}^{\prime})$. However convergence with the Kullback–Leibler loss and function approximation is still an open question. Theoretical progress has been made when considering other losses, although we may lose the stability benefits coming from the relative ease of minimizing (
6) [
9,
11,
16].
An algorithm implementing CDRL with function approximation is
C51 [
5]. It essentially uses the same neural network architecture and training procedure as DQN [
17]. To increase stability during training, this also involves sampling transitions from an experience buffer and maintaining an older, periodically updated, copy of the weights for target computation. However, instead of estimating Qvalues,
C51 uses a finite support
$\mathbf{z}$ of 51 points and learns discrete probability distributions
$\varphi (x,a;\mathbf{\theta})$ over
$\mathbf{z}$ via softmax transfer. Training is done by using the KLdivergence as the loss function over batches with computed targets
${\widehat{\eta}}^{(x,a)}$ of CDRL.