F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning

Upadhyay, Anurag; Lu, Xin; Baradaranshokouhi, Yashar; Li, Jun; Jing, Yanguo

doi:10.3390/info17030232

Open AccessArticle

F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning

by

Anurag Upadhyay

¹,

Xin Lu

^1,*

,

Yashar Baradaranshokouhi

¹

,

Jun Li

²

and

Yanguo Jing

³

¹

School of Computing and Creative Industries, Leeds Trinity University, Brownberrie Lane Horsforth, Leeds LS18 5HD, UK

²

Centre for Computational Engineering Sciences, School of Aerospace, Transport and Manufacturing (SATM) Cranfield University, Cranfield, Bedfordshire MK43 0AL, UK

³

Institute of Business, Industry and Leadership, University of Cumbria, Cumbria CA1 2HH, UK

^*

Author to whom correspondence should be addressed.

Information 2026, 17(3), 232; https://doi.org/10.3390/info17030232

Submission received: 2 February 2026 / Revised: 24 February 2026 / Accepted: 26 February 2026 / Published: 1 March 2026

Download

Browse Figures

Versions Notes

Abstract

Reinforcement learning for robotic manipulation is often limited by poor sample efficiency and unstable training dynamics, challenges that are further amplified in federated settings due to data privacy constraints and task heterogeneity. To address these issues, we propose F-DRL, a federated dynamics-aware representation learning framework that enables multiple robotic tasks to collaboratively learn structured latent representations without sharing raw trajectories or policy parameters. The framework combines robotics priors with an action-conditioned latent dynamics model to learn low-dimensional state and state–action embeddings that explicitly capture task-relevant geometric and transition structure. Representation learning is performed locally at each client, while a central server aggregates encoder parameters using a similarity-weighted scheme based on second-order latent geometry. The learned representations are then used as frozen auxiliary inputs for downstream model-free reinforcement learning. We evaluate F-DRL on seven heterogeneous robotic manipulation tasks from the MetaWorld benchmark. While achieving performance comparable to centralized training and standard federated baseline, F-DRL substantially improves training stability relative to FedAvg on heterogeneous manipulation tasks with partially shared dynamics (e.g., Drawer-Open and Window-Open), reducing the mean across-seed standard deviation and the AUC of this deviation by over 60%. The method remains neutral on simple tasks and performs less consistently on contact-rich manipulation tasks with task-specific dynamics, indicating both the benefits and the practical limits of representation-level knowledge sharing in federated robotic learning.

Keywords:

robotics priors; action-conditioned system dynamics; state-action encode; deep reinforcement learning; robotic manipulation; dynamics aware embeddings

Graphical Abstract

1. Introduction

The combination of deep learning and Reinforcement Learning (RL), following the seminal work of [1], has accelerated the adoption of Deep Reinforcement Learning (DRL) across domains such as robotics, gaming, autonomous driving, and recommendation systems [2]. Although DRL enables end-to-end learning directly from high-dimensional raw observations without explicit feature engineering, these methods remain notoriously sample inefficient, often requiring tens of millions of interactions to solve even simple tasks. This inefficiency arises from several inherent challenges in RL, including the need for extensive exploration, long-horizon credit assignment, and instability in value-based and policy-gradient optimization. These issues are amplified in real-world robotic settings, where data collection is slow, expensive, and physically risky, and where even high-fidelity simulators suffer from sim-to-real gaps [3]. As a result, training instability and poor reproducibility remain fundamental barriers to deploying DRL systems in practice.

A promising direction for mitigating these challenges is representation learning [4], where compact, low-dimensional embeddings are learned explicitly rather than implicitly within the policy network. Such representations constrain the hypothesis space of downstream policy learning, improving sample efficiency and stability. Recent work on unsupervised representation learning [5] has demonstrated that structured latent spaces can capture meaningful factors of variation useful for robotic control. While much of the existing literature focuses on high-dimensional visual observations [6], representation learning is equally critical for proprioceptive inputs, since the resulting observation stream corresponds to a partially observable Markov decision process (POMDP). In this setting, representations must filter sensor noise while preserving task-relevant physical structure.

However, learning representations solely from interaction data is inherently challenging: Unconstrained encoders can capture spurious correlations or task-specific artefacts that fail to generalize. Robotics Priors [7] address this issue by introducing physics-inspired inductive biases—such as temporal coherence, causality, repeatability, and proportionality—that encourage physically meaningful latent spaces. By constraining representation learning with these priors, learned embeddings better reflect the underlying structure of robotic dynamics and improve robustness in downstream control.

Building on this foundation, an action-conditioned representation is introduced to explicitly capture system dynamics within a robotics-prior-constrained latent space. Unlike state-only embeddings, the proposed encoder models how actions transform states, enabling the latent space to encode transition structure rather than static observations alone. The resulting representation,

z (s, a)

, is therefore well suited for transfer across tasks with partially shared dynamics. However, exploiting this property in collaborative multi-task settings requires a mechanism for selective and stable knowledge sharing. In realistic deployments, robots often learn multiple tasks under privacy, communication, or operational constraints that preclude centralized training. In such settings, naive parameter sharing can be harmful: Tasks with incompatible dynamics can interfere, leading to unstable learning and negative transfer. This motivates a formulation in which knowledge is shared only to the extent that tasks are dynamically related and where the unit of sharing is the representation rather than the policy or value function.

Federated learning (FL) provides a natural framework for such collaboration, as it enables decentralized knowledge sharing without exposing raw data or requiring centralized optimization. However, existing federated reinforcement learning (FRL) approaches are poorly suited to heterogeneous robotic settings: most aggregate policy or value-function parameters directly, tightly coupling federation with unstable RL updates and making these methods highly sensitive to task diversity.

To address this limitation, a representation-centric federated learning framework is introduced in which federation is applied exclusively to representation learning and fully decoupled from downstream policy optimization. Multiple tasks performed by the same robot are treated as federated clients, reflecting realistic scenarios in manufacturing, service robotics, and household automation. Clients are aligned based on learned dynamics similarity rather than task identity or uniform averaging, enabling selective, stability-oriented knowledge transfer while avoiding direct synchronization of unstable policy updates. This design shifts the role of federation from policy coordination to representation alignment, which is more appropriate for heterogeneous robotic systems.

The core hypothesis is that tasks sharing the same state and action spaces, and exhibiting partially overlapping transition dynamics, should share similar latent representations. Federated aggregation guided by dynamics-based similarity encourages alignment across such tasks, improving generalization while mitigating negative transfer and remaining neutral when heterogeneity is limited. To the best of the authors’ knowledge, this work is the first to integrate dynamics-aware action-conditioned representations with representation-level federated aggregation within a unified FRL framework for robotic manipulation.

Contributions

The main contributions of this work are:

A federated dynamics-aware representation learning framework that learns shared state and action-conditioned embeddings across multiple robotic manipulation tasks without sharing raw data or policy parameters, explicitly targeting stability under task heterogeneity.
An extension of robotics-prior-constrained representation learning through the incorporation of action-conditioned system dynamics, enabling embeddings that capture task-relevant transition structure rather than static state features.
A similarity-weighted federated aggregation strategy based on second-order latent geometry, which selectively aligns clients according to dynamics similarity and mitigates negative transfer induced by uniform averaging.
An extensive empirical evaluation on heterogeneous MetaWorld manipulation tasks demonstrating that the learned representations act as a stabilizing auxiliary signal for downstream model-free reinforcement learning, substantially reducing variance across random seeds while achieving performance comparable to centralized training.

The remainder of this paper is organized as follows. Section 2 reviews related work on deep reinforcement learning, representation learning for control, and federated reinforcement learning. Section 3 presents the proposed federated dynamics-aware representation learning framework. Section 4 details the experimental setup and empirical evaluation. Section 5 concludes and outlines directions for future work.

2. Related Work

This section reviews prior work across three complementary research directions that motivate the proposed framework. We first discuss advances in deep reinforcement learning, highlighting persistent challenges such as sample inefficiency, training instability, and limited generalisation, which are particularly pronounced in robotic manipulation and motivate the need for structured and transferable representations. We then review representation learning methods for reinforcement learning, with a focus on robotics-prior-based and dynamics-aware approaches, which demonstrate that incorporating physical inductive biases and transition structure into latent representations can improve learning efficiency and robustness. Finally, we survey FRL methods that address privacy and data-sharing constraints, noting that most existing approaches rely on policy- or value-level aggregation and remain highly sensitive to task heterogeneity, while representation-level knowledge sharing is comparatively under-explored. Together, these works expose a gap at the intersection of dynamics-aware representation learning and federated aggregation, which our framework addresses by decoupling representation learning from control and enabling selective, stability-oriented knowledge sharing across heterogeneous robotic tasks.

2.1. Deep Reinforcement Learning and Sample Inefficiency

DRL achieved widespread attention following the seminal work of [1], which demonstrated that deep neural networks can learn control policies directly from raw observations. While this removed the need for hand-engineered features, DRL methods remain fundamentally limited by unstable optimization dynamics, high variance, and severe sample inefficiency. A key innovation introduced in the DQN framework is the experience replay buffer, which decorrelates training data and stabilizes off-policy learning. However, uniform sampling treats all transitions as equally informative, constraining learning efficiency. Several extensions addressed these limitations by improving representational and learning capacity. The duelling network architecture [8] decomposes the Q-function into state-value and advantage components for faster generalization, while Double DQN [9] reduces overestimation bias through decoupled action selection and evaluation. Prioritized Experience Replay (PER) [10] further enhances sample efficiency by sampling transitions according to temporal-difference error, with bias correction via importance sampling. Policy-gradient approaches take a complementary perspective by directly parametrizing the policy

π_{θ} (a ∣ s)

and optimizing it via gradient ascent. These methods provide expressive control policies but are prone to high variance and unstable updates. Trust Region Policy Optimization (TRPO) [11] addressed this by enforcing a trust-region constraint on policy updates, ensuring monotonic improvement. Proximal Policy Optimization (PPO) [12] introduced a clipped surrogate objective that prevents destructive policy shifts while remaining computationally efficient. Generalized Advantage Estimation (GAE) [13] reduced variance in advantage estimation by combining multi-step returns through an exponentially weighted scheme, offering a controllable bias-variance trade-off. Modern off-policy actor–critic methods unify value-based and policy-based principles. DDPG [14] extended deterministic policy gradients to continuous control using replay buffers, target networks, and exploration noise. TD3 [15] further improved stability and sample efficiency by introducing twin critics to mitigate overestimation, target policy smoothing, and delayed policy updates.

Although these advances substantially improve optimization stability and reduce estimation variance, they operate primarily at the level of value approximation and policy updates. They do not directly address the challenge of learning compact, structured, and transferable representations—a limitation that becomes critical in robotic manipulation, where data collection is expensive and task variations induce significant state–action distribution shifts. This motivates the study of explicit representation learning techniques, which we discuss next.

2.2. Representation Learning for Reinforcement Learning

2.2.1. Robotics Priors and Physical Constraints

A long-standing view in representation learning is that useful latent spaces should reflect general priors about how the physical world behaves [4]. In robotics, this idea was first introduced by [16], who proposed Robotics Priors, a set of physics-inspired constraints—temporal coherence, proportionality, causality, and repeatability—that impose local structure on the latent state space. These priors encourage representations that evolve smoothly over time, respond predictably to actions, and remain consistent with underlying physical dynamics. Several extensions have been proposed to enhance this framework. Position–Velocity Encoders (PVEs) [17] incorporate additional priors to induce an explicit position–velocity decomposition reflecting canonical quantities in robot motion. Other works broadened the applicability of Robotics Priors to multi-task reinforcement learning via task-gating mechanisms and task-coherence priors that cluster representations within an episode [18]. To improve cross-episode consistency [19] introduced a reference point prior that stabilizes latent representations across trajectories. Extensions to partially observable settings were explored by [20], who incorporated recurrent architectures to maintain temporally coherent latent states. Further, reward-shaped variants of Robotics Priors were proposed in [21], and adaptations to continuous-action spaces were introduced in [22]. Most prior work in this line of research focuses on imposing structural constraints on state-only representations and is studied in centralized or single-task settings. In contrast, our work considers action-conditioned, dynamics-aware representations and investigates how such structured embeddings can be selectively shared across tasks in a federated setting under privacy constraints.

2.2.2. Dynamics Modelling via Self-Prediction

Dynamics-modelling-based representation learning methods learn low-dimensional embeddings by approximating transition and reward distributions

p (s^{'} ∣ s, a)

and

p (r ∣ s, a)

. By jointly training the representation and predictive models, the resulting latent space is encouraged to encode task-relevant system dynamics. These approaches are typically categorised according to the downstream reinforcement learning paradigm in which the learned representations are used. Model-free methods employing deterministic encoders include [23,24,25,26,27]. In contrast, approaches using stochastic encoders learn parametrized distributions over latent states rather than point embeddings, as in [28,29,30]. Although often described as model-free, many of these methods still learn approximate transition or reward models during representation learning. In this context, “model-free” refers to the absence of model-based planning in the control pipeline rather than the absence of predictive modelling altogether. Model-based methods with deterministic encoders include [31,32,33,34,35]. Approaches with stochastic latent dynamics models, such as [36,37,38], employ probabilistic representations to capture uncertainty and support planning-based control. While these approaches demonstrate the benefits of dynamics-aware latent spaces for reinforcement learning, they are predominantly studied in centralized settings and do not address how dynamics representations can be aligned, aggregated, or selectively shared across heterogeneous agents, which is the focus of our framework.

2.3. Federated Reinforcement Learning

Federated reinforcement learning (FRL) has been widely studied as a means of addressing sequential decision-making problems in privacy-preserving and decentralized settings [39]. In most existing FRL approaches, federation is performed at the policy or value-function level, where the parameters of the actor, critic, or Q-network are periodically synchronized across clients using federated averaging or proximal variants [40,41,42,43,44,45]. While effective in settings with homogeneous tasks or aligned objectives, such approaches tightly couple federated synchronization with the inherently unstable optimization dynamics of reinforcement learning. As a result, aggregation noise, non-stationarity, and policy interference are amplified under task diversity, often leading to degraded performance or unstable training.

FRL methods have also been applied to robotic systems, motivated by scenarios in which robots must collaboratively learn while preserving data locality and privacy [46,47,48,49,50,51,52,53]. These works demonstrate the feasibility of federated training across robotic platforms or tasks and explore extensions such as lifelong learning, preference-based aggregation, and selective communication. A subset of studies explicitly considers task heterogeneity in robotic manipulation and control [54,55,56]. However, despite addressing heterogeneity at the algorithmic or optimization level, these methods remain fundamentally rooted in policy-level or value-function aggregation. Consequently, federated updates are directly entangled with unstable reinforcement learning gradients, rendering such approaches sensitive to inter-task mismatch and prone to negative transfer when task dynamics differ.

A key limitation shared by existing FRL methods lies in the choice of aggregation unit. Policies and value functions are task-specific, highly non-linear, and tightly coupled to reward structure and exploration dynamics. Aggregating these parameters across heterogeneous tasks implicitly assumes alignment at the control level, an assumption that rarely holds in realistic multi-task robotic settings. Even when tasks share identical state and action spaces, differences in transition dynamics or reward geometry can cause federated averaging to introduce destructive interference, undermining both convergence and training stability.

In contrast, representation-centric federated learning frameworks shift the unit of federation from control parameters to learned representations. When federation operates on dynamics-aware embeddings rather than policies, aggregation is decoupled from unstable policy optimization and can be guided by structural similarity in latent space. Aligning clients based on latent geometric similarity enables selective knowledge sharing under task heterogeneity, while avoiding direct synchronization of unstable gradients. This architectural distinction underpins the stability improvements observed in recent representation-level FRL approaches and motivates the framework introduced in this work.

3. Framework and Methodology

3.1. Design Rationale and Framework Overview

The proposed framework follows a representation-centric federated learning paradigm, in which representation learning is decoupled from downstream control and federated synchronization is applied exclusively in latent space. This design is motivated by the instability of policy- and value-level aggregation under task heterogeneity, which can amplify variance and induce negative transfer in reinforcement learning. By federating only representation parameters and keeping control optimization local, the framework enables selective knowledge sharing while preserving training stability. An overview of the modular framework is shown in Figure 1, highlighting the separation between federated representation learning and local policy optimization.

3.2. Federated Learning Setting and Assumptions

We consider a federated reinforcement learning setting in which each client is modelled as a Markov Decision Process (MDP) [57], defined by a state space

S_{i}

, action space

A_{i}

, transition dynamics

P_{i}

, reward function

R_{i}

, and discount factor

γ

. In our experiments, each client corresponds to a distinct robotic manipulation task from the MetaWorld benchmark [58]. All federated clients share identical observation and action spaces and operate on robots with comparable morphology. Accordingly, the heterogeneity considered in this work arises primarily from differences in task structure and transition dynamics rather than differences in embodiment. This assumption reflects realistic deployment scenarios such as multi-task industrial manipulators, service robots, or household robots, where a single physical platform performs multiple heterogeneous tasks under a fixed embodiment. Offline trajectories are required for representation learning, and no interaction data is shared across clients, preserving data privacy. Under these conditions, the framework is most effective when tasks share common dynamical structure but differ in task-specific objectives. This setting reflects realistic deployment scenarios in which a single robotic platform performs multiple heterogeneous tasks under a shared control interface. Extending the framework to cross-embodiment federation would require additional representation alignment mechanisms and remain an important direction for future research.

We adopt a client–server federated learning architecture [59,60], where clients perform local representation learning and periodically communicate model updates to a central server. The server aggregates these updates to produce a global representation model, which is then broadcast back to clients for subsequent local training rounds. Only representation parameters are shared between clients and the server. Policy and value networks, replay buffers, and environment interactions remain strictly local to each client throughout training. This design preserves data privacy while decoupling federated synchronization from unstable policy optimization.

Table 1 summarizes the key architectural differences between the proposed framework and existing federated reinforcement learning approaches.

This comparison highlights how shifting federation from policy parameters to latent representations fundamentally changes the stability and applicability of federated reinforcement learning under task heterogeneity.

3.3. Overview of the Proposed Framework

Our framework is composed of two tightly coordinated yet decoupled stages: (i) dynamics-aware representation learning (see Figure 2) and (ii) local downstream policy learning(see Figure 3).

In the federated training stage, each client privately learns a structured state encoder

z (s)

and an action-conditioned latent dynamics module

z (s, a)

from offline trajectories using Robotics Priors and auxiliary dynamics objectives. Instead of sharing raw data or policy parameters, each client periodically transmits only a compact latent dynamics summary

e_{i}

together with its local encoder parameters to a central federated server. The server computes inter-client similarity through a dedicated similarity estimator, which normalizes client summaries, constructs a reference mean embedding, and derives similarity scores via cosine alignment. These scores are transformed into aggregation weights and used by a similarity-weighted FedAvg aggregator to produce a global encoder that selectively integrates knowledge from dynamically related tasks. The aggregated encoder is then broadcast back to all clients for the next local training round.

In the downstream control stage, the learned encoders are frozen and used as auxiliary inputs for model-free reinforcement learning. Each client independently trains its own actor–critic policy using local environment interactions, replay buffers, and rewards, without any further federated synchronization. This strict decoupling ensures that unstable policy optimization remains fully local, while federation is applied exclusively to representation learning, where it improves stability, transferability, and robustness under task heterogeneity.

3.4. Federated Representation Learning

3.4.1. Architecture of the Local Client Module

Figure 2 illustrates the local client architecture. Each client trains its representation model using offline trajectories consisting of a mixture of random and expert demonstrations, ensuring sufficient coverage of the state–action space while preserving task-relevant structure.

We learn deterministic state and state–action embeddings using parametric neural encoders. To impose physics-inspired structure on the latent space, we incorporate Robotics Priors [22], which encode inductive biases such as temporal coherence, proportionality, repeatability, and causality. These priors constitute the primary geometric training objectives for the state encoder during representation learning.

The state encoder outputs a latent state representation

z (s)

and is trained using the following objectives.

Temporal Coherence Prior:

L_{temp} = E [{({∥Δ z_{t}∥}_{1} exp (- α {∥a_{t}∥}_{1}))}^{2}] .

(1)

Proportionality Prior:

L_{prop} = E [{({∥Δ z_{t_{2}}∥}_{2} - {∥Δ z_{t_{1}}∥}_{2})}^{2} exp (- β {∥a_{t_{1}} - a_{t_{2}}∥}_{2}^{2})] .

(2)

Repeatability Prior:

L_{rep} = E [{∥Δ z_{t_{2}} - Δ z_{t_{1}}∥}_{2}^{2} exp (- {∥z_{t_{2}} - z_{t_{1}}∥}_{2}^{2}) exp (- β {∥a_{t_{1}} - a_{t_{2}}∥}_{2}^{2})] .

(3)

Causality Prior:

L_{caus} = E [exp (- {∥z_{t_{2}} - z_{t_{1}}∥}_{2}^{2}) exp (- β {∥a_{t_{1}} - a_{t_{2}}∥}_{2}^{2})] .

(4)

To prevent representational collapse and encourage sufficient dispersion across latent dimensions, we additionally incorporate a variance regularization term inspired by VICReg [61]. The overall training objective for the state encoder is defined as

L_{state} = w_{temp} L_{temp} + w_{caus} L_{caus} + w_{prop} L_{prop} + w_{rep} L_{rep} + w_{var} L_{var} .

(5)

The state–action encoder takes the state embedding

z (s)

together with the executed action a and produces a joint embedding

z (s, a)

. A stop-gradient operation is applied to

z (s)

to prevent dynamics learning objectives from distorting the geometric structure induced by the Robotics Priors.

The dynamics module is trained using a latent dynamics consistency loss that aligns the encoded next state representation

z (s^{'})

with the predicted next latent state

\hat{z} (s, a)

produced by the action-conditioned dynamics module for the current state and action:

L_{dyn} = E_{(s, a, s^{'})} [{∥z (s^{'}) - \hat{z} (s, a)∥}^{2}] .

(6)

To enforce explicit action conditioning in the dynamics head, a mismatched (“wrong”) action is constructed by randomly permuting actions within a minibatch. Let

\tilde{a}

denote the permuted action corresponding to a within the minibatch, and let

\cos (\cdot, \cdot)

denote cosine similarity. The wrong-action loss is defined as

L_{wa} = E_{(s, a, s^{'})} [- log (σ (\frac{\cos (\hat{z} (s, a), z (s^{'})) - \cos (\hat{z} (s, \tilde{a}), z (s^{'}))}{T}) + ϵ)],

(7)

where

σ (\cdot)

is the sigmoid function,

T > 0

is a temperature parameter, and

ϵ

is a small constant for numerical stability.

The full objective for the state–action encoder is

L_{SA} = w_{dyn} L_{dyn} + w_{wa} L_{wa} .

(8)

3.4.2. Similarity-Weighted Federated Aggregation

The federated server module in Figure 1 illustrates the federated aggregation procedure. At each federated round, each client computes a compact summary vector from a held-out validation set, instantiated as a second-order Gram fingerprint of its state–action embeddings. These summaries capture latent geometric structure while remaining robust to sign ambiguities and representation noise.

Let

e_{i} \in R^{m}

denote the summary vector reported by client i, and let N denote the number of participating clients. Each summary is normalized:

{\tilde{e}}_{i} = \frac{e_{i}}{∥ e_{i} ∥_{2}} .

(9)

A reference vector is constructed as the normalized mean of the client summaries:

\tilde{r} = \frac{\frac{1}{N} \sum_{i = 1}^{N} {\tilde{e}}_{i}}{{∥\frac{1}{N} \sum_{i = 1}^{N} {\tilde{e}}_{i}∥}_{2}} .

(10)

Each client receives a similarity score:

{sim}_{i} = 〈 {\tilde{e}}_{i}, \tilde{r} 〉 .

(11)

Aggregation weights are obtained via a temperature-scaled softmax and interpolated with a uniform prior:

w_{i} = α (\frac{exp ({sim}_{i} / T)}{\sum_{k = 1}^{N} exp ({sim}_{k} / T)}) + (1 - α) \frac{1}{N} .

(12)

The uniform interpolation mitigates overly sharp weighting and improves robustness during early training stages when latent representations are still evolving.

The similarity interpolation coefficient and aggregation temperature control the degree to which federated updates emphasize task similarity versus uniform averaging. The interpolation coefficient regulates the balance between similarity-weighted aggregation and equal-weight sharing, thereby controlling the strength of selective knowledge transfer. The temperature parameter was employed to regulate the sharpness of the similarity weighting. Lower temperature values result in a more peaked similarity distribution, such that aggregation weights become concentrated on highly aligned clients whose representations or learning dynamics exhibit strong similarity. Conversely, higher temperature values have a smoother weighting distribution. It promotes more uniform contributions across clients and approximate standard averaging behaviour. In this way, the temperature parameter overlooks the degree of selectivity imposed by the similarity mechanism, and meanwhile, the interpolation coefficient determines how strongly this selective weighting influences the final aggregated update. In practice, the proposed method demonstrated limited sensitive to moderate variations of these parameters within a reasonable operating range. The experimental results provided that the interpolation coefficient remained within a range that prevents dominance by a single client and the temperature avoided overly sharp weighting early in training. The final parameter values were determined through preliminary validation experiments designed to identify stable operating regions rather than task-specific optima. To ensure reproducibility and fairness across different task comparison, the selected values were fixed across all benchmark tasks and experimental settings. This design choice supports the general applicability of the proposed aggregation mechanism.

The global encoder parameters

θ

are computed via similarity-weighted federated averaging:

θ = \sum_{i = 1}^{N} w_{i} θ_{i} .

(13)

Algorithm 1 describes the federated training procedure for learning dynamics-aware representations. At each federated round, every client samples offline trajectories and updates its state encoder

z (s)

using Robotics Priors and variance regularization (Equation (5)). The action-conditioned dynamics module

z (s, a)

is then updated using the latent dynamics consistency and action-contrastive losses (Equation (8)). Each client computes a latent dynamics summary by evaluating the

z (s, a)

embeddings on a held-out validation set and transmits this summary together with its local encoder parameters to the server. The server estimates inter-client similarity from these summaries and uses the resulting weights to perform similarity-weighted aggregation of the encoder parameters. The aggregated global encoder is then broadcast back to all clients and used to initialize the next round of local training.

Algorithm 1 Federated Dynamics-Aware Representation Learning

Require: Clients

{C_{i}}_{i = 1}^{N}

, initial encoder parameters

θ^{0}

1:: for federated round $r = 1, \dots, R$ do
2:: for all clients i in parallel do
3:: Sample offline trajectories $D_{i}$
4:: Update state encoder $z (s; θ_{i})$ using Robotics Priors and variance regularization
5:: Update action-conditioned module $z (s, a)$ using latent dynamics consistency and action-contrastive losses
6:: Compute latent dynamics summary $e_{i}$
7:: Send $(θ_{i}, e_{i})$ to the server
8:: end for
9:: Compute similarity weights ${w_{i}}$ from ${e_{i}}$
10:: Aggregate global encoder $θ^{r} \leftarrow \sum_{i = 1}^{N} w_{i} θ_{i}$
11:: Broadcast $θ^{r}$ to all clients
12:: end for

Upon receiving the aggregated global encoder, each client initializes the next local training round by updating its encoder parameters with the broadcast model, using a convex interpolation to retain task-specific structure defined as follows:

θ_{i}^{(r + 1)} \leftarrow α θ^{(r)} + (1 - α) θ_{i}^{(r)},

(14)

where

θ^{(r)}

denotes the aggregated global encoder parameters at federated round r,

θ_{i}^{(r)}

denotes the local encoder parameters of client i prior to synchronization, and

θ_{i}^{(r + 1)}

denotes the updated local encoder used to initialize the next round of local training. The scalar

α \in [0, 1]

controls the interpolation between global and local parameters, with

α = 1

corresponding to a full overwrite by the global model and

α = 0

corresponding to retaining the previous local parameters.

3.5. Downstream Policy Optimization with Learned Representations

Figure 3 illustrates the downstream reinforcement learning stage. This component is decoupled from representation learning, with encoder parameters kept fixed. Each client trains an actor–critic agent using locally collected trajectories.

In our experiments, we employ TD3 as the downstream reinforcement learning algorithm. The policy and value networks receive raw states together with frozen state and state–action embeddings as auxiliary inputs. All remaining components of policy optimization, including target networks, replay buffers, and exploration strategies, follow standard TD3 practice.

The pseudocode for local policy learning with frozen representations is defined in Algorithm 2. Each client independently optimizes an actor–critic policy using locally collected environment interactions, while keeping the representation encoders fixed. This stage is fully decoupled from federated training: No parameters are exchanged between clients, and all replay buffers and updates remain local. By isolating policy optimization from federation, the framework avoids synchronizing unstable policy gradients across heterogeneous tasks while still exploiting the shared, dynamics-aware representations learned in the federated stage.

Algorithm 2 TD3 Policy Learning with Frozen Dynamics-Aware Representations

Require: Frozen encoders

(z (\cdot), z (\cdot, \cdot))

with parameters

θ

, actor

π_{ϕ}

, critics

Q_{ψ_{1}}, Q_{ψ_{2}}

1:: for all clients i do
2:: Initialize actor and twin critics; initialize target networks $π_{ϕ^{'}}, Q_{ψ_{1}^{'}}, Q_{ψ_{2}^{'}}$
3:: Initialize replay buffer $B_{i}$
4:: for each training step do
5:: Observe state $s_{t}$
6:: Compute state embedding $z_{t} = z (s_{t})$
7:: Select action with exploration noise:
8:: $a_{t} = π_{ϕ} (s_{t}, z_{t}) + ϵ$ , $ϵ \sim N (0, σ)$
9:: Execute $a_{t}$ , observe reward $r_{t}$ and next state $s_{t + 1}$
10:: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $B_{i}$
11:: if update step then
12:: Sample minibatch $(s, a, r, s^{'})$ from $B_{i}$
13:: Compute embeddings $z = z (s)$ and $z^{a} = z (s, a)$
14:: Compute target action with smoothing:
15:: $\tilde{a} = π_{ϕ^{'}} (s^{'}, z (s^{'})) + clip (ϵ^{'}, - c, c)$
16:: Compute target value:
17:: $y = r + γ min (Q_{ψ_{1}^{'}} (s^{'}, \tilde{a}, z (s^{'}, \tilde{a}), z (s^{'})), Q_{ψ_{2}^{'}} (s^{'}, \tilde{a}, z (s^{'}, \tilde{a}), z (s^{'})))$
18:: Update both critics $Q_{ψ_{1}}$ and $Q_{ψ_{2}}$ by minimizing the TD error between $Q_{ψ_{j}} (s, a, z^{a}, z)$ and target y, for $j \in {1, 2}$
19:: if delayed policy update then
20:: Actor update using $a_{ϕ} = π_{ϕ} (s, z)$ and $z_{ϕ}^{a} = z (s, a_{ϕ})$
21:: Update target networks with Polyak averaging
22:: end if
23:: end if
24:: end for
25:: end for

3.6. Scope and Extensions

The proposed framework is designed for federated reinforcement learning settings in which all clients share identical observation and action spaces and operate on robots with comparable morphology, while exhibiting partially overlapping transition dynamics. This setting reflects many practical multi-task robotic scenarios, such as a single manipulator performing diverse manipulation behaviours in a shared workspace, where task heterogeneity arises primarily from differences in contact dynamics, constraints, and reward functions rather than embodiment.

Within this scope, the framework assumes access to offline trajectories for representation learning at each client. These trajectories remain private and are never shared across clients, preserving data privacy while enabling federated alignment of latent dynamics representations. The approach is therefore most effective when sufficient offline data is available to learn stable dynamics-aware embeddings before downstream policy optimization.

As in most representation learning paradigms, the quality and diversity of the offline dataset influence the usefulness of the resulting embedding. Trajectories that insufficiently cover relevant state–action regions may bias the representation toward frequently observed transitions while under-representing rare but task-critical dynamics, potentially limiting transfer across tasks. This dependence on dataset coverage and distribution alignment is well documented in offline reinforcement learning and imitation learning, where insufficient diversity or distributional mismatch can lead to extrapolation errors and degraded generalization performance [62,63]. In our experiments, we mitigated this risk by combining random exploration with demonstration data to improve coverage of both exploratory transitions and task-relevant behaviours. Investigating how representation quality scales with dataset size, diversity, and exploration strategy is an important direction for future work and may be addressed through improved data collection strategies or adaptive refinement of representations during training.

Several natural extensions of the proposed framework are possible. First, extending F-DRL to heterogeneous embodiments (e.g., different robot arms, grippers, or kinematic structures) would require task-specific front-end encoders followed by a shared latent projection space, enabling alignment at the representation level while preserving embodiment-specific features. Second, the current formulation assumes proprioceptive state inputs; incorporating visual observations would require combining the proposed dynamics-aware objectives with perceptual encoders and cross-modal alignment mechanisms.

Third, the similarity-based aggregation mechanism could be made adaptive over training time. In early federated rounds, when representations are still evolving and similarity estimates are noisy, aggregation naturally approaches uniform averaging. As representations stabilize, increasing the influence of similarity-weighted aggregation may further improve selectivity and reduce negative transfer. Finally, extending the framework to continual and lifelong federated learning settings, where new tasks arrive over time, would allow investigation of how latent dynamics geometry can support incremental knowledge accumulation without catastrophic interference.

These extensions are conceptually orthogonal to the core framework and can be incorporated without modifying the decoupled representation–control architecture, making F-DRL a flexible foundation for scalable and robust federated robotic learning.

4. Experimental Evaluation and Discussion

4.1. Experimental Setup

We evaluate the proposed framework on a suite of robotic manipulation tasks from the MetaWorld benchmark [58]. The tasks considered include button-press-topdown, door-open, drawer-open, drawer-close, reach, window-open, and window-close. These tasks span a range of contact-rich and articulated manipulation behaviours, introducing heterogeneity in transition dynamics while sharing identical state and action spaces.

We consider a federated learning setting with

N = 7

clients, where each client corresponds to a distinct task instance. Each episode is capped at 500 environment steps. Clients perform 20 local representation learning epochs per federated round, and representation models are aggregated over 100 federated rounds. Offline datasets for representation learning consist of a mixture of random and expert trajectories and remain local to each client throughout training.

For downstream control, each task is trained for 200,000 environment interaction steps using three independent random seeds. Performance is evaluated using task success rate, which directly reflects successful task completion.

4.2. Representation Networks

The state encoder

E_{θ}

is implemented as a multilayer perceptron with two hidden layers of 64 units and ReLU activations, mapping the raw state to a latent embedding of dimensionality

D = 16

. The action-conditioned module

F_{ϕ}

is also a multilayer perceptron with two hidden layers of 64 units. It takes as input the concatenation of the state embedding and the executed action, i.e.,

[z (s), a]

, and outputs a predicted next-latent embedding

\hat{z} (s, a) \in R^{16}

. Both modules are trained using the Adam optimizer with a learning rate of

10^{- 4}

. During local representation learning, a batch size of 128 is used. Client summary representations for federated aggregation are computed by averaging Gram-based fingerprints over 50 validation batches of size 64.

4.3. Downstream Reinforcement Learning Networks

For policy learning, we employ a TD3-style actor–critic architecture augmented with frozen representation inputs. The actor network is implemented as a multilayer perceptron with two hidden layers of 256 units. It takes as input the raw state together with the frozen state embedding

z (s)

and outputs a continuous action via a tanh activation. The critic consists of twin Q-networks following the TD3 formulation. Each critic receives the raw state–action pair together with frozen state and state–action embeddings, i.e.,

(s, a, z (s), z (s, a))

, and outputs a scalar Q-value.

We use standard TD3 hyperparameters: discount factor

γ = 0.99

, target smoothing coefficient

τ = 0.005

, policy delay of 2, batch size 256, and Gaussian exploration noise with standard deviation 0.2. Encoder parameters remain fixed during downstream reinforcement learning.

4.4. Communication Overhead and Scalability

In our implementation, each client transmits only the state encoder parameters together with a compact summary vector derived from latent Gram statistics. With a latent dimensionality of (D = 16), the summary vector contains (D(D + 1)/2 = 136) floating-point values, corresponding to approximately 0.53 KB per communication round in float32 precision. The federated model consists solely of the state encoder, which contains 7760 parameters (approx. 30 KB in float32), while the dynamics head, policy networks, replay buffers, and trajectories remain local to each client. This results in approximately 31 KB of communication per client per round, corresponding to roughly 3 MB per client over 100 federated rounds. Even for seven participating tasks, the total communication volume remains on the order of a few tens of megabytes, highlighting the lightweight nature of the proposed federation scheme.

4.5. Baselines and Evaluation Protocol

We compare the proposed method against the following baselines. All baselines use the same offline datasets, downstream reinforcement learning algorithm, network architectures, optimization hyperparameters, and random seeds; only the representation learning and aggregation mechanisms differ.

Centralized: Offline trajectories from all tasks are pooled to train a single state encoder and state–action encoder without privacy constraints. Training proceeds for 20 epochs per round over 100 rounds and serves as an upper bound on representation sharing.
FedAvg: Client state encoder parameters are aggregated using equal weights for each client at each federated round following the standard FedAvg procedure. All other training components are identical to the proposed method.
Local End-to-End Training: Representation learning and policy optimization are trained jointly for each task in an end-to-end manner, without federated aggregation.
Raw States Only:Policies are trained directly on raw state observations without any learned representation or auxiliary embedding signals.

4.6. Downstream Reinforcement Learning Performance

Figure 4 reports downstream reinforcement learning performance across the seven MetaWorld manipulation tasks. Although representation learning methods are often evaluated using latent-space proxy metrics, such measures do not necessarily correlate with control performance. We therefore assess learned representations solely through their impact on downstream reinforcement learning.

Across all tasks, the proposed method achieves performance that is competitive with centralized training and standard FedAvg, while consistently outperforming local end-to-end training and raw-state baselines. In tasks with relatively simple dynamics and limited heterogeneity, such as reach, performance differences between federated methods are minimal, indicating that uniform aggregation is sufficient when client representations are naturally aligned. In contrast, tasks involving articulated objects and constrained motion, including drawer-open, drawer-close, and window-close, exhibit clearer benefits from representation-aligned aggregation.

4.7. Stability and Effect of Task Heterogeneity

A consistent trend across tasks is that the proposed method exhibits more stable learning dynamics than FedAvg and, in most cases, than local end-to-end training.

While final success rates are often similar, the proposed approach produces smoother learning curves with reduced variance across random seeds. This effect is particularly pronounced in longer-horizon manipulation tasks, where heterogeneous dynamics can induce conflicting gradient updates under naive aggregation. Table 2 quantitatively summarizes stability metrics for representative heterogeneous tasks. Stability metrics are computed over the first 70% of evaluation checkpoints, corresponding to the primary learning phase prior to performance saturation.

Stability metrics are reported only for tasks exhibiting non-trivial inter-task heterogeneity; for tasks that saturate early across all methods, variance-based measures are not informative and are therefore omitted.

4.7.1. Mean Seed Standard Deviation

To quantify training stability across random initializations, we report the mean seed standard deviation, defined as the average standard deviation of task success rates across random seeds, computed at each evaluation checkpoint and averaged over a fixed training window. Lower values indicate more consistent and reproducible learning dynamics.

4.7.2. Integrated Dispersion (AUC of Seed Standard Deviation)

In addition to the mean seed standard deviation, we report the temporal integral of the across-seed standard deviation curve. Specifically, at each evaluation checkpoint we compute the standard deviation of success across random seeds, and then sum these values over the primary learning phase. This metric reflects the cumulative magnitude of divergence between learning trajectories over time. Lower values indicate that seeds remain closer throughout training, corresponding to more stable optimization dynamics.

To enhance statistical transparency, it is clarified that all reported performance and stability metrics were computed across independent random seeds, each corresponding to an independent training trajectory incorporating stochastic initialisation and sampling effects. Confidence intervals were estimated from the variability observed across these independent tests. Our primary objective is to characterize robustness of learning dynamics across random initializations, which is more directly captured by variability in learning trajectories than by statistical tests restricted to equality of final performance. Given the limited number of independent runs typically feasible in reinforcement learning experiments, we therefore reported dispersion-based statistics that quantify variability across seeds rather than relying on formal hypothesis tests whose assumptions may be difficult to justify in this setting. We further observed that reductions in variability appear consistently across seeds and across multiple tasks, suggesting that the reported trends are unlikely to be driven by individual outlier runs.

Importantly, the proposed method does not uniformly outperform FedAvg across all tasks. In settings with low inter-task heterogeneity, uniform aggregation performs competitively, suggesting that selective weighting is not always necessary. However, as task dynamics diverge, similarity-weighted aggregation becomes increasingly beneficial, mitigating negative transfer between incompatible tasks. This adaptive behaviour is desirable in federated robotic settings, where task similarity cannot be assumed a priori.

The observed performance trends are consistent with the behaviour of the representation-aligned aggregation mechanism. Clients whose latent spaces exhibit similar geometric structure naturally receive comparable aggregation weights, while clients with divergent dynamics are softly down-weighted. By relying on second-order latent geometry rather than first-order statistics, the aggregation process remains robust to representation noise and sign ambiguities.

4.8. Analysis of Similarity-Weighted Federated Aggregation

To interpret the behaviour of the proposed similarity-weighted aggregation strategy, we analyse the evolution of the server-side client weights across federated rounds. During training, the server logs the aggregation weights

{w_{i}^{(r)}}_{i = 1}^{N}

at each round r (see Section 3 for the weight computation). For analysis, we load the per-round weight vectors saved by the server (one file per round) and stack them into a matrix

W \in R^{R \times N}

, where

W_{r, i} = w_{i}^{(r)}

denotes the contribution of client i at federated round r.

For interpretability, we normalise weights within each round such that

\sum_{i = 1}^{N} W_{r, i} = 1

and sort clients by their mean weight across rounds, i.e., in descending order of

\frac{1}{R} \sum_{r = 1}^{R} W_{r, i}

. Figure 5 visualises the resulting matrix as a heat map, where rows correspond to tasks (clients) and columns correspond to federated rounds.

The heat map shows that aggregation is not uniform and exhibits structured, smooth evolution over training. In particular, a subset of tasks consistently receives higher weights, indicating that the similarity estimator assigns greater influence to clients whose latent dynamics summaries are more aligned with the global reference. This provides direct evidence that F-DRL performs selective representation sharing rather than uniform averaging, supporting the claim that similarity-weighted aggregation mitigates negative transfer under task heterogeneity.

4.9. Discussion

Overall, the experimental results indicate that representation-aligned aggregation enables stable and effective knowledge sharing in federated reinforcement learning, particularly under task heterogeneity. Importantly, the primary benefit of the proposed approach does not lie in uniformly improving peak task performance, but rather in shaping the learning dynamicsthemselves. Across heterogeneous manipulation tasks, similarity-weighted aggregation consistently reduces variance across random seeds and produces smoother, more predictable learning curves, indicating improved optimization stability and reproducibility.

This behaviour highlights a key distinction between representation-level and policy-level federation. While uniform aggregation (FedAvg) can be effective when client tasks are naturally aligned, it becomes fragile when transition dynamics diverge, as conflicting gradients are averaged indiscriminately. By contrast, the proposed method selectively aligns clients based on latent dynamics similarity, allowing shared structure to be exploited while softly down-weighting incompatible tasks. The resulting effect is not aggressive transfer, but controlled sharing, which mitigates negative transfer without suppressing task-specific specialization.

The task-wise analysis further reveals that the method is neutral when heterogeneity is low and beneficial when heterogeneity is high. In simple tasks such as reach, where representations are naturally similar, all aggregation strategies perform comparably. However, in articulated and contact-rich tasks such as drawer-open, drawer-close, and window-close, where dynamics differ more substantially, representation-aligned aggregation tends to produce more stable learning relative to uniform aggregation. Importantly, performance within contact-rich settings depends on the degree of shared transition structure: Tasks that involve similar contact dynamics benefit from representation alignment, while tasks with substantially different contact regimes exhibit weaker gains. This finding suggests that explicitly grouping clients by latent dynamics similarity, rather than aggregating across all clients uniformly, may further improve robustness and motivates clustering-based federation strategies as a direction for future work. This adaptive behaviour is desirable in realistic robotic deployments, where task similarity is rarely known a priori and may evolve over time.

The aggregation weight analysis provides direct evidence that the server does not converge to uniform averaging, but instead learns a structured weighting scheme that remains stable across rounds. The persistence of higher weights for a subset of tasks suggests that the learned latent geometry captures meaningful dynamical overlap, supporting the central hypothesis that dynamics-aware representations form a suitable substrate for federated knowledge sharing. Crucially, this selectivity emerges without explicit task labels, reward information, or policy synchronization, reinforcing the value of representation-centric federation.

From a broader perspective, these results suggest that stability, rather than peak performance, should be treated as a first-class objective in federated reinforcement learning. In safety-critical and long-horizon robotic settings, predictable and reproducible learning dynamics are often more valuable than marginal gains in final success rate. The proposed framework offers a practical mechanism to achieve this by decoupling representation alignment from control optimization and grounding aggregation decisions in latent dynamics geometry rather than parameter similarity.

The proposed framework is formulated under the assumption that participating clients share compatible state and action interfaces, while allowing their underlying dynamics to differ. This assumption reflects a common and practically relevant deployment scenarios in practical robotic systems, especially in industrial and laboratory settings where multiple robots operate under the same control specification but interact with different objects, environments, or task instances. In many practical cases, fleets of manipulators with identical control inputs may experience varying contact conditions, payloads, or workspace layouts, leading to heterogeneous transition dynamics despite shared observation and action spaces. The proposed formulation therefore targets the realistic regime of shared-interface, heterogeneous-dynamics federation, rather than fully cross-embodiment transfer. These extensions introduce additional modelling assumptions and optimisation complexities that fall outside the scope of the present work. Cross-embodiment federation is recognised as a promising and important direction for future research, particularly in the context of scalable heterogeneous robotic collectives.

5. Conclusions and Future Work

In this work, we introduced a dynamics-aware federated representation learning framework for robotic control that combines Robotics Priors, action-conditioned embeddings, and similarity-based aggregation. By aligning representations based on latent geometric structure rather than uniform parameter averaging, our approach enables stable and selective knowledge sharing across heterogeneous tasks while preserving a model-free downstream reinforcement learning setup. Empirical results on a suite of MetaWorld manipulation tasks demonstrate that the proposed method achieves competitive performance with improved training stability under task heterogeneity.

Despite these improvements, our approach has limitations. The computation of client summary representations introduces additional communication overhead, and similarity estimates can be noisy during early training when representations are still evolving. In such cases, aggregation weights tend to be close to uniform, effectively recovering FedAvg behaviour.

Future work may explore adaptive aggregation schedules that gradually increase the influence of similarity-based weighting as representations stabilize, as well as uncertainty-aware weighting strategies that explicitly account for confidence in client summaries. Extending the framework to more diverse robotic platforms and real-world deployments, and investigating tighter connections between representation geometry and theoretical generalization guarantees, are also promising directions.

Author Contributions

Conceptualization, A.U.; methodology, A.U.; software, A.U.; validation, A.U.; formal analysis, A.U. and X.L.; investigation, A.U.; resources, A.U.; data curation, A.U.; writing—original draft preparation, A.U.; writing—review and editing, A.U., X.L., Y.J., J.L. and Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are generated from publicly available benchmark environments (MetaWorld). The processed data and experimental logs are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5064–5078. [Google Scholar] [CrossRef] [PubMed]
Pitkevich, A.; Makarov, I. A Survey on Sim-to-Real Transfer Methods for Robotic Manipulation. In Proceedings of the 2024 IEEE 22nd Jubilee International Symposium on Intelligent Systems and Informatics (SISY), Pula, Croatia, 19–21 September 2024; pp. 259–266. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
Botteghi, N.; Poel, M.; Brune, C. Unsupervised Representation Learning in Deep Reinforcement Learning: A Review. IEEE Control Syst. 2025, 45, 26–68. [Google Scholar] [CrossRef]
Echchahed, A.; Castro, P.S. A Survey of State Representation Learning for Deep Reinforcement Learning. Trans. Mach. Learn. Res. 2025. Available online: https://openreview.net/forum?id=gOk34vUHtz (accessed on 25 February 2026).
Jonschkowski, R.; Brock, O. Learning state representations with robotic priors. Auton. Robot. 2015, 39, 407–428. [Google Scholar] [CrossRef]
Wang, Z.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2016, arXiv:1511.05952. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2018, arXiv:1506.02438. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2019, arXiv:1509.02971. [Google Scholar] [PubMed]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor–Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Jonschkowski, R.; Brock, O. State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. In Proceedings of the Robotics: Science and Systems, Berkeley, CA, USA, 12–16 July 2014. [Google Scholar] [CrossRef]
Jonschkowski, R.; Hafner, R.; Scholz, J.; Riedmiller, M. Pves: Position-velocity encoders for unsupervised learning of structured state representations. arXiv 2017, arXiv:1705.09805. [Google Scholar] [CrossRef]
Raffin, A.; Höfer, S.; Jonschkowski, R.; Brock, O.; Stulp, F. Unsupervised Learning of State Representations for Multiple Tasks. 2017. Available online: https://openreview.net/forum?id=r1aGWUqgg (accessed on 25 February 2026).
Lesort, T.; Seurin, M.; Li, X.; Díaz-Rodríguez, N.; Filliat, D. Deep Unsupervised State Representation Learning with Robotic Priors: A Robustness Analysis. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2019), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar] [CrossRef]
Morik, M.; Rastogi, D.; Jonschkowski, R.; Brock, O. State Representation Learning with Robotic Priors for Partially Observable Environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), Macau, China, 3–8 November 2019; pp. 6693–6699. [Google Scholar] [CrossRef]
Botteghi, N.; Obbink, R.; Geijs, D.; Poel, M.; Sirmacek, B.; Brune, C.; Mersha, A.; Stramigioli, S. Low Dimensional State Representation Learning with Reward-Shaped Priors. In Proceedings of the International Conference on Pattern Recognition (ICPR 2020), Milan, Italy, 10–15 January 2021; pp. 3736–3743. [Google Scholar]
Botteghi, N.; Alaa, K.; Poel, M.; Sirmacek, B.; Brune, C.; Mersha, A.; Stramigioli, S. Low Dimensional State Representation Learning with Robotics Priors in Continuous Action Spaces. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 190–197. [Google Scholar] [CrossRef]
Munk, J.; Kober, J.; Babuška, R. Learning state representation for deep actor-critic control. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 4667–4673. [Google Scholar] [CrossRef]
Zhang, A.; Satija, H.; Pineau, J. Decoupling Dynamics and Reward for Transfer Learning. 2018. Available online: https://openreview.net/forum?id=H1aoddyvM (accessed on 25 February 2026).
Gelada, C.; Kumar, S.; Buckman, J.; Nachum, O.; Bellemare, M.G. DeepMDP: Learning Continuous Latent Space Models for Representation Learning. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; pp. 2170–2179. [Google Scholar]
Fujimoto, S.; Chang, W.D.; Smith, E.J.; Gu, S.S.; Precup, D.; Meger, D. For SALE: State-action representation learning for deep reinforcement learning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 10–16 December 2023. [Google Scholar]
Fujimoto, S.; D’Oro, P.; Zhang, A.; Tian, Y.; Rabbat, M. Towards General-Purpose Model-Free Reinforcement Learning. In Proceedings of the The Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
van Hoof, H.; Chen, N.; Karl, M.; van der Smagt, P.; Peters, J. Stable reinforcement learning with autoencoders for tactile and visual data. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 3928–3934. [Google Scholar] [CrossRef]
Lee, A.X.; Nagabandi, A.; Abbeel, P.; Levine, S. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Adv. Neural Inf. Process. Syst. 2020, 33, 741–752. [Google Scholar]
Zintgraf, L.; Schulze, S.; Lu, C.; Feng, L.; Igl, M.; Shiarlis, K.; Gal, Y.; Hofmann, K.; Whiteson, S. VariBAD: Variational bayes-adaptive deep RL via meta-learning. J. Mach. Learn. Res. 2021, 22, 1–39. [Google Scholar]
Finn, C.; Tan, X.Y.; Duan, Y.; Darrell, T.; Levine, S.; Abbeel, P. Deep Spatial Autoencoders for Visuomotor Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2016), Stockholm, Sweden, 16–21 May 2016; pp. 512–519. [Google Scholar]
Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef] [PubMed]
Schrittwieser, J.; Hubert, T.; Mandhane, A.; Barekatain, M.; Antonoglou, I.; Silver, D. Online and offline reinforcement learning by planning with a learned model. Adv. Neural Inf. Process. Syst. 2021, 34, 27580–27591. [Google Scholar]
Ye, W.; Liu, S.; Kurutach, T.; Abbeel, P.; Gao, Y. Mastering atari games with limited data. Adv. Neural Inf. Process. Syst. 2021, 34, 25476–25488. [Google Scholar]
Hansen, N.; Su, H.; Wang, X. Td-mpc2: Scalable, robust world models for continuous control. arXiv 2023, arXiv:2310.16828. [Google Scholar]
Watter, M.; Springenberg, J.T.; Boedecker, J.; Riedmiller, M. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. In Proceedings of the Neural Information Processing Systems (NeurIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 2746–2754. [Google Scholar]
Karl, M.; Soelch, M.; Bayer, J.; Van der Smagt, P. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv 2016, arXiv:1605.06432. [Google Scholar]
Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T.P. Mastering Diverse Domains through World Models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
Qi, J.; Zhou, Q.; Lei, L.; Zheng, K. Federated reinforcement learning: Techniques, applications, and open challenges. Intell. Robot. 2021, 1, 18–57. [Google Scholar] [CrossRef]
Jin, H.; Peng, Y.; Yang, W.; Wang, S.; Zhang, Z. Federated Reinforcement Learning with Environment Heterogeneity. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 2022), Valencia, Spain, 28–30 March 2022; pp. 18–37. [Google Scholar]
Tehrani, P.; Restuccia, F.; Levorato, M. Federated Deep Reinforcement Learning for the Distributed Control of NextG Wireless Networks. In Proceedings of the 2021 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Online, 13–15 December 2021; pp. 248–253. [Google Scholar]
Wong, Y.J.; Tham, M.L.; Kwan, B.H.; Owada, Y. FedDdrl: Federated Double Deep Reinforcement Learning for Heterogeneous IoT with Adaptive Early Client Termination and Local Epoch Adjustment. Sensors 2023, 23, 2494. [Google Scholar] [CrossRef]
Rengarajan, D.; Ragothaman, N.; Kalathil, D.; Shakkottai, S. Federated Ensemble-Directed Offline Reinforcement Learning. 2024. Available online: https://openreview.net/forum?id=XdSYtriYfI (accessed on 25 February 2026).
Woo, J.; Shi, L.; Joshi, G.; Chi, Y. Federated offline reinforcement learning: Collaborative single-policy coverage suffices. arXiv 2024, arXiv:2402.05876. [Google Scholar] [CrossRef]
Zhao, H.; Li, Y.; Pang, Z.; Ma, Z. Federated Multi-Agent DRL for Task Offloading in Vehicular Edge Computing. Electronics 2025, 14, 3501. [Google Scholar] [CrossRef]
Na, S.; Rouček, T.; Ulrich, J.; Pikman, J.; Krajník, T.; Lennox, B.; Arvin, F. Federated Reinforcement Learning for Collective Navigation of Robotic Swarms. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2122–2131. [Google Scholar] [CrossRef]
Yuan, Z.; Xu, S.; Zhu, M. Federated reinforcement learning for robot motion planning with zero-shot generalization. Automatica 2024, 166, 111709. [Google Scholar] [CrossRef]
Liu, B.; Wang, L.; Liu, M. Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems. IEEE Robot. Autom. Lett. 2019, 4, 4555–4562. [Google Scholar] [CrossRef]
An, X.; Lin, Y.; Lin, M.; Wu, C.; Murase, T.; Ji, Y. Federated Reinforcement Learning Framework for Mobile Robot Navigation Using ROS and Gazebo. IEEE Internet Things Mag. 2025, 8, 45–51. [Google Scholar] [CrossRef]
Lu, S.; Cai, Y.; Liu, Z.; Lian, Y.; Chen, L.; Wang, H. A Preference-Based Multi-Agent Federated Reinforcement Learning Algorithm Framework for Trustworthy Interactive Urban Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2025, 26, 10131–10145. [Google Scholar] [CrossRef]
Fu, Y.; Li, C.; Yu, F.R.; Luan, T.H.; Zhang, Y. A Selective Federated Reinforcement Learning Strategy for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2023, 24, 1655–1668. [Google Scholar] [CrossRef]
Hafid, A.; Hocine, R.; Guezouli, L.; Abdessemed, M.R. Centralized and Decentralized Federated Learning in Autonomous Swarm Robots: Approaches, Algorithms, Optimization Criteria and Challenges: The Sixth Edition of International Conference on Pattern Analysis and Intelligent Systems (PAIS’24). In Proceedings of the 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), El Oued, Algeria, 24–25 April 2024; pp. 1–8. [Google Scholar] [CrossRef]
Gutierrez, G.M.; Rincon, J.A.; Julian, V. Federated Learning for Collaborative Robotics: A ROS 2-Based Approach. Electronics 2025, 14, 1323. [Google Scholar] [CrossRef]
Kong, X.; Peng, L.; Rahim, S. Dynamic Service Function Chain (SFC) Deployment for Autonomous Intelligent Systems. IEEE Trans. Consum. Electron. 2025, 71, 10776–10785. [Google Scholar] [CrossRef]
Wang, Y.; Zhong, S.; Yuan, T. Grasp Control Method for Robotic Manipulator Based on Federated Reinforcement Learning. In Proceedings of the 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Shanghai, China, 1–3 March 2024; pp. 1513–1519. [Google Scholar] [CrossRef]
Yue, S.; Hua, X.; Deng, Y.; Chen, L.; Ren, J.; Zhang, Y. Momentum-Based Contextual Federated Reinforcement Learning. IEEE Trans. Netw. 2025, 33, 865–880. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
McLean, R.; Chatzaroulas, E.; McCutcheon, L.; Röder, F.; Yu, T.; He, Z.; Zentner, K.; Julian, R.; Terry, J.K.; Woungang, I.; et al. Meta-World+: An Improved, Standardized, RL Benchmark. In Proceedings of the The Thirty-Ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Nitin Bhagoji, A.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Bardes, A.; Ponce, J.; LeCun, Y. VICReg: Variance–Invariance–Covariance Regularization for Self-Supervised Learning. In Proceedings of the International Conference on Learning Representations (ICLR 2022). Available online: https://openreview.net/forum?id=xm6YD62D1Ub (accessed on 25 February 2026).
Schweighofer, K.; Dinu, M.-C.; Radler, A.; Hofmarcher, M.; Patil, V.P.; Bitto-Nemling, A.; Eghbal-Zadeh, H.; Hochreiter, S. A Dataset Perspective on Offline Reinforcement Learning. In Proceedings of the Conference on Lifelong Learning Agents (CoLLAs 2022), Montreal, QC, Canada, 22–24 August 2022; pp. 470–517. [Google Scholar]
Belkhale, S.; Cui, Y.; Sadigh, D. Data quality in imitation learning. Adv. Neural Inf. Process. Syst. 2023, 36, 80375–80395. [Google Scholar]

Figure 1. Overview of the proposed F-DRL framework.

Figure 2. Local Client Architecture.

Figure 3. Local policy learning.

Figure 4. Downstream reinforcement learning performance across seven MetaWorld manipulation tasks. Panels show task success rate as a function of environment steps for all the seven tasks. Curves are averaged over three random seeds and shaded regions denote 95% confidence intervals.

Figure 5. Evolution of task-wise aggregation weights across federated rounds.

Table 1. Comparison between existing federated reinforcement learning frameworks and the proposed F-DRL framework.

Aspect	Existing FRL	F-DRL (Ours)
Federated object	Policy/value networks	Representation encoder
Shared signal	Control parameters	Latent dynamics embeddings
Aggregation strategy	Uniform/proximal averaging	Similarity-weighted aggregation
Sensitivity to task heterogeneity	High	Reduced
Training stability focus	Implicit	Explicit
Local control optimisation	Federated	Fully local
Privacy exposure	Policy gradients	Latent summaries only
Extensibility	Limited	Modular

Table 2. Stability metrics for heterogeneous MetaWorld tasks. Arrows indicate the preferred direction of each metric, where ↓ denotes that lower values are better.

Task	Method	Final Success (%)	Mean Seed Std ↓	AUC of Seed Std ↓
Drawer-Open	Centralized	100.00	7.57	211.83
	FedAvg	99.79	20.11	562.94
	Local	79.43	18.93	530.10
	Ours	98.65	7.72	216.12
Window-Open	Centralized	100.00	10.11	283.04
	FedAvg	100.00	19.06	533.67
	Local	100.00	3.97	111.10
	Ours	100.00	6.49	181.79

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Upadhyay, A.; Lu, X.; Baradaranshokouhi, Y.; Li, J.; Jing, Y. F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning. Information 2026, 17, 232. https://doi.org/10.3390/info17030232

AMA Style

Upadhyay A, Lu X, Baradaranshokouhi Y, Li J, Jing Y. F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning. Information. 2026; 17(3):232. https://doi.org/10.3390/info17030232

Chicago/Turabian Style

Upadhyay, Anurag, Xin Lu, Yashar Baradaranshokouhi, Jun Li, and Yanguo Jing. 2026. "F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning" Information 17, no. 3: 232. https://doi.org/10.3390/info17030232

APA Style

Upadhyay, A., Lu, X., Baradaranshokouhi, Y., Li, J., & Jing, Y. (2026). F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning. Information, 17(3), 232. https://doi.org/10.3390/info17030232

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

F-DRL: Federated Dynamics Representation Learning for Robust Multi-Task Reinforcement Learning

Abstract

1. Introduction

Contributions

2. Related Work

2.1. Deep Reinforcement Learning and Sample Inefficiency

2.2. Representation Learning for Reinforcement Learning

2.2.1. Robotics Priors and Physical Constraints

2.2.2. Dynamics Modelling via Self-Prediction

2.3. Federated Reinforcement Learning

3. Framework and Methodology

3.1. Design Rationale and Framework Overview

3.2. Federated Learning Setting and Assumptions

3.3. Overview of the Proposed Framework

3.4. Federated Representation Learning

3.4.1. Architecture of the Local Client Module

3.4.2. Similarity-Weighted Federated Aggregation

3.5. Downstream Policy Optimization with Learned Representations

3.6. Scope and Extensions

4. Experimental Evaluation and Discussion

4.1. Experimental Setup

4.2. Representation Networks

4.3. Downstream Reinforcement Learning Networks

4.4. Communication Overhead and Scalability

4.5. Baselines and Evaluation Protocol

4.6. Downstream Reinforcement Learning Performance

4.7. Stability and Effect of Task Heterogeneity

4.7.1. Mean Seed Standard Deviation

4.7.2. Integrated Dispersion (AUC of Seed Standard Deviation)

4.8. Analysis of Similarity-Weighted Federated Aggregation

4.9. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI