Next Article in Journal
QuantumTrust-FedChain: A Blockchain-Aware Quantum-Tuned Federated Learning System for Cyber-Resilient Industrial IoT in 6G
Previous Article in Journal
Towards Fair Medical Risk Prediction Software
Previous Article in Special Issue
Integrating AIoT Technologies in Aquaculture: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Federated Decision Transformers for Scalable Reinforcement Learning in Smart City IoT Systems

by
Laila AlTerkawi
* and
Mokhled AlTarawneh
Computer Engineering and Cybersecurity Department, College of Engineering and Computing, International University of Kuwait (IUK), Ardiya 92400, Kuwait
*
Author to whom correspondence should be addressed.
Future Internet 2025, 17(11), 492; https://doi.org/10.3390/fi17110492
Submission received: 17 September 2025 / Revised: 12 October 2025 / Accepted: 16 October 2025 / Published: 27 October 2025
(This article belongs to the Special Issue Internet of Things (IoT) in Smart City)

Abstract

The rapid proliferation of devices on the Internet of Things (IoT) in smart city environments enables autonomous decision-making, but introduces challenges of scalability, coordination, and privacy. Existing reinforcement learning (RL) methods, such as Multi-Agent Actor–Critic (MAAC), depend on centralized critics and recurrent structures, which limit scalability and create single points of failure. This paper proposes a Federated Decision Transformer (FDT) framework that integrates transformer-based sequence modeling with federated learning. By replacing centralized critics with self-attention-driven trajectory modeling, the FDT preserves data locality, enhances privacy, and supports decentralized policy learning across distributed IoT nodes. We benchmarked the FDT against MAAC in a mobile edge computing (MEC) environment with identical hyperparameter configurations. The results demonstrate that the FDT achieves superior reward efficiency, scalability, and adaptability in dynamic IoT networks, although with slightly higher variance during early training. These findings highlight transformer-based federated RL as a robust and privacy-preserving alternative to critic-based methods for large-scale IoT systems.

1. Introduction

The rapid growth of smart city technologies has increased the reliance on IoT devices to support transportation, surveillance, energy management, and other critical urban operations [1,2,3,4,5,6,7]. These distributed systems generate large amounts of sequential data and require intelligent and adaptive methods for real-time decision-making. RL offers a promising framework for such tasks, yet traditional actor–critic and value-based approaches struggle with two persistent challenges: capturing long-term temporal dependencies and scaling across large heterogeneous IoT networks [8,9,10,11].
MAAC methods were developed to improve coordination among agents in complex settings [12,13,14]. However, MAAC retains a centralized critic that constrains scalability and creates a single point of failure. Its reliance on recurrent architectures further limits effectiveness as vanishing gradients hinder long-horizon reasoning. These shortcomings reduce the suitability of MAAC for high-dimensional dynamic smart city applications where both privacy and adaptability are essential [15,16].
Recent advances in deep learning, particularly transformer architectures, have reshaped sequential modeling. Using self-attention, transformers capture long-range dependencies more effectively than RNNs or LSTM [17,18,19]. Decision Transformers (DTs) extend this capability to reinforcement learning by framing trajectory optimization as a sequence modeling problem [20,21,22]. DTs generate actions conditioned on return-to-go, enabling efficient, value-free policy learning and improved generalization across tasks [20,23,24]. However, most DT frameworks assume centralized training and do not address privacy or scalability in federated contexts [25,26,27].
In this work, we propose an FDT that integrates transformer-based sequence modeling with a federated learning paradigm [28,29]. This framework eliminates the centralized critic, enabling multiple agents to collaboratively learn policies while keeping data local. By combining self-attention with federated aggregation, the FDT improves scalability, robustness, and privacy preservation in distributed IoT environments [30,31].
Our contributions are threefold:
  • We designed a federated reinforcement learning framework that embeds Decision Transformers in place of MAAC’s centralized critic, capturing long-horizon dependencies without relying on recurrent networks [32,33].
  • We demonstrate how the FDT supports horizontal scalability and privacy by enabling decentralized policy learning through federated aggregation of local updates [34].
  • We empirically benchmarked the FDT against MAAC in mobile edge computing simulations, showing improved reward efficiency, adaptability, and scalability, with a trade-off of slightly higher variance during early training [35,36].
The remainder of this paper is organized as follows. Section 2 reviews previous work on reinforcement learning in smart city applications. Section 3 introduces the proposed Federated Decision Transformer framework and its algorithms, followed in Section 4 by a conceptual comparison with the MAAC baseline. Section 5 outlines the experimental environment and evaluation protocol, and Section 6 presents and discusses the empirical results. Finally, Section 7 summarizes the main findings and outlines future research directions.

2. Related Work

2.1. Federated Learning in Smart Cities

Federated learning (FL) has emerged as a practical alternative to centralized training for privacy-sensitive urban analytics across domains such as transportation, energy, and public safety. Surveys highlight key challenges, statistical heterogeneity, client drift, and communication costs that affect real deployments [3,4,5,6]. Beyond surveys, concrete case studies illustrate the promise of FL: Bao et al.  [7,37] applied federated RL for traffic signal control, achieving reduced congestion and improved travel times while preserving privacy by keeping intersection data local. Similarly, Yu et al. [11] demonstrated how federated RL supports resource allocation in multi-access edge computing, while Wang et al. [10] highlighted the role of in-edge AI for joint caching, communication, and learning. Recent work also integrates fairness and adversarial robustness into edge FL frameworks [15], underscoring the need for scalable and secure collaborative training in IoT-driven smart cities.

2.2. Reinforcement Learning for Autonomous Decision-Making

Reinforcement learning (RL) has proven to be an effective methodology for autonomous decision-making across a range of dynamic domains. Recent advances demonstrate that RL can adapt to complex, non-stationary environments such as traffic control [38], energy optimization [39], resource allocation in edge-IoT networks [40], and collaborative intelligence for IoT systems [35]. These studies highlight RL’s capacity for continuous adaptation, distributed coordination, and reward-driven optimization, key attributes for smart city applications. Recent Works on multi-agent RL [33] and transformer-enhanced coordination [36] further confirm its growing relevance in achieving scalable, cooperative, and context-aware decision-making in large, interconnected environments.

2.3. Transformers for Reinforcement Learning and MARL

Transformers have recently expanded from natural language processing into RL, where their self-attention enables long-horizon credit assignment. Several surveys map this emerging landscape, covering representation learning, policy optimization, and dynamics modeling, while also noting challenges such as computation cost and data hunger [18,19]. In multi-agent reinforcement learning (MARL), attention-based designs such as the multi-agent transformer (MAT) [21] have achieved state-of-the-art performance in cooperative benchmarks. More recent work on Multi-Agent Decision Transformers (MADTs) demonstrates efficient trajectory modeling across agents, supporting transfer and adaptation from offline to online [22]. Extensions in 2024 propose selective or structured attention to reduce complexity and latency when scaling to large agent populations [41]. Applied studies also show the benefits of attention-based MARL in traffic control and dynamic environments [42], strengthening their suitability for smart city applications.

2.4. Federated Reinforcement Learning and Federated Transformers

Another research thread adapts reinforcement learning and transformers to federated or distributed settings. Qi et al. [32] provided one of the first surveys on federated reinforcement learning (FRL), highlighting techniques and open challenges. More recently, Li et al. introduced FedTP, a federated personalized transformer architecture that uses server-side hypernetworks to mitigate non-IID challenges [25]. Other works explore federated prompt- or split-based Decision Transformers for mobile edge computing, reducing bandwidth demands and improving personalization [43,44]. In security-sensitive contexts, Parra et al. [29] propose interpretable federated transformers for intrusion detection, aligning with privacy mandates in critical infrastructures. Robustness remains a central concern: recent surveys and reviews highlight threats such as poisoning and backdoors while proposing secure aggregation and anomaly detection as defenses [16,30,34,45]. Broader surveys on efficient FL with foundation and transformer models [28] further emphasize parameter-efficient tuning and update scheduling strategies, both relevant for resource-constrained IoT devices.
Compared with centralized-critic MAAC approaches and prior federated/multi-agent transformers, our contribution is threefold: (i) removing the centralized critic by embedding Decision Transformers per agent node, (ii) integrating transformer-specific federated personalization strategies, and (iii) evaluating scalability, variance, and communication trade-offs under MEC-like smart city constraints.

3. Methodology

We propose an FDT framework that replaces the centralized critic of MAAC with decentralized Decision Transformers trained with a federated learning loop. Each IoT agent maintains its own DT, trained locally on private trajectories, while the server aggregates model updates without ever accessing raw data. This section details the model design, the client-side procedure, and the aggregation of the servers.

3.1. Model Design

Each client agent encodes sequences of ( s t , a t , r t , G ^ t ) tokens, where G ^ t is the normalized return-to-go (RTG).
  • Embeddings. States, actions, and RTGs are projected into a common hidden dimension and concatenated with positional encodings.
  • Transformer encoder. Multi-head self-attention captures long-range dependencies, enabling agents to model temporal correlations and interaction patterns without recurrence.
  • Policy head. Outputs logits for discrete actions or mean values for continuous actions, conditioned on the contextualized sequence representation.
  • RTG normalization. A running normalizer stabilizes training in heterogeneous client environments. Normalization is performed locally at each client, while aggregated statistics across clients maintain consistency at the federated level.
In the federated multi-agent setting, the RTG acts as the implicit coordination signal across agents. Rather than depending on a centralized critic to exchange gradient information, each agent conditions its decision process on a tokenized sequence [ G ^ t , s t , a t 1 ] , where G ^ t represents the normalized cumulative future reward from step t. This conditioning allows agents to infer how their local actions contribute to the global objective through predicted future returns, rather than through explicit critic feedback. As all agents learn to optimize toward consistent normalized RTG expectations, coordinated behaviors emerge implicitly even without parameter sharing or direct communication of trajectories. This decentralized mechanism supports scalability and privacy preservation, while the self-attention structure ensures that long-horizon dependencies and inter-agent dynamics are effectively captured.
This design eliminates the need for a centralized critic by conditioning action prediction directly on returns-to-go.

3.2. Client-Side Training

Each client interacts with its environment to collect trajectories, computes discounted RTGs, normalizes them, and stores the data in a replay buffer. Training proceeds in subsequences of length L, padded with masks to handle variable episode lengths. Teacher forcing is applied: the DT is conditioned on observed states, actions, and RTGs and predicts the next action token. The objective is as follows.
L = CE ( a t , π θ ( a t s 1 : t , a 1 : t , G ^ 1 : t ) ) discrete actions , a t μ θ 2 2 continuous actions ,
with optional RTG consistency and weight regularization terms. For every K local episodes, the client transmits its model delta Δ c and sample count n c to the server.

Complexity

The per-client cost per round is
O E · | D c | B · L 2 d H ,
where | D c | is the number of trajectory tokens, L is the sequence length, d the hidden dimension, and H the number of transformer layers. Communication is O ( | θ | ) parameters for every K episodes.

3.3. Server-Side Aggregation

The server receives ( Δ c , n c ) from participating clients and applies sample-size weighted aggregation:
θ ( r ) = θ ( r 1 ) + c S ( r ) w c Δ c , w c = n c j n j .
Optional preprocessing includes decompression, unmasking for secure aggregation, adapter masking for PEFT, and norm clipping for differential privacy. Robust alternatives, such as coordinate median, trimmed mean, or Krum mitigate adversarial or corrupted updates.

3.3.1. Complexity

The server cost is O ( | S ( r ) | · | θ | ) for FedAvg. Robust methods such as Krum require O ( | S ( r ) | 2 ) distance computations. Communication is O ( | S ( r ) | · | θ | ) per round, reducible with compression or PEFT.
The proposed methodology integrates client-side Decision Transformer training with server-side federated aggregation, enabling scalable and privacy-preserving reinforcement learning across heterogeneous IoT environments. Algorithms 1 and 2 formalize the decentralized workflow, while complexity analysis highlights the trade-offs between computation ( O ( L 2 d H ) self-attention per sequence) and communication ( O ( | θ | ) parameters per upload). To validate the effectiveness of this framework, we evaluated the FDT against the MAAC baseline in controlled MEC simulations. The following section describes the experimental setup, including environments, hyperparameters, and evaluation metrics, designed to assess convergence, scalability, communication efficiency, and robustness.
Algorithm 1 FDT Client-Side Training
  • Require: Clients C ; global params θ ( 0 ) ; rounds R; local episodes E; cadence K; batch size B; sequence length L; learning rate η
  • Ensure: Updated global params θ ( R )
  1:
for  r = 1 to R do
  2:
    Server sends θ ( r 1 ) to clients
  3:
    for all  c S ( r )  in parallel do
  4:
         θ c θ ( r 1 ) ; n c 0
  5:
        for  e = 1 to E do
  6:
           Collect trajectory ( s t , a t , r t )
  7:
           Compute discounted RTG; normalize G ^ t
  8:
           Store ( s t , a t , r t , G ^ t ) in buffer
  9:
           Sample subsequences of length L with padding/masks
10:
           Train DT with teacher forcing; update θ c
11:
            n c n c + episode length
12:
           if  e mod K = 0  then
13:
               Send ( Δ c , n c ) to server; reset n c 0
14:
           end if
15:
        end for
16:
    end for
17:
    Server aggregates (Algorithm 2)
18:
end for
Algorithm 2 Server Aggregation with Privacy and Robustness
  • Require: Global params θ ( r 1 ) ; client updates { ( Δ c , n c ) }
  • Ensure: Updated params θ ( r )
  1:
for all ( Δ c , n c ) do
  2:
    Optionally decompress, unmask, clip, or mask adapters
  3:
end for
  4:
Compute weights w c = n c / j n j
  5:
if robust aggregator enabled then
  6:
     Δ ˜ Robust ( { Δ c } , { w c } )
  7:
else
  8:
     Δ ˜ c w c Δ c
  9:
end if
10:
θ ( r ) θ ( r 1 ) + Δ ˜
11:
Broadcast θ ( r ) to clients

3.3.2. Explanation of Algorithm 2

Algorithm 2 describes the aggregation on the server-side in each federated round. Upon receiving client updates Δ c and their sample counts n c , the server first applies optional preprocessing: decompression of quantized/sparse updates, unmasking for secure aggregation, masking to retain only adapter parameters in PEFT, and 2 -norm clipping to bound sensitivity. Differential privacy (DP) accounting can be maintained if Gaussian noise is added on the client side. The server then computes sample size weights w c = n c / j n j to balance non-IID client contributions. If robustness is required, the server replaces FedAvg with robust aggregators such as coordinate-wise median, trimmed mean, or Krum, thus mitigating malicious or outlier updates. Finally, the global model is updated to θ ( r ) = θ ( r 1 ) + Δ ˜ , with an optional exponential moving average (EMA) to reduce variance. The updated parameters are broadcast to the clients for the next round.

3.3.3. Complexity

The server-side complexity is linear in the number of participants,
O | S ( r ) | · | θ | ,
since each round requires aggregating updates on | S ( r ) | clients for a model of size | θ | . Optional robust aggregation (e.g., Krum) introduces an additional O ( | S ( r ) | 2 ) distance computation but remains practical for typical client counts. The communication cost is O ( | S ( r ) | · | θ | ) per round (downlink and uplink), which can be reduced by compression or PEFT.

4. Comparative Analysis with MAAC

To evaluate the effectiveness of our proposed framework, we compare it with the MAAC algorithm, a widely adopted baseline in MARL. MAAC supports cooperative learning across agents through a centralized critic, but this design introduces several limitations: (i) communication bottlenecks due to frequent gradient exchange, (ii) privacy risks arising from centralized data aggregation, and (iii) limited ability to capture long-term temporal dependencies as the critic is typically coupled with recurrent encoders.
Our proposed FDT addresses these issues by embedding self-attention-based sequence modeling within a federated learning loop. Unlike MAAC, which relies on a global critic for coordination, the FDT enables each agent to learn decentralized trajectory-based policies while keeping raw data local. This architecture improves scalability, robustness, and privacy preservation while eliminating the single point of failure associated with centralized critics.
From a coordination perspective, MAAC relies on explicit synchronization through its centralized critic, which computes joint value estimates based on global observations. This explicit mechanism requires agents to share intermediate representations or gradients, creating potential bottlenecks and privacy concerns. In contrast, the FDT achieves coordination implicitly: each agent conditions its policy on locally normalized return-to-go (RTG) trajectories that reflect cumulative future rewards. Because these normalized returns are aligned across agents through federated aggregation, coordination emerges naturally without centralized value propagation. This implicit synchronization mechanism allows the FDT to maintain coherent cooperative behavior even in non-IID and privacy-constrained environments, providing a key advantage over critic-based architectures.

Baseline Comparison Scope

Our quantitative evaluation focuses on the MAAC baseline, implemented using the publicly available codebase of the authors to ensure reproducibility with identical hyperparameter settings. Although other attention-based MARL methods, such as Actor–Attention–Critic (AAC) [12], the multi-agent transformer (MAT) [21], and the Multi-Agent Decision Transformer (MADT) [22], represent important advances; they are considered here at a conceptual level rather than re-implemented. These approaches generally assume centralized or offline training pipelines, which differ from the federated MEC environment targeted in this work. A detailed review of these methods and their limitations is provided in Section 2. Beyond MAAC, recent attention-based baselines such as AAC [12], MAT [21], and MADT [22] represent important advances in multi-agent learning. However, these approaches generally assume centralized or offline training pipelines and do not directly address federated coordination or privacy-preserving constraints. For this reason, we discuss them conceptually but do not re-implement them in this work. A direct empirical comparison with these methods remains an interesting avenue for future investigation.
Table 1 summarizes the key differences between MAAC and our proposed FDT on the algorithmic and system-level dimensions.
In the following sections, we detail our experimental setup, evaluation metrics (decision accuracy, convergence speed, communication overhead, and security robustness), and comparative results against MAAC baselines.

5. Experimental Setup

To provide a fair and reproducible evaluation, both MAAC and the proposed FDT were trained under identical environmental conditions. The MAAC baseline was implemented using the publicly available code from the original paper’s GitHub repository (Original MAAC implementation: https://github.com/shariqiqbal2810/MAAC, accessed on 15 October 2025), while our FDT model was adapted to operate under the same conditions, allowing a direct comparison between the two approaches.
Computational Environment. Experiments were carried out on an Apple Mac system equipped with an Apple M1 Max processor, 64 GB of unified RAM and running MacOS Sequoia 15.3.1. The software environment included Python 3.9, TensorFlow 2.9, and PyTorch 2.0.
Training Procedure. The MAAC model is trained using its original implementation, whereas the FDT model employs our modified Decision Transformer integrated with federated learning. Both models were trained in identical environments and number of episodes, ensuring that performance differences arise solely from architectural innovations.
Hyperparameters. To eliminate confounding factors, both models use identical hyperparameter settings, as summarized in Table 2. These include learning rates, discount factors, batch sizes, and training durations.

5.1. Multi-Agent Environment

We employ a synthetic multi-agent environment inspired by distributed IoT/MEC systems, where each agent controls localized decisions and interacts with others through a shared global state. States include abstracted features such as task queues, waiting times, current decision phase, and short-horizon demand flows. Actions correspond to either maintaining the current policy or switching to an alternative configuration under safety constraints. Rewards penalize inefficiency in terms of delay, queue buildup, and communication overhead while encouraging balanced throughput across agents. For the FDT, trajectories are tokenized as [ RTG t , s t , a t 1 ] with sequence length L { 20 , 30 } . The evaluation reports performance in terms of reward dynamics, convergence speed, scalability, and communication cost (bytes/round and bytes-to-convergence), averaged over five seeds.

5.2. Federated Setup

Agents are grouped by region into M clients (default M = 8 ; also 16/32). Each round, clients run K = 10 local episodes and upload model deltas; the server aggregates with FedAvg (FedProx μ = 0.001 in ablation). We studied non-IID demand by assigning distinct load profiles per region and varying the network scale (16/36/64 agents).
With this unified setup, we ensure that any observed performance differences reflect the architectural characteristics of MAAC and the FDT rather than experimental inconsistencies. The next section presents the results of this comparative evaluation and discusses their implications for scalability, privacy, and robustness in smart city environments.

6. Results and Discussion

The experiments evaluated how MAAC and the FDT process sequential decision-making tasks and adapt to dynamic multi-agent environments. Performance was assessed in terms of decision accuracy, convergence speed, scalability, communication overhead, and robustness.

6.1. Learning Behavior

Figure 1 illustrates the distribution of agent decision strategies. The FDT model promotes broader exploration, with agents spread throughout the environment, indicating higher adaptability. In contrast, MAAC exhibits denser agent clustering around predefined coordination zones, reflecting stronger cooperative stability but limited flexibility in dynamic settings.

6.2. Reward Efficiency and Stability

Figure 2 compares reward distributions. MAAC achieves stable performance, with most rewards concentrated in the 10–12 range, consistent with its critic-driven cooperative learning. The FDT surpasses MAAC in peak performance, achieving rewards above 22 but with wider variance. This highlights a trade-off: the FDT improves adaptability and long-horizon learning but introduces greater variability during early training.
The high performance variance observed in the FDT can be attributed to its trajectory-based credit assignment: long-horizon RTG conditioning introduces stochasticity in sequence sampling and gradient updates. In early training, this leads to greater fluctuations compared to the centralized critic in MAAC. However, variance gradually decreases as more trajectories are aggregated. This highlights a trade-off between adaptability and stability, suggesting that future optimizations could integrate variance-reduction strategies such as reward normalization, adaptive RTG scaling, or hybrid critic-transformer models.

6.3. Scalability

Figure 3 reports the mean reward for episodes as the number of agents increases. The FDT consistently maintains higher performance across larger agent populations, demonstrating robustness to increasing system complexity. MAAC performance degrades as the number of agents increases, confirming that its centralized critic becomes a bottleneck in large-scale settings. These findings validate the scalability advantage of federated self-attention models in smart city contexts where agent participation is highly dynamic.
The consistent improvements observed across 16, 36, and 64 agents indicate that the FDT scales effectively as system complexity grows. These trends suggest that the FDT can extend naturally to even larger agent populations, and future studies could investigate networks with hundreds of agents to further substantiate scalability in more complex environments.

6.4. Comparative Summary

In general, the results reveal complementary strengths. MAAC excels in structured cooperative tasks, where stability and low variance are paramount. In contrast, the FDT provides superior scalability, adaptability, and privacy preservation, albeit with greater performance variance. This trade-off suggests that hybrid approaches that combine the stability of MAAC with the adaptability of the FDT could be a promising direction for future research.

6.5. Implications for Smart Cities

For practical deployment in smart city environments, the FDT offers significant advantages: (i) horizontal scalability supports dynamic IoT infrastructures, (ii) privacy is preserved by design through federated aggregation, and (iii) robustness is improved through decentralization, reducing single points of failure. These characteristics make the FDT particularly suitable for mobility, energy management, and anomaly detection tasks where adaptability and privacy are critical. It should be noted that the reported results stem from a synthetic multi-agent environment rather than domain-specific traffic datasets. Validating the FDT on real-world applications such as CityFlow or SUMO remains an important direction for future research.
  • Feasibility and Deployment Considerations. Practical IoT deployment introduces several operational constraints. Communication overhead is mitigated by transmitting compressed model deltas instead of raw gradients, combined with quantization and structured pruning to further reduce bandwidth requirements. Computational latency can be alleviated by partially offloading transformer inference to regional servers in a hierarchical FL architecture. Energy consumption challenges in battery-powered IoT devices are addressed through lightweight transformer variants (e.g., TinyBERT and MobileFormer) and adaptive client participation strategies. Together, these mechanisms enable scalable and energy-efficient implementations of the FDT across resource-constrained environments.
  • Scalability and Component Sensitivity. To better understand the FDT’s performance sources, future work should include ablation studies on the transformer encoder, RTG conditioning, and aggregation mechanisms. This would clarify the contribution of each component, improve interpretability, and reinforce the FDT’s scalability under diverse network and workload conditions.
  • Resilience and Cybersecurity. Beyond efficiency, the FDT inherently mitigates several key challenges in federated multi-agent learning. Client heterogeneity is handled through local RTG normalization and adaptive weighting, ensuring stable convergence across non-IID clients. Communication bottlenecks are alleviated through compressed delta updates, partial participation, and compatibility with quantization and structured pruning for bandwidth efficiency. Security and privacy vulnerabilities are mitigated using differential privacy, secure aggregation, and robust aggregation methods such as trimmed mean and Krum, which protect against poisoning and Byzantine attacks. Recent studies also emphasize blockchain-based auditability and trust management to strengthen security in smart city infrastructures [2,16,30,45,46]. Together, these strategies enhance the reliability, robustness, and trustworthiness of large-scale FDT deployments in urban IoT ecosystems.

7. Conclusions

This study compared the MAAC model with a transformer-based reinforcement learning framework integrated into a Federated Decision Transformer (FDT) for smart city applications. The experimental results showed that MAAC provides stable cooperative decision-making through centralized critics but suffers from scalability limitations in dynamic environments. In contrast, the FDT leverages self-attention and decentralized training to achieve higher adaptability, scalability, and reward efficiency, although at the cost of increased performance variance.
The key contribution of this work lies in the integration of Decision Transformers within a federated multi-agent framework, enabling decentralized coordination without a centralized critic. Unlike existing federated reinforcement learning or transformer-based MARL studies, the FDT unifies temporal sequence modeling with privacy-preserving learning, addressing scalability, non-IID data, and communication efficiency simultaneously. This dual innovation in architectural design and training paradigm advances the state of the art in scalable, privacy-aware decision intelligence for smart city IoT systems.
The findings highlight a clear trade-off: MAAC is well-suited for structured cooperative tasks requiring stability, whereas the FDT excels in decentralized, evolving environments where scalability and privacy are critical. These complementary strengths suggest that hybrid approaches that combine the stability of MAAC with the adaptability of the FDT represent a promising avenue for future research.
Future work will extend the FDT with variance-reduction strategies, modular aggregation schemes, and real-world benchmarking on traffic and energy datasets to further validate its generalizability and deployment readiness in smart city environments [47,48,49].

Author Contributions

L.A. was primarily responsible for the development of the proposed method, including the design, coding, and execution of experiments, as well as the initial drafting of the manuscript. M.A. provided supervision, critical review, and substantial revisions of the text. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
IoTInternet of Things
RLReinforcement Learning
MAACMulti-Agent Actor–Critic
FDTFederated Decision Transformer
MECMobile Edge Computing
DTDecision Transformer
RNNRecurrent Neural Network
LSTMLong Short-Term Memory
FLFederated Learning
MARLMulti-Agent Reinforcement Learning
MATMulti-Agent Transformer
MADTMulti-Agent Decision Transformer
FRLFederated Reinforcement Learning
FedTPFederated Personalized Transformer

References

  1. Ilyas, M. IoT Applications in Smart Cities. In Proceedings of the 2021 International Conference on Electronic Communications, Internet of Things and Big Data (ICEIB), Yilan County, Taiwan, 10–12 December 2021; pp. 44–47. [Google Scholar] [CrossRef]
  2. Zhang, K.; Ni, J.; Yang, K.; Liang, X.; Ren, J.; Shen, X.S. Security and Privacy in Smart City Applications: Challenges and Solutions. IEEE Commun. Mag. 2017, 55, 122–129. [Google Scholar] [CrossRef]
  3. Pandya, S.; Srivastava, G.; Jhaveri, R.; Babu, M.R.; Bhattacharya, S.; Maddikunta, P.K.R.; Mastorakis, S.; Piran, M.J.; Gadekallu, T.R. Federated Learning for Smart Cities: A Comprehensive Survey. Sustain. Energy Technol. Assess. 2023, 55, 102987. [Google Scholar] [CrossRef]
  4. Al-Huthaifi, R.; Li, T.; Huang, W.; Gu, J.; Li, C. Federated Learning in Smart Cities: Privacy and Security Survey. Inf. Sci. 2023, 632, 833–857. [Google Scholar] [CrossRef]
  5. Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Poor, H.V. Federated Learning for Internet of Things: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2021, 23, 1622–1658. [Google Scholar] [CrossRef]
  6. Zhang, Z.; Rath, S.; Xu, J.; Xiao, T. Federated Learning for Smart Grid: A Survey on Applications and Potential Vulnerabilities. ACM Trans.-Cyber-Phys. Syst. 2025; Accepted. [Google Scholar] [CrossRef]
  7. Fu, Y.; Di, X. Federated Reinforcement Learning for Adaptive Traffic Signal Control: A Case Study in New York City. In Proceedings of the 26th IEEE International Conference on Intelligent Transportation Systems (ITSC 2023), Bilbao, Spain, 24–28 September 2023; pp. 5738–5743. [Google Scholar] [CrossRef]
  8. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
  9. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
  10. Wang, X.; Han, Y.; Wang, C.; Zhao, Q.; Chen, X.; Chen, M. In-Edge AI: Intelligentizing Mobile Edge Computing, Caching and Communication by Federated Learning. IEEE Netw. 2019, 33, 156–165. [Google Scholar] [CrossRef]
  11. Yu, S.; Chen, X.; Zhou, Z.; Gong, X.; Wu, D. When Deep Reinforcement Learning Meets Federated Learning: Intelligent Multitimescale Resource Management for Multiaccess Edge Computing in 5G Ultradense Network. IEEE Internet Things J. 2021, 8, 2238–2251. [Google Scholar] [CrossRef]
  12. Iqbal, S.; Sha, F. Actor-Attention-Critic for Multi-Agent Reinforcement Learning. arXiv 2019, arXiv:1810.02912. [Google Scholar]
  13. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative?Competitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6382–6393. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html (accessed on 15 September 2025).
  14. Buşoniu, L.; Babuška, R.; De Schutter, B. Multi-Agent Reinforcement Learning: An Overview. In Innovations in Multi-Agent Systems and Applications–1; Srinivasan, D., Jain, L.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 183–221. [Google Scholar] [CrossRef]
  15. Yang, H.; Huang, Y.; Shi, J.; Yang, Y. A Federated Framework for Edge Computing Devices with Collaborative Fairness and Adversarial Robustness. J. Grid Comput. 2023, 21, 36. [Google Scholar] [CrossRef]
  16. Feng, Y.; Guo, Y.; Hou, Y.; Wu, Y.; Lao, M.; Yu, T.; Liu, G. A survey of security threats in federated learning. Adv. Eng. Inform. 2025, 11, 165. [Google Scholar] [CrossRef]
  17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  18. Li, W.; Luo, H.; Lin, Z.; Zhang, C.; Lu, Z.; Ye, D. A Survey on Transformers in Reinforcement Learning. Trans. Mach. Learn. Res. 2023. Available online: https://openreview.net/forum?id=r30yuDPvf2 (accessed on 15 September 2025).
  19. Agarwal, P.; Abdul Rahman, A.; St-Charles, P.-L.; Prince, S.J.D.; Ebrahimi Kahou, S. Transformers in Reinforcement Learning: A Survey. arXiv 2023, arXiv:2307.05979. [Google Scholar] [CrossRef]
  20. Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. arXiv 2021, arXiv:2106.01345. [Google Scholar] [CrossRef]
  21. Wen, M.; Grudzien Kuba, J.; Lin, R.; Zhang, W.; Wen, Y.; Wang, J.; Yang, Y. Multi-Agent Reinforcement Learning Is a Sequence Modeling Problem. arXiv 2022, arXiv:2205.14953. [Google Scholar] [CrossRef]
  22. Meng, L.; Wen, M.; Yang, Y.; Le, C.; Li, X.; Zhang, W.; Wen, Y.; Zhang, H.; Wang, J.; Xu, B. Offline Pre-Trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks. arXiv 2022, arXiv:2112.02845. [Google Scholar]
  23. Kapturowski, S.; Ostrovski, G.; Quan, J.; Munos, R.; Dabney, W. Recurrent Experience Replay in Distributed Reinforcement Learning. Available online: https://api.semanticscholar.org/CorpusID:59345798 (accessed on 15 September 2025).
  24. Park, H.; Shin, T.; Kim, S.; Lho, D.; Sim, B.; Song, J.; Kong, K.; Kim, J. Scalable Transformer Network-Based Reinforcement Learning Method for PSIJ Optimization in HBM. In Proceedings of the IEEE 31st Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS 2022), San Jose, CA, USA, 9–12 October 2022; pp. 1–3. [Google Scholar] [CrossRef]
  25. Li, H.; Cai, Z.; Wang, J.; Tang, J.; Ding, W.; Lin, C.-T.; Shi, Y. FedTP: Federated Learning by Transformer Personalization. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 13426–13440. [Google Scholar] [CrossRef]
  26. Reddy, M.S.; Karnati, H.; Mohana Sundari, L. Transformer-Based Federated Learning Models for Recommendation Systems. IEEE Access 2024, 12, 109596–109607. [Google Scholar] [CrossRef]
  27. Sun, Z.; Xu, Y.; Liu, Y.; He, W.; Kong, L.; Wu, F.; Jiang, Y.; Cui, L. A Survey on Federated Recommendation Systems. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 6–20. [Google Scholar] [CrossRef] [PubMed]
  28. Woisetschlager, H.; Erben, A.; Wang, S.; Mayer, R.; Jacobsen, H.-A. A Survey on Efficient Federated Learning Methods for Foundation Model Training. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI 2024), Jeju, Republic of Korea, 3–9 August 2024; pp. 1–9. [Google Scholar] [CrossRef]
  29. De La Torre Parra, G.; Selvera, L.; Khoury, J.; Irizarry, H.; Bou-Harb, E.; Rad, P. Interpretable Federated Transformer Log Learning for Cloud Threat Forensics. In Proceedings of the Network and Distributed System Security Symposium (NDSS 2022), San Diego, CA, USA, 27 February–3 March 2022. [Google Scholar]
  30. Rane, N.; Mallick, S.; Kaya, O.; Rane, J. Federated learning for edge artificial intelligence: Enhancing security, robustness, privacy, personalization, and blockchain integration in IoT. In Future Research Opportunities for Artificial Intelligence in Industry 4.0 and 5.0; Deep Science Publishing: San Francisco, CA, USA, 2024. [Google Scholar]
  31. Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; pp. 39–57. [Google Scholar] [CrossRef]
  32. Qi, J.; Zhou, Q.; Lei, L.; Zheng, K. Federated Reinforcement Learning: Techniques, Applications, and Open Challenges. arXiv 2021, arXiv:2108.11887. [Google Scholar] [CrossRef]
  33. Tang, X.; Yu, H. Competitive-Cooperative Multi-Agent Reinforcement Learning for Auction-Based Federated Learning. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023), Macao, China, 19–25 August 2023; pp. 4262–4270. [Google Scholar]
  34. Uddin, M.P.; Xiang, Y.; Hasan, M.; Bai, J.; Zhao, Y.; Gao, L. A Systematic Literature Review of Robust Federated Learning: Issues, Solutions, and Future Research Directions. ACM Comput. Surv. 2025, 57, 245. [Google Scholar] [CrossRef]
  35. Zeng, T.; Semiari, O.; Chen, M.; Saad, W.; Bennis, M. Federated Learning for Collaborative Controller Design of Connected and Autonomous Vehicles. In Proceedings of the 60th IEEE Conference on Decision and Control (CDC 2021), Austin, TX, USA, 14–17 December 2021; pp. 5033–5038. [Google Scholar] [CrossRef]
  36. Zhao, R.; Hu, H.; Li, Y.; Fan, Y.; Gao, F.; Gao, Z. Sequence Decision Transformer for Adaptive Traffic Signal Control. Sensors 2024, 24, 6202. [Google Scholar] [CrossRef]
  37. Xing, X.; Zhou, Z.; Li, Y.; Xiao, B.; Xun, Y. Multi-UAV Adaptive Cooperative Formation Trajectory Planning Based on an Improved MATD3 Algorithm of Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2024, 73, 12484–12499. [Google Scholar] [CrossRef]
  38. Li, Z.; Xu, C.; Zhang, G. A Deep Reinforcement Learning Approach for Traffic Signal Control Optimization. arXiv 2021, arXiv:2107.06115. [Google Scholar] [CrossRef]
  39. Chen, S.; Liu, J.; Cui, Z.; Chen, Z.; Wang, H.; Xiao, W. A Deep Reinforcement Learning Approach for Microgrid Energy Transmission Dispatching. Appl. Sci. 2024, 14, 3682. [Google Scholar] [CrossRef]
  40. Liu, X.; Qin, Z.; Gao, Y. Resource Allocation for Edge Computing in IoT Networks via Reinforcement Learning. arXiv 2019, arXiv:1903.01856. [Google Scholar] [CrossRef]
  41. Daniel, J.; de Kock, R.J.; Ben Nessir, L.; Abramowitz, S.; Mahjoub, O.; Khlifi, W.; Formanek, J.C.; Pretorius, A. Multi-Agent Reinforcement Learning with Selective State-Space Models. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’25), Detroit, MI, USA, 19–23 May 2025; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2025; pp. 2481–2483. Available online: https://dl.acm.org/doi/10.5555/3709347.3743910 (accessed on 15 September 2025).
  42. Jiang, H.; Li, Z.; Wei, H.; Xiong, X.; Ruan, J.; Lu, J.; Mao, H.; Zhao, R. X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI 2024), Jeju, Republic of Korea, 3–9 August 2024; pp. 1–9. [Google Scholar] [CrossRef]
  43. Zhou, T.; Yu, J.; Zhang, J.; Tsang, D.H.K. Federated Prompt-based Decision Transformer for Resource Allocation of Customized VR Streaming in Mobile Edge Computing. IEEE Trans. Wireless Commun. 2025; early access. [Google Scholar] [CrossRef]
  44. Qiang, X.; Chang, Z.; Ye, C.; Hamalainen, T.; Min, G. Split Federated Learning Empowered Vehicular Edge Intelligence: Concept, Adaptive Design, and Future Directions. IEEE Wirel. Commun. 2025, 32, 90–97. [Google Scholar] [CrossRef]
  45. Lyu, L.; Yu, H.; Ma, X.; Chen, C.; Sun, L.; Zhao, J.; Yang, Q.; Yu, P.S. Privacy and Robustness in Federated Learning: Attacks and Defenses. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 8726–8746. [Google Scholar] [CrossRef]
  46. Dritsas, E.; Trigka, M. Machine Learning for Blockchain and IoT Systems in Smart Cities: A Survey. Future Internet 2024, 16, 324. [Google Scholar] [CrossRef]
  47. Zhang, H.; Feng, S.; Liu, C.; Ding, Y.; Zhu, Y.; Zhou, Z.; Zhang, W.; Yu, Y.; Jin, H.; Li, Z. CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; pp. 3620–3624. [Google Scholar] [CrossRef]
  48. CityFlow. CityFlow. Available online: https://github.com/cityflow-project/CityFlow (accessed on 15 September 2025).
  49. NYC Taxi and Limousine Commission (TLC). TLC Trip Record Data. 2024. Available online: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page (accessed on 15 September 2025).
Figure 1. Comparison of agent distribution strategies. Transformer RL promotes broader exploration while MAAC clusters agents in tighter formations.
Figure 1. Comparison of agent distribution strategies. Transformer RL promotes broader exploration while MAAC clusters agents in tighter formations.
Futureinternet 17 00492 g001
Figure 2. Comparison of MAAC and transformer RL. (Left) Average pairwise distance among agents shows that transformer RL maintains a more distributed layout, promoting adaptive behavior. (Right) Reward distribution indicates that transformer RL achieves higher peak rewards with greater variance, while MAAC yields more stable but lower performance.
Figure 2. Comparison of MAAC and transformer RL. (Left) Average pairwise distance among agents shows that transformer RL maintains a more distributed layout, promoting adaptive behavior. (Right) Reward distribution indicates that transformer RL achieves higher peak rewards with greater variance, while MAAC yields more stable but lower performance.
Futureinternet 17 00492 g002
Figure 3. Scalability comparison between transformer-based RL and MAAC across varying agent counts. Transformer RL maintains high reward efficiency as the number of agents increases, while MAAC performance degrades due to centralized critic bottlenecks and coordination challenges.
Figure 3. Scalability comparison between transformer-based RL and MAAC across varying agent counts. Transformer RL maintains high reward efficiency as the number of agents increases, while MAAC performance degrades due to centralized critic bottlenecks and coordination challenges.
Futureinternet 17 00492 g003
Table 1. Comparison of MAAC (centralized) and FDT (federated, decentralized) across algorithmic and system-level aspects.
Table 1. Comparison of MAAC (centralized) and FDT (federated, decentralized) across algorithmic and system-level aspects.
DimensionMAAC (Centralized)FDT (Federated, Decentralized)
Learning ArchitectureActor with centralized critic; coordination depends on critic accessCritic-free; decentralized trajectory modeling via self-attention
Temporal DependenciesCaptured through RNN/LSTM encoders; prone to vanishing gradientsCaptured through multi-head self-attention; robust to long horizons
Training ParadigmCentralized training with global critic gradientsFederated local training with periodic model aggregation
ScalabilityCritic bottleneck limits performance in large networksHorizontally scalable; resilient to node/server drop-outs
Privacy and Data FlowRaw observations may be shared with criticRaw data remains local; only encrypted model updates exchanged
Communication OverheadFrequent critic gradient synchronizationLightweight periodic updates (every K episodes)
Table 2. Experimental parameters for MAAC and transformer-based RL training.
Table 2. Experimental parameters for MAAC and transformer-based RL training.
ParameterMAACTransformer RL
Number of Agents1010
Training Episodes50,00050,000
Batch Size6464
Discount Factor ( γ )0.990.99
Learning Rate (Actor)0.00050.0005
Learning Rate (Critic)0.00050.0005
OptimizerAdamAdam
Exploration Strategy ϵ -greedy (decay from 1.0 to 0.1) ϵ -greedy (decay from 1.0 to 0.1)
Target Update FrequencyEvery 100 stepsEvery 100 steps
Communication Frequency (FL)N/AEvery 10 episodes
Replay Buffer Size1,000,0001,000,000
Environment TypeMEC SimulationMEC Simulation
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

AlTerkawi, L.; AlTarawneh, M. Federated Decision Transformers for Scalable Reinforcement Learning in Smart City IoT Systems. Future Internet 2025, 17, 492. https://doi.org/10.3390/fi17110492

AMA Style

AlTerkawi L, AlTarawneh M. Federated Decision Transformers for Scalable Reinforcement Learning in Smart City IoT Systems. Future Internet. 2025; 17(11):492. https://doi.org/10.3390/fi17110492

Chicago/Turabian Style

AlTerkawi, Laila, and Mokhled AlTarawneh. 2025. "Federated Decision Transformers for Scalable Reinforcement Learning in Smart City IoT Systems" Future Internet 17, no. 11: 492. https://doi.org/10.3390/fi17110492

APA Style

AlTerkawi, L., & AlTarawneh, M. (2025). Federated Decision Transformers for Scalable Reinforcement Learning in Smart City IoT Systems. Future Internet, 17(11), 492. https://doi.org/10.3390/fi17110492

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop