1. Introduction
ITS combine sensing, communication, and control to improve safety, efficiency, and sustainability in road transport [
1,
2]. Traditionally, data generated by vehicles, roadside units (RSUs), and user devices have been uploaded to centralized traffic management centers, where ML models are trained and deployed [
3,
4]. While this centralized paradigm has enabled many successful applications, it raises well-known concerns regarding privacy [
5] and bandwidth consumption among others, especially when large volumes of raw data must be transferred to the cloud.
FL [
6], in contrast, explicitly decouples model training from data centralization: models are updated where data are generated and only model parameters or gradients are aggregated. This makes FL particularly well aligned with hierarchical ITS architectures, where vehicles, RSUs, and control centers already form a multi-tier structure that can naturally host local training, intermediate aggregation, and global coordination.
However, ITS presents distinctive constraints—heterogeneous devices, intermittent connectivity, real-time latency requirements, and multi-stakeholder governance—that make it unclear when an application can be federated and which FL framework is appropriate in practice. This work addresses these questions by combining taxonomies, an operational federability criterion, and architecture-level adaptations of concrete systems, as summarized below. In addition, we provide a compact quantitative synthesis of how FL-enabled ITS frameworks are empirically validated in the literature.
Building on our previously published literature review on ML and FL in ITS [
7], this paper advances from a broad survey perspective toward a more architecture-oriented and adaptation-driven analysis. We begin by establishing a fundamental distinction between contributions that describe
systems, with explicit architectures and data flows, and those that introduce only
models, where the focus lies on algorithmic innovation rather than on deployment structure. This distinction enables a domain-aware organization of the literature—spanning road-, vehicle-, and user-centric applications—while clarifying how each type of contribution constrains or enables the transition from centralized to federated learning. In parallel, we organize existing FL proposals for ITS into three families that reflect their architectural intent and infrastructural assumptions: privacy-focused approaches that minimize changes to existing systems,
integrable frameworks that can be coupled directly with deployed ITS pipelines, and advanced-infrastructure solutions that leverage hierarchical or resource-coordinated deployments. Within this structure, we articulate the operational notion of
federability, defined as the degree to which an existing ITS application can be refactored into a federated-learning deployment without redesigning its sensing and decision loop. An application is considered federable when it already relies on naturally distributed data sources (e.g., vehicles, roadside units, user devices), can feasibly process the relevant features locally at these nodes, and presents an input–output organization that can be partitioned across node-level model updates and an aggregation layer.
Unlike recent surveys on FL for ITS and connected/automated vehicles—which primarily catalogue applications, datasets, and challenges—our aim is to derive a mapping methodology that rewrites concrete ITS architectures in terms of hierarchical FL frameworks. To illustrate this methodological shift, we conduct a qualitative comparison of three representative frameworks, FedGRU [
8], DT + HFL [
9], and TFL-CNN [
10], using the classical client–server pattern as a baseline. Building on that comparison, we ultimately retain two complementary frameworks, DT + HFL and TFL-CNN, as candidates for detailed architectural adaptation.
We then apply these frameworks to four representative ITS applications—traffic prediction and management, traffic control for signalized intersections, intelligent traffic-light control, and driver profiling—demonstrating how each can be instantiated with realistic edge/cloud roles and hierarchical aggregation layers. This study is a purely qualitative, architecture-level examination and does not include new simulation or testbed experiments nor explicit analysis of economic and operational costs, which are left for future empirical and implementation-oriented work. To support such future efforts, we also outline an architecture-level benchmarking blueprint in
Section 5.4.
Contributions. In summary, this work provides (1) a taxonomy connecting ITS systems vs. models and domain focus to their FL implications; (2) a consolidated map of FL frameworks by family (privacy-focused, integrable, advanced-infrastructure), including their base models and aggregation styles; (3) an evidence map and compact quantitative synthesis of evaluation practices across FL-enabled ITS frameworks; (4) a practical federability filter for selecting ITS systems suitable for FL adaptation; (5) an adaptation methodology—illustrated on four ITS systems—showing how DT + HFL and TFL-CNN can be instantiated with realistic edge/cloud roles and hierarchical aggregation, highlighting latency, connectivity, privacy, and edge-compute trade-offs; and (6) an architecture-level benchmarking blueprint to guide controlled architecture-to-architecture comparison in future FL-enabled ITS studies.
Paper organization. The remainder of this paper is organized as follows:
Section 2 reviews the ITS, ML, and FL concepts needed in the rest of the paper.
Section 3 summarizes related surveys and framework proposals for vehicular communications.
Section 4 presents the systematic literature review (SLR)-based taxonomies and consolidates FL-enabled ITS frameworks, including an evidence map and a compact quantitative summary of validation practices.
Section 5 compares three FL frameworks and introduces an architecture-level benchmarking blueprint.
Section 6 reports the federability filter and the four adaptation case studies.
Section 7 concludes and outlines future work.
4. ITS Applications and Frameworks
This section provides a synopsis of the findings from an earlier SLR [
7]. The review initially compiled 283 documents, sourced from IEEE Xplore (194) and Scopus (89). After applying inclusion and exclusion criteria, along with removing duplicates and completing the final screening, 39 documents were selected for in-depth examination, as can be seen in
Figure 3. The study was directed by two research questions: (i) Which ITS applications incorporate ML or FL? (ii) Which ITS frameworks utilize FL? The responses to these questions form the linkage between ITS applications and the frameworks that will be assessed in this paper.
4.1. Overview of the Literature Review Methodology
The ITS application and framework taxonomies used in this paper are grounded on our previously published SLR on ML and FL in ITS [
7]. That SLR followed a standard protocol comprising database selection, query design, screening, and full-text assessment. Here, we summarize only the elements that are directly relevant for the present study.
Databases and time span. The SLR searched IEEE Xplore and Scopus for primary studies on ML- and FL-based ITS applications. Publications from 2016 onwards were considered, and only peer-reviewed journal articles and conference papers were included.
Search queries. For ML-based ITS applications, we combined ITS terms with generic machine-learning terminology. In both databases, the query followed the pattern
("Intelligent Transportation System") AND ("machine learning" OR "deep learning" OR "supervised learning" OR "unsupervised learning") AND "applications"
For FL-based ITS applications, we used the pattern
("Intelligent Transportation System") AND "federated learning" AND "applications"
Search terms were applied to titles, abstracts, and metadata in both databases.
Inclusion and exclusion criteria. The initial result set was filtered using the inclusion and exclusion criteria summarized in
Table 2. In brief, studies had to (i) target road-transport ITS scenarios, (ii) implement or evaluate an ITS-relevant application, and (iii) employ a supervised or unsupervised ML/FL model. We excluded publications on maritime or air mobility, duplicate records, and works that did not propose or evaluate a concrete ITS application.
4.2. Applications Taxonomy
In our earlier research, we established a classification framework for ITS-related studies using two axes: (i) distinguishing whether the contribution is characterized as a
system (involving architecture, components, and implementation) or as a
model (concentrating on algorithmic or computational elements), and (ii) the functional domain being targeted, specifically
road-centric,
vehicle-centric, or
user-centric. Within this taxonomy, we flag as
federable those ITS systems whose architectures satisfy the three conditions introduced in
Section 1: distributed data sources, feasibility of local feature processing at edge nodes, and node-level partitionability of the learning task. This federability tag is later used as a diagnostic filter when selecting the four applications adapted in
Section 6, ensuring that the chosen case studies can realistically transition from centralized ML to federated learning without extensive architectural redesign.
Table 3 presents a compilation of representative examples aligned with the taxonomy categories, showing the breadth of applications, the corresponding ML algorithms, and their respective domains.
This taxonomy shows that road-centric applications dominate in terms of traffic management and congestion monitoring, vehicle-centric applications emphasize safety and perception, and user-centric applications focus on behavior and vulnerability. Importantly, contributions classified as systems provide architectural detail that makes them more readily adaptable to federated learning, while models often contribute algorithmic insights but lack explicit integration context. This distinction, together with the domain grouping, provides the foundation for the next step: evaluating which FL frameworks are most suitable for adapting these ITS applications. By keeping both models and systems in the taxonomy, we capture the full landscape of approaches reported in the literature. However, it is important to note that, in the subsequent analysis, we will concentrate primarily on systems, since they provide the architectural detail and data flow specification needed for assessing the suitability of federated learning frameworks, whereas models mainly contribute algorithmic insights without explicit deployment context.
4.3. FL Framework Selection
The selection of a federated learning framework in ITS is predominantly influenced by privacy constraints, network conditions, and the computational resources available [
7]. Overall, 15 out of 39 papers from the systematic review are FL-enabled ITS frameworks, which are grouped in
Table 4 according to their architectural family and deployment profile. The frameworks are grouped into three categories. Privacy-focused solutions (e.g., [
48,
49,
50,
51,
52]) integrate mechanisms such as differential privacy, blockchain, or synthetic data generation to protect user information without degrading model utility. Integrable approaches (e.g., [
8,
53,
54,
55,
56]) target specific challenges such as non-IID data, mobility-aware caching, vehicular trust, or concept drift, while remaining adaptable to broader ITS platforms. Finally, advanced infrastructure frameworks (e.g., [
9,
10,
57,
58,
59]) exploit high-performance architectures, digital twins (DT), hierarchical federated learning, and 6G contexts for large-scale, multi-level coordination. This classification ensures coverage of representative solutions that balance privacy, adaptability, and technical sophistication in FL-enabled ITS.
For each framework,
Table 4 reports the deployment style (in parentheses, e.g., cross-device, cross-silo, hierarchical, decentralized, asynchronous), the base model, the aggregation mechanism, and a representative ITS application. It also provides an operational assessment in terms of Latency (T), Connectivity (C), Privacy (P), and Edge compute (E) (H/M/L), offering a consolidated view of strengths and limitations across framework families.
While
Table 4 characterizes each framework in terms of architectural family, deployment profile, and intended operational fit, it does not capture how each proposal is empirically validated in the original study (e.g., whether evidence is conceptual, model-level, or system-level). To fill this gap,
Table 5 provides an evidence map that summarizes (i) evaluation scope, (ii) data provenance, (iii) the main quantitative focus and baselines, and (iv) whether FL architectures are explicitly instantiated and/or compared (as opposed to only comparing models, aggregation variants, or application-specific algorithms).
Quantitative synthesis from
Table 4 and
Table 5 (see
Figure 4 for a compact visual summary) shows the following: among the 15 FL-enabled ITS frameworks, 7/15 are reported as cross-device deployments, 4/15 as cross-silo, 3/15 as hierarchical (including an end–edge–cloud two-layer design), and 2/15 as decentralized (note that these labels are not mutually exclusive). In addition, 1/15 incorporate asynchronous update mechanisms (vs. synchronous baselines). Regarding empirical validation, 13/15 works report quantitative results, with 8/15 evaluated at the model/task level and 5/15 including system/network-level simulation metrics (e.g., delay, Packet Delivery Ratio [PDR], cache hit ratio), whereas 2/15 remain conceptual proposals without numerical results. Despite this prevalence of quantitative evaluation, no study performs a controlled architecture-to-architecture benchmark; reported comparisons mainly target model baselines, aggregation variants (FedAvg-derived schemes appear in 8/15 frameworks), or application-specific algorithms under a fixed orchestration setting.
This evidence pattern is critical for interpreting architectural claims in the FL–ITS literature: most quantitative results support model-level gains, aggregation variants, or application-specific pipelines under a fixed orchestration setting, but they do not establish controlled advantages across FL architectural families. To make this gap explicit, we outline a benchmarking blueprint that specifies the minimum controls, datasets, and system-level metrics required for architecture-level comparisons in ITS. In line with this scope, the remainder of this work emphasizes a structured adaptation analysis of representative frameworks rather than introducing new end-to-end simulation benchmarks.
Informed by the taxonomy in
Table 4 and the evidence map in
Table 5, we next examine three representative frameworks from the non-privacy-focused categories. This choice aligns with our focus on system-level integration and scalability rather than cryptographic privacy mechanisms. From the
ntegrable approaches category, we include FedGRU [
8], a recurrent-based federated framework optimized for traffic flow prediction using large-scale organizational datasets (e.g., Uber or Didi). FedGRU is closely aligned with the classical client–server architecture and thus acts as a conceptual baseline to contrast with more advanced hierarchical designs. From the
advanced infrastructure family, two frameworks were selected. The first, DT + HFL [
9], integrates hierarchical federated learning with digital twins, providing a scalable and modular structure that enables real-time simulation and monitoring of vehicular systems without the excessive complexity of other DT-based proposals. The second, TFL-CNN [
10], employs a dual-layer architecture tailored for 6G vehicular environments, where RSUs perform intermediate aggregation before transmitting updates to a cloud server. As reported in the original TFL-CNN study [
10], this design supports scalability and low-latency coordination in large, heterogeneous networks.
In summary, the three selected frameworks—FedGRU, DT + HFL, and TFL-CNN—cover distinct layers of abstraction within the federated learning landscape: FedGRU represents an integrable baseline, DT + HFL emphasizes hierarchical and simulation-driven coordination, and TFL-CNN illustrates next-generation vehicular edge integration. Together, they provide a balanced foundation for comparative evaluation and subsequent adaptation of ITS applications.
5. Comparative Analysis of Selected FL Frameworks
This section presents a qualitative comparative analysis of three representative FL frameworks—FedGRU [
8], DT + HFL [
9], and TFL-CNN [
10]. These frameworks illustrate the evolution of FL solutions in ITS, spanning increasing levels of structural sophistication and deployment scope. Methodologically, the comparative analysis proceeds in two steps. First, we derive evaluation criteria (e.g., scalability, latency, privacy, integrability) from the taxonomies and empirical trends reported in
Section 4.3. Second, we apply these criteria to the selected frameworks using the architectural properties and empirical evidence reported in the cited literature, as summarized in the comparative tables of this section. As a result, all statements about scalability, latency, or privacy reflect qualitative interpretations of previously published experimental results, rather than new measurements obtained in this work. Finally, to operationalize the architecture-level evidence gap identified in
Section 4.3,
Section 5.4 outlines a compact benchmarking blueprint for controlled architecture-to-architecture evaluation in FL-enabled ITS, provided as methodological guidance for future experimental testbed or emulation/simulation studies.
FedGRU [
8] represents an integrable approach derived from the client–server paradigm, serving as a conceptual baseline that bridges traditional centralized learning and more distributed designs. Building upon this foundation, DT + HFL [
9] and TFL-CNN [
10] embody advanced infrastructure frameworks that incorporate hierarchical aggregation and edge intelligence, enabling scalability and richer coordination across ITS environments. The following comparison examines their architectures, base models, aggregation mechanisms, privacy strategies, and scalability properties to identify their respective strengths and limitations.
5.1. Architecture and Infrastructure
The client–server architecture represents the most basic and widely used form of FL, serving as the reference point for more advanced frameworks. In this setup, individual devices—such as smartphones, sensors, or vehicles—train models locally on their own data and transmit only the learned parameters to a single central server. The server aggregates these parameters to update a global model, which is then redistributed to the clients for the next training round.
Figure 5 illustrates this structure, where the simplicity of coordination comes at the cost of limited scalability and a potential bottleneck at the central server. Nevertheless, this approach remains effective in small or controlled environments with relatively few participating nodes.
Building on this baseline, FedGRU [
8] follows the same client–server paradigm but adapts it to an organizational context rather than to individual devices. Here, the clients are organizations such as municipal traffic agencies, private companies, or sensor stations. Each organization maintains its own infrastructure, collects traffic flow data (e.g., from cameras, radars, or mobile devices), and trains a local GRU-based model before sending updates to the central cloud. While conceptually similar to the client–server model, this framework emphasizes collaboration between institutions rather than end-user devices, allowing the integration of larger and more diverse datasets. FedGRU therefore extends the baseline by scaling its scope to organizational silos, while maintaining the simplicity of centralized aggregation.
In contrast, DT + HFL [
9] introduces a hierarchical architecture that combines federated learning with DTs. As shown in
Figure 6, vehicles and IoT sensors in this framework do not perform local training; instead, they act only as data collectors. The raw information is forwarded to edge cloudlets, where a digital twin is created to represent each vehicle and its environment. These cloudlets perform local training and serve as the first aggregation layer in the hierarchical process. A central cloud server then consolidates the parameters from all cloudlets to generate the global model. The framework operates in six phases (initial, functional, analytical, anomaly identification, collaborative, and decision-making), where the analytical phase is particularly important for transmitting data into the simulated DT environment. This design increases system complexity but enables powerful anomaly detection capabilities and richer context awareness that cannot be achieved with simpler client–server structures.
Finally, TFL-CNN [
10] employs a two-layer architecture illustrated in
Figure 7. Vehicles perform local training and send their model updates to RSUs, which act as intermediate aggregation nodes. RSUs are equipped with limited computing and caching capabilities but provide context-aware aggregation within their coverage area. The RSU outputs are then forwarded to a centralized cloud with high-performance computing resources, where final aggregation is performed. Unlike the purely hierarchical DT + HFL, this two-layer design explicitly leverages RSUs as edge nodes, making it well suited for 6G-enabled vehicular networks that demand fast object detection, low latency, and scalability across large fleets of vehicles.
In summary, the baseline client–server model establishes the simplest FL configuration, FedGRU extends it to organizational-level collaboration, DT + HFL adds hierarchical layers with digital twins for anomaly detection, and TFL-CNN introduces RSUs for scalable edge aggregation. These architectural choices directly influence scalability, computational requirements, and the type of ITS applications each framework can support. Beyond infrastructure, the next step is to compare the frameworks in terms of their purpose, data requirements, and base algorithms, which are summarized in
Table 6.
5.2. Purpose, Data, and Base Algorithms
Table 6 presents a concise overview of the selected frameworks, highlighting their core purpose, the type of data considered, and the base algorithms implemented. The table is designed to provide a quick reference, while the accompanying text elaborates on the broader implications and nuances of each case.
FedGRU [
8] extends the baseline client–server model by involving organizations as federated nodes. Its main contribution lies in aggregating mobility data from agencies and companies into GRU-based models for traffic prediction. This organizational scope allows for larger and more heterogeneous datasets than individual devices could provide.
DT + HFL [
9], in contrast, adopts a hierarchical structure where cloudlets equipped with digital twins handle local training before global aggregation. By combining IoT sensor streams with contextual information such as weather and regulations, this framework emphasizes anomaly detection and system resilience. Its modular design admits various base models, although CNN and Recurrent Neural Networks (RNN) hybrids are highlighted.
TFL-CNN [
10] situates itself in the context of 6G vehicular networks. Vehicles train local CNN models, which are aggregated at RSUs and then at the central cloud. This two-layer structure enables scalability and supports object recognition tasks such as traffic sign identification and pedestrian detection, while leveraging edge processing for latency reduction.
5.3. Aggregation, Privacy, and Limitations
Table 7 summarizes how each framework addresses model aggregation, whether it incorporates privacy mechanisms beyond the default of federated learning, and their main limitations. The table is concise by design, while the accompanying text highlights distinctive choices and constraints.
FedGRU [
8] employs FedAvg for small deployments and a joint announcement protocol to reduce overhead at scale. It does not integrate additional privacy mechanisms, making it dependent on the inherent safeguards of FL. Its specialization in traffic prediction and reliance on homogeneous data limit its adaptability to broader ITS domains.
DT + HFL [
9] uses hierarchical FedAvg across cloudlets and a central server, with digital twins providing an intermediate abstraction that protects raw data. This strengthens anomaly detection but introduces significant system complexity and resource requirements, constraining scalability in practice.
TFL-CNN [
10] combines RSU-level aggregation with FedAvg at the cloud, improving scalability in edge-enabled vehicular networks. It assumes secure RSUs with encrypted communication, but its dependence on RSUs and sensitivity to data heterogeneity may delay convergence in less advanced infrastructures.
In summary, the three frameworks exhibit complementary trade-offs between simplicity, fidelity, and scalability. FedGRU retains the straightforward client–server organization, facilitating rapid deployment but limiting extensibility beyond traffic-flow prediction; thus, it is regarded as a reference architecture for comparison. DT + HFL enhances analytical depth through hierarchical aggregation and digital twins, offering richer contextual modeling while potentially increasing computational demand. TFL-CNN achieves efficient coordination across vehicular networks through RSU-based aggregation and is conceptually better aligned with scalability requirements in edge-enabled 6G contexts.
5.4. Architecture-Level Benchmarking Blueprint for FL-Enabled ITS
Table 4 and
Table 5 and
Figure 4 show that, while quantitative evaluation is common in FL–ITS studies, controlled architecture-to-architecture benchmarking is largely missing. To make future actionable, reproducible architectural comparisons, we propose a compact benchmarking blueprint tailored to ITS constraints (mobility, intermittent Vehicle-to-Everything (V2X) links, heterogeneous edge hardware), using the classical client–server orchestration as the baseline.
Benchmark objective and controls. The objective is to isolate the impact of the orchestration architecture (client–server, hierarchical, or decentralized/asynchronous) from confounding factors. Thus, each experimental condition must fix (i) the learning task and dataset partitioning (including a controlled non-IID profile), (ii) the model architecture and training hyperparameters, and (iii) the aggregation rule when possible (e.g., FedAvg at each aggregation point). Only the orchestration topology (aggregation locations), update scheduling (synchronous vs. asynchronous), and compute placement across vehicle/RSU/cloud tiers should vary.
Approach A (gold standard): minimal physical testbed. A compact but representative ITS testbed can use one central server, two RSU/MEC edge aggregators, and at least three clients per aggregator (9 nodes total). Hardware heterogeneity should be explicit: (i) lightweight sensing/tiny-model endpoints (ESP32-class), (ii) vehicle/On-Board Unit (OBU)-grade clients (Raspberry Pi-class SBCs), (iii) higher-compute RSU/MEC aggregators (Jetson-class), plus an x86 cloud server. This setup supports measuring end-to-end training latency, communication overhead, and device-level compute/energy under realistic V2X variability.
Approach B: software-based benchmarking We distinguish two complementary software-based tracks based on whether the goal is to study static, repeatable regimes or dynamic, mobility-driven variability.
B1—Controlled network emulation (static/stress-test regimes). FL roles (clients/aggregators/server) run as containers with the actual training and communication stack. Network conditions are imposed using traffic control mechanisms (delay, loss, jitter, rate limits) [
60] and, when needed, explicit multi-hop topologies via Mininet [
61] or container-aware variants such as Containernet [
62]. This track yields architecture-sensitive measurements under repeatable conditions: time-to-target, bytes-per-round (including control signaling), and the impact of orchestration choices (e.g., intermediate aggregation placement, asynchronous updates) under predefined “typical” vs. “worst-case” V2X impairment profiles. Emulation thus supports highly reproducible what-if testing, but does not generate mobility-driven variability unless scripted.
B2—Mobility-aware network simulation (dynamic/large-scale regimes). Mobility is generated with SUMO [
63] and coupled to packet-level network simulators (e.g., Veins over OMNeT++/INET [
64] or ns-3 [
65]) to capture realistic temporal variability. This enables controlled experiments with time-varying connectivity—handovers, intermittent links, client churn, and changing link quality—as vehicles move. FL traffic is injected as application flows parameterized by the learning setup (e.g., update size/compression, clients per round, update periodicity, and synchronization/asynchrony policies), supporting analysis of distributions (median/95th percentile) and averages across many mobility realizations and fleets larger than any physical testbed. Thus, simulation offers dynamic realism and scalability at the cost of relying on abstracted network/compute models rather than executing all system components natively.
B2 can be used to extract representative distributions for delay/loss/availability (e.g., typical and extreme percentiles) that inform the impairment profiles for B1, while B1 can provide measured update sizes, serialization overheads, and processing times that parameterize B2’s traffic models.
Minimal metric set. To avoid over-parameterization, we recommend reporting a compact metric set:
Learning outcome: Task metric (e.g., Error Absoluto Medio (MAE)/Root Mean Square Error (RMSE) for traffic time-series; F1/precision/recall for detection/classification).
End-to-end efficiency: Time-to-target (time to reach a fixed performance threshold) and rounds-to-target.
Communication cost: Total bytes transmitted (uplink/downlink) and bytes-per-round (including control signaling when applicable).
Compute/energy footprint: Per-role CPU/GPU utilization and memory; energy-per-round for Approach A (or compute-time proxies for Approach B).
Latency-related metrics should be reported as distributions (median and 95th percentile) due to mobility-induced variability.
Experimental design and statistical reporting. A practical design is a two-factor experiment: Architecture (baseline client–server vs. one alternative architecture at a time) × Network regime (e.g., good/medium/poor connectivity or low/high mobility). For each condition, run multiple independent seeds (≥3) and report confidence intervals and effect sizes. When assumptions hold, a two-way ANOVA can be used; otherwise, a non-parametric alternative (e.g., Kruskal–Wallis with corrected post hoc pairwise tests) is appropriate.
Executing either Approach A or Approach B requires a dedicated engineering and experimental campaign. Approach A involves procuring and configuring heterogeneous devices, implementing and instrumenting cross-tier orchestration, measuring energy/resource footprints, and running repeated trials across multiple regimes. Approach B requires containerizing the pipeline, calibrating impairment profiles and mobility scenarios, integrating FL update-traffic generation consistent with client participation and round timing, and performing factorial sweeps to achieve statistically robust conclusions. Ensuring reproducibility (configuration management, artifact packaging, trace release, and consistent baselines) further increases the effort. For these reasons, a controlled architecture-to-architecture benchmark would constitute a standalone experimental contribution; accordingly,
Section 6 focuses on a structured adaptation analysis of representative frameworks rather than introducing new simulation-based benchmarks.
6. Results
This section evaluates how selected ITS applications, originally implemented with traditional ML, can be restructured to operate under an FL paradigm. The analysis focuses on architectural feasibility, data distribution, and the specific adaptations required to integrate these systems into decentralized frameworks. The evaluation that follows is qualitative and architecture-oriented; no new simulations or testbed measurements are performed in this work. The applications were chosen from the taxonomy developed in the systematic review, but not all were suitable for federation. To make this notion precise, we applied the federability criteria introduced in
Section 1, formulated as three diagnostic questions: (i) Are the main data sources naturally distributed across vehicles, roadside units, or user devices? (ii) Is it feasible to execute the core learning task locally at those nodes, without centralizing raw data? (iii) Can the application’s learning pipeline be decomposed into node-level model updates plus an aggregation step at an edge or cloud coordinator? Only applications answering all three questions positively were retained as truly federable candidates. Only those applications combining multi-source data collection with potential for local processing were retained. As a result, four ITS applications were selected for adaptation: traffic prediction and management [
36,
66], real-time accident detection [
12], transport mode identification [
53], and driver profiling and behavior detection [
67]. Applications such as vehicle type identification by sound, driver fatigue recognition, and railway fault prediction were excluded due to their reliance on fixed or centralized data capture, which prevents distributed training without significant architectural redesign.
Since FedGRU [
8] served as a baseline, meta-framework extension of the client–server model, the comparative framework analysis is focused on the two architectures deemed most suitable for adaptation: DT + HFL [
9] and TFL-CNN [
10]. These frameworks represent complementary paradigms of hierarchical and edge-enabled federation, presenting unique but compatible routes for embedding federated learning into ITS. In addition, they provide sufficient implementation detail for mapping ITS applications.
Each chosen ITS application is concurrently evaluated using DT + HFL and TFL-CNN within a standardized matrix-based analysis format. This format emphasizes common architectural traits, unique adaptation needs, and balances between latency, scalability, and privacy. The methodology maintains consistency across examples, demonstrating how conventional ITS systems can evolve into practical federated deployments in real-world scenarios.
6.1. Traffic Prediction and Management Applications
Two traffic prediction studies were considered for adaptation analysis: Najada [
66] and Lakshna [
36]. Both share common architectural features—heterogeneous sensing devices, real-time data requirements, and a need for hierarchical coordination—which make them suitable for federated settings.
Table 8 summarizes their integration possibilities under the two selected frameworks, DT + HFL and TFL-CNN.
In both traffic prediction studies, the DT + HFL structure lets each vehicle or sensor cluster have a digital twin to simulate real-time traffic conditions. Cloudlets act as intermediate aggregation nodes, merging data from edge devices. The analytical phase involves executing models with algorithms like linear regression or random forest within the digital twin. A global model is formed via a Lambda server by combining parameters from all cloudlets. This setup offers scalability and privacy but increases infrastructure costs and synchronization needs between physical and digital nodes.
TFL-CNN resembles conventional vehicular structures, with RSUs act as coordination layers linking vehicles and the cloud, creating a two-tier system. RSUs locally train models like random forests or lightweight regressions and send parameters to a central server using FedAvg. The addition of 6G edge connectivity ensures quick synchronization, enabling real-time congestion alerts. This efficient method depends on RSUs with adequate computing power and communication bandwidth.
Overall, our qualitative analysis suggests that DT + HFL is conceptually better aligned with detailed, simulation-based decision-making for dense urban networks, while TFL-CNN is conceptually better aligned with scalability and communication efficiency in 6G vehicular contexts. The choice between them depends on the deployment goal: DT + HFL for predictive precision and proactive management, or TFL-CNN for lightweight, real-time adaptation across wide vehicular grids. On the other hand, in a traditional client-server setup, ITS applications rely on a main server to handle communication and processing among nodes or vehicles. Local devices train partial models and send their parameters or data to the server, which aggregates them into a global model. With no intermediate RSU or Cloudlets, all coordination and aggregation depend on the central server. To ensure data privacy, encryption of identifiers like MAC addresses is advised before transmission. This setup simplifies management but centralizes core tasks, reducing the distributed benefits of federated applications.
6.2. Real-Time Accident Detection
The accident detection system from [
12] was assessed for potential adaptation to DT + HFL and TFL-CNN frameworks. This application, originally built over a VANET-based decentralized architecture, it is crucial for ITS scenarios demanding ultra-low latency and continuous vehicle data exchange.
Table 9 maps the system’s components to the frameworks.
Within the DT + HFL framework, an accident detection system can be reorganized into a hierarchy where vehicles with onboard computing serve as edge nodesto update local models with sensed data like speed, position, and ID. These nodes send summaries or weights to upper-level entities such as cloudlets or a central server, allowing digital twins to simulate vehicles for anomaly detection without raw data sharing. Although local aggregation could produce inconsistent models, a semi-centralized server layer can rectify this, albeit with added deployment complexity.
The system can integrate into the TFL-CNN framework with moderate adjustments. Vehicles using VANET modules serve as local training units, while RSU-like nodes handle local aggregation and send model updates to the cloud. This setup maintains quick local inference and supports 6G-ready edge growth. Its hierarchical coordination is conceptually better aligned with low-latency requirements, though it lacks flexibility for detailed simulation or digital replication.
In summary, our qualitative comparison indicates that the DT + HFL framework provides richer modeling and privacy control through digital twins and simulation layers, making it attractive for analytical extensions albeit with additional resource overhead. Conversely, TFL-CNN emphasizes rapid communication and simpler coordination and is conceptually better aligned with low-latency requirements, albeit at the cost of reduced representational fidelity. The choice between them depends on whether the ITS deployment prioritizes accuracy and contextual insight (DT + HFL) or real-time responsiveness (TFL-CNN).
6.3. Transport Mode Identification
Below is a summary of the assessment conducted on the transport mode identification app from [
45], as outlined in
Table 10. This application uses mobile devices to gather and categorize travel modes using contextual and sensor data, featuring two data acquisition approaches that influence how the system could be adapted to federated learning adaptation.
In the DT + HFL framework, the system could leverage mobile devices as local nodes to collect sensor data (accelerometer, GPS, gyroscope) and transfer it to digital twin entities on edge cloudlets. These twins model individual user mobility for privacy-preserving local training. Cloudlets serve as initial aggregators before central servers update the global model. The DT layer can emulate user mobility in different modes (walking, driving, cycling) to enhance model calibration. While the original setup uses centralized training, only slight adjustments are needed for cloudlet-based processing. Synchronizing DT instances and dealing with varying sensor data rates remain key challenges.
The TFL-CNN framework facilitates transport mode detection via hierarchical aggregation. RSUs or micro-edge servers can obtain features from smartphones or wearables, conduct local training, and then transmit model parameters to a central server. The CNN component of the framework is naturally suited to learning spatio-temporal patterns from sensor data. Unlike DT + HFL, digital twins aren’t needed; aggregation occurs directly at the RSU with a lightweight FedAvg routine. Thus, the mobile sensing architecture can be extended cost-effectively, assuming stable network connectivity. This setup enables real-time classification with minimal added latency.
In our qualitative assessment, the DT + HFL framework offers richer contextual modeling through digital twins but requires higher computational resources and synchronization overhead at the edge. In contrast, TFL-CNN adopts a lighter-weight architecture with lighter hierarchical aggregation and a coordination path conceptually better aligned with low-latency requirements, which makes it conceptually better aligned with the requirements of real-time multimodal mobility detection in urban environments.
6.4. Driver Profiling and Behavior Detection
Below, we provide a summary of the driver profiling and behavior detection application’s evaluation from [
67], as shown in
Table 11. This tool assesses telemetry and driver behavior to detect risk, patterns, and anomalies using on-board and mobile data.
The original system architecture collects data from OBD-II sensors, radar sensors, and a mobile app linked to a central cloud. In the DT + HFL framework, the mobile app and on-board unit (OBU) can function as local nodes, each linked to a digital twin simulating driver behavior. Cloudlets act as intermediaries, handling local model training and sending aggregated parameters to a global server. Digital twins incorporate contextual data such as traffic density, vehicle type, and environment to improve behavioral modeling. The cloudlet-layer aggregated model estimates driver risk levels, while the central server combines these to create population-level profiles. Privacy is maintained as personal data stays within local digital twin instances.
In the TFL-CNN framework, vehicles serve as edge nodes, with nearby RSUs or mobile base stations acting as aggregation points. The mobile app acts as a supplemental RSU, sending summarized behavioral parameters rather than raw sensor data. RSUs consolidate driver-level models (e.g., CNN-based classifiers for risky maneuvers) and transmit parameters to the central cloud for global aggregation via the FedAvg algorithm. This hierarchical training setup provides real-time feedback and lowers communication costs. Thus, the dual approach of CNN feature extraction at vehicles and hierarchical aggregation at RSUs supports scalable monitoring of driver behavior across fleets.
Overall, our qualitative analysis suggests that DT + HFL enables deeper semantic analysis of driver behavior through digital twins and multi-layer aggregation, which is suitable for detailed behavioral studies. In contrast, TFL-CNN prioritizes low-latency hierarchical inference, which is conceptually better aligned with the requirements of large-scale, connected vehicle networks.
In brief, the comparative examination of the chosen ITS applications highlights how incorporating federated learning is feasible across various transportation sectors, from infrastructure-level traffic management to driver profiling. DT + HFL offers enhanced contextual modeling via its hierarchical, simulation-based setup, while TFL-CNN is conceptually better aligned with low-latency and scalability requirements in vehicle networks. Taken together, these qualitative observations form a basis for assessing the balance between model accuracy and deployment efficiency.
7. Conclusions and Future Work
This study examined ITS applications that employ ML and analyzed their potential transition toward FL. Through a systematic review and a comparative framework analysis, we identified the implementation patterns, aggregation algorithms, and structural requirements that determine the feasibility of adapting traditional ITS systems to a federated paradigm. Moreover, the consolidated evidence map (
Table 5) and the compact counts summary (
Figure 4) indicate that FedAvg and FedAvg-derived schemes dominate reported FL-enabled ITS frameworks, whereas hierarchical and asynchronous mechanisms appear less frequently and, crucially, are not assessed through controlled architecture-to-architecture benchmarks. In particular, despite the prevalence of quantitative evaluation in the literature, we found no study that performs a controlled architecture-level comparison under fixed tasks, models, and network regimes. No claim of empirical performance superiority is made; all architectural comparisons are conceptual and grounded in previously published evidence.
While our previous work [
7] provided a comprehensive literature review on ML/FL-based ITS applications, the present paper goes one step further deriving taxonomies, task-to-framework mappings, and a component-level adaptation methodology for concrete ITS systems. In this sense, our contribution complements existing surveys by offering guidelines that can be directly used by practitioners when selecting and tailoring FL architectures for specific ITS deployments.
Our analysis confirmed that traditional ITS applications typically collect data from multiple distributed sensors or vehicles but centralize model training in a single server. Although this approach improves predictive performance by aggregating large datasets, it also introduces vulnerabilities such as bottlenecks, single points of failure, and risks to data privacy. In contrast, federated learning mitigates these issues by decentralizing training—sharing model parameters instead of raw data—and thus enhancing privacy and robustness. However, FL also faces challenges related to latency, communication overhead, and node heterogeneity, which remain open questions for large-scale deployment.
According to the distinction between models and systems, ITS applications characterized as systems—detailing architecture, data flow, and component interaction—are prime candidates for federation. Conversely, models only utilize datasets for predictions without implementation details. Applications with nodes capable of some level of local processing adapt well, whereas those with centralized architectures need significant restructuring, including local computing and distributed coordination enhancements.
The architecture-oriented adaptation analysis demonstrated that ITS applications can effectively integrate with frameworks such as DT + HFL and TFL-CNN by introducing intermediate aggregation layers and edge-computing components (e.g., cloudlets or RSUs). Both frameworks improve scalability and privacy, though they differ in complexity and implementation context: DT + HFL leverages digital twins for virtual simulation and monitoring, while TFL-CNN employs a two-layer edge–cloud hierarchy suited for vehicular and mobile networks. The integration of DTs proved particularly beneficial, enabling realistic modeling of dynamic traffic conditions without exposing sensitive information.
This work—structured as a literature-based, architecture-level analysis—synthesizes and critically examines existing research without introducing new quantitative, simulation, or testbed experiments. Consequently, the discussion on scalability, latency, and accuracy reflects previously published empirical evidence rather than measurements collected for this study. Likewise, economic and operational costs, including communication overhead, energy consumption, and infrastructure requirements, fall outside the scope of this review. These aspects are more appropriately evaluated in concrete deployments and therefore remain an important direction for future replication and implementation studies.
The most immediate direction for future work is to
operationalize the architecture-level benchmarking blueprint (
Section 5.4) as a standalone experimental effort. This can follow two complementary tracks: (i) a minimal physical testbed on heterogeneous edge hardware, using either an FL stack (e.g., Flower) or a lightweight custom coordinator to enable instrumentation, and (ii) mobility-aware simulation/emulation campaigns that sweep controlled network regimes and dynamic connectivity (e.g., SUMO traces coupled with Veins or NS-3). Both tracks would quantify architecture-level trade-offs under ITS variability, including latency distributions, communication overhead, and compute/energy footprint proxies.
Beyond benchmarking, an additional research avenue is to study how hierarchical orchestration interacts with privacy-enhancing mechanisms (e.g., differential privacy or trust layers) under the same controlled architecture-level protocol.
Overall, the study establishes a conceptual basis for the gradual migration of ITS applications toward federated learning, emphasizing adaptability, scalability, and privacy as key enablers for future deployments.