Next Article in Journal
Bluetooth Low Energy-Based Docking Solution for Mobile Robots
Next Article in Special Issue
Workload-Aware Adaptive Duplex Mode Selection for Mobile Ad Hoc Networks: A Workload Zone Estimation Approach
Previous Article in Journal
Physics-Informed Neural Networks for Underwater Acoustic Propagation Modeling: A Review
Previous Article in Special Issue
A Real-Time UWB-Based Device-Free Localization and Tracking System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AptEVS: Adaptive Edge-and-Vehicle Scheduling for Hierarchical Federated Learning over Vehicular Networks

1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
University of Chinese Academy of Sciences, Nanjing 211135, China
4
Nanjing Institute of InforSuperBahn, Nanjing 211100, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(2), 479; https://doi.org/10.3390/electronics15020479
Submission received: 28 December 2025 / Revised: 12 January 2026 / Accepted: 20 January 2026 / Published: 22 January 2026
(This article belongs to the Special Issue Technology of Mobile Ad Hoc Networks)

Abstract

Hierarchical federated learning (HFL) has emerged as a promising paradigm for distributed machine learning over vehicular networks. Despite recent advances in vehicle selection and resource allocation, most still adopt a fixed Edge-and-Vehicle Scheduling (EVS) configuration that keeps the number of participating edge nodes and vehicles per node constant across training rounds. However, given the diverse training tasks and dynamic vehicular environments, our experiments confirm that such static configurations struggle to efficiently meet the task-specific requirements across model accuracy, time delay, and energy consumption. To address this, we first formulate a unified, long-term training cost metric that balances these conflicting objectives. We then propose AptEVS, an adaptive scheduling framework based on deep reinforcement learning (DRL), designed to minimize this cost. The core of AptEVS is its phase-aware design, which adapts the scheduling strategy by first identifying the current training phase and then switching to specialized strategies accordingly. Extensive simulations demonstrate that AptEVS learns an effective scheduling policy online from scratch, consistently outperforming baselines and and reducing the long-term training cost by up to 66.0%. Our findings demonstrate that phase-aware DRL is both feasible and highly effective for resource scheduling over complex vehicular networks.

1. Introduction

Vehicular networks are evolving into the core infrastructure for a data-driven ecosystem within intelligent transportation systems (ITS), built upon a unified communication plane connecting vehicles, distributed edge nodes (ENs), and a central cloud server (CS) [1,2]. This data-rich ecosystem lays the foundation for a new frontier of large-scale machine learning (ML) applications, such as real-time road hazard detection and traffic object recognition [3,4]. However, the training process for these powerful applications fundamentally conflicts with the traditional centralized paradigm due to insurmountable communication and privacy barriers. Federated learning (FL) has emerged as the key distributed paradigm to overcome these obstacles by enabling collaborative training directly on vehicles without exposing raw data [5].
Nonetheless, organizing FL in large-scale deployment scenarios, such as within an entire city, remains a daunting challenge that imposes significant network coordination and communication overhead. This inherent complexity drives academia and industry to pragmatically focus on optimizing services within well-defined and representative hotspots or critical road segments, as illustrated in Figure 1. It is within these dense and dynamic road segments that the Hierarchical FL (HFL) framework becomes particularly attractive, utilizing an intermediate edge layer to create a scalable and efficient training architecture tailored to vehicular networks [6].
Despite its architectural advantages, deploying HFL in such vehicular networks, referred to as VHFL, unveils a dual challenge stemming from task diversity and vehicle mobility. First, as vehicular networks evolve into multi-service platforms, they are expected to support a broad spectrum of ML training tasks, such as autonomous driving perception, cooperative hazard detection, and infotainment recommendation. These tasks vary not only in computational complexity and communication overhead [7], but more importantly, each entails a multi-objective optimization problem that involves inherent trade-offs among model accuracy, time delay, and energy consumption [8]. To address this, the system necessitates establishing a principled and configurable metric that consolidates these conflicting objectives into a quantifiable training cost, and accordingly design task-specific scheduling strategies to satisfy performance requirements.
Second, the challenge of adapting to task diversity is further compounded by the inherent environmental dynamics of vehicular networks. In real-world deployments, vehicles continuously arrive at and depart from road segments covered by densely deployed ENs [9], giving rise to a multi-layered dynamic execution environment for VHFL. These dynamics include the quasi-static diversity of network topology across ENs, the heterogeneity of locally available data on vehicles, and the stochastic nature of vehicle mobility. Such fluctuations not only undermine communication stability but also reshape the statistical properties of training data at each EN, ultimately affecting the model convergence behavior [10].
The intricate interplay between task diversity and tightly coupled environmental dynamics underscores the need for jointly optimizing the number of participating ENs and the number of scheduled vehicles per EN, namely, the configuration of Edge-and-Vehicle Scheduling (EVS). However, this edge–vehicle collaborative scheduling optimization remains largely underexplored. Most existing works focus on optimizing client selection or resource allocation under a fixed EVS configuration, without adapting optimize the EVS configuration. Although fixed scheduling policies, exemplified by the static 30% client selection, have been effective in less dynamic or quasi-static Internet of Things (IoT) [11], we contend that the effectiveness of such static approaches is significantly limited by the profound heterogeneity and high mobility of vehicular networks.
To address these challenges, this paper proposes AptEVS, a phase-aware adaptive scheduling framework based on deep reinforcement learning (DRL) that minimizes the long-term training cost in VHFL systems. The main contributions of this work are summarized as follows:
(1)
Comprehensive Cost Metric: We formulate a holistic training cost metric for VHFL systems, integrating accuracy deviation cost (ADC), time delay cost (TDC), and energy consumption cost (ECC) to quantitatively evaluate performance against task-specific requirements.
(2)
Dynamic Configuration Analysis: Through systematic experiments, we analyze optimal EVS configurations. Our findings reveal that the optimal configuration is a dynamic variable, exhibiting high sensitivity not only to intrinsic task attributes and environmental dynamics, but crucially, to the evolution of the training phase. This establishes the necessity for a phase-aware, learning-based approach.
(3)
Phase-aware DRL Framework (AptEVS): We propose AptEVS, a novel phase-aware DRL framework for online EVS adaptation. Its core employs a lightweight mechanism to detect the current training phase and dynamically switches between two specialized algorithms: Structured Exploration for the Initial Training Phase (ITP) and Priority-Enhanced DQN Training for the Medium Training Phase (MTP).
(4)
Experimental Validation: Extensive simulations demonstrate that AptEVS consistently outperforms baseline methods, achieving significant reductions in long-term training cost. These results prove the feasibility and effectiveness of online, phase-aware scheduling in complex vehicular environments.
The remainder of the paper is structured as follows. Section 2 describes the related work. Section 3 presents the system model of VHFL along with the problem formulation. Section 4 discusses the motivation. Section 5 describes the detail of AptEVS. Then, Section 6 provides the performance evaluation. Finally, conclusions are drawn in Section 7.

2. Related Work

Resource management is a critical factor for ensuring the efficiency of FL in vehicular networks. Prior research has explored this problem from various perspectives, with most studies focusing on client selection and resource allocation in conventional single-layer FL architectures.
Early works primarily aimed to optimize a single performance metric. For instance, a variety of approaches have been proposed to improve model accuracy through vehicle selection [12,13,14], or to minimize training time delay by adjusting aggregation intervals and deadlines [15,16]. Recognizing the limitations of single-objective optimization, recent studies have shifted toward multi-objective formulations, introducing composite metrics to jointly balance time delay, energy consumption, and other system factors [17,18,19,20,21]. While these methods provide valuable insights into managing performance trade-offs, they often lack sufficient configurability to address the diverse and task-specific requirements found in vehicular scenarios.
The emergence of HFL and the challenges posed by vehicular mobility have introduced a more complex optimization landscape. Many foundational HFL studies have been conducted under relatively stable network settings, such as smartphones or stationary IoT sensors [6,11,22,23,24,25,26,27,28]. Only a few pioneering works have begun to investigate the unique challenges introduced by high-speed and persistent mobility in vehicular networks. These efforts can be broadly categorized into two directions: (i) enhancing model performance by mitigating data distribution shift [29,30,31], and (ii) improving system efficiency by analyzing the impact of mobility on time delay and convergence [10,32].
Notably, previous work [11] demonstrated that scheduling 30% of clients can significantly reduce training overhead while maintaining accuracy. Meanwhile, the work [33] proposed that scheduling fewer clients in the initial training phase and increasing participation in the final training phase can effectively save resources. However, most existing works concentrate on client-level resource optimization, overlooking a fundamental challenge in hierarchical settings: the joint optimization of the number of active ENs and the number of scheduled vehicles per EN, referred to as the EVS configuration. While prior studies have acknowledged the importance of adapting resource allocation to the training process in FL [34,35], scheduling strategies that simultaneously consider task characteristics, environmental dynamics, and the temporal evolution of training phases remain largely unexplored, particularly under multi-objective EVS optimization in dynamic vehicular networks.

3. System Model

Similar to existing works [6,20,36], we consider a VHFL system implemented within a representative road segment, as shown in Figure 2. The system consists of a cloud server (CS), a set of ENs M = { 1 , 2 , , M max } , and a set of vehicles N = { 1 , 2 , , N max } . The CS orchestrates the three-layer HFL process until the cloud model converges and achieves target accuracy. Following hierarchical federated averaging [6], the CS aggregates edge models every κ 1 rounds, while each EN aggregates local models from its connected vehicles every κ 2 local epochs. Each vehicle i N trains its local model using a private dataset D i = ( x i , j R s , y i , j R ) j = 1 D i . The key notations in this paper are summarized in Table 1.

3.1. Training Procedure of VHFL

To formalize the training procedure, we define a time slice ( r , τ ) as the τ -th edge round within the r-th cloud round. In particular, the proposed VHFL framework incorporates dynamic state monitoring at both the edge and vehicle layers. To ensure privacy preservation, vehicles upload only necessary non-sensitive state information through lightweight encryption [37] or differential privacy techniques [38]. Notably, existing incentive mechanisms [39] can ensure that vehicles share state updates with low overhead and preserved privacy. This supports the feasibility of our approach and enables the CS to obtain key information for the HFL process. Prior to each time slice, the CS performs EVS based on the current system state. The VHFL design explored additionally considers state information collection at both the edge and vehicle layers, allowing the CS to better adapt its scheduling decisions throughout the training process. The training process of VHFL in the r-th cloud round is described as follows.
(1) Information Collection and Adaptive Scheduling: Before each cloud round, the CS first collects non-sensitive state information from the edge and vehicle layers, as detailed in Section 5. Based on this state, the CS dynamically determines the scheduling configuration ( M r , N r ) , where M r is the number of scheduled ENs and N r is the number of scheduled vehicles per EN. The specific scheduling method is described in Section 5. Let M r = { 1 , 2 , , M r } denote the set of scheduled ENs in the r-th cloud round. In the τ -th edge round, each scheduled EN m M r randomly selects a subset of vehicles N m s ( r , τ ) from its connected set N m c ( r , τ ) for local training.
(2) Hierarchical Distribution: The CS distributes the cloud model parameters w r , obtained from the previous cloud round, to all scheduled ENs. Let w 0 denote the initialized cloud model parameters. During the time slot ( r , τ ) , the scheduled EN m broadcasts its edge model w m r , τ to the available vehicles within its coverage area, where w m r , 0 = w r .
(3) Local Training and Model Uploading: During time slice ( r , τ ) , only the scheduled vehicles participate in local training. Specifically, each scheduled vehicle i N m s ( r , τ ) initializes its local model with the corresponding edge model w i r , τ ( 0 ) = w m r , τ , and then updates it at the l-th local iteration via mini-batch stochastic gradient descent (SGD):
w i r , τ ( l + 1 ) = w i r , τ ( l ) η f i w i r , τ ( l ) , D i , l
where η is the learning rate, and  f i ( w i r , τ ( j ) , D i , l ) denotes the stochastic gradient of the loss function computed on mini-batch D i , l sampled from the local dataset D i . The local loss is estimated during training as
F i local ( r , τ ) = 1 b L l = 1 L f i w i r , τ ( l ) , D i , l ,
where L = κ 2 D i b is the number of local iterations, b is the batch size, and edge interval κ 2 is the number of local epochs. Once training is complete, vehicle i uploads its final local model w i r , τ = w i r , τ ( L ) and corresponding local loss F i local ( w i r , τ ) to its associated EN. After  κ 1 edge rounds, the scheduled EN m uploads its edge model w m r = w m r , κ 1 and edge loss to the CS.
(4) Hierarchical Aggregation: The scheduled ENs exchange reception confirmations of local models in the form of non-sensitive data. Once receiving all available local models, each scheduled EN computes the edge loss and updates its edge model by performing edge aggregation as follows:
F m edge ( r , τ ) = i N m s ( r , τ ) D i F i local ( r , τ ) i N m s ( r , τ ) D i ,
w m r , τ + 1 = i N m s ( r , τ ) D i w i r , τ i N m s ( r , τ ) D i .
Upon receiving all edge model parameters after κ 1 edge rounds, the CS computes the cloud loss and updates the cloud model by performing cloud aggregation as follows:
F cloud ( r ) = m = 1 M r τ = 1 κ 1 i N m s ( r , τ ) D i F m edge ( r , τ ) m = 1 M r τ = 1 κ 1 i N m s ( r , τ ) D i ,
w r + 1 = m = 1 M r τ = 1 κ 1 i N m s ( r , τ ) D i w m r m = 1 M r τ = 1 κ 1 i N m s ( r , τ ) D i .

3.2. Mobility Model

We assume that all ENs managed by the CS are uniformly distributed along a road segment [40], with the distance between adjacent ENs defined as the edge spacing L. The set of road segments covered by all ENs is referred to as the candidate training section. As shown in Figure 2, the road segments covered by scheduled ENs form the active training section, where only the vehicles passing through during the current time slice are eligible to participate in the HFL process. We further assume that vehicle arrivals at the entrance of the candidate training section follow a Poisson process [20,41,42] with a vehicle arrival rate of λ (in vehicles per second, veh/s).
Considering that the contribution of vehicle data to the cloud model decreases over time, the candidate training section is generally short. Therefore, speed variations across this section are negligible, allowing each vehicle to maintain a constant speed during the training period. Specifically, each vehicle generates its speed v i (in meters per second, m/s) randomly and independently from a truncated Gaussian distribution [13,20,43]. Let v ¯ and σ denote the mean and standard deviation of vehicle speeds within the coverage area of the CS at a given time. The pair ( v ¯ , σ v ) represents the speed characteristics of vehicles within the coverage area of the CS. The probability density function of v i is given in [20] as
f ( v i ) = 2 e ( v i v ¯ ) 2 2 σ 2 σ 2 π erf ( v max v ¯ σ 2 erf ( v min v ¯ σ 2 ) ) , v min v i v max , 0 , otherwise ,
where erf ( x ) = 2 π 0 x e t 2 d t is the Gaussian error function. v min and v max representing the minimum and maximum speeds, respectively, are calculated as v min = v ¯ 2 σ and v max = v ¯ + 2 σ .

3.3. Performance Metrics

In VHFL, scheduling optimization entails navigating trade-offs among multiple performance objectives, namely model accuracy, training delay, and energy consumption. To address this multi-objective optimization in a unified manner, we formulate a composite performance metric that quantifies the long-term training cost. This metric integrates the three core dimensions using a weighted formulation, enabling it to flexibly represent and enforce task-specific performance preferences. We next formally define the individual cost terms of this metric.
(1) Accuracy deviation cost (ADC): To enhance optimization sensitivity near the target threshold, we follow the exponential penalty design proposed in [44,45] and define the ADC in cloud round r as
C A ( r ) = max { 0 , 1 Ξ ( A r A target ) } ,
where Ξ > 1 is a scaling factor that controls the penalty sensitivity. This function has two desirable properties: (i) When A r A target , the ADC is zero due to the max operator, thereby avoiding unnecessary optimization efforts beyond the required accuracy; (ii) When A r < A target , the ADC increases exponentially with the accuracy shortfall, reinforcing the system’s responsiveness to underperformance.
The ADC can be interpreted as a normalized penalty score within the range [ 0 , 1 ] , quantifying the degree to which the current accuracy violates the target. A value of 0 indicates the accuracy target is satisfied, while values closer to 1 reflect more severe deviation. By adjusting Ξ , the system can flexibly control the tolerance to accuracy deviation. For example, setting Ξ = 1000 enforces stringent adherence to the accuracy requirement, which is well-suited for safety-critical tasks.
(2) Time delay cost (TDC): For each scheduled vehicle i at time slot ( r , τ ) , let f i ( r , τ ) denote its allocated computational frequency, and  C i the required cycles per data sample. The local computation delay is expressed as [15,41,46]:
T i cmp ( r , τ ) = κ 2 C i D i f i ( r , τ ) .
Each vehicle uploads its model via the Orthogonal Frequency Division Multiple Access (OFDMA) channels [46]. Let b i ( r , τ ) be the number of allocated resource blocks (RBs), and B denote the bandwidth per RB. The uplink bandwidth is then B i ( r , τ ) = b i ( r , τ ) B . Given the allocated transmit power P i ( r , τ ) , average channel power gain h i ( r , τ ) , model size Z, and noise power spectral density N 0 , the communication delay is expressed as [41,47]
T i comm ( r , τ ) = Z B i ( r , τ ) log 2 1 + h i ( r , τ ) P i ( r , τ ) N 0 B i ( r , τ ) .
Due to vehicle heterogeneity, the cloud round delay is determined by the slowest participant [41]:
T vel ( r ) = τ = 1 κ 1 T vel ( r , τ ) = τ = 1 κ 1 max m M max i N m s ( r , τ ) ( T i cmp ( r , τ ) + T i comm ( r , τ ) ) .
After κ 1 rounds, each EN uploads its aggregated model to the CS at a transmission rate R e . Following the assumptions in prior works [48], we consider these wired links to be stable and high-capacity. Since the ENs upload edge models in parallel, the wired time delay is T edge ( r ) = Z R e . Thus, the TDC in cloud round r is
C T ( r ) = T vel ( r ) + T edge ( r ) .
It is noted that downlink time delay is negligible compared to uplink time delay [49].
(3) Energy consumption cost (ECC): For each scheduled vehicle i at time slot ( r , τ ) , let c i denote its effective capacitance factor of the computing chipset. The local energy consumption consists of two parts (the computation energy, E i cmp , and the communication energy, E i comm ), given by
E i cmp ( r , τ ) = c i κ 2 C i D i f i 2 ( r , τ ) ,
E i comm ( r , τ ) = P i ( r , τ ) T i comm ( r , τ ) .
The total energy consumed by all scheduled vehicles in cloud round r is
E vel ( r ) = τ = 1 κ 1 i N m s ( r , τ ) ( E i cmp ( r , τ ) + E i comm ( r , τ ) ) .
Let P e denote the EN’s transmit power. The wired transmission energy for one EN is E edge ( r ) = P e Z / R e . Thus, the ECC in cloud round r is
C E ( r ) = E vel ( r ) + M r E edge ( r ) .
Training cost: As discussed at the beginning of this section, scheduling optimization in VHFL must navigate trade-offs among model accuracy, training delay, and energy consumption. To unify optimization objectives, we define a long-term training cost that integrates the three dimensions using a weighted formulation. Similar weighted-sum approaches have been widely adopted in FL to balance performance metrics [11,18,44]. Let α 1 , α 2 , and  α 3 denote the weight coefficients for ADC, TDC, and ECC, respectively, with  α 1 + α 2 + α 3 = 1 . These weights capture the task-specific performance preferences: (i) safety-critical tasks emphasize high-reliability accuracy, thus assigning a larger α 1 ; (ii) time-sensitive tasks prioritize fast responses, leading to a higher α 2 ; (iii) non-critical background tasks focus on energy efficiency, hence favoring a larger α 3 . To ensure numerical comparability across the cost components, we also introduce scaling factors β 1 , β 2 , and  β 3 to normalize the magnitudes of ADC, TDC, and ECC. Accordingly, the training cost in cloud round r is defined as
C ( r ) = α 1 β 1 C A ( r ) + α 2 β 2 C T ( r ) + α 3 β 3 C E ( r ) .
In practice, for a given task, the weights α i are fixed according to its characteristics, system policies, or service-level agreements. In contrast, the scaling factors β i are dynamically computed at the start of training to normalize the cost components based on their absolute magnitudes, as detailed in Section 5.

3.4. Problem Formulation

For a VHFL system, we expect to reduce the long-term training cost (LTTC) as much as possible while meeting the target accuracy of the ML task. Specifically, before the training process begins, the CS adaptively determines the Edge-and-Vehicle Scheduling number ( M r , N r ) in each cloud round. The problem formulation is as follows:
P 1 : min r = 1 R C ( r )
α 1 + α 2 + α 3 = 1 ,
0 α 1 , α 2 , α 3 1 ,
β 1 , β 2 , β 3 > 0 ,
M r M max , M r Z ,
N r N m max , m M r , N r Z ,
F cloud ( r 1 ) F cloud ( r ) < δ , r { R 4 , , R } ,
A R A target .
The objective of the ML task is to minimize the long-term training cost over R cloud rounds, as formulated in Equation (18), where R denotes the number of cloud rounds required for the cloud model to reach the target accuracy on the test dataset. The training cost C ( r ) incurred in each cloud round r is defined in Equation (17). Constraints (19)–(21) specify the bounds on the weighting factors α i and scaling factors β i , which balance and normalize the performance metrics. Specifically, constraint (22) restricts the number of scheduled ENs M r in cloud round r to be no greater than the total number of available ENs, M max . Similarly, constraint (23) limits the number of vehicles scheduled per EN N r to be no more than N m max = b m , where b m is the number of available RBs allocated for the ML task at each scheduled EN m M r . Both M r and N r are integer variables, reflecting discrete scheduling decisions. Constraints (24) and (25) ensure that the cloud model converges and meets the specified target accuracy A target , where δ represents the convergence threshold, and  A R denotes the model accuracy in the R-th cloud round.

4. Motivation

This section shows that the optimal EVS configuration is a dynamic variable, highly sensitive to multiple hierarchical factors. We begin by analyzing the intrinsic training dynamics of HFL, specifically how static task attributes and training phase evolution shape a baseline strategy. We then examine how this baseline is further influenced by external vehicular environments. This multi-level analysis provides key insights into system optimization and underpins the design of our environment-aware adaptive scheduling framework.

4.1. Experimental Setup

This subsection outlines the simulation setup, including the general VHFL system and the controlled experimental design for evaluating EVS performance under various tasks and environmental conditions.

4.1.1. VHFL System Simulation Setup

We simulate the VHFL system described in Section 3. To emulate realistic vehicular communication conditions, the simulation incorporates a composite channel model that dynamically determines the channel power gain h i ( r , τ ) . This model integrates three standard physical effects: (i) a 3GPP-compliant Non-Line-of-Sight (NLOS) path loss model to account for signal obstruction in urban and highway environments [50]; (ii) a standard log-normal shadow fading component for large-obstacle attenuation [50]; and (iii) a Rayleigh fading model to capture multipath fluctuations consistent with the NLOS assumption [51]. The general parameters are listed in Table 2.

4.1.2. Controlled Simulation Settings

We evaluate the system on three representative learning tasks of varying complexity: MNIST [52], FMNIST [53], and CIFAR-10 [54]. The specific CNN models, their computational complexities, communication load, and target accuracies are detailed in Table 3. To emulate data heterogeneity, we employ two standard non-IID data partitioning schemes [55] to control for quantity-based imbalance (parameterized by ϕ ) and distribution-based imbalance (parameterized by ψ ) across vehicles. Default data distribution parameters are listed in Table 2.
This observational study aims to systematically characterize the performance landscape. To this end, for each scenario, we simulate a range of fixed EVS configurations ( M , N ) for each cloud round, varying the number of scheduled ENs M and vehicles per EN N within the set { 2 , 4 , 6 , 8 } . This allows us to identify the ground-truth optimal EVS for each condition by comparing the long-term training cost. To isolate the impact of EVS configuration and ensure both reproducibility and fair comparison, we adopt random vehicle selection and uniform bandwidth allocation as controlled baselines, and fix the random seed across all experiments to eliminate stochastic variations such as model initialization and data shuffling.

4.2. Impact of Intrinsic Task Attributes

(1) Impact of Static Task Characteristics: The static task characteristics, specifically the computational complexity quantified by the required cycles per data sample C i and the communication load represented by the model size Z, define the baseline optimization challenge. Our experiments first confirm that the optimal EVS is highly task-dependent. As shown in Figure 3, tasks with higher complexity (e.g., CIFAR-10) benefit more significantly from an increased EVS size, whereas simpler tasks (e.g., MNIST) reach performance saturation earlier. Furthermore, it is also worth noting that the results in Figure 4 reveal that the EVS configuration that minimizes the long-term training cost is a complex trade-off and does not necessarily align with the one that optimizes any single performance metric in isolation. These findings highlight that a scheduler must be sensitive to the static task characteristics.
(2) Impact of Training Phase Evolution: As shown in Figure 3, the training process exhibits a time-varying dynamic. We characterize this training phase evolution by analyzing the first- and second-order differences of the learning curve, namely the learning speed Δ A = A r A r 1 and the learning acceleration Δ 2 A r = Δ A r Δ A r 1 . Here,
  • Initial training phase (ITP): Accuracy increases rapidly as the model is far from optimal and gradients are large. This results in significant performance gains and a convex learning curve: A r A target , Δ A r 0 , Δ 2 A r < 0 .
  • Medium training phase (MTP): Accuracy improvement slows and begins to fluctuate. The model enters a local convergence region, where training shifts from coarse adjustment to fine-tuning and becomes more sensitive to data-induced variations: Δ A r 0 + , Δ 2 A r changes sign frequently.
  • Final training phase (FTP): The model approaches convergence and accuracy saturates. Gradients diminish, performance gains become negligible, and variations are mainly due to noise: A r A target , Δ A r 0 , Var ( A r ) becomes stable or slightly increases.
This observed three-phase learning behavior aligns with established SGD theory on non-convex landscapes [34]. The effect is especially evident in FL, where non-IID data slows convergence and often triggers distinct performance phase transitions [35]. Crucially, the phase transition reflects a fundamental shift in the optimization objective: from aggressively minimizing the ADC in the ITP to prioritizing the conservation of TDC and ECC in the MTP and FTP. To monitor this evolution accurately, model accuracy evaluations are conducted on the cloud server using a fixed, pre-sampled test dataset. This standard approach ensures an unbiased assessment of the cloud model and avoids evaluation errors caused by heterogeneous local data.

4.3. Impact of Environmental Dynamics

Beyond the intrinsic task attributes and training phase evolution, the external environment introduces another layer of complexity. We decompose these environmental dynamics into three distinct but interconnected layers:
(1) Impact of Quasi-Static Network Topology (Edge Spacing). We first analyze the network topology, using edge spacing d edge as a representative parameter. Our results in Table 4 reveal a non-monotonic structural shift in the optimal EVS ( M , N ) configuration as spacing increases from 200 m to 400 m, flipping from a vehicle-centric ( 2 , 6 ) to an edge-centric ( 6 , 2 ) . This reflects a change in the dominant performance bottleneck. At small spacing, handover-induced straggler risk becomes the primary challenge, inflating the TDC. The optimal strategy therefore favors a more centralized structure (fewer ENs) to minimize instability. Conversely, at large spacing, the bottleneck shifts to communication inefficiency, prompting a more distributed topology (more ENs) to improve average connectivity. This indicates that the optimal EVS is highly sensitive to the physical infrastructure layout of ENs.
(2) Impact of Baseline Data Heterogeneity. Operating on this physical topology, the statistical distribution of data across the vehicles introduces the fundamental learning challenge. Our experiments in Table 4 confirm that a higher degree of non-IID data makes it more difficult to reduce the ADC. To ensure stable convergence, the scheduler is incentivized to increase the EVS size. However, this directly conflicts with the goal of minimizing TDC and ECC. The optimal EVS is therefore a direct reflection of the trade-off between the lowering ADC, TDC, and ECC.
(3) Synergistic Impact of Mobility Patterns. Finally, vehicle mobility patterns act as the dynamic mechanism that reshapes the data landscape within the constraints of the network topology. As observed in Table 4, mobility patterns synergistically influence the optimal EVS. We dissect these effects by analyzing the macro-level vehicle arrival rate and the micro-level vehicle speed characteristics.
  • Vehicle arrival rate: The arrival rate λ reveals a non-monotonic impact on the optimal EVS. At low rates, the system is data-starved, and the optimal EVS is small due to a lack of diverse candidates. At very high rates, the system becomes data-congested, as high vehicle density creates severe contention for limited communication bandwidth, increasing the TDC. This again forces the scheduler to select a smaller EVS to manage congestion. A moderate arrival rate provides the optimal balance of a rich data supply without prohibitive resource contention.
  • Vehicle speed characteristics: The speed characteristics ( v ¯ , σ v ) illustrate the duality of mobility. The mean speed v ¯ dictates the primary trade-off: at low EVS, the bottleneck is stagnant data diversity, hindering ADC reduction; at high EVS, the dominant issue is system instability, increasing TDC. The standard deviation σ v further acts as a data diversity amplifier; greater speed variation enhances vehicle mixing, allowing the system to achieve its learning goals with a potentially smaller EVS.
These findings collectively demonstrate that the optimal EVS must adapt not only to the edge spacing and the data on the vehicles but also to the dynamic evolution of the data landscape, which is shaped by complex mobility patterns.

4.4. Theoretical Analysis

The necessity for dynamic EVS adjustment stems from the fundamental conflict between convergence gain and system overhead. On the one hand, expanding the scale of participation generates significant convergence gains. According to the theoretical convergence bound for FedAvg on non-IID data derived in [56] and extended to hierarchical settings, the number of communication rounds R required to reach a target accuracy is inversely proportional to the number of participating nodes, satisfying R 1 M N . This implies that increasing the number of EN and vehicles per EN accelerates global convergence. On the other hand, expanding the scale incurs substantial system overheads. The cost per training round is positively correlated with the participation scale, where the ADC and TDC typically exhibit weak linear growth with respect to M and N, while the ECC often exhibits super-linear growth due to cumulative aggregation and transmission burdens. Due to the task diversity and environment dynamics, the task attributes that determine convergence gain and the environmental state that determines resource overhead show significant differences in different scenarios, causing the optimal balance point after weighing convergence gains and system costs to drift. Therefore, the long-term training cost is closely related to the EVS configurations, and the optimal configuration has strong scenario dependence.

5. AptEVS: Adaptive Edge-and-Vehicle Scheduling

Section 4 reveals that the optimal EVS configuration is highly sensitive to task dynamics, environmental variability, and training phases, exhibiting strong non-stationarity and nonlinearity. Such complexity limits the effectiveness of static or heuristic strategies. Given the problem’s nature as sequential decision-making under uncertainty with no tractable model, we adopt DRL to exploit its strength in learning adaptive policies in high-dimensional dynamic environments.

5.1. Framework Design Overview

A key challenge in applying DRL to VHFL is determining whether an agent can learn an effective scheduling policy online, within a dynamic and previously unseen vehicular environment, without relying on pre-trained models. Before a generalizable solution can be developed, it is necessary to first validate the feasibility of online learning in such settings. To this end, AptEVS is designed as a proof-of-concept framework aimed at demonstrating this feasibility. We intentionally configure the agent to learn from scratch in each distinct environment. If the agent can consistently converge from random initialization to a high-performing policy, it indicates that our DRL formulation has captured the essential dynamics of the problem, suggesting that the scheduling task is fundamentally learnable.
To tackle this learnable problem, a powerful DRL approach is required. While classic reinforcement learning algorithms like tabular Q-learning are effective for problems with discrete and manageable state spaces [57], they are ill-suited for our VHFL setting due to the curse of dimensionality. The system state s t that our agent must process is high-dimensional and composed of numerous continuous variables. Discretizing such a space would lead to a combinatorially explosive number of states, rendering a Q-table computationally infeasible. Therefore, we leverage a DQN-based approach, which employs a Deep Neural Network (DNN) as a powerful function approximator to learn the mapping from the high-dimensional, continuous state space to the optimal action values.
Building on this motivation, we present AptEVS, a phase-aware DRL framework that dynamically adjusts the agent’s learning strategy according to the current phase of the HFL process, specifically, the ITP, MTP, or FTP. The framework comprises two key components: (i) a lightweight module for detecting training phase transitions, and (ii) a set of phase-adaptive algorithms that jointly enhance exploration and convergence. The remainder of this section details the DRL environment formulation and the implementation of the phase-aware scheduling strategy.

5.2. Design of DRL Environment

Figure 5 shows the environment design overview for AptEVS. The DRL environment is defined by three standard elements, state, action, and reward, where each cloud round in VHFL corresponds to a discrete time step t. Below we detail each component.
(1) State: The state s t is a structured vector composed of two sub-components, defined as s t = [ s task , s t env ] , allowing the agent to perceive multi-level dynamics for context-aware scheduling. Here,
  • s task = { C , Z } is the set of the static task characteristics in intrinsic task attributes, where C is computational complexity and Z is communication load. While model accuracies { A t , A t 1 ) } are available, they are not included in the state but used by a separate phase detection module.
  • s t env = { d edge , s t data , s t mobility } is the dynamic of the external environment.
    d edge is the current distance between neighboring ENs.
    s data = { D avg , D std , S avg , S std } is the real-time statistical characteristics of the distributed data, where D avg and D std denote the mean and standard deviation of the local data sizes across vehicles, and  S avg and S std represent the corresponding statistics of local label skew. These statistics are also reported by vehicles in a privacy-preserving form. Specifically, the label skew for client i is defined as S i μ k = 1 K ( α i k 1 K ) 2  [58].
    s mobility = { λ , v ¯ , σ v } is the vehicle mobility pattern in the current candidate training segment. λ indicates the current vehicle arrival rate, and v ¯ and σ v are the mean and variance of vehicle speeds.
(2) Action: The agent selects an EVS configuration as the action a t = ( M t , N t ) , where M t is the number of selected ENs and N t is the number of scheduled vehicles per node.
(3) Reward: To provide a stable and effective learning signal, we design a novel reward function based on a short-term target accuracy A target S T . This serves as a dynamic reference point to measure the model’s relative performance improvement. We define A target S T as the peak accuracy achieved within the last R cycle rounds:
A target S T = max { A t R cycle , , A t 1 , A t } .
Based on this dynamic target, we then define a short-term accuracy deviation cost (ST-ADC), which penalizes deviations from this short-term peak:
C STA ( t ) = 1 2 Ξ ( A target S T A t ) , A t > A target S T , 1 1 2 Ξ ( A t A target S T ) , A t A target S T .
The final reward r t at time step t is computed as a weighted sum of ST-ADC, TDC, and ECC:
r t = α 1 β 1 C STA ( t ) α 2 β 2 T ( t ) α 3 β 3 E ( t ) ,
where this raw reward is subsequently normalized to a consistent scale to ensure stable learning throughout the training process.
This novel formulation is motivated by the limitations of more straightforward approaches. A simple method, such as directly using the negative training cost C ( t ) from our global objective (P1) (18), is problematic due to the sparse and delayed nature of the learning signal. Furthermore, conventional reward designs in FL that rely on instantaneous accuracy metrics, such as A t  [59], Ξ A t  [44], or  ( A t A t 1 )  [57], are suboptimal in dynamic environments like VHFL. As our analysis revealed in Section 4.2, such signals can become monotonic, less informative, or highly susceptible to noise from non-IID data.
This dynamic-target-guided design offers three benefits: (i) it avoids reward saturation by leveraging a moving accuracy baseline; (ii) it enhances action discriminability via relative performance feedback; and (iii) it suppresses noise sensitivity by anchoring to a periodically updated peak accuracy.

5.3. Phase-Aware Scheduling Algorithms and Workflow

The adaptive capability of AptEVS is realized through a set of specialized algorithms, orchestrated by a main coordinating workflow detailed in Algorithm 1. This main algorithm functions as a high-level scheduler. At the beginning of each training episode, the agent’s components are initialized (Lines 1–2). Then, in each round r, namely, time step t, the agent’s core logic is as follows: First, it observes the comprehensive system state s t (Line 5). It then calls the Adaptive Phase Detection Mechanism in Algorithm 2 to identify the current training phase based on recent performance (Lines 7–8). Based on the detected phase, the agent invokes the appropriate action selection subroutine: structured exploration in Algorithm 3 for the ITP (Line 9–13), priority-enhanced DQN training in Algorithm 4 for the MTP (Line 14–21), or the pure exploitation policy for the FTP (Lines 22–26). After executing the action and receiving a reward, the experience is stored and used to train the DQN networks via the classic Mixed Experience Replay (MER) mechanism (Lines 13 and 18). This entire process repeats, allowing the agent to continuously learn and converge towards a stable, high-performance scheduling policy that is tailored to the specific dynamics of the current environment.

5.3.1. Adaptive Phase Detection Mechanism

A cornerstone of AptEVS is its ability to perceive the training phase evolution of HFL. Based on our observations in Section 4, we designed the lightweight, rule-based mechanism detailed in Algorithm 2 to detect transitions between phases based on accuracy stagnation and policy stability.
Algorithm 1 AptEVS: phase-aware DRL scheduling
Input: 
Discount factor γ , learning rate α , update frequency of target network J, cloud interval κ 1 , initial epsilon ϵ 0 , decay coefficient ζ ϵ , and minimum epsilon ϵ min , mini-batch b DQN .
Output: 
Trained online network parameters θ .
1:Initialize DQN online and target networks θ , θ , replay buffer V .
2:Initialize counters k 1 0 , k 2 0 , best action a best , counter threshold K 1 and K 2 .
3:for each cloud round r = 1 , 2 , , R  do
4:    Set the current time step t be r.
5:    Observe current system state s t .
6:    // Adaptive phase detection
7:    Apply Algorithm 2, obtain P t , { k 1 , k 2 } .
8:    // Phase-specific action selection
9:    if  P t is ITP then▹ DQN Training
10:        Apply Algorithm 3, obtain and execute a t .
11:        Calculate normalized reward r t .
12:        Store experience ( s t , a t , r t , s t + 1 ) in V .
13:        Train DQN networks θ , θ using MER from V .
14:    else if  P t is MTP then▹ DQN Training
15:        Apply Algorithm 4, obtain and execute a t .
16:        Calculate normalized reward r t .
17:        Store experience ( s t , a t , r t , s t + 1 ) in V .
18:        Train DQN networks θ , θ using MER from V .
19:        if  a t is a new best action then
20:            a best a t .
21:        end if
22:    else
23:        // Phase switches to the FTP
24:         a t arg max a A t Q ( s t , a ; θ ) ▹ DQN Inference
25:        Execute a t .
26:    end if
27:end for
28:return  θ
ITP → MTP Transition: Theoretically, the magnitude of accuracy gain ( Δ A ) alone serves as a sufficient indicator to distinguish between different training phases. However, strictly applying this theoretical threshold to single-round data is prone to error due to the stochastic volatility inherent in federated learning. To bridge the gap between this macroscopic theoretical trend and microscopic observational noise, we introduced a cumulative counting mechanism. Instead of relying on instantaneous values, this mechanism triggers a phase shift only when the accuracy gain satisfies the condition for a cumulative number of rounds. This effectively acts as a temporal filter, smoothing out transient fluctuations and ensuring that the detected phase transition is a stable trend rather than a result of stochastic noise. The transition from the ITP to the MTP is triggered when the accuracy gain stagnates, which we detect when the accuracy gain, Δ A t , remains below a predefined threshold δ for K 1 consecutive rounds (Lines 4–6). Upon detecting this transition, a critical adaptation is performed: the action space is pruned for the subsequent MTP. This step leverages the exploration from the ITP to establish an empirical lower bound for effective EVS configurations, significantly accelerating the subsequent policy learning. Let a min ITP = ( M min ITP , N min ITP ) be the final EVS configuration from the ITP. The original action space, A , is defined as A = { ( M , N ) M { 1 , , M max } , N { 1 , , N max } } . The new, refined action space for the MTP is then formally constructed by retaining only the configurations that dominate this empirically found lower bound: A t { ( M , N ) A M M min ITP , N N min ITP } .
Algorithm 2 Adaptive phase detection mechanism
Input: 
Current phase P t , accuracy history { A t 1 , A t 2 } , counters { k 1 , k 2 } , best action a best , current action a t , thresholds { K 1 , A ITP , K 2 } .
Output: 
Next phase P t + 1 , updated counters { k 1 , k 2 } .
1:Initialize P t + 1 P t , k 1 k 1 , k 2 k 2 .
2:if  P t is ITP then
3:    // Check for ITP to MTP transition
4:    if  ( A t A t 1 ) < δ  then
5:         k 1 k 1 + 1 .
6:    end if
7:    if ( k 1 K 1 or A t > A ITP ) and t > 1  then
8:         P t + 1 MTP ▹ Transition to MTP.
9:        Construct the new action space A t .
10:    end if
11:else if  P t is MTP then
12:    // Check for MTP to FTP transition
13:    if  a t = a best  then
14:         k 2 k 2 + 1 .
15:    else
16:         k 2 0 ▹ Reset counter.
17:    end if
18:    if  k 2 K 2  then
19:         P t + 1 FTP .▹ Transition to Inference Phase
20:    end if
21:end if
22:return  P t + 1 , { k 1 , k 2 } .
Algorithm 3 Structured exploration for ITP
1:Input: Accuracy history { A t 1 , A t 2 } , last action a t 1 = ( M t 1 , N t 1 ) .
2:Output: Action a t = ( M t , N t ) .
3:if  t = = 1  then
4:     ( M t , N t ) ( M max , N max ) .▹ Ensure a robust initial performance
5:else if  t = = 2  then
6:     ( M t , N t ) ( M min , N min ) .▹ Probe the baseline
7:else
8:     ( M t , N t ) ( M t 1 , N t 1 ) .▹ Retain previous configuration
9:    if  Δ A t 1 0  then▹ If accuracy stagnates
10:        if  M t < M max  then
11:              M t M t + 1 ▹ Preferentially increase ENs.
12:        else if  N t < N max  then
13:            N t N t + 1 .
14:        end if
15:    end if
16:end if
17:return  ( M t , N t ) .
MTP → FTP Transition: The MTP is the agent’s primary learning phase, where it refines its scheduling policy. The transition to the final, inference-only FTP is determined by the convergence of this learned policy. We assess convergence based on policy stability. Specifically, the agent tracks the current best-known action a best (Line 13). A counter is incremented only when the agent’s chosen action a t in a round matches a best (Line 14). If the agent discovers a new, superior action, a best is updated, and the counter is reset (Line 16). The policy is considered to have converged when this counter reaches a predefined threshold K 2 , signifying that the optimal action has remained stable for a sufficient number of consecutive rounds (Line 18).
Algorithm 4 Priority-enhanced DQN training for MTP
1:Input: State s t , epsilon ϵ .
2:Output: Action a t .
3:if  r a n d ( ) < ϵ  then▹ Exploration
4:    Calculate the prioritization probability p ( a ) (29) for each action a A t .
5:    Obtain action a t randomly from A t according to p ( a ) .
6:else▹ Exploitation
7:     a t arg max a A t Q ( s t , a ; θ ) .
8:end if
9:Update ϵ with decay coefficient ζ ϵ .
10:return  a t .

5.3.2. Phase-Aware DRL Algorithms

AptEVS employs customized DQN algorithms for the ITP and MTP, aiming to maximize learning efficiency while minimizing exploration overhead.
Structured Exploration for ITP. In the ITP, the objective is to quickly identify an EVS configuration that is good enough. We design Algorithm 3, which starts from the minimal EVS setting and increases its scale only when accuracy improvement stagnates. This not only reduces the long-term training cost during ITP but, more importantly, identifies a constrained and high-potential action space for the subsequent MTP, significantly accelerating the main DRL training process. The joint state–action space of HFL is vast, where large-scale configurations often yield high penalties. In the ITP, random exploration in these regions causes extreme variance in target Q-values, destabilizing training. Our “Start Small” strategy constrains exploration to a low-cost feasible region, reducing reward variance and providing a robust warm start for the policy.
Priority-Enhanced DQN Training for MTP. In the MTP, the agent performs fine-grained policy optimization within the reduced action space identified during ITP. As illustrated in Algorithm 4, we propose a priority-enhanced epsilon-greedy strategy, where the probability of exploring a specific action is inversely proportional to its resource cost ( M t · N t ) . This design ensures that lower-cost actions are explored more frequently, further improving training efficiency. The probability of selecting the action a is
p ( a ) = 1 / ( M t · N t ) ( M ^ t · N ^ t ) A t 1 / ( M ^ t · N ^ t ) .
In the MTP, the priority-enhanced strategy acts as guided exploration. By weighting exploration probabilities inversely to resource cost, it effectively performs importance sampling. This concentrates computational resources on potentially optimal subspaces rather than invalid high-cost regions, thereby improving sample efficiency and accelerating convergence.
Policy Deployment for FTP. During the final deployment phase, the agent shifts from training to pure inference. It ceases exploration and exclusively exploits the learned Q-network. Unlike the ITP (heuristic acceleration) and MTP (DRL-based learning), the scheduling design for the FTP focuses on exploitation rather than exploration. Our empirical analysis indicates that during the MTP, the global model’s accuracy fluctuations are highly sensitive to the scheduling actions ( M , N ) . This sensitivity provides rich feedback (gradients) for the DQN agent, enabling it to efficiently explore the state space and converge to an optimal policy. Once the DQN agent converges, the learned policy is deemed robust for the remaining convergence process. Consequently, in the FTP, AptEVS terminates the training of the DQN and switches to Inference Mode. The ENs and vehicles are scheduled directly based on the inference results of the trained DQN model.

6. Performance Evaluation

This section presents a comprehensive evaluation of the proposed AptEVS framework. We first establish the overall superiority of AptEVS through a performance comparison against a static baseline. Following this, we conduct two rigorous ablation studies to isolate and validate the individual contributions of the phase-aware mechanism and novel reward function. The evaluation concludes with an analysis of the framework’s robustness and computational overhead.

6.1. Experimental Setup

Consider a VHFL system with M max = 8 ENs. It is evaluated on the MNIST, FMNIST, and CIFAR-10 tasks under two types of non-IID data partitioning. The DRL agent is realized using a fully connected DQN comprising two hidden layers of 128 and 64 neurons, respectively. Each layer adopts the ReLU activation function, and the network parameters are optimized using the Adam algorithm. The simulation parameters for these HFL tasks align with those specified in Section 4.1. Additionally, the hyperparameters of DRL for AptEVS are detailed in Table 5.

6.2. Overall Performance Comparison

To evaluate the overall performance of AptEVS, we compare it with two representative baselines:
  • Fixed proportional Edge-and-Vehicle Scheduling (FpEVS): A static baseline used to reflect uniform and task-independent scheduling commonly adopted in earlier studies. It schedules all ENs and 30% of vehicles in every training round, following the empirical setup in [11].
  • Phase-aware Edge-and-Vehicle Scheduling (PahEVS): A dynamic baseline that adjusts the vehicle participation ratio across different training phases, 20% in ITP, 30% in MTP, and 50% in FTP, while keeping all ENs scheduled. This design is based on the core insight from [33] that scheduling fewer clients in early stages and more in later stages can reduce training cost.
Although FpEVS and PahEVS do not rely on explicit optimization, they reflect two common scheduling paradigms in FL and provide useful references for evaluating the effectiveness of our adaptive approach. Given the absence of prior studies employing DRL for the joint optimization of edge-and-vehicle scheduling quantities in HFL, no direct state-of-the-art DRL baseline exists for this specific problem formulation. Therefore, we benchmark AptEVS against representative static (FpEVS) and dynamic heuristic (PahEVS) methods. To further validate that the performance gains are intrinsic to our proposed mechanisms rather than the generic RL backbone, we conduct rigorous ablation studies. These studies isolate the contributions of the novel reward function and the phase-aware strategy, verifying their role in ensuring robust performance without relying on complex model-based assumptions.
To ensure the reproducibility and reliability of the reported performance, all experiments were conducted over five independent trials with distinct random seeds. In the accuracy plots, the solid curves represent the average accuracy, while the shaded regions reflect the fluctuation range across trials. For the bar charts, the bar height represents the mean value, and the black error bars explicitly denote the standard deviation.
As shown in Figure 6(a1–c1), AptEVS does not always require the fewest communication rounds to reach the target accuracy across different tasks. However, it consistently achieves the lowest long-term training cost, demonstrating superior efficiency under diverse task settings. As illustrated in Figure 6(a2–c2), AptEVS reduces the long-term training cost by up to 44.4%, 37.5%, and 29.2% compared with the static FpEVS policy for the MNIST, FMNIST, and CIFAR-10 tasks, respectively.
Interestingly, the dynamic PahEVS policy, despite its phase awareness, incurs an even higher long-term training cost than the static FpEVS. This counter-intuitive result highlights a critical weakness of heuristic-based scheduling, where PahEVS rigidly enforces a high participation rate (50%) in FTP, causing a surge in TDC and ECC that far outweighs the marginal gains in model accuracy. This underscores that coarse-grained adaptivity can be inefficient.
In contrast, by intelligently managing this trade-off, AptEVS achieves further cost reductions of 62.4%, 56.3%, and 66.0% compared to PahEVS on the same tasks. This superior efficiency stems from its adaptive scheduling policy, which dynamically balances the cost and benefits among ADC, TDC, and ECC. These results validate our core hypothesis: static or coarse-grained heuristic strategies are insufficient to address the heterogeneous and dynamic nature of VHFL. Instead, AptEVS confirms that fine-grained, cost-aware adaptivity is essential for true optimization in VHFL.

6.3. Ablation Study: Efficacy of the Reward Function Design

The design of the reward function is paramount as it directly guides the DRL agent toward the desired performance objectives. To rigorously validate the effectiveness of our proposed reward structure in AptEVS, we conduct a comprehensive ablation study. We compare the performance of our complete AptEVS algorithm against four distinct variants. Each variant preserves the core architecture of AptEVS but substitutes our reward function with an alternative design.
  • AptEVS-A: Employs a reward based on the direct model accuracy A t , as in [59]. The reward is defined as r t = α 1 β 1 A t α 2 β 2 T ( t ) α 3 β 3 E ( t ) ) .
  • AptEVS-EA: Uses an exponential form of the accuracy Ξ A t as the primary reward component, following [44]. The reward is defined as r t = α 1 β 1 Ξ A t α 2 β 2 T ( t ) α 3 β 3 E ( t ) ) .
  • AptEVS-ADC: Utilizes the ADC C A ( t ) as the reward signal, a method used in [45]. The reward is defined as r t = α 1 β 1 C A ( t ) α 2 β 2 T ( t ) α 3 β 3 E ( t ) .
  • AptEVS-ACAR: A variant based on [57], which uses a weighted sum of the current accuracy and the change in accuracy from the previous round. Let μ 1 and μ 2 be weights such that μ 1 + μ 2 = 1 , and set U A ( t ) = μ 1 A t + μ 2 ( A t A t 1 ) . The reward function is given by r t = A t + r 1 , A t < A t 1 , α 1 β 1 U A ( t ) α 2 β 2 T ( t ) α 3 β 3 E ( t ) , A t A t 1 .
As shown in Figure 7, AptEVS consistently achieves the lowest long-term training cost. This confirms our earlier analysis in Section 5.2: traditional rewards suffer from saturation, noise sensitivity, or limited discriminability. Our feedback signal, based on a short-term target accuracy anchor, provides more stable and informative learning guidance, enabling more efficient policy learning.
To further validate the effectiveness of our reward design, we observed the reward progression of AptEVS across edge rounds. Empirical results in Figure 8 show that the DRL-based policy quickly converges within the early stages of HFL training, less than 10% of the total training rounds across all evaluated tasks. This highlights that the exploration cost induced by our reward function remains limited, and that the learned policy is capable of generalizing effectively in later stages.

6.4. Ablation Study: Efficacy of the Phase-Aware Mechanism

To verify the impact of our proposed phase-aware mechanism, we conduct an ablation study comparing the full AptEVS framework against a degenerated version, AptEVS-vanilla, which removes this mechanism. As illustrated in Figure 9, the full AptEVS agent exhibits similar or even accelerated convergence while consistently achieving the lowest long-term training cost. This significant performance advantage stems from the phase-aware design. The ITP encourages structured exploration, allowing the agent to build a robust understanding of the environment and effectively prune the vast action space. Subsequently, the main MTP capitalizes on this knowledge, enabling focused policy refinement within a more promising region of the action space. This leads to a more stable and efficient learning process, culminating in a superior final policy.

6.5. Performance Robustness Across Diverse Environments

Having verified the core mechanisms, we now investigate how well AptEVS adapts under diverse and challenging environments. Rather than exhaustive benchmarks, we select representative scenarios to examine the behavior of the learned policy and its ability to navigate dynamic bottlenecks.
Table 6 summarizes the long-term training cost across different edge spacings and mobility patterns. A striking observation is that AptEVS learns efficient scheduling strategies from scratch in the vast majority of scenarios, including unfamiliar environments, underscoring its robustness. In specific circumstances, the fixed configuration of FpEVS happens to coincidentally align with the near-optimal operating point by chance. In contrast, AptEVS is an online learning agent. This inherent exploration overhead causes a slight performance cost. Therefore, the zero-overhead FpEVS yields a slightly lower cost. However, this is merely situational. AptEVS achieves superior performance in the vast majority of scenarios. Its ability to remain effective globally, despite being slightly suboptimal in rare specific cases, strongly validates the robustness of the proposed method.
More importantly, the results reveal how AptEVS learns to dynamically adapt to diverse performance bottlenecks. First, the agent learns to modulate its EVS configuration in response to edge spacings, strategically balancing the handover-induced straggler risks of small edge spacings against the vehicle edge communication inefficiency of large spacings. Second, it learns to manage the trade-off between data sparsity at low arrival rates and resource congestion at high arrival rates. Finally, the agent demonstrates a sophisticated understanding of the duality of vehicle speed. It learns to counter the high instability risk of fast-moving environments by adopting a conservative, small-EVS policy to control the TDC. Concurrently, it learns to recognize and exploit the speed standard deviation σ v as a data diversity amplifier. The results show that when σ v is high, the agent strategically reduces the EVS size. This indicates it has learned that the increased vehicle mixing naturally helps satisfy the ADC requirement, freeing the agent to select a smaller, more resource-efficient configuration that minimizes TDC and ECC.
Adaptability to Diverse Preference Weights: It is worth noting that the weighting factors in Equation (28) embody the diverse service requirements of vehicular tasks. A key advantage of the proposed AptEVS framework is its inherent preference-aware adaptability. Since the DRL agent updates its action value estimates to derive a policy that maximizes the cumulative reward derived directly from these weighted costs, changes in weighting factors automatically reshape the reward landscape. Consequently, the agent dynamically adjusts its learned policy to align with the new weights without requiring modification to the algorithm’s structure. This ensures that AptEVS functions as a robust, general-purpose solver capable of satisfying varying task objectives.

6.6. Overhead Analysis

The cloud servers typically possess high computational capabilities, with an aggregate processing frequency equivalent to 50 GHz. The computational complexity of the DQN model in AptEVS is approximately 10 4 cycles/sample. During the DQN training phase, whenever an HFL cloud round is completed, experience samples of b DQN = 16 are used to train the DQN model, resulting in a runtime of approximately 3 µs. Additionally, the time delay taken by the DQN model for decision-making action is approximately 1 µs, so the total additional time overhead for AptEVS is only 4 µs. Given that the execution time for each cloud round in HFL is approximately 1 s, the time overhead introduced by AptEVS is virtually negligible. Due to the minimal computation time required, the corresponding energy consumption overhead is also minimal, exerting an insignificant impact on the overall system performance.
The state information required by AptEVS consists of low-dimensional scalars. The total size of this metadata is negligible compared to the high-dimensional model parameters exchanged in HFL. Therefore, the bandwidth consumed by state monitoring is insignificant. Additionally, processing this state information primarily involves calculating means and standard deviations. These basic statistical operations incur minimal CPU load on the powerful cloud server compared to the complex gradient calculations in HFL.

7. Conclusions

This paper investigated the critical problem of adaptive EVS in VHFL systems, where static strategies fall short due to the interplay between diverse task-specific requirements and dynamic vehicular environments. To address this, we proposed a unified long-term training cost metric that jointly captures ADC, TDC, and ECC, providing a principled basis for evaluating performance under diverse tasks. Our experimental analysis demonstrated that the optimal EVS configuration is not fixed but evolves dynamically with task complexity, environmental conditions, and training phase progression. Based on this observation, we introduced AptEVS, a DRL-based framework that learns optimal EVS configurations online. AptEVS integrates a lightweight training phase detection mechanism with a phase-aware scheduling strategy, employing structured exploration in the ITP and priority-enhanced DQN in the MTP. Extensive simulations verified that AptEVS consistently derives high-performance scheduling policies from scratch, achieving significant reductions in long-term training cost compared with the baselines.
Future work may focus on improving sample efficiency for faster convergence, potentially through transfer learning by augmenting the DRL state space with model structural information. This approach aims to facilitate robust adaptation across varying tasks and dynamic environments. In conclusion, this study demonstrates the feasibility and benefits of phase-aware DRL-based scheduling, offering a promising direction for efficient and robust HFL in realistic vehicular networks.

Author Contributions

Conceptualization, Y.T.; Methodology, Y.T.; Software, Y.T.; Validation, Y.T.; Formal Analysis, Y.T.; Investigation, Y.T.; Resources, L.T.; Data Curation, Y.T.; Writing—Original Draft Preparation, Y.T.; Writing—Review and Editing, W.Z., L.Z. and S.L.; Visualization, Y.T.; Supervision, N.W. and Z.Z.; Project Administration, N.W.; Funding Acquisition, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62120106007).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yan, H.; Li, Y. Generative AI for Intelligent Transportation Systems: Road Transportation Perspective. Acm Comput. Surv. 2025, 57, 315. [Google Scholar] [CrossRef]
  2. Talpur, A.; Gurusamy, M. Machine Learning for Security in Vehicular Networks: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2022, 24, 346–379. [Google Scholar] [CrossRef]
  3. Khan, M.W.; Obaidat, M.S.; Mahmood, K.; Sadoun, B.; Badar, H.M.S.; Gao, W. Real-Time Road Damage Detection Using an Optimized YOLOv9s-Fusion in IoT Infrastructure. IEEE Internet Things J. 2025, 12, 17649–17660. [Google Scholar] [CrossRef]
  4. Huang, Y.; Wang, F. D-TLDetector: Advancing Traffic Light Detection With a Lightweight Deep Learning Model. IEEE Trans. Intell. Transp. Syst. 2025, 26, 3917–3933. [Google Scholar] [CrossRef]
  5. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.Y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
  6. Liu, L.; Zhang, J.; Song, S.; Letaief, K.B. Client-edge-cloud hierarchical federated learning. In Proceedings of the ICC 2020–2020 IEEE International Conference on Communications (ICC), Virtual, 7–11 June 2020; pp. 1–6. [Google Scholar]
  7. Deng, Y.; Lyu, F.; Ren, J.; Wu, H.; Zhou, Y.; Zhang, Y.; Shen, X. AUCTION: Automated and quality-aware client selection framework for efficient federated learning. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 1996–2009. [Google Scholar] [CrossRef]
  8. Tao, M.; Zhou, Y.; Shi, Y.; Lu, J.; Cui, S.; Lu, J.; Letaief, K.B. Federated Edge Learning for 6G: Foundations, Methodologies, and Applications. Proc. IEEE, 2024; early access. [Google Scholar]
  9. Banafaa, M.; Shayea, I.; Din, J.; Azmi, M.H.; Alashbi, A.; Daradkeh, Y.I.; Alhammadi, A. 6G mobile communication technology: Requirements, targets, applications, challenges, advantages, and opportunities. Alex. Eng. J. 2023, 64, 245–274. [Google Scholar] [CrossRef]
  10. Chen, T.; Yan, J.; Sun, Y.; Zhou, S.; Gündüz, D.; Niu, Z. Mobility accelerates learning: Convergence analysis on hierarchical federated learning in vehicular networks. IEEE Trans. Veh. Technol. 2025, 74, 1657–1673. [Google Scholar] [CrossRef]
  11. Zhang, T.; Lam, K.Y.; Zhao, J. Device Scheduling and Assignment in Hierarchical Federated Learning for Internet of Things. IEEE Internet Things J. 2024, 11, 18449–18462. [Google Scholar] [CrossRef]
  12. Zhao, J.; Chang, X.; Feng, Y.; Liu, C.H.; Liu, N. Participant selection for federated learning with heterogeneous data in intelligent transport system. IEEE Trans. Intell. Transp. Syst. 2022, 24, 1106–1115. [Google Scholar] [CrossRef]
  13. Taik, A.; Mlika, Z.; Cherkaoui, S. Clustered vehicular federated learning: Process and optimization. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25371–25383. [Google Scholar] [CrossRef]
  14. Wu, Q.; Wang, S.; Fan, P.; Fan, Q. Deep reinforcement learning based vehicle selection for asynchronous federated learning enabled vehicular edge computing. In Proceedings of the International Congress on Communications, Networking, and Information Systems, Guilin, China, 25–27 March 2023; pp. 3–26. [Google Scholar]
  15. Tang, X.; Zhang, J.; Fu, Y.; Li, C.; Cheng, N.; Yuan, X. A Fair and Efficient Federated Learning Algorithm for Autonomous Driving. In Proceedings of the 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Hong Kong, China, 10–13 October 2023; pp. 1–5. [Google Scholar]
  16. Sangdeh, P.K.; Li, C.; Pirayesh, H.; Zhang, S.; Zeng, H.; Hou, Y.T. CF4FL: A communication framework for federated learning in transportation systems. IEEE Trans. Wirel. Commun. 2022, 22, 3821–3836. [Google Scholar] [CrossRef]
  17. Saputra, Y.M.; Hoang, D.T.; Nguyen, D.N.; Tran, L.N.; Gong, S.; Dutkiewicz, E. Dynamic federated learning-based economic framework for internet-of-vehicles. IEEE Trans. Mob. Comput. 2021, 22, 2100–2115. [Google Scholar] [CrossRef]
  18. Li, Z.; Wu, H.; Lu, Y. Coalition based utility and efficiency optimization for multi-task federated learning in Internet of Vehicles. Future Gener. Comput. Syst. 2023, 140, 196–208. [Google Scholar] [CrossRef]
  19. Pervej, M.F.; Jin, R.; Dai, H. Resource constrained vehicular edge federated learning with highly mobile connected vehicles. IEEE J. Sel. Areas Commun. 2023, 41, 1825–1844. [Google Scholar] [CrossRef]
  20. Xiao, H.; Zhao, J.; Pei, Q.; Feng, J.; Liu, L.; Shi, W. Vehicle selection and resource optimization for federated learning in vehicular edge computing. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11073–11087. [Google Scholar] [CrossRef]
  21. Wang, G.; Xu, F.; Zhang, H.; Zhao, C. Joint resource management for mobility supported federated learning in Internet of Vehicles. Future Gener. Comput. Syst. 2022, 129, 199–211. [Google Scholar] [CrossRef]
  22. Lin, F.P.C.; Hosseinalipour, S.; Michelusi, N.; Brinton, C.G. Delay-aware hierarchical federated learning. IEEE Trans. Cogn. Commun. Netw. 2023, 10, 674–688. [Google Scholar] [CrossRef]
  23. Luo, S.; Chen, X.; Wu, Q.; Zhou, Z.; Yu, S. HFEL: Joint edge association and resource allocation for cost-efficient hierarchical federated edge learning. IEEE Trans. Wirel. Commun. 2020, 19, 6535–6548. [Google Scholar] [CrossRef]
  24. Wang, Z.; Xu, H.; Liu, J.; Huang, H.; Qiao, C.; Zhao, Y. Resource-efficient federated learning with hierarchical aggregation in edge computing. In Proceedings of the IEEE INFOCOM 2021-IEEE Conference on Computer Communications, Vancouver, BC, Canada, 10–13 May 2021; pp. 1–10. [Google Scholar]
  25. Qi, T.; Zhan, Y.; Li, P.; Guo, J.; Xia, Y. Hwamei: A learning-based synchronization scheme for hierarchical federated learning. In Proceedings of the 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), Hong Kong, China, 18–21 July 2023; pp. 534–544. [Google Scholar]
  26. Wei, X.; Liu, J.; Shi, X.; Wang, Y. Participant selection for hierarchical federated learning in edge clouds. In Proceedings of the 2022 IEEE International Conference on Networking, Architecture and Storage (NAS), Philadelphia, PA, USA, 3–4 October 2022; pp. 1–8. [Google Scholar]
  27. Lim, W.Y.B.; Ng, J.S.; Xiong, Z.; Niyato, D.; Miao, C.; Kim, D.I. Dynamic edge association and resource allocation in self-organizing hierarchical federated learning networks. IEEE J. Sel. Areas Commun. 2021, 39, 3640–3653. [Google Scholar] [CrossRef]
  28. Su, L.; Zhou, R.; Wang, N.; Chen, J.; Li, Z. Low-latency hierarchical federated learning in wireless edge networks. IEEE Internet Things J. 2023, 11, 6943–6960. [Google Scholar] [CrossRef]
  29. Nguyen, T.D.; Tong, N.A.; Nguyen, B.P.; Nguyen, Q.V.H.; Le Nguyen, P.; Huynh, T.T. Hierarchical Federated Learning in MEC Networks with Knowledge Distillation. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
  30. Kou, W.B.; Wang, S.; Zhu, G.; Luo, B.; Chen, Y.; Ng, D.W.K.; Wu, Y.C. Communication resources constrained hierarchical federated learning for end-to-end autonomous driving. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 9383–9390. [Google Scholar]
  31. Zhou, X.; Liang, W.; She, J.; Yan, Z.; Kevin, I.; Wang, K. Two-layer federated learning with heterogeneous model aggregation for 6g supported internet of vehicles. IEEE Trans. Veh. Technol. 2021, 70, 5308–5317. [Google Scholar] [CrossRef]
  32. Feng, C.; Yang, H.H.; Hu, D.; Zhao, Z.; Quek, T.Q.; Min, G. Mobility-aware cluster federated learning in hierarchical wireless networks. IEEE Trans. Wirel. Commun. 2022, 21, 8441–8458. [Google Scholar] [CrossRef]
  33. Lai, F.; Zhu, X.; Madhyastha, H.V.; Chowdhury, M. Oort: Efficient federated learning via guided participant selection. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI’21), Virtual, 14–16 July 2021; pp. 19–35. [Google Scholar]
  34. Swenson, B.; Murray, R.; Poor, H.V.; Kar, S. Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima. J. Mach. Learn. Res. 2022, 23, 1–62. [Google Scholar]
  35. Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated learning with non-iid data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
  36. Xie, B.; Sun, Y.; Zhou, S.; Niu, Z.; Xu, Y.; Chen, J.; Gunduz, D. MOB-FL: Mobility-aware federated learning for intelligent connected vehicles. In Proceedings of the ICC 2023-IEEE International Conference on Communications, Rome, Italy, 28 May–1 June 2023; pp. 3951–3957. [Google Scholar]
  37. Cui, J.; Chen, Y.; Zhong, H.; He, D.; Wei, L.; Bolodurina, I.; Liu, L. Lightweight encryption and authentication for controller area network of autonomous vehicles. IEEE Trans. Veh. Technol. 2023, 72, 14756–14770. [Google Scholar] [CrossRef]
  38. Ma, Z.; Zhang, T.; Liu, X.; Li, X.; Ren, K. Real-time privacy-preserving data release over vehicle trajectory. IEEE Trans. Veh. Technol. 2019, 68, 8091–8102. [Google Scholar] [CrossRef]
  39. Fu, Y.; Li, C.; Yu, F.R.; Luan, T.H.; Zhao, P. An incentive mechanism of incorporating supervision game for federated learning in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14800–14812. [Google Scholar] [CrossRef]
  40. Dai, P.; Hu, K.; Wu, X.; Xing, H.; Teng, F.; Yu, Z. A probabilistic approach for cooperative computation offloading in MEC-assisted vehicular networks. IEEE Trans. Intell. Transp. Syst. 2020, 23, 899–911. [Google Scholar] [CrossRef]
  41. Zhang, X.; Chang, Z.; Hu, T.; Chen, W.; Zhang, X.; Min, G. Vehicle selection and resource allocation for federated learning-assisted vehicular network. IEEE Trans. Mob. Comput. 2023, 23, 3817–3829. [Google Scholar] [CrossRef]
  42. Yu, Z.; Hu, J.; Min, G.; Zhao, Z.; Miao, W.; Hossain, M.S. Mobility-aware proactive edge caching for connected vehicles using federated learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 5341–5351. [Google Scholar] [CrossRef]
  43. Zhang, C.; Zhang, W.; Wu, Q.; Fan, P.; Fan, Q.; Wang, J.; Letaief, K.B. Distributed deep reinforcement learning based gradient quantization for federated learning enabled vehicle edge computing. IEEE Internet Things J. 2024, 12, 4899–4913. [Google Scholar] [CrossRef]
  44. Mao, W.; Lu, X.; Jiang, Y.; Zheng, H. Joint client selection and bandwidth allocation of wireless federated learning by deep reinforcement learning. IEEE Trans. Serv. Comput. 2024, 17, 336–348. [Google Scholar] [CrossRef]
  45. Wang, H.; Kaplan, Z.; Niu, D.; Li, B. Optimizing federated learning on non-iid data with reinforcement learning. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Virtual, 6–9 July 2020; pp. 1698–1707. [Google Scholar]
  46. Wu, C.; Ren, Y.; So, D.K. Adaptive User Scheduling and Resource Allocation in Wireless Federated Learning Networks: A Deep Reinforcement Learning Approach. In Proceedings of the ICC 2023-IEEE International Conference on Communications, Rome, Italy, 28 May–1 June 2023; pp. 1219–1225. [Google Scholar]
  47. Peng, Y.; Tang, X.; Zhou, Y.; Hou, Y.; Li, J.; Qi, Y.; Liu, L.; Lin, H. How to tame mobility in federated learning over mobile networks? IEEE Trans. Wirel. Commun. 2023, 22, 9640–9657. [Google Scholar] [CrossRef]
  48. You, C.; Guo, K.; Yang, H.H.; Quek, T.Q. Hierarchical personalized federated learning over massive mobile edge computing networks. IEEE Trans. Wirel. Commun. 2023, 22, 8141–8157. [Google Scholar] [CrossRef]
  49. Liu, S.; Guan, P.; Yu, J.; Taherkordi, A. Fedssc: Joint client selection and resource management for communication-efficient federated vehicular networks. Comput. Netw. 2023, 237, 110100. [Google Scholar] [CrossRef]
  50. Study on Channel Model for Frequencies from 0.5 to 100 GHz (Release 18); Technical Report TR 38.901 V18.0.0, 3GPP; ETSI: Sofia, France, 2024.
  51. Zeng, T.; Semiari, O.; Chen, M.; Saad, W.; Bennis, M. Federated learning on the road autonomous controller design for connected and autonomous vehicles. IEEE Trans. Wirel. Commun. 2022, 21, 10407–10423. [Google Scholar] [CrossRef]
  52. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  53. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
  54. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Computer Science University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  55. Li, Q.; Diao, Y.; Chen, Q.; He, B. Federated learning on non-iid data silos: An experimental study. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 965–978. [Google Scholar]
  56. Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the convergence of fedavg on non-iid data. arXiv 2019, arXiv:1907.02189. [Google Scholar]
  57. Kim, Y.G.; Wu, C.J. Autofl: Enabling heterogeneity-aware energy efficient federated learning. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual, 18–22 October 2021; pp. 183–198. [Google Scholar]
  58. Tian, Y.; Wang, N.; Zhang, Z.; Zou, W.; Zou, G.; Tian, L.; Li, W. Joint Client Selection and Bandwidth Allocation Algorithm for Time-Sensitive Federated Learning over Wireless Networks. In Proceedings of the 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring), Singapore, 24–27 June 2024; pp. 1–6. [Google Scholar]
  59. Yu, X.; Gao, Z.; Xiong, Z.; Zhao, C.; Yang, Y. DDPG-AdaptConfig: A deep reinforcement learning framework for adaptive device selection and training configuration in heterogeneity federated learning. Future Gener. Comput. Syst. 2025, 163, 107528. [Google Scholar] [CrossRef]
Figure 1. A representative HFL deployment in vehicular networks. The common strategy of deploying tasks along key transportation arteries, such as major highways or the illustrated urban road, naturally establishes a linear topology of ENs for efficient hierarchical aggregation.
Figure 1. A representative HFL deployment in vehicular networks. The common strategy of deploying tasks along key transportation arteries, such as major highways or the illustrated urban road, naturally establishes a linear topology of ENs for efficient hierarchical aggregation.
Electronics 15 00479 g001
Figure 2. Overview of the VHFL system.
Figure 2. Overview of the VHFL system.
Electronics 15 00479 g002
Figure 3. Test accuracy of HFL tasks with different ( M , N ) values under the configuration of edge spacing d edge = 300 m, data partitioning parameter ϕ = 5 , data size D i = 200 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s.
Figure 3. Test accuracy of HFL tasks with different ( M , N ) values under the configuration of edge spacing d edge = 300 m, data partitioning parameter ϕ = 5 , data size D i = 200 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s.
Electronics 15 00479 g003
Figure 4. ADC, time delay, energy consumption, and training cost of HFL tasks with different ( M , N ) values under the configuration of edge spacing d edge = 300 m, data partitioning parameter ϕ = 5 , data size D i = 200 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (The downward arrow indicates the corresponding minimum value.)
Figure 4. ADC, time delay, energy consumption, and training cost of HFL tasks with different ( M , N ) values under the configuration of edge spacing d edge = 300 m, data partitioning parameter ϕ = 5 , data size D i = 200 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (The downward arrow indicates the corresponding minimum value.)
Electronics 15 00479 g004
Figure 5. Overview of the environment design for AptEVS.
Figure 5. Overview of the environment design for AptEVS.
Electronics 15 00479 g005
Figure 6. Overall performance comparison of HFL tasks under the configuration of edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (a1c1) Test accuracy. (a2c2) Long-term training cost (objective value) and ADC. (a3c3) Time delay and energy consumption.
Figure 6. Overall performance comparison of HFL tasks under the configuration of edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (a1c1) Test accuracy. (a2c2) Long-term training cost (objective value) and ADC. (a3c3) Time delay and energy consumption.
Electronics 15 00479 g006
Figure 7. Ablation study on reward function design under the configuration of edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (a1c1) Test accuracy. (a2c2) Long-term training cost (objective value) and ADC. (a3c3) Time delay and energy consumption.
Figure 7. Ablation study on reward function design under the configuration of edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (a1c1) Test accuracy. (a2c2) Long-term training cost (objective value) and ADC. (a3c3) Time delay and energy consumption.
Electronics 15 00479 g007
Figure 8. Reward progression over edge rounds in AptEVS, under the following conditions: edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s.
Figure 8. Reward progression over edge rounds in AptEVS, under the following conditions: edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s.
Electronics 15 00479 g008
Figure 9. Ablation study on the phase-aware mechanism under the configuration of edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (a1c1) Test accuracy. (a2c2) Long-term training cost (objective value) and ADC. (a3c3) Time delay and energy consumption.
Figure 9. Ablation study on the phase-aware mechanism under the configuration of edge spacing d edge = 300 m, data partitioning parameter ψ = 0.4 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. (a1c1) Test accuracy. (a2c2) Long-term training cost (objective value) and ADC. (a3c3) Time delay and energy consumption.
Electronics 15 00479 g009
Table 1. Summary of key notations.
Table 1. Summary of key notations.
SymbolDefinitionSymbolDefinition
N m max maximum of vehicles in EN m N set of vehicles
M max maximum of ENs M set of ENs managed by the CS
mindex of an ENiindex of a vehicle
D i local data size of vehicle i D i local dataset of vehicle i
κ 1 cloud interval κ 2 edge interval
( r , τ ) time slice of edge round τ in cloud round rRtotal number of cloud rounds
R set of cloud rounds M r set of ENs scheduled by the CS
M r edge scheduling number N r vehicle scheduling number per EN
N m c ( r , τ ) vehicles connected to EN m at time slice ( r , τ ) N m s ( r , τ ) vehicles scheduled by EN m at time slice ( r , τ )
wtraining model of vehicles w r cloud model in the r-th cloud round
w m r , τ edge model at time slice ( r , τ ) w i r , τ ( j ) local model of vehicle i after j-th local iteration
Ltotal number of local iterationsbbatch size
η learning rate d edge edge spacing
λ vehicle arrival rate ρ ( v i ) probability density function of v i
v i speed of vehicle i v ¯ mean speed of vehicles
σ v standard deviation of vehicle speeds A r model accuracy after the r-th cloud round
A target target accuracy f i ( r , τ ) actual computing frequency of vehicle i
C i computational cycles per sample for vehicle i c i effective capacitance factor of the computing chipset
P i ( r , τ ) actual transmit power of vehicle iZmodel size
b i ( r , τ ) number of RBs allocated to vehicle i at ( r , τ ) Bbandwidth of one RB
Table 2. Simulation parameters for HFL.
Table 2. Simulation parameters for HFL.
ParameterValue
Computational frequency of vehicle for three HFL tasks, f i max ( r , τ ) 0.5 /1/4 GHz
Transmit power of vehicle, P i max ( r , τ ) [ 20 , 30 ] dBm
Bandwidth of one RB for three HFL tasks, B180/180/360 kHz
Number of RBs at each EN, b m 20
Carrier frequency, f c 3.5 GHz
Effective capacitance factor of the computing chipset on vehicle i, c i 10 27
Noise power spectral density, N 0 174 dBm/MHz
Edge Spacing, L300 m
Vehicle arrival rate, λ 1 veh/s
Speed characteristics of vehicles in the training area, ( v ¯ , σ v ) ( 20 , 3 ) m/s
Uplink transmission rate of the ENs, R e 50 Mbps
Cloud interval, κ 1 5
Edge interval, κ 2 5
Learning rate, η 0.01
Decay rate, ζ η 0.995
Batch size, b20
Convergence threshold for three HFL tasks, δ 0.01 / 0.1 / 0.1
Data partitioning parameter, ϕ 5
Data size per vehicle, D i 200
Weighting factors for ADC/TDC/ECC, α 1 / α 2 / α 3 0.1/0.5/0.4
scaling factor of ADC, Ξ 100
Table 3. Parameter setup of different HFL tasks.
Table 3. Parameter setup of different HFL tasks.
Task
Class
Computational
Complexity (cycles)
Communication
Load (bits)
Target
Accuracy (%)
MNIST102,41213,492,54498
FMNIST867,80413,504,83285
CIFAR-106,291,65283,333,63275
Table 4. Impact of environmental dynamics on the optimal EVS configuration.
Table 4. Impact of environmental dynamics on the optimal EVS configuration.
VariableValueOptimal
EVS
Cloud
Rounds
ADCTDC
(s)
ECC
(J)
LTTC
d edge
(m)
200(2, 6)5817.2438.54732.714,575.2
300(4, 4)5316.0421.87663.915,213.7
400(6,2)5218.1368.18094.214,257.4
ϕ 5(4, 4)5316.0421.87663.915,213.7
2(8, 2)6532.6457.613,474.320,089.3
ψ 0.6 (6, 2)12133.7981.717,097.231,999.4
0.4 (8, 2)13537.11161.125,348.640,517.0
λ
(veh/s)
0.6(4, 2)6522.1452.06888.816,264.4
1.0(4, 4)5316.0421.87663.915,213.7
1.4(4, 2)5515.5436.67937.215,636.3
( v ¯ , σ v )
(m/s)
( 10 , 3 ) (2, 4)5619.1436.34036.514,438.7
( 20 , 1 ) (8, 2)5215.9365.610,784.715,046.4
( 20 , 3 ) (4, 4)5316.0421.87663.915,213.7
( 20 , 5 ) (2, 2)6828.3467.53520.015,926.8
( 30 , 3 ) (2, 2)6829.5466.83562.416,046.3
Table 5. Simulation parameters for AptEVS.
Table 5. Simulation parameters for AptEVS.
ParameterValue
Learning rate of DQN, α 0.01
Discounted factor of DQN, γ 0.90
Initial epsilon of the epsilon-greedy strategy, ϵ 0 0.95
Decay coefficient of the epsilon-greedy strategy, ζ ϵ 0.98
Minimum epsilon of the epsilon-greedy strategy, ϵ min 0.01
Experience replay buffer size, V32
Mini-batch size of DQN, b DQN 16
Update frequency of target network, J4
Minimum of time step, t min 40
ADC factor, Ξ 100
Update cycle for short-term target accuracy, R cycle 4
Accuracy improvement threshold for ITP, δ 5%
Counter threshold for ITP, K 1 3
Counter threshold for the DQN training phase, K 2 5
Termination accuracy for ITP, A ITP 60%
Table 6. Robustness evaluation of the proposed algorithm under systematic variations in key environmental parameters against a default configuration: edge spacing d edge = 300 m, data partitioning parameter ϕ = 5 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. The best results are highlighted in bold.
Table 6. Robustness evaluation of the proposed algorithm under systematic variations in key environmental parameters against a default configuration: edge spacing d edge = 300 m, data partitioning parameter ϕ = 5 , arrival rate λ = 1.0 veh/s, and speed characteristics ( v ¯ , σ v ) = ( 20 , 3 ) m/s. The best results are highlighted in bold.
VariableValueMethodEdge RoundsADCTDC (s)ECC (J)Long-Term Training Cost
d edge
(m)
200AptEVS (Ours) 428 ± 10 53.1 ± 6.1 521.9 ± 11.2 8677.0 ± 1992.0 7895.8 ± 798.7
FpEVS 403 ± 8 20.0 ± 0.2 486.6 ± 10.6 10,918.3 ± 219.5 7318.1 ± 148.3
PahEVS 352 ± 6 16.9 ± 0.2 512.2 ± 10.9 17,423.3 ± 458.9 10,686.0 ±   375.3
400AptEVS (Ours) 395 ± 19 68.7 ± 8.6 523.8 ± 21.8 6663.3 ± 709.3 7313.5 ± 113.8
FpEVS 387 ± 8 17.4 ± 0.2 624.6 ± 12.3 20,026.9 ± 315.3 13,481.3 ± 229.5
PahEVS 345 ± 11 15.5 ± 0.2 536.6 ± 18.8 17,663.0 ± 799.6 10,739.0 ± 310.2
λ
(veh/s)
0.6 AptEVS (Ours) 483 ± 9 71.4 ± 5.4 597.3 ± 23.4 9222.9 ± 1735.2 8635.2 ± 1113.4
FpEVS 425 ± 22 21.9 ± 0.4 506.3 ± 23.8 10,281.2 ± 383.3 6350.7 ± 252.1
PahEVS 402 ± 9 18.3 ± 0.1 619.5 ± 16.0 21,030.0 ± 695.3 12,761.4 ± 381.3
1.4 AptEVS (Ours) 357 ± 19 49.8 ± 4.3 439.8 ± 21.4 5200.0 ± 977.5 5832.4 ± 455.1
FpEVS 277 ± 6 13.4 ± 0.0 461.9 ± 10.0 16,362.1 ± 391.7 10,761.3 ± 280.8
PahEVS 320 ± 0 14.9 ± 0.2 478.3 ± 0.7 15,710.2 ± 22.7 9499.8 ± 104.8
( v ¯ , σ v )
(m/s)
( 10 , 3 ) AptEVS (Ours) 427 ± 18 75.8 ± 15.1 543.0 ± 17.6 8718.7 ± 2412.4 8382.5 ± 1153.9
FpEVS 383 ± 5 16.8 ± 0.2 742.3 ± 11.9 33,054.9 ± 498.5 22,237.3 ± 508.0
PahEVS 335 ± 4 15.6 ± 0.1 503.0 ± 8.2 16,784.4 ± 287.6 10,263.2 ± 168.4
( 20 , 1 ) AptEVS (Ours) 373 ± 6 64.7 ± 8.5 465.9 ± 27.3 5774.5 ± 484.4 6500.9 ± 449.4
FpEVS 283 ± 2 14.8 ± 0.3 393.0 ± 4.6 11,331.6 ± 147.2 7352.6 ± 85.9
PahEVS 345 ± 4 15.4 ± 0.1 519.7 ± 5.7 17,656.7 ± 378.6 10,718.1 ± 189.2
( 20 , 3 ) AptEVS (Ours) 391 ± 29 57.5 ± 0.8 475.0 ± 55.0 5100.0 ± 1100.6 6163.7 ± 442.5
FpEVS 298 ± 5 14.6 ± 0.1 416.7 ± 6.0 12,131.6 ± 197.3 7825.1 ± 116.9
PahEVS 335 ± 12 15.9 ± 0.2 502.6 ± 22.4 16,843.9 ± 825.7 10,289.0 ± 478.6
( 20 , 5 ) AptEVS (Ours) 411 ± 46 70.0 ± 5.3 521.3 ± 52.8 8106.6 ± 1333.1 7813.3 ± 800.0
FpEVS 367 ± 2 17.9 ± 0.1 511.2 ± 3.9 14,975.8 ± 99.2 9632.3 ± 61.3
PahEVS 336.7 ± 8 15.9 ± 0.2 507.0 ± 15.2 16,897.1 ± 719.5 10,339.3 ± 383.6
( 30 , 3 ) AptEVS (Ours) 369 ± 26 60.2 ± 5.5 458.6 ± 36.2 5247.8 ± 1169.6 6155.5 ± 478.5
FpEVS 310 ± 15 16.9 ± 0.2 380.6 ± 17.8 8248.9 ± 395.8 5626.4 ± 258.4
PahEVS 345 ± 7 16.8 ± 0.2 518.7 ± 11.9 17,108.7 ± 442.6 10,513.5 ± 256.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, Y.; Wang, N.; Zhang, Z.; Zou, W.; Zhao, L.; Liu, S.; Tian, L. AptEVS: Adaptive Edge-and-Vehicle Scheduling for Hierarchical Federated Learning over Vehicular Networks. Electronics 2026, 15, 479. https://doi.org/10.3390/electronics15020479

AMA Style

Tian Y, Wang N, Zhang Z, Zou W, Zhao L, Liu S, Tian L. AptEVS: Adaptive Edge-and-Vehicle Scheduling for Hierarchical Federated Learning over Vehicular Networks. Electronics. 2026; 15(2):479. https://doi.org/10.3390/electronics15020479

Chicago/Turabian Style

Tian, Yu, Nina Wang, Zongshuai Zhang, Wenhao Zou, Liangjie Zhao, Shiyao Liu, and Lin Tian. 2026. "AptEVS: Adaptive Edge-and-Vehicle Scheduling for Hierarchical Federated Learning over Vehicular Networks" Electronics 15, no. 2: 479. https://doi.org/10.3390/electronics15020479

APA Style

Tian, Y., Wang, N., Zhang, Z., Zou, W., Zhao, L., Liu, S., & Tian, L. (2026). AptEVS: Adaptive Edge-and-Vehicle Scheduling for Hierarchical Federated Learning over Vehicular Networks. Electronics, 15(2), 479. https://doi.org/10.3390/electronics15020479

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop