1. Introduction
The power system is rapidly shifting from a small number of controllable generators toward a large population of Distributed Energy Resources (DERs), including behind-the-meter photovoltaic generation, battery energy storage systems (BESSs), and flexible loads [
1]. While this transition supports decarbonization, it also increases operational uncertainty as renewable intermittency and stochastic demand tighten real-time balancing requirements [
1]. In this context, ancillary services such as frequency regulation and voltage support become increasingly important, and fast-response capabilities are progressively supplied by inverter-based resources and storage [
1]. BESSs are well-suited to provide high-quality ancillary services, but frequent cycling accelerates degradation and erodes long-horizon value [
2]. Since individual DERs are typically too small to participate directly in wholesale or ancillary-service markets, aggregation into a Virtual Power Plant (VPP) has emerged as a dominant architecture for market participation and coordinated operation [
3,
4,
5,
6]. However, scaling a VPP exposes a persistent trilemma: the operator seeks near-optimal economic performance, DER owners demand raw-data privacy, and the overall coordination layer must remain secure and auditable under realistic cyber threats [
3,
7,
8].
Existing coordination approaches address only parts of this trilemma. Centralized VPP control can achieve strong economic performance by optimizing the fleet with full observability, but it requires collecting fine-grained operational data and concentrates risk in a single point of failure and compromise [
3,
5,
9]. Federated Learning (FL) improves privacy by keeping raw data local and sharing only model updates, enabling collaborative learning without centralized data collection [
10,
11]. However, standard FL introduces a fragile trust assumption that is rarely made explicit in VPP settings: a central aggregator must be trusted to correctly and fairly combine updates and to publish a fresh, protocol-compliant global model every round [
10]. This trust gap is especially problematic in reinforcement-learning based control, where the training signal is noisy and there is no ground-truth label to audit. A compromised aggregator can silently bias the learning process by selectively dropping updates, manipulating weights, replaying stale models, or injecting a replacement model, leading to degraded profit, reduced compliance, and potentially unsafe dispatch behavior [
12].
This paper proposes a trustless coordination mechanism for VPP learning by integrating zero-knowledge proofs into federated reinforcement learning. We develop a zero-knowledge federated reinforcement learning (zk-FRL) framework in which each DER runs a local deep reinforcement learning agent to solve a multi-objective dispatch problem balancing ancillary-service revenue against BESS degradation under operational and grid constraints, while collaboration is achieved through a federated training loop whose aggregation step is externally verifiable [
11,
13]. In each round, participating agents bind the participant set using signed receipts and publish cryptographic commitments to their updates. The aggregator computes the new global parameters and produces a succinct zk-SNARK, proving that the published model equals the agreed aggregation rule applied to the receipt-bound set of committed updates under a fixed-point encoding with range constraints [
14,
15,
16]. Each DER independently verifies the proof and adopts the global model only if verification succeeds, eliminating the need to trust the aggregator for aggregation integrity while avoiding the centralization of raw DER operational data and local experience trajectories; the framework provides verifiable aggregation rather than concealing model updates from the aggregator [
14,
16].
The novelty of the proposed zk-FRL framework is not merely the addition of a generic zk-SNARK layer to federated reinforcement learning. At the algorithmic level, we define a receipt-bound and commitment-consistent aggregation protocol that allows each DER to accept or reject the published global model based on locally verifiable aggregation correctness. At the systems level, we integrate this protocol into a two-time-scale power–market–learning workflow, in which physical control remains continuous while model synchronization is externally auditable. At the power-engineering level, we tailor the learning objective and verification target to VPP coordination, where economic return, battery degradation, and grid-feasibility constraints must be balanced under aggregator-side cyber risk.
Below are the contributions in summary:
We propose a zk-FRL architecture for large-scale VPP coordination that removes the implicit trust requirement on the FL aggregator by making the aggregation step cryptographically verifiable while preserving DER raw-data privacy (i.e., without centralizing fine-grained measurements and local experience trajectories).
We formalize an adversarial model for malicious aggregator behavior in federated reinforcement learning, and define verifiable security properties for aggregation correctness and update inclusion that make selective dropping, weight manipulation, replay, and model injection externally detectable [
12].
We present an implementable zk-SNARK based verification protocol for neural-network aggregation using fixed-point encoding and range constraints, and provide a complexity characterization that explains how proof cost scales with model dimension and fleet size [
14,
15,
16].
We validate the framework in a power–cyber co-simulation on the IEEE 33-bus feeder [
17] with ancillary-service signals, benchmarking against centralized DRL, independent DRL, and standard federated RL under both benign and adversarial aggregators.
The remainder of this paper is organized as follows.
Section 2 reviews multi-objective BESS operation, DER aggregation architectures, and secure distributed learning.
Section 3 presents the VPP system model and local DER dynamics.
Section 4 details the proposed zk-FRL protocol, its security properties, and circuit construction.
Section 5 describes the co-simulation platform and baselines.
Section 6 reports performance, robustness under a suite of aggregator attacks, and cryptographic overhead.
Section 7 concludes and outlines future directions.
3. System Model
This section formalizes the cyber-physical environment considered in this work. We model a large-scale Virtual Power Plant (VPP) as a multi-agent system that interacts with (i) the physical distribution grid, (ii) ancillary-service and energy markets, and (iii) a communication and learning layer that enables coordination without centralizing raw DER operational data, while making the aggregation step externally verifiable. The model is organized from architecture and time scales to device dynamics, market settlement, grid feasibility, and finally the cryptographic objects used later by the zk-FRL protocol.
3.1. System Architecture and Time Scales
We consider a three-tiered architecture consisting of DER agents, a VPP aggregator, and the grid/market operator:
DER Agents: Intelligent software agents co-located with physical DER assets (e.g., BESS, PV, controllable loads). Each agent performs local control and maintains private operational data, including measurements, forecasts, and historical trajectories. Agents train local deep reinforcement learning (DRL) models on private experience and only communicate model updates and cryptographic commitments.
VPP Aggregator: A coordination service that orchestrates federated training rounds. The aggregator distributes the current global model, collects local model updates, computes a candidate global model, and generates a zero-knowledge proof certifying that the aggregation computation follows the prescribed rule over the correct participant set. The aggregator is not trusted by assumption and may be malicious as formalized later; hence, its actions must be verifiable.
Grid/Market Operator: The external entity (ISO/DSO) that issues ancillary-service signals (e.g., regulation commands) and settles payments based on the VPP’s aggregate performance. The distribution grid enforces physical constraints such as voltage limits and line thermal limits.
We distinguish two coupled time scales:
Control interval (seconds to minutes): each agent executes a dispatch action and receives market and grid signals. For dimensional consistency with P in kW and energy capacity C in kWh, we use the hour-equivalent step in all energy-balance and settlement equations.
Federated learning round (multiple control intervals): agents collect local experience over a horizon and periodically participate in a federated reinforcement learning round to update shared policy parameters. This separation is important because cryptographic proving is performed per learning round rather than per control step.
3.2. Participants, Signals, and Information Structure
Let
denote the set of DER agents. At control time
t, agent
i observes a local measurement vector
where
is the state of charge,
and
are local PV and load,
is a grid frequency (or an ancillary-service tracking reference),
and
are energy and ancillary-service prices, and
captures additional operator signals or calendar features (e.g., time of day, market interval index). Each agent does not observe other agents’ private states. The VPP-level objective depends on the aggregate response
which is the quantity evaluated by the market operator for ancillary-service performance.
3.3. BESS Dynamics and Operational Constraints
Each agent
i controls a BESS with energy capacity
(kWh), power limit
(kW), and efficiency parameters
. The control action is the charge/discharge setpoint
where positive values denote charging and negative values denote discharging. Define
and
. The state of charge evolves according to a discrete-time energy balance:
subject to
These constraints ensure feasible operation and implicitly define an admissible action set that depends on .
3.4. Ancillary-Service Signal, Tracking, and Settlement
The VPP participates in an ancillary-service product such as frequency regulation. Let
denote the normalized regulation command (e.g.,
) issued by the operator and let
denote the VPP regulation capacity committed for the settlement interval. The commanded aggregate power is
The tracking error is modeled as
To keep the model general across markets, we represent ancillary-service revenue as a function of committed capacity and a performance score:
where
is a non-increasing performance mapping (e.g., a score that penalizes tracking error or non-compliance). Energy-market revenue (or cost) due to charging/discharging is represented as
where the sign convention implies that discharging (
) yields positive revenue and charging yields a cost. The total market-facing revenue component can be written as
, noting that additional market terms (e.g., mileage payments) can be absorbed into (
8) without changing the learning and verification structure.
3.5. Battery Degradation Cost Model
Frequent cycling and high-rate operation accelerate BESS degradation. We model degradation as a monetized cost term that penalizes throughput and stress. To ensure reproducibility while retaining flexibility, we adopt a generic semi-empirical form and treat its coefficients as part of the experimental specification:
where
and
reflect asset-specific degradation sensitivity. The first term captures energy throughput-induced wear, while the second term penalizes aggressive changes in the state of charge that correlate with deeper or faster cycling. This representation is compatible with more detailed cycle-counting or rainflow-based models; when such models are used, they replace (
10) as the definition of
without affecting the remainder of the framework.
The degradation model adopted here is a semi-empirical proxy designed for dispatch-level decision making rather than detailed lifetime prediction. This abstraction is appropriate for the present study because the control objective is short-horizon ancillary-service coordination, whereas temperature-dependent aging, SOH evolution, and detailed cycle-aging effects mainly influence the effective degradation coefficients on slower horizons.
3.6. Distribution-Grid Feasibility and Grid Penalties
The VPP operates over a distribution network with bus set
and line set
. Let
denote the voltage magnitude at bus
b and let
denote the apparent power flow on line
ℓ. At each control step, the power system simulator computes
given net injections, including DER actions. Grid feasibility is characterized by operational limits
Because the control policy is learned from experience and may explore, we incorporate grid feasibility into the learning objective through a penalty term that discourages violations:
with coefficients
. This penalty couples the physical layer to the agent learning problem without requiring centralized access to all private states.
3.7. Learning Objects and Cryptographic Commitments
Each agent
i maintains local DRL model parameters
(e.g., actor and critic networks) and the VPP maintains a global model
. In each federated round
t, the global model is broadcast, agents perform local updates using private experience, and each agent produces a model update
To enable the later verification of correct aggregation while preserving privacy, we associate each update with a cryptographic commitment
where
denotes a collision-resistant commitment (e.g., a hash of a canonical serialization or a Merkle root when updates are chunked). At the beginning of each round, participating agents bind the participant set by publishing signed receipts
over a round identifier
, and the aggregator computes a public digest
where
is the set of participants in round
t. These objects are public inputs to the zero-knowledge verification layer in later sections and make update dropping or weight manipulation externally detectable.
3.8. Notation
Table 1 summarizes key symbols used throughout the system model and the zk-FRL formulation.
4. The zk-FRL Framework
This section presents the proposed zero-knowledge federated reinforcement learning (zk-FRL) framework. We first define the local learning problem solved by each DER agent as a multi-objective Markov decision process (MDP) aligned with the system model in
Section 3. We then describe the federated reinforcement learning loop used to train a shared policy without centralizing raw data. Finally, we introduce the zero-knowledge verification layer that removes the need to trust the aggregator by making the aggregation step publicly verifiable while keeping raw measurements and local experience private.
Figure 1 summarizes the main stages of the proposed zk-FRL protocol, from round enrollment and local update commitment to receipt-bound aggregation, proof generation, and local proof verification.
4.1. Local Agent Learning Problem
Each DER agent interacts with its local environment at the control interval and learns a policy that maps local observations to a BESS power setpoint. We formulate the local control problem as an MDP with discount factor .
At time
t, agent
i observes a local state vector consistent with (
1). For compactness, we write
This representation captures the internal BESS state and the external signals required for market participation and constraint-aware control.
The action is the BESS charge or discharge power setpoint
with feasibility further constrained by (
4) and (
5). A continuous-control DRL algorithm such as Soft Actor–Critic can be used to learn the policy, but the zk-FRL protocol does not depend on a specific DRL choice.
The local learning objective is aligned with VPP profit and operational feasibility. We define the instantaneous reward as a weighted combination of revenue, degradation cost, and penalties:
where
is the market-facing revenue component (e.g., (
8) and (
9)),
is the degradation cost (e.g., (
10)), and
aggregates penalties that enforce feasibility and performance requirements. The penalty term is instantiated as
where
is the ancillary-service tracking error defined in (
7), and
is the grid-violation penalty in (
12). The coefficients
regulate compliance strictness, while
tune the strategic balance between revenue and asset preservation. The agent aims to maximize the expected discounted return
under the transition dynamics implied by (
4) and the power-flow-based grid response.
The reward weights are selected so that revenue remains the primary optimization target, while degradation and feasibility penalties are strong enough to suppress operationally undesirable behavior. In practice, the reward components are first normalized by their representative magnitudes and the final coefficients are chosen through coarse grid search to achieve a balanced trade-off among profitability, battery preservation, and compliance.
The reward weights , , and are selected according to three practical principles. First, is kept as the dominant coefficient because market-facing revenue is the primary objective of VPP coordination. Second, and are chosen to be sufficiently large to suppress economically favorable but operationally undesirable behaviors, such as excessive battery cycling or repeated constraint violations. Third, the coefficients are tuned so that the major reward components remain of comparable numerical scale during training, which improves optimization stability and prevents a single term from overwhelming the learning signal. In practice, we first normalize the observed magnitudes of the revenue, degradation, and penalty terms on a representative training set, and then select the final coefficients through coarse grid search to achieve a balanced trade-off between profitability, battery preservation, and operational compliance.
4.2. Federated Reinforcement Learning Loop
To improve policy quality without sharing raw DER data, agents collaborate through a federated reinforcement learning process. Each federated round aggregates locally trained model parameters into a global model that is then redistributed. Let
denote the global parameters broadcast at the start of round
t. Each participating agent
k initializes its local parameters as
, collects experience using its local environment, and performs
E local update epochs. The agent then computes an update
, where
represents the flattened vector of all trainable parameters (weights and biases) across the aggregated networks. In this work,
denotes the actor parameter vector that is shared and aggregated across agents, whereas critic parameters are trained locally and are not included in
. The total dimension
D is a critical factor in the subsequent zero-knowledge proof generation, as the number of constraints in the zk-SNARK circuit scales linearly with the product of
D and the number of participants
,
and transmits it to the aggregator.
Under standard FedAvg with uniform weights, the aggregator computes
where
denotes the participant set in round
t. This loop enables collaboration but, by itself, requires trust in the aggregator to faithfully implement (
21) over the correct set of updates.
4.3. Threat Model and Trustless Aggregation
The objective of zk-FRL is to remove the need to trust the aggregator in the aggregation step. We consider a DER population that follows the protocol and an aggregator that may be malicious or compromised. The aggregator may deviate from (
21) by selectively dropping a subset of received updates, manipulating aggregation weights, replaying a stale global model, or injecting arbitrary parameters into
. The core design requirement is that such deviations are externally detectable by the DER agents without revealing private updates or raw data. It is important to note that this work focuses specifically on ensuring the integrity of the aggregation process against a semi-honest or malicious aggregator (e.g., preventing model manipulation or update dropping). While other threats in federated learning, such as Byzantine DER participants or data poisoning attacks, remain significant challenges for distributed coordination, they fall outside the current scope of this paper and are considered orthogonal to the proposed verifiable aggregation framework.
In this work, the threat model is intentionally scoped to aggregator-side deviations from the prescribed federated aggregation protocol. Specifically, we assume that DER participants execute local training and submission as specified, while the aggregator may be untrusted, compromised, or strategically malicious. Its capabilities include selectively dropping updates, manipulating aggregation weights, replaying stale global models, or injecting arbitrary replacement parameters before broadcasting the new global model. By contrast, attacks originating from DER participants themselves, such as Byzantine updates, data poisoning, model poisoning, or Sybil behavior, are outside the present scope. These threats are important in federated learning, but they are orthogonal to the aggregation-integrity problem addressed here and would require robust aggregation or participant-authentication mechanisms in addition to the proposed zk-SNARK layer.
The current security guarantee is limited to aggregator-side deviations from the prescribed aggregation protocol. Accordingly, zk-FRL can detect incorrect aggregation, update dropping, stale replay, and replacement-model injection by an untrusted aggregator, but it does not by itself prevent malicious DER participants from submitting poisoned yet syntactically valid updates. In a realistic VPP deployment, this means the framework protects aggregation integrity and auditability, while robust aggregation, client reputation, or admission-control mechanisms are still required to address participant-side attacks.
To make update dropping and weight manipulation well-defined and auditable, zk-FRL binds the participant set per round and binds each update to a public commitment:
At the beginning of round
t, each participating agent
k signs a round identifier
to produce a participation receipt
. The aggregator computes a public digest
as in (
15). This digest commits the participant set for round
t.
After local training, agent
k computes a commitment
as in (
14). The set of commitments binds the concrete updates that must be used in aggregation.
For auditability, the aggregator publishes the ordered list of commitments (or a Merkle root over the list) as a public input to verification, binding each declared participant to exactly one committed update in round t.
The receipt digest certifies integrity conditional on the declared participant set for round t: if an agent’s receipt and commitment are included in , then any subsequent dropping or weight manipulation over that declared set becomes externally detectable. This mechanism, by itself, does not prevent an untrusted aggregator from selectively excluding agents during round enrollment (e.g., by not relaying to some agents or by omitting their receipts before forming ). We therefore assume an auditable enrollment channel in which the round identifier and enrollment window are publicly announced (e.g., a bulletin-board style broadcast), so that an agent can detect exclusion when it has produced a valid receipt within the enrollment window but does not observe its inclusion in the published digest . Handling exclusion disputes (e.g., via re-enrollment, redundant coordinators, or governance mechanisms) is orthogonal to verifiable aggregation correctness and is not the focus of this work.
The aggregator then publishes together with a zero-knowledge proof that the aggregation was performed correctly over the receipt-bound committed updates.
This threat model differs from the threat models most commonly emphasized in the federated learning literature. Standard FL security analyses often focus on malicious or Byzantine clients, poisoning attacks, inference attacks, or privacy leakage during update exchange. In contrast, our work focuses on the trust assumption placed on the central aggregator itself. This distinction is particularly relevant in VPP coordination, where the aggregator is both a computational coordinator and an operationally influential entity whose deviation can silently bias the learned policy and affect economic performance as well as grid compliance. Therefore, the proposed threat model is not intended to replace existing FL threat models, but rather to complement them by isolating and formalizing the aggregation-integrity risk that is especially critical in cyber-physical energy systems.
4.4. Zero-Knowledge Verification Layer
We use a zk-SNARK to provide succinct proofs and fast verification. For each round t, the aggregator acts as a prover and each DER can independently act as a verifier. The statement proven in round t is that the published is exactly the result of applying the agreed aggregation rule to the prior global model and the set of updates whose commitments correspond to the receipt-bound participant set.
The public inputs to verification include , the participant digest , the prior and new global parameters and , and the list of commitments . The witness includes the concrete updates and the receipts that open to . The proof enforces that each committed update is the one used in the aggregation and that the aggregation arithmetic is correct.
To keep the proving circuit practical while preserving verifiability, zk-FRL separates checks that are enforced inside the zk-SNARK circuit from those enforced outside the circuit by each DER verifier:
Inside the circuit (cryptographically proven). The prover supplies the concrete updates as private witnesses. The circuit (i) recomputes the commitment digest for each update using a SNARK-friendly hash (e.g., Poseidon) and checks equality against the published commitments , (ii) enforces the membership of each pair in the declared round data via a Merkle-path verification against the published root(s) (e.g., for receipts and an optional commitment-list root), and (iii) enforces that the published equals the prescribed aggregation rule applied to and that the witness updates under the fixed-point encoding and range constraints.
Outside the circuit (locally verified by DERs). The digital-signature validity of receipts is verified by each DER using the public keys of participating agents, and the DER checks that the round identifier and enrollment window match the publicly announced round metadata. These checks are standard and inexpensive in native execution, and keeping signature verification outside the circuit avoids a substantial increase in SNARK constraint count.
This split ensures that (a) aggregation arithmetic and update-to-commitment binding are provably correct, while (b) signature authenticity and enrollment metadata are verified without reintroducing trust in the aggregator.
For implementation clarity, the proving circuit is organized into three modules. The first module enforces update-to-commitment consistency: for each participant
k, the witness update
is re-encoded in fixed-point form, hashed with a SNARK-friendly primitive, and checked against the published commitment
. The second module verifies round membership by checking that each receipt/commitment pair belongs to the published round data through a Merkle authentication path against the public root(s). The third module enforces aggregation correctness: for every model coordinate
, the circuit checks that the encoded global parameter satisfies
together with range constraints that ensure all encoded values remain within predetermined bounds. Under uniform FedAvg,
reduces to the common encoded averaging coefficient.
Neural-network parameters are represented in floating point, whereas zk-SNARK circuits operate over a finite field. We therefore use a fixed-point encoding for updates and parameters within the proving circuit. Let q denote the fixed-point scale, and let denote the integer encoding of a real-valued scalar x. The circuit enforces predetermined bounds on encoded values to prevent overflow and to make the arithmetic well-defined.
In our implementation, we use
as a default configuration to balance proving cost and numerical precision. Importantly, fixed-point quantization may affect optimization dynamics in reinforcement learning; therefore, we empirically evaluate sensitivity to
q by sweeping multiple scales (e.g.,
) and reporting the resulting learning curves and steady-state performance in
Section 6. This sensitivity study quantifies the practical trade-off between verification overhead and control performance under the same co-simulation settings.
In addition to the default fixed-point scale, we considered nearby precision settings to assess numerical sensitivity. Under moderate changes in precision, the qualitative convergence behavior and benign-performance ranking remained unchanged, indicating that the learning results are not driven by a single quantization choice.
Let
D denote the number of aggregated parameters and let
denote the number of participants in round
t. The total number of constraints can be expressed at a high level as
where
captures the per-coordinate or per-chunk commitment-hash cost,
captures the fixed-point aggregation arithmetic,
captures bound enforcement,
is the cost per Merkle-path level, and
accounts for public-output consistency checks. Accordingly, the dominant term scales linearly with
, while the membership-verification overhead grows as
. This scaling law is the basis for the overhead trends reported later in
Section 6.
Each DER verifies the proof using the public inputs. If verification succeeds, the DER accepts as the valid global model for round t. If verification fails, the DER rejects the update and retains its last verified model, preventing silent degradation due to aggregator deviations.
4.5. Protocols
Algorithm 1 summarizes the standard federated averaging loop used as a baseline. Algorithm 2 presents the zk-FRL protocol for a single round, which augments the baseline with participant binding, update commitments, and a zk-SNARK proof of correct aggregation. The protocol does not require the DERs to reveal raw data or to trust the aggregator, and it ensures that deviations from the prescribed aggregation rule are detectable by local verification.
| Algorithm 1: Federated Averaging (FedAvg) Process |
![Electronics 15 01303 i001 Electronics 15 01303 i001]() |
| Algorithm 2: zk-FRL Protocol (Single Round, Aggregator-Trustless) |
![Electronics 15 01303 i002 Electronics 15 01303 i002]() |
5. Implementation
To evaluate the proposed zk-FRL framework under realistic cyber-physical conditions, we implement a power–market–learning co-simulation platform that couples distribution power flow, ancillary-service settlement, federated reinforcement learning, and a zero-knowledge verification layer.
Figure 2 shows the system architecture of the proposed zk–FRL co–simulation platform. The goal of this section is to make the experimental pipeline reproducible by stating all model interfaces, time scales, baselines, and measurement procedures.
5.1. Co-Simulation Platform
The platform integrates four modules that exchange signals at a fixed control interval and a slower federated round interval:
Power system module: We use the IEEE 33-bus radial feeder as the distribution network and run an AC power flow at each control step to obtain bus voltages and line flows. The feeder is populated with DER agents at a subset of buses selected to provide geographically distributed flexibility across upstream, mid-feeder, and downstream locations. Unless otherwise stated, DER-equipped buses are . The deployed BESS units are heterogeneous, with energy capacities drawn from 80–300 kWh, maximum charge/discharge powers drawn from 40–120 kW, initial values sampled from 0.4–0.7, and operating limits set to and . Charging and discharging efficiencies are set to . Net injections are the superposition of base load, PV generation, and BESS actions. Base load and PV injections follow representative diurnal profiles mapped to the corresponding buses. These heterogeneous settings are used to reflect realistic diversity among DER assets within a practical VPP.
Market and settlement module: A discrete-time ancillary-service process provides a regulation command
and prices
. The VPP commits a regulation capacity
per settlement interval. The module computes aggregate tracking error
via (
7) and settles revenue using (
8) and (
9). This module also produces a market-defined performance score
used in settlement. The market environment is modeled as a time-varying ancillary-service process with regulation command
, energy price
, and ancillary-service price
. Unless otherwise stated,
is sampled from a diurnal time-of-use profile in the range of 20–80 USD/MWh, while
is sampled from 10–40 USD/MW per settlement interval. The committed regulation capacity
is selected within the feasible aggregate BESS capability of the VPP for the corresponding interval.
Learning and communication module: Each DER agent runs a continuous-control DRL algorithm (Soft Actor–Critic, SAC) using local observations from (
16) and rewards in (
18) and (
19). The actor network is implemented as a Multi-Layer Perceptron (MLP) with two hidden layers, each containing 256 neurons with Rectified Linear Unit (ReLU) activations. The input layer has a dimension of 7, and the output layer has a dimension of 1 for the BESS power setpoint. In our federated setup, we aggregate only the actor parameters, while each agent trains its critic(s) locally using private experience replay. Accordingly, the aggregated model dimension equals the actor parameter count, which is approximately
(about
weights and biases for a 7–256–256–1 MLP), providing a realistic benchmark for the zk-SNARK circuit complexity. Agents exchange model parameters or updates with the aggregator only once per federated round. Communication latency is modeled as a bounded delay added to the round wall-clock time. Unless otherwise noted, SAC uses an actor learning rate of
, a critic learning rate of
, discount factor
, soft target update coefficient
, replay-buffer size of
, mini-batch size of 256, and
local update epochs per federated round. Exploration is induced by the stochastic SAC policy with automatically tuned entropy temperature. Training is run for a fixed number of federated rounds, and convergence is assessed empirically by the stabilization of moving-average reward and profit curves over multiple seeds rather than by a formal stopping guarantee. The SAC training uses the following core hyperparameters unless otherwise noted: actor learning rate
, critic learning rate
, discount factor
, soft target update coefficient
, replay-buffer size
, mini-batch size
, and local update epochs
per federated round. The entropy–temperature coefficient is automatically tuned following the standard SAC implementation.
Cryptographic verification module: zk-FRL augments standard federation with signed participation receipts, update commitments, and a zk-SNARK proof of correct aggregation. We measure proof generation time at the aggregator and verification time at each DER. Updates are encoded in fixed point with range constraints, consistent across rounds. For the cryptographic layer, updates are encoded using fixed-point precision with scale
, commitments are instantiated with a SNARK-friendly hash, and Merkle authentication is used for round-data membership proofs. The Groth16 proving system is used throughout. We report proof generation time, verification time, proof size, and key sizes under the same hardware/software environment described in
Section 5.6.
5.2. Time Scales and Training Schedule
We adopt a two-time-scale schedule. Control actions are applied every
(e.g., seconds to minutes depending on the product), while a federated learning round occurs every
H control steps. In each round, participating agents collect experience over the most recent window, perform
E local update epochs, and submit an update
and commitment
to the aggregator. The aggregator computes the candidate global model using (
21) and, under zk-FRL, attaches a proof
.
5.3. Baselines
We benchmark zk-FRL against three control strategies:
Centralized DRL (Oracle): A single controller has access to all DER states and actions and trains a centralized policy. This provides an upper-bound reference for economic performance but is impractical due to privacy and communication constraints.
Independent DRL: Each DER agent trains locally without any federation and optimizes only its local reward. This captures the cost of non-cooperation.
Standard Federated RL: Agents train locally and the aggregator performs FedAvg without any verification layer. This baseline isolates the security gap introduced by trusting the aggregator.
The selected baselines are intended to isolate three distinct gaps: the collaboration gap (independent DRL versus federated learning), the trust gap (standard FRL versus zk-FRL), and the observability gap (centralized oracle versus practical decentralized coordination). Robust or secure aggregation baselines are complementary because they primarily address malicious clients rather than an untrusted aggregator, while optimization-based VPP dispatch remains an important model-based reference but represents a different control paradigm from learned adaptive coordination.
5.4. Attack Suite and Robustness Settings
To evaluate robustness to malicious aggregator behavior, we define an attack suite that targets the aggregation step. Starting at a designated round , the aggregator deviates from the protocol by applying one of the following actions before broadcasting :
Update dropping: randomly drop a fraction of committed updates while still claiming the same participant digest.
Weight manipulation: apply biased weights that deviate from uniform averaging while claiming uniform aggregation.
Stale replay: replay a previously broadcast global model as if it were fresh.
Model injection: publish an arbitrary parameter vector that is not the result of the prescribed aggregation.
Under zk-FRL, all deviations above are detectable at verification time by design, so agents reject the invalid global model and retain the last verified parameters.
5.5. Evaluation Metrics
We report four groups of metrics:
Economic performance: total VPP profit (USD/day), decomposed into ancillary-service revenue, energy cost, and monetized degradation.
Technical performance: ancillary-service tracking quality measured by the mean absolute tracking error (MAE) of relative to , and grid feasibility measured by voltage-violation magnitude aggregated over buses and time.
Security and robustness: relative degradation of profit and tracking performance under each attack type and strength, compared to benign operation.
Computational overhead: proof generation time at the aggregator and verification time at each DER, as a function of fleet size and model dimension.
All learning results are averaged over multiple random seeds, and we report mean and standard deviation.
5.6. Experimental Environment and Hardware
The power–cyber co-simulation and the cryptographic operations are executed on a high-performance workstation equipped with an AMD Ryzen 9 7950X CPU (16 cores, 5.7 GHz) and 128 GB of DDR5 RAM. The reinforcement learning agents are trained using PyTorch 2.1, and the distribution grid power flow is solved via OpenDSS v10.2.0.1. For the zero-knowledge verification layer, we utilize the Circom 2.1 compiler to define the aggregation circuits and the SnarkJS library to implement the Groth16 zk-SNARK protocol. The reported proof generation times account for the full witness generation and proving phase, averaged over 10 independent trials per fleet size.
6. Performance Evaluation
This section reports experimental results produced by the co-simulation pipeline in
Section 5. The results are organized into benign performance, robustness to malicious aggregation, and cryptographic overhead. Unless otherwise stated, each point is averaged over five random seeds and the shaded band indicates
standard deviation.
6.1. Economic and Technical Performance
Figure 3 compares learning curves of total VPP profit under a benign aggregator. Centralized DRL provides an upper-bound reference. Under the tested IEEE 33-bus co-simulation setting, standard federated RL and zk-FRL closely track the centralized reference, with zk-FRL achieving performance close to the centralized benchmark under benign aggregation. We emphasize that this result provides encouraging evidence within the studied scenario, rather than establishing general superiority under all deployment conditions or larger-scale real-world systems. Notably, the transition from 32-bit floating-point parameters (Standard FRL) to 16-bit fixed-point encoding (zk-FRL) did not result in observable performance degradation or convergence instability. This empirical evidence confirms that the policy is robust to the precision truncation required for zero-knowledge verification, as the coordination benefits of federation significantly outweigh the negligible quantization error.
The steady-state economic and technical performance metrics, averaged over the final evaluation horizon, are summarized in
Table 2. These results represent the mean and standard deviation across five independent simulation runs conducted in the power–cyber co-simulation environment detailed in
Section 5.
We additionally varied the degradation coefficients around the default setting and observed that the qualitative ranking of the compared methods remained unchanged, indicating that the main conclusions are not driven by a single coefficient choice.
Reward-Weight Sensitivity Analysis: We further examined the sensitivity of the learned policy to the reward-weight configuration by varying the relative emphasis on degradation cost and penalty terms around the default setting while keeping the overall reward structure unchanged. The results show that the proposed zk-FRL framework remains stable under moderate weight variations. Increasing the degradation weight leads to more conservative battery usage and lower cycling stress, but slightly reduces total profit. Increasing the penalty weight improves tracking and grid-feasibility compliance, but overly large penalties reduce economic performance by discouraging flexible dispatch. Conversely, when the degradation or penalty weights are set too small, the learned policy becomes more aggressive and may obtain slightly higher short-term revenue at the expense of larger tracking error, greater constraint pressure, and higher battery wear. Overall, the default weight setting provides the best balance among economic performance, battery health, and operational feasibility, which is consistent with the intended design of the multi-objective reward.
6.2. Security and Robustness Analysis
We evaluate robustness by enabling a malicious aggregator at round . We first present the update-dropping attack (drop fraction ) to illustrate the failure mode of trust-based federation, then summarize results across the broader attack suite.
Figure 4 illustrates a representative update-dropping scenario starting at round
. Under Standard FRL, a malicious aggregator can deviate from the prescribed aggregation rule while still broadcasting a syntactically valid global model, which may lead to degraded learning dynamics. In zk-FRL, any deviation that violates the receipt-bound commitment set or the aggregation arithmetic causes the proof verification to fail at the DER side. In that case, each DER rejects the broadcast model and continues operating with its last verified parameters (i.e., a local “parameter-lock” fallback). This design changes the failure mode from silent model drift to explicit rejection, and limits the impact of aggregator-side manipulation to the quality of the most recent integrity-verified model.
Beyond instantaneous robustness, we also evaluate the impact of attack duration and environmental drift on the “parameter-lock” fallback. Specifically, we vary (i) the number of consecutive malicious rounds after and (ii) the degree of non-stationarity in market and grid signals (e.g., price process and regulation command statistics) to quantify how long DERs can safely operate under the last verified parameters without re-training. This study characterizes a practical trade-off: rejecting invalid aggregation prevents silent poisoning, but prolonged rejection may slow adaptation when the underlying environment shifts. We report these results together with recovery behaviors (e.g., resuming federation once a valid proof is observed) in the following experiments.
Table 3 reports robustness across attack types using relative profit loss compared to benign steady state. We present results for both Standard FRL and zk-FRL, where zk-FRL may reject invalid broadcasts under aggregator deviation; the resulting performance reflects operation under the last verified parameters until a valid global model is available.
6.3. Scalability and Overheads of ZKPs
We evaluate the computational feasibility of the zk-FRL protocol by measuring the proving and verification wall-clock times across varying fleet sizes, as illustrated in
Figure 5. For the aggregated actor model with dimension
parameters, the measured proof generation time at the aggregator increases approximately linearly with the number of participating DERs, while the verification time at each DER remains lightweight and largely insensitive to fleet size. This verifier/prover asymmetry is a key advantage of zk-SNARKs: even resource-constrained DER devices can independently audit aggregation integrity with low latency. In our implementation, proving is performed at the federated-round time scale and does not block the sub-second physical control loop, since DERs continue operating with the last verified parameters until a valid global model is available.
These overheads are compatible with settings where federated rounds occur on the order of minutes (or longer), while sub-second control loops would require additional optimizations such as update quantization, batching, and hardware-assisted proving. In our implementation, the cryptographic layer does not affect the physical feasibility of control actions within a round because verification is performed before adopting the next global model, and agents can continue operating with the last verified parameters when verification fails. For reproducibility, we report the circuit- and system-level measurements underlying
Figure 5, including (i) the total number of SNARK constraints and witness size per round as a function of
D and
, (ii) peak memory usage and wall-clock proving time at the aggregator (mean, standard deviation, and selected percentiles over repeated trials), (iii) verification time distribution at DERs, and (iv) the sizes of the proving and verification keys and the proof payload. We also specify the hash primitive used for commitments and Merkle membership in the circuit, the fixed-point scale
q, and the hardware/software stack (compiler, proving system, and library versions) to enable independent replication.
The current scalability and performance conclusions should be interpreted within the tested feeder size, DER composition, communication settings, and market-signal regimes. While the reported results provide encouraging evidence under the considered IEEE 33-bus scenarios, broader studies involving larger feeders and more heterogeneous DER portfolios remain important future work before claiming deployment-level generality.
For reproducibility, we additionally report the total constraint count, proof size, proving and verification key sizes, peak memory usage, and the overhead breakdown between witness generation and proof generation. We also quantify the effect of fixed-point encoding by reporting the approximation error induced by the chosen scale q. These implementation details complement the asymptotic scaling discussion and clarify the concrete cost of the cryptographic layer.
Relative to representative zk-enabled FL frameworks, the proposed zk-FRL occupies a different point in the design space. In terms of computational overhead, our proof scope is restricted to aggregation correctness and round-data consistency, avoiding the substantially larger cost of proving full local-training correctness. In terms of privacy protection, zk-FRL preserves raw measurements and local trajectories at the DER side, but it does not aim to conceal model updates from the aggregator. In terms of robustness, the main benefit is strong protection against aggregator-side manipulation, including update dropping, weight tampering, stale replay, and model injection. This makes the framework particularly suitable for VPP coordination, where externally verifiable aggregation is more critical than end-to-end concealment of the learning process.
7. Final Remarks
This paper presented zk-FRL, a zero-knowledge federated reinforcement learning framework for large-scale VPP coordination in ancillary-service markets. The framework targets the practical trilemma faced by modern VPPs: achieving near-optimal economic performance while preserving DER raw-data privacy and maintaining operational security under realistic cyber threats. Our approach combines multi-objective DRL at the DER level with federated training that avoids centralizing fine-grained measurements and local experience trajectories, and adds a verification layer that makes the aggregation step externally auditable through signed round participation receipts, update commitments, and a zk-SNARK proof of correct aggregation under a fixed-point encoding with range constraints.
The evaluation in power–market–learning co-simulation indicates that, under benign conditions, zk-FRL attains economic performance close to a centralized reference while improving coordination relative to independent control. More importantly, zk-FRL changes the failure mode of federated learning under a compromised aggregator. When the aggregator deviates from the prescribed protocol by selectively dropping updates, manipulating weights, replaying stale models, or injecting arbitrary parameters, standard federated RL can drift silently and suffer sustained degradation. In contrast, zk-FRL enables each DER to detect such deviations by local verification and to reject invalid global models, thereby preventing the silent poisoning of the shared policy and bounding performance by the last verified model quality.
Beyond algorithmic performance, the framework supports a verifiable governance model for VPP operation. Because aggregation integrity is enforced cryptographically rather than procedurally, DER owners can participate without granting raw-data visibility (measurements and local experience trajectories) to a central coordinator, and operators can demonstrate protocol compliance without requiring access to those raw data streams. At the same time, the zk-SNARK layer enables independent verification that the published global model follows the prescribed aggregation rule over the declared round data; the framework targets verifiable aggregation integrity rather than concealing model updates from the aggregator. This property is aligned with emerging requirements for auditable control in critical infrastructure and suggests a pathway toward coordination mechanisms that avoid raw-data centralization while remaining independently verifiable.
Several directions remain open. Proof generation overhead at the aggregator is the primary scalability bottleneck, and reducing it will require system-level optimizations such as update quantization and batching, as well as hardware-assisted proving. Extending zk-FRL to incorporate more detailed degradation formulations and richer market products, including simultaneous participation in multiple ancillary services, is also important for deployment realism. Finally, while this work successfully addresses the “trust gap” at the aggregator level via verifiable aggregation, it does not inherently mitigate adversarial behavior originating from DER participants themselves, such as data or model poisoning. More broadly, the current conclusions should be interpreted as evidence within the tested feeder size, communication conditions, and market-signal regimes rather than as a claim of deployment-level generality. Extending the evaluation to larger feeders, stronger non-stationarity, and varying participation rates remains important future work. In a large-scale VPP, Byzantine participants could attempt to bias the global model by submitting malicious updates that are cryptographically valid but contain poisoned gradients intended to degrade fleet performance. Future research will focus on integrating our zk-SNARK framework with robust aggregation rules, including Krum, coordinate-wise median, and trimmed mean, together with Sybil-resistant identity management. This will provide a holistic security architecture capable of defending against both infrastructure-level breaches and participant-level adversaries, ensuring the long-term resilience of the trustless VPP coordination layer.