Trustless Federated Reinforcement Learning for VPP Dispatch

Zhang, Xin; Liang, Fan

doi:10.3390/electronics15061303

Open AccessArticle

Trustless Federated Reinforcement Learning for VPP Dispatch

by

Xin Zhang

¹ and

Fan Liang

^2,*

¹

Department of Product Management, China Tower Corporation, Beijing 100048, China

²

Department of Computer Science, Sam Houston State University, Huntsville, TX 77341, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1303; https://doi.org/10.3390/electronics15061303

Submission received: 28 February 2026 / Revised: 15 March 2026 / Accepted: 18 March 2026 / Published: 20 March 2026

(This article belongs to the Special Issue Secure and Intelligent IoT & CPS: AI Driven Attack–Defense, Network Analysis and Smart Data Protection)

Download

Browse Figures

Versions Notes

Abstract

Large-scale Virtual Power Plants (VPPs) are increasingly essential as Distributed Energy Resources (DERs) assume ancillary service duties once supplied by conventional generation, yet scaling a VPP exposes a persistent trilemma among economic efficiency, data privacy, and operational security. Centralized coordination can approach optimal revenue but requires collecting fine-grained DER operational data and creates a single point of compromise. Federated Learning (FL) mitigates raw data centralization by keeping measurements and experience local, but it introduces a fragile trust assumption that the aggregator will correctly and fairly combine model updates. This trust gap is acute in reinforcement learning-based VPP control because aggregation deviations, including selectively dropping updates, manipulating weights, replaying stale models, or injecting a replacement model, can silently bias the learned policy and degrade both profit and compliance. We propose a zero-knowledge federated reinforcement learning framework for trustless VPP coordination in which each DER trains a local deep reinforcement learning agent to solve a multi-objective dispatch problem that balances ancillary service revenue against battery degradation under operational and grid constraints, while the global aggregation step is made externally verifiable. In each round, participants bind membership via signed receipts and commit to their updates, and the aggregator produces a zk-SNARK, proving that the published global parameters equal the agreed aggregation rule applied to the receipt-bound set of committed updates under a fixed-point encoding with range constraints. Verification is lightweight and can be performed independently by each DER, removing the need to trust the aggregator for aggregation integrity without centralizing raw DER operational data or trajectories. The proposed design does not aim to hide model updates from the aggregator. Instead, it provides external verifiability of the aggregation computation while keeping raw measurements and local experience. We formalize the threat model and verifiable security properties for aggregation correctness and update inclusion, present a circuit construction with proof complexity characterized by model dimension and fleet size, and evaluate the approach in power and cyber co-simulation on the IEEE 33 bus feeder with ancillary service signals. Results show near-centralized economic performance under benign conditions and improved robustness to aggregator side deviations compared to standard federated reinforcement learning.

Keywords:

Virtual Power Plant (VPP); federated learning; zero-knowledge proofs (ZKPs); deep reinforcement learning; ancillary services; Distributed Energy Resources (DERs); smart grid; cybersecurity

1. Introduction

The power system is rapidly shifting from a small number of controllable generators toward a large population of Distributed Energy Resources (DERs), including behind-the-meter photovoltaic generation, battery energy storage systems (BESSs), and flexible loads [1]. While this transition supports decarbonization, it also increases operational uncertainty as renewable intermittency and stochastic demand tighten real-time balancing requirements [1]. In this context, ancillary services such as frequency regulation and voltage support become increasingly important, and fast-response capabilities are progressively supplied by inverter-based resources and storage [1]. BESSs are well-suited to provide high-quality ancillary services, but frequent cycling accelerates degradation and erodes long-horizon value [2]. Since individual DERs are typically too small to participate directly in wholesale or ancillary-service markets, aggregation into a Virtual Power Plant (VPP) has emerged as a dominant architecture for market participation and coordinated operation [3,4,5,6]. However, scaling a VPP exposes a persistent trilemma: the operator seeks near-optimal economic performance, DER owners demand raw-data privacy, and the overall coordination layer must remain secure and auditable under realistic cyber threats [3,7,8].

Existing coordination approaches address only parts of this trilemma. Centralized VPP control can achieve strong economic performance by optimizing the fleet with full observability, but it requires collecting fine-grained operational data and concentrates risk in a single point of failure and compromise [3,5,9]. Federated Learning (FL) improves privacy by keeping raw data local and sharing only model updates, enabling collaborative learning without centralized data collection [10,11]. However, standard FL introduces a fragile trust assumption that is rarely made explicit in VPP settings: a central aggregator must be trusted to correctly and fairly combine updates and to publish a fresh, protocol-compliant global model every round [10]. This trust gap is especially problematic in reinforcement-learning based control, where the training signal is noisy and there is no ground-truth label to audit. A compromised aggregator can silently bias the learning process by selectively dropping updates, manipulating weights, replaying stale models, or injecting a replacement model, leading to degraded profit, reduced compliance, and potentially unsafe dispatch behavior [12].

This paper proposes a trustless coordination mechanism for VPP learning by integrating zero-knowledge proofs into federated reinforcement learning. We develop a zero-knowledge federated reinforcement learning (zk-FRL) framework in which each DER runs a local deep reinforcement learning agent to solve a multi-objective dispatch problem balancing ancillary-service revenue against BESS degradation under operational and grid constraints, while collaboration is achieved through a federated training loop whose aggregation step is externally verifiable [11,13]. In each round, participating agents bind the participant set using signed receipts and publish cryptographic commitments to their updates. The aggregator computes the new global parameters and produces a succinct zk-SNARK, proving that the published model equals the agreed aggregation rule applied to the receipt-bound set of committed updates under a fixed-point encoding with range constraints [14,15,16]. Each DER independently verifies the proof and adopts the global model only if verification succeeds, eliminating the need to trust the aggregator for aggregation integrity while avoiding the centralization of raw DER operational data and local experience trajectories; the framework provides verifiable aggregation rather than concealing model updates from the aggregator [14,16].

The novelty of the proposed zk-FRL framework is not merely the addition of a generic zk-SNARK layer to federated reinforcement learning. At the algorithmic level, we define a receipt-bound and commitment-consistent aggregation protocol that allows each DER to accept or reject the published global model based on locally verifiable aggregation correctness. At the systems level, we integrate this protocol into a two-time-scale power–market–learning workflow, in which physical control remains continuous while model synchronization is externally auditable. At the power-engineering level, we tailor the learning objective and verification target to VPP coordination, where economic return, battery degradation, and grid-feasibility constraints must be balanced under aggregator-side cyber risk.

Below are the contributions in summary:

We propose a zk-FRL architecture for large-scale VPP coordination that removes the implicit trust requirement on the FL aggregator by making the aggregation step cryptographically verifiable while preserving DER raw-data privacy (i.e., without centralizing fine-grained measurements and local experience trajectories).
We formalize an adversarial model for malicious aggregator behavior in federated reinforcement learning, and define verifiable security properties for aggregation correctness and update inclusion that make selective dropping, weight manipulation, replay, and model injection externally detectable [12].
We present an implementable zk-SNARK based verification protocol for neural-network aggregation using fixed-point encoding and range constraints, and provide a complexity characterization that explains how proof cost scales with model dimension and fleet size [14,15,16].
We validate the framework in a power–cyber co-simulation on the IEEE 33-bus feeder [17] with ancillary-service signals, benchmarking against centralized DRL, independent DRL, and standard federated RL under both benign and adversarial aggregators.

The remainder of this paper is organized as follows. Section 2 reviews multi-objective BESS operation, DER aggregation architectures, and secure distributed learning. Section 3 presents the VPP system model and local DER dynamics. Section 4 details the proposed zk-FRL protocol, its security properties, and circuit construction. Section 5 describes the co-simulation platform and baselines. Section 6 reports performance, robustness under a suite of aggregator attacks, and cryptographic overhead. Section 7 concludes and outlines future directions.

2. Related Work

The literature informing our work spans multi-objective BESS control, DER aggregation architectures, and secure distributed learning for cyber-physical energy systems.

2.1. Multi-Objective BESS Operation

The economic viability of BESSs hinges on value stacking, where storage concurrently supports energy arbitrage and ancillary services [1,5]. However, aggressive participation in fast services such as frequency regulation increases cycling intensity and accelerates degradation [2]. This conflict frames BESS scheduling as a multi-objective decision problem that balances short-term market value against long-term asset wear [2].

2.2. DER Aggregation Architectures

Centralized VPPs can achieve high economic efficiency via global optimization but face scalability bottlenecks, single points of failure, and privacy risks due to centralized data collection [3,4,5]. Government and industry reports further highlight operational and deployment considerations for VPPs, including participation models, customer engagement, and integration pathways [6]. From a security perspective, reliability-focused analyses emphasize privacy and security impacts introduced by DER aggregators and discuss representative attack surfaces and mitigations [7,9]. Recent policy-oriented guidance also discusses privacy expectations and responsibilities for aggregators interacting with customer data [8].

2.3. Control Methods for DERs

DER scheduling has historically been dominated by model-based optimization such as mixed-integer programming, which is effective for planning but can be computationally intensive and sensitive to forecasting errors in real-time operation [1,3]. These limitations have motivated reinforcement learning for fast-response control under uncertainty, particularly for continuous-control dispatch actions [13]. We also discuss the recent MPC-based energy-management literature, including the systematic review by Minchala-Ávila et al. [18]. This review helps to position our RL-based approach against predictive, model-based coordination methods: MPC offers strong constraint handling and interpretability when accurate system models are available, whereas our zk-FRL framework targets decentralized adaptation and verifiable aggregation under privacy and trust constraints.

2.4. Privacy and Security in Distributed Intelligence

Federated learning enables collaborative training without centralizing raw data [10] and has been explored for distributed energy optimization [11]. However, standard federated learning requires trusting the aggregator to correctly execute the aggregation rule, creating a vulnerability where an aggregator can deviate in ways that are difficult to detect promptly [12]. Zero-knowledge proofs provide a principled tool to verify computations without revealing private inputs [14]. Recent systems research has begun to combine ZKPs with collaborative learning to obtain verifiable training or aggregation, clarifying feasibility and performance tradeoffs [15,16]. Our work builds on these directions and tailors verifiable aggregation to the VPP federated reinforcement learning setting.

We also discuss the recent literature on secure federated learning in smart-grid and energy settings, as well as recent robust federated learning work against participant-side poisoning attacks. These studies are complementary to the present paper: secure FL in energy systems mainly emphasizes privacy-preserving collaboration and secure coordination, while robust FL primarily addresses malicious-client behavior. By contrast, our focus is on externally verifiable aggregation integrity under an untrusted aggregator in VPP coordination.

Compared with representative zk-enabled federated learning studies, the proposed zk-FRL targets a different design objective. Many existing zk-FL schemes primarily emphasize privacy-preserving training or verifiable learning in general-purpose machine-learning settings, often at a higher proving cost due to a broader proof scope. By contrast, our framework focuses specifically on aggregation integrity in VPP coordination, where the main requirement is that DER participants can independently verify that the published global model is computed from the declared round data according to the agreed rule. Accordingly, the proposed design provides a lighter-weight verification target, preserves raw-data locality, and improves robustness against aggregator-side deviations that are especially consequential in cyber-physical energy systems.

3. System Model

This section formalizes the cyber-physical environment considered in this work. We model a large-scale Virtual Power Plant (VPP) as a multi-agent system that interacts with (i) the physical distribution grid, (ii) ancillary-service and energy markets, and (iii) a communication and learning layer that enables coordination without centralizing raw DER operational data, while making the aggregation step externally verifiable. The model is organized from architecture and time scales to device dynamics, market settlement, grid feasibility, and finally the cryptographic objects used later by the zk-FRL protocol.

3.1. System Architecture and Time Scales

We consider a three-tiered architecture consisting of DER agents, a VPP aggregator, and the grid/market operator:

DER Agents: Intelligent software agents co-located with physical DER assets (e.g., BESS, PV, controllable loads). Each agent $i \in I$ performs local control and maintains private operational data, including measurements, forecasts, and historical trajectories. Agents train local deep reinforcement learning (DRL) models on private experience and only communicate model updates and cryptographic commitments.
VPP Aggregator: A coordination service that orchestrates federated training rounds. The aggregator distributes the current global model, collects local model updates, computes a candidate global model, and generates a zero-knowledge proof certifying that the aggregation computation follows the prescribed rule over the correct participant set. The aggregator is not trusted by assumption and may be malicious as formalized later; hence, its actions must be verifiable.
Grid/Market Operator: The external entity (ISO/DSO) that issues ancillary-service signals (e.g., regulation commands) and settles payments based on the VPP’s aggregate performance. The distribution grid enforces physical constraints such as voltage limits and line thermal limits.

We distinguish two coupled time scales:

Control interval $Δ t$ (seconds to minutes): each agent executes a dispatch action $P_{b e s s, i} (t)$ and receives market and grid signals. For dimensional consistency with P in kW and energy capacity C in kWh, we use the hour-equivalent step $Δ t_{hr} ≜ Δ t / 3600$ in all energy-balance and settlement equations.
Federated learning round (multiple control intervals): agents collect local experience over a horizon and periodically participate in a federated reinforcement learning round to update shared policy parameters. This separation is important because cryptographic proving is performed per learning round rather than per control step.

3.2. Participants, Signals, and Information Structure

Let

I

denote the set of DER agents. At control time t, agent i observes a local measurement vector

o_{i} (t) = [S O C_{i} (t), P_{p v, i} (t), P_{l o a d, i} (t), f (t), π_{e} (t), π_{a s} (t), ξ (t)],

(1)

where

S O C_{i} (t)

is the state of charge,

P_{p v, i} (t)

and

P_{l o a d, i} (t)

are local PV and load,

f (t)

is a grid frequency (or an ancillary-service tracking reference),

π_{e} (t)

and

π_{a s} (t)

are energy and ancillary-service prices, and

ξ (t)

captures additional operator signals or calendar features (e.g., time of day, market interval index). Each agent does not observe other agents’ private states. The VPP-level objective depends on the aggregate response

P_{agg} (t) ≜ \sum_{i \in I} P_{b e s s, i} (t),

(2)

which is the quantity evaluated by the market operator for ancillary-service performance.

3.3. BESS Dynamics and Operational Constraints

Each agent i controls a BESS with energy capacity

C_{i}

(kWh), power limit

P_{i}^{\max}

(kW), and efficiency parameters

η_{c}, η_{d} \in (0, 1]

. The control action is the charge/discharge setpoint

a_{i} (t) ≜ P_{b e s s, i} (t), - P_{i}^{\max} \leq P_{b e s s, i} (t) \leq P_{i}^{\max},

(3)

where positive values denote charging and negative values denote discharging. Define

P_{b e s s, i}^{+} (t) ≜ \max {P_{b e s s, i} (t), 0}

and

P_{b e s s, i}^{-} (t) ≜ \max {- P_{b e s s, i} (t), 0}

. The state of charge evolves according to a discrete-time energy balance:

S O C_{i} (t + 1) = S O C_{i} (t) + η_{c} \frac{P_{b e s s, i}^{+} (t) Δ t_{hr}}{C_{i}} - \frac{1}{η_{d}} \frac{P_{b e s s, i}^{-} (t) Δ t_{hr}}{C_{i}},

(4)

subject to

S O C_{i}^{\min} \leq S O C_{i} (t) \leq S O C_{i}^{\max}, - P_{i}^{\max} \leq P_{b e s s, i} (t) \leq P_{i}^{\max} .

(5)

These constraints ensure feasible operation and implicitly define an admissible action set that depends on

S O C_{i} (t)

.

3.4. Ancillary-Service Signal, Tracking, and Settlement

The VPP participates in an ancillary-service product such as frequency regulation. Let

r (t)

denote the normalized regulation command (e.g.,

r (t) \in [- 1, 1]

) issued by the operator and let

C^{r e g} (t) \geq 0

denote the VPP regulation capacity committed for the settlement interval. The commanded aggregate power is

P_{cmd} (t) ≜ C^{r e g} (t) r (t) .

(6)

The tracking error is modeled as

e_{a s} (t) ≜ |P_{agg} (t) - P_{cmd} (t)| .

(7)

To keep the model general across markets, we represent ancillary-service revenue as a function of committed capacity and a performance score:

R_{a s} (t) = π_{a s} (t) C^{r e g} (t) ϕ (e_{a s} (t)),

(8)

where

ϕ (\cdot) \in [0, 1]

is a non-increasing performance mapping (e.g., a score that penalizes tracking error or non-compliance). Energy-market revenue (or cost) due to charging/discharging is represented as

R_{e} (t) = - π_{e} (t) P_{agg} (t) Δ t_{hr},

(9)

where the sign convention implies that discharging (

P_{agg} (t) < 0

) yields positive revenue and charging yields a cost. The total market-facing revenue component can be written as

R_{r e v} (t) = R_{a s} (t) + R_{e} (t)

, noting that additional market terms (e.g., mileage payments) can be absorbed into (8) without changing the learning and verification structure.

3.5. Battery Degradation Cost Model

Frequent cycling and high-rate operation accelerate BESS degradation. We model degradation as a monetized cost term that penalizes throughput and stress. To ensure reproducibility while retaining flexibility, we adopt a generic semi-empirical form and treat its coefficients as part of the experimental specification:

C_{d e g, i} (t) = c_{1, i} | P_{b e s s, i} (t) | Δ t_{hr} + c_{2, i} {(Δ S O C_{i} (t))}^{2},

(10)

where

Δ S O C_{i} (t) ≜ S O C_{i} (t + 1) - S O C_{i} (t)

and

c_{1, i}, c_{2, i} \geq 0

reflect asset-specific degradation sensitivity. The first term captures energy throughput-induced wear, while the second term penalizes aggressive changes in the state of charge that correlate with deeper or faster cycling. This representation is compatible with more detailed cycle-counting or rainflow-based models; when such models are used, they replace (10) as the definition of

C_{d e g, i} (t)

without affecting the remainder of the framework.

The degradation model adopted here is a semi-empirical proxy designed for dispatch-level decision making rather than detailed lifetime prediction. This abstraction is appropriate for the present study because the control objective is short-horizon ancillary-service coordination, whereas temperature-dependent aging, SOH evolution, and detailed cycle-aging effects mainly influence the effective degradation coefficients on slower horizons.

3.6. Distribution-Grid Feasibility and Grid Penalties

The VPP operates over a distribution network with bus set

B

and line set

L

. Let

V_{b} (t)

denote the voltage magnitude at bus b and let

S_{ℓ} (t)

denote the apparent power flow on line ℓ. At each control step, the power system simulator computes

(V_{b} (t), S_{ℓ} (t))

given net injections, including DER actions. Grid feasibility is characterized by operational limits

V^{\min} \leq V_{b} (t) \leq V^{\max}, \forall b \in B, | S_{ℓ} (t) | \leq S_{ℓ}^{\max}, \forall ℓ \in L .

(11)

Because the control policy is learned from experience and may explore, we incorporate grid feasibility into the learning objective through a penalty term that discourages violations:

P^{g r i d} (t) = λ_{v} \sum_{b \in B} [\max (0, V^{\min} - V_{b} (t)) + \max (0, V_{b} (t) - V^{\max})] + λ_{s} \sum_{ℓ \in L} \max (0, | S_{ℓ} (t) | - S_{ℓ}^{\max}),

(12)

with coefficients

λ_{v}, λ_{s} > 0

. This penalty couples the physical layer to the agent learning problem without requiring centralized access to all private states.

3.7. Learning Objects and Cryptographic Commitments

Each agent i maintains local DRL model parameters

θ_{i}

(e.g., actor and critic networks) and the VPP maintains a global model

θ_{g l o b a l}

. In each federated round t, the global model is broadcast, agents perform local updates using private experience, and each agent produces a model update

Δ θ_{i}^{t} ≜ θ_{i}^{t} - θ_{g l o b a l}^{t - 1} .

(13)

To enable the later verification of correct aggregation while preserving privacy, we associate each update with a cryptographic commitment

C_{i}^{t} \leftarrow C (Δ θ_{i}^{t}),

(14)

where

C (\cdot)

denotes a collision-resistant commitment (e.g., a hash of a canonical serialization or a Merkle root when updates are chunked). At the beginning of each round, participating agents bind the participant set by publishing signed receipts

σ_{i}^{t}

over a round identifier

{rid}_{t}

, and the aggregator computes a public digest

R_{t} \leftarrow Root ({σ_{i}^{t}}_{i \in I_{t}}),

(15)

where

I_{t}

is the set of participants in round t. These objects are public inputs to the zero-knowledge verification layer in later sections and make update dropping or weight manipulation externally detectable.

3.8. Notation

Table 1 summarizes key symbols used throughout the system model and the zk-FRL formulation.

4. The zk-FRL Framework

This section presents the proposed zero-knowledge federated reinforcement learning (zk-FRL) framework. We first define the local learning problem solved by each DER agent as a multi-objective Markov decision process (MDP) aligned with the system model in Section 3. We then describe the federated reinforcement learning loop used to train a shared policy without centralizing raw data. Finally, we introduce the zero-knowledge verification layer that removes the need to trust the aggregator by making the aggregation step publicly verifiable while keeping raw measurements and local experience private.

Figure 1 summarizes the main stages of the proposed zk-FRL protocol, from round enrollment and local update commitment to receipt-bound aggregation, proof generation, and local proof verification.

4.1. Local Agent Learning Problem

Each DER agent

i \in I

interacts with its local environment at the control interval

Δ t

and learns a policy that maps local observations to a BESS power setpoint. We formulate the local control problem as an MDP

(S, A, P, R, γ)

with discount factor

γ \in (0, 1)

.

At time t, agent i observes a local state vector consistent with (1). For compactness, we write

s_{i, t} = [S O C_{i} (t), P_{p v, i} (t), P_{l o a d, i} (t), f (t), π_{e} (t), π_{a s} (t), ξ (t)] \in S .

(16)

This representation captures the internal BESS state and the external signals required for market participation and constraint-aware control.

The action is the BESS charge or discharge power setpoint

a_{i, t} ≜ P_{b e s s, i} (t) \in A, - P_{i}^{\max} \leq P_{b e s s, i} (t) \leq P_{i}^{\max},

(17)

with feasibility further constrained by (4) and (5). A continuous-control DRL algorithm such as Soft Actor–Critic can be used to learn the policy, but the zk-FRL protocol does not depend on a specific DRL choice.

The local learning objective is aligned with VPP profit and operational feasibility. We define the instantaneous reward as a weighted combination of revenue, degradation cost, and penalties:

R_{i, t} = w_{r e v} R_{r e v} (t) - w_{d e g} C_{d e g, i} (t) - w_{p e n} P_{i} (t),

(18)

where

R_{r e v} (t)

is the market-facing revenue component (e.g., (8) and (9)),

C_{d e g, i} (t)

is the degradation cost (e.g., (10)), and

P_{i} (t)

aggregates penalties that enforce feasibility and performance requirements. The penalty term is instantiated as

\begin{matrix} P_{i} (t) = & λ_{soc} (\max {0, S O C_{i}^{\min} - S O C_{i} (t)} + \max {0, S O C_{i} (t) - S O C_{i}^{\max}}) \\ + λ_{p} \max {0, | P_{b e s s, i} (t) | - P_{i}^{\max}} + λ_{a s} e_{a s} (t) + λ_{g r i d} P^{g r i d} (t) . \end{matrix}

(19)

where

e_{a s} (t)

is the ancillary-service tracking error defined in (7), and

P^{g r i d} (t)

is the grid-violation penalty in (12). The coefficients

λ_{s o c}, λ_{p}, λ_{a s}, λ_{g r i d} > 0

regulate compliance strictness, while

w_{r e v}, w_{d e g}, w_{p e n} > 0

tune the strategic balance between revenue and asset preservation. The agent aims to maximize the expected discounted return

E [\sum_{t \geq 0} γ^{t} R_{i, t}]

under the transition dynamics implied by (4) and the power-flow-based grid response.

The reward weights are selected so that revenue remains the primary optimization target, while degradation and feasibility penalties are strong enough to suppress operationally undesirable behavior. In practice, the reward components are first normalized by their representative magnitudes and the final coefficients are chosen through coarse grid search to achieve a balanced trade-off among profitability, battery preservation, and compliance.

The reward weights

w_{r e v}

,

w_{d e g}

, and

w_{p e n}

are selected according to three practical principles. First,

w_{r e v}

is kept as the dominant coefficient because market-facing revenue is the primary objective of VPP coordination. Second,

w_{d e g}

and

w_{p e n}

are chosen to be sufficiently large to suppress economically favorable but operationally undesirable behaviors, such as excessive battery cycling or repeated constraint violations. Third, the coefficients are tuned so that the major reward components remain of comparable numerical scale during training, which improves optimization stability and prevents a single term from overwhelming the learning signal. In practice, we first normalize the observed magnitudes of the revenue, degradation, and penalty terms on a representative training set, and then select the final coefficients through coarse grid search to achieve a balanced trade-off between profitability, battery preservation, and operational compliance.

4.2. Federated Reinforcement Learning Loop

To improve policy quality without sharing raw DER data, agents collaborate through a federated reinforcement learning process. Each federated round aggregates locally trained model parameters into a global model that is then redistributed. Let

θ_{g l o b a l}^{t - 1}

denote the global parameters broadcast at the start of round t. Each participating agent k initializes its local parameters as

θ_{k}^{t} \leftarrow θ_{g l o b a l}^{t - 1}

, collects experience using its local environment, and performs E local update epochs. The agent then computes an update

Δ θ_{k}^{t} = θ_{k}^{t} - θ_{g l o b a l}^{t - 1}

, where

θ \in R^{D}

represents the flattened vector of all trainable parameters (weights and biases) across the aggregated networks. In this work,

θ

denotes the actor parameter vector that is shared and aggregated across agents, whereas critic parameters are trained locally and are not included in

θ

. The total dimension D is a critical factor in the subsequent zero-knowledge proof generation, as the number of constraints in the zk-SNARK circuit scales linearly with the product of D and the number of participants

| K_{t} |

,

Δ θ_{k}^{t} = θ_{k}^{t} - θ_{g l o b a l}^{t - 1},

(20)

and transmits it to the aggregator.

Under standard FedAvg with uniform weights, the aggregator computes

θ_{g l o b a l}^{t} = θ_{g l o b a l}^{t - 1} + \sum_{k \in K_{t}} \frac{1}{| K_{t} |} Δ θ_{k}^{t},

(21)

where

K_{t}

denotes the participant set in round t. This loop enables collaboration but, by itself, requires trust in the aggregator to faithfully implement (21) over the correct set of updates.

4.3. Threat Model and Trustless Aggregation

The objective of zk-FRL is to remove the need to trust the aggregator in the aggregation step. We consider a DER population that follows the protocol and an aggregator that may be malicious or compromised. The aggregator may deviate from (21) by selectively dropping a subset of received updates, manipulating aggregation weights, replaying a stale global model, or injecting arbitrary parameters into

θ_{g l o b a l}^{t}

. The core design requirement is that such deviations are externally detectable by the DER agents without revealing private updates or raw data. It is important to note that this work focuses specifically on ensuring the integrity of the aggregation process against a semi-honest or malicious aggregator (e.g., preventing model manipulation or update dropping). While other threats in federated learning, such as Byzantine DER participants or data poisoning attacks, remain significant challenges for distributed coordination, they fall outside the current scope of this paper and are considered orthogonal to the proposed verifiable aggregation framework.

In this work, the threat model is intentionally scoped to aggregator-side deviations from the prescribed federated aggregation protocol. Specifically, we assume that DER participants execute local training and submission as specified, while the aggregator may be untrusted, compromised, or strategically malicious. Its capabilities include selectively dropping updates, manipulating aggregation weights, replaying stale global models, or injecting arbitrary replacement parameters before broadcasting the new global model. By contrast, attacks originating from DER participants themselves, such as Byzantine updates, data poisoning, model poisoning, or Sybil behavior, are outside the present scope. These threats are important in federated learning, but they are orthogonal to the aggregation-integrity problem addressed here and would require robust aggregation or participant-authentication mechanisms in addition to the proposed zk-SNARK layer.

The current security guarantee is limited to aggregator-side deviations from the prescribed aggregation protocol. Accordingly, zk-FRL can detect incorrect aggregation, update dropping, stale replay, and replacement-model injection by an untrusted aggregator, but it does not by itself prevent malicious DER participants from submitting poisoned yet syntactically valid updates. In a realistic VPP deployment, this means the framework protects aggregation integrity and auditability, while robust aggregation, client reputation, or admission-control mechanisms are still required to address participant-side attacks.

To make update dropping and weight manipulation well-defined and auditable, zk-FRL binds the participant set per round and binds each update to a public commitment:

At the beginning of round t, each participating agent k signs a round identifier ${rid}_{t}$ to produce a participation receipt $σ_{k}^{t}$ . The aggregator computes a public digest $R_{t} \leftarrow Root ({σ_{k}^{t}}_{k \in K_{t}})$ as in (15). This digest commits the participant set for round t.
After local training, agent k computes a commitment $C_{k}^{t} \leftarrow C (Δ θ_{k}^{t})$ as in (14). The set of commitments binds the concrete updates that must be used in aggregation.

For auditability, the aggregator publishes the ordered list of commitments

{C_{k}^{t}}_{k \in K_{t}}

(or a Merkle root over the list) as a public input to verification, binding each declared participant to exactly one committed update in round t.

The receipt digest

R_{t}

certifies integrity conditional on the declared participant set for round t: if an agent’s receipt and commitment are included in

R_{t}

, then any subsequent dropping or weight manipulation over that declared set becomes externally detectable. This mechanism, by itself, does not prevent an untrusted aggregator from selectively excluding agents during round enrollment (e.g., by not relaying

{rid}_{t}

to some agents or by omitting their receipts before forming

R_{t}

). We therefore assume an auditable enrollment channel in which the round identifier

{rid}_{t}

and enrollment window are publicly announced (e.g., a bulletin-board style broadcast), so that an agent can detect exclusion when it has produced a valid receipt

σ_{k}^{t}

within the enrollment window but does not observe its inclusion in the published digest

R_{t}

. Handling exclusion disputes (e.g., via re-enrollment, redundant coordinators, or governance mechanisms) is orthogonal to verifiable aggregation correctness and is not the focus of this work.

The aggregator then publishes

({rid}_{t}, R_{t}, θ_{g l o b a l}^{t})

together with a zero-knowledge proof that the aggregation was performed correctly over the receipt-bound committed updates.

This threat model differs from the threat models most commonly emphasized in the federated learning literature. Standard FL security analyses often focus on malicious or Byzantine clients, poisoning attacks, inference attacks, or privacy leakage during update exchange. In contrast, our work focuses on the trust assumption placed on the central aggregator itself. This distinction is particularly relevant in VPP coordination, where the aggregator is both a computational coordinator and an operationally influential entity whose deviation can silently bias the learned policy and affect economic performance as well as grid compliance. Therefore, the proposed threat model is not intended to replace existing FL threat models, but rather to complement them by isolating and formalizing the aggregation-integrity risk that is especially critical in cyber-physical energy systems.

4.4. Zero-Knowledge Verification Layer

We use a zk-SNARK to provide succinct proofs and fast verification. For each round t, the aggregator acts as a prover and each DER can independently act as a verifier. The statement proven in round t is that the published

θ_{g l o b a l}^{t}

is exactly the result of applying the agreed aggregation rule to the prior global model and the set of updates whose commitments correspond to the receipt-bound participant set.

The public inputs to verification include

{rid}_{t}

, the participant digest

R_{t}

, the prior and new global parameters

θ_{g l o b a l}^{t - 1}

and

θ_{g l o b a l}^{t}

, and the list of commitments

{C_{k}^{t}}_{k \in K_{t}}

. The witness includes the concrete updates

{Δ θ_{k}^{t}}_{k \in K_{t}}

and the receipts

{σ_{k}^{t}}_{k \in K_{t}}

that open to

R_{t}

. The proof enforces that each committed update is the one used in the aggregation and that the aggregation arithmetic is correct.

To keep the proving circuit practical while preserving verifiability, zk-FRL separates checks that are enforced inside the zk-SNARK circuit from those enforced outside the circuit by each DER verifier:

Inside the circuit (cryptographically proven). The prover supplies the concrete updates ${Δ θ_{k}^{t}}$ as private witnesses. The circuit (i) recomputes the commitment digest for each update using a SNARK-friendly hash (e.g., Poseidon) and checks equality against the published commitments ${C_{k}^{t}}$ , (ii) enforces the membership of each $(σ_{k}^{t}, C_{k}^{t})$ pair in the declared round data via a Merkle-path verification against the published root(s) (e.g., $R_{t}$ for receipts and an optional commitment-list root), and (iii) enforces that the published $θ_{g l o b a l}^{t}$ equals the prescribed aggregation rule applied to $θ_{g l o b a l}^{t - 1}$ and that the witness updates under the fixed-point encoding and range constraints.
Outside the circuit (locally verified by DERs). The digital-signature validity of receipts $σ_{k}^{t} = {Sign}_{k} ({rid}_{t})$ is verified by each DER using the public keys of participating agents, and the DER checks that the round identifier ${rid}_{t}$ and enrollment window match the publicly announced round metadata. These checks are standard and inexpensive in native execution, and keeping signature verification outside the circuit avoids a substantial increase in SNARK constraint count.

This split ensures that (a) aggregation arithmetic and update-to-commitment binding are provably correct, while (b) signature authenticity and enrollment metadata are verified without reintroducing trust in the aggregator.

For implementation clarity, the proving circuit is organized into three modules. The first module enforces update-to-commitment consistency: for each participant k, the witness update

Δ θ_{k}^{t}

is re-encoded in fixed-point form, hashed with a SNARK-friendly primitive, and checked against the published commitment

C_{k}^{t}

. The second module verifies round membership by checking that each receipt/commitment pair belongs to the published round data through a Merkle authentication path against the public root(s). The third module enforces aggregation correctness: for every model coordinate

j = 1, \dots, D

, the circuit checks that the encoded global parameter satisfies

{\hat{θ}}_{g l o b a l, j}^{t} = {\hat{θ}}_{g l o b a l, j}^{t - 1} + \sum_{k \in K_{t}} {\hat{w}}_{k} {\hat{Δ θ}}_{k, j}^{t},

together with range constraints that ensure all encoded values remain within predetermined bounds. Under uniform FedAvg,

{\hat{w}}_{k}

reduces to the common encoded averaging coefficient.

Neural-network parameters are represented in floating point, whereas zk-SNARK circuits operate over a finite field. We therefore use a fixed-point encoding for updates and parameters within the proving circuit. Let q denote the fixed-point scale, and let

\hat{x} = round (2^{q} x)

denote the integer encoding of a real-valued scalar x. The circuit enforces predetermined bounds on encoded values to prevent overflow and to make the arithmetic well-defined.

In our implementation, we use

q = 16

as a default configuration to balance proving cost and numerical precision. Importantly, fixed-point quantization may affect optimization dynamics in reinforcement learning; therefore, we empirically evaluate sensitivity to q by sweeping multiple scales (e.g.,

q \in {8, 12, 16, 20}

) and reporting the resulting learning curves and steady-state performance in Section 6. This sensitivity study quantifies the practical trade-off between verification overhead and control performance under the same co-simulation settings.

In addition to the default fixed-point scale, we considered nearby precision settings to assess numerical sensitivity. Under moderate changes in precision, the qualitative convergence behavior and benign-performance ranking remained unchanged, indicating that the learning results are not driven by a single quantization choice.

Let D denote the number of aggregated parameters and let

K_{t} = | K_{t} |

denote the number of participants in round t. The total number of constraints can be expressed at a high level as

N_{con} \approx K_{t} D (c_{hash} + c_{agg} + c_{range}) + K_{t} c_{path} \log_{2} K_{t} + c_{pub} D,

where

c_{hash}

captures the per-coordinate or per-chunk commitment-hash cost,

c_{agg}

captures the fixed-point aggregation arithmetic,

c_{range}

captures bound enforcement,

c_{path}

is the cost per Merkle-path level, and

c_{pub}

accounts for public-output consistency checks. Accordingly, the dominant term scales linearly with

K_{t} D

, while the membership-verification overhead grows as

K_{t} \log K_{t}

. This scaling law is the basis for the overhead trends reported later in Section 6.

Each DER verifies the proof using the public inputs. If verification succeeds, the DER accepts

θ_{g l o b a l}^{t}

as the valid global model for round t. If verification fails, the DER rejects the update and retains its last verified model, preventing silent degradation due to aggregator deviations.

4.5. Protocols

Algorithm 1 summarizes the standard federated averaging loop used as a baseline. Algorithm 2 presents the zk-FRL protocol for a single round, which augments the baseline with participant binding, update commitments, and a zk-SNARK proof of correct aggregation. The protocol does not require the DERs to reveal raw data or to trust the aggregator, and it ensures that deviations from the prescribed aggregation rule are detectable by local verification.

Algorithm 1: Federated Averaging (FedAvg) Process

Algorithm 2: zk-FRL Protocol (Single Round, Aggregator-Trustless)

5. Implementation

To evaluate the proposed zk-FRL framework under realistic cyber-physical conditions, we implement a power–market–learning co-simulation platform that couples distribution power flow, ancillary-service settlement, federated reinforcement learning, and a zero-knowledge verification layer. Figure 2 shows the system architecture of the proposed zk–FRL co–simulation platform. The goal of this section is to make the experimental pipeline reproducible by stating all model interfaces, time scales, baselines, and measurement procedures.

5.1. Co-Simulation Platform

The platform integrates four modules that exchange signals at a fixed control interval

Δ t

and a slower federated round interval:

Power system module: We use the IEEE 33-bus radial feeder as the distribution network and run an AC power flow at each control step to obtain bus voltages and line flows. The feeder is populated with DER agents at a subset of buses selected to provide geographically distributed flexibility across upstream, mid-feeder, and downstream locations. Unless otherwise stated, DER-equipped buses are ${6, 10, 14, 18, 21, 25, 30, 33}$ . The deployed BESS units are heterogeneous, with energy capacities drawn from 80–300 kWh, maximum charge/discharge powers drawn from 40–120 kW, initial $S O C$ values sampled from 0.4–0.7, and operating limits set to $S O C^{\min} = 0.1$ and $S O C^{\max} = 0.9$ . Charging and discharging efficiencies are set to $η_{c} = η_{d} = 0.95$ . Net injections are the superposition of base load, PV generation, and BESS actions. Base load and PV injections follow representative diurnal profiles mapped to the corresponding buses. These heterogeneous settings are used to reflect realistic diversity among DER assets within a practical VPP.
Market and settlement module: A discrete-time ancillary-service process provides a regulation command $r (t)$ and prices $(π_{e} (t), π_{a s} (t))$ . The VPP commits a regulation capacity $C^{r e g} (t)$ per settlement interval. The module computes aggregate tracking error $e_{a s} (t)$ via (7) and settles revenue using (8) and (9). This module also produces a market-defined performance score $ϕ (e_{a s} (t))$ used in settlement. The market environment is modeled as a time-varying ancillary-service process with regulation command $r (t) \in [- 1, 1]$ , energy price $π_{e} (t)$ , and ancillary-service price $π_{a s} (t)$ . Unless otherwise stated, $π_{e} (t)$ is sampled from a diurnal time-of-use profile in the range of 20–80 USD/MWh, while $π_{a s} (t)$ is sampled from 10–40 USD/MW per settlement interval. The committed regulation capacity $C^{r e g} (t)$ is selected within the feasible aggregate BESS capability of the VPP for the corresponding interval.
Learning and communication module: Each DER agent runs a continuous-control DRL algorithm (Soft Actor–Critic, SAC) using local observations from (16) and rewards in (18) and (19). The actor network is implemented as a Multi-Layer Perceptron (MLP) with two hidden layers, each containing 256 neurons with Rectified Linear Unit (ReLU) activations. The input layer has a dimension of 7, and the output layer has a dimension of 1 for the BESS power setpoint. In our federated setup, we aggregate only the actor parameters, while each agent trains its critic(s) locally using private experience replay. Accordingly, the aggregated model dimension equals the actor parameter count, which is approximately $D \approx 6.8 \times 10^{4}$ (about $6.8 \times 10^{4}$ weights and biases for a 7–256–256–1 MLP), providing a realistic benchmark for the zk-SNARK circuit complexity. Agents exchange model parameters or updates with the aggregator only once per federated round. Communication latency is modeled as a bounded delay added to the round wall-clock time. Unless otherwise noted, SAC uses an actor learning rate of $3 \times 10^{- 4}$ , a critic learning rate of $3 \times 10^{- 4}$ , discount factor $γ = 0.99$ , soft target update coefficient $τ = 0.005$ , replay-buffer size of $10^{5}$ , mini-batch size of 256, and $E = 5$ local update epochs per federated round. Exploration is induced by the stochastic SAC policy with automatically tuned entropy temperature. Training is run for a fixed number of federated rounds, and convergence is assessed empirically by the stabilization of moving-average reward and profit curves over multiple seeds rather than by a formal stopping guarantee. The SAC training uses the following core hyperparameters unless otherwise noted: actor learning rate $= 3 \times 10^{- 4}$ , critic learning rate $= 3 \times 10^{- 4}$ , discount factor $γ = 0.99$ , soft target update coefficient $τ = 0.005$ , replay-buffer size $= 10^{5}$ , mini-batch size $= 256$ , and local update epochs $E = 5$ per federated round. The entropy–temperature coefficient is automatically tuned following the standard SAC implementation.
Cryptographic verification module: zk-FRL augments standard federation with signed participation receipts, update commitments, and a zk-SNARK proof of correct aggregation. We measure proof generation time at the aggregator and verification time at each DER. Updates are encoded in fixed point with range constraints, consistent across rounds. For the cryptographic layer, updates are encoded using fixed-point precision with scale $q = 16$ , commitments are instantiated with a SNARK-friendly hash, and Merkle authentication is used for round-data membership proofs. The Groth16 proving system is used throughout. We report proof generation time, verification time, proof size, and key sizes under the same hardware/software environment described in Section 5.6.

5.2. Time Scales and Training Schedule

We adopt a two-time-scale schedule. Control actions are applied every

Δ t

(e.g., seconds to minutes depending on the product), while a federated learning round occurs every H control steps. In each round, participating agents collect experience over the most recent window, perform E local update epochs, and submit an update

Δ θ_{k}^{t}

and commitment

C_{k}^{t}

to the aggregator. The aggregator computes the candidate global model using (21) and, under zk-FRL, attaches a proof

π_{z k p}

.

5.3. Baselines

We benchmark zk-FRL against three control strategies:

Centralized DRL (Oracle): A single controller has access to all DER states and actions and trains a centralized policy. This provides an upper-bound reference for economic performance but is impractical due to privacy and communication constraints.
Independent DRL: Each DER agent trains locally without any federation and optimizes only its local reward. This captures the cost of non-cooperation.
Standard Federated RL: Agents train locally and the aggregator performs FedAvg without any verification layer. This baseline isolates the security gap introduced by trusting the aggregator.

The selected baselines are intended to isolate three distinct gaps: the collaboration gap (independent DRL versus federated learning), the trust gap (standard FRL versus zk-FRL), and the observability gap (centralized oracle versus practical decentralized coordination). Robust or secure aggregation baselines are complementary because they primarily address malicious clients rather than an untrusted aggregator, while optimization-based VPP dispatch remains an important model-based reference but represents a different control paradigm from learned adaptive coordination.

5.4. Attack Suite and Robustness Settings

To evaluate robustness to malicious aggregator behavior, we define an attack suite that targets the aggregation step. Starting at a designated round

t_{a}

, the aggregator deviates from the protocol by applying one of the following actions before broadcasting

θ_{g l o b a l}^{t}

:

Update dropping: randomly drop a fraction $α \in (0, 1)$ of committed updates while still claiming the same participant digest.
Weight manipulation: apply biased weights $w_{k}$ that deviate from uniform averaging while claiming uniform aggregation.
Stale replay: replay a previously broadcast global model $θ_{g l o b a l}^{t - δ}$ as if it were fresh.
Model injection: publish an arbitrary parameter vector that is not the result of the prescribed aggregation.

Under zk-FRL, all deviations above are detectable at verification time by design, so agents reject the invalid global model and retain the last verified parameters.

5.5. Evaluation Metrics

We report four groups of metrics:

Economic performance: total VPP profit (USD/day), decomposed into ancillary-service revenue, energy cost, and monetized degradation.
Technical performance: ancillary-service tracking quality measured by the mean absolute tracking error (MAE) of $P_{agg} (t)$ relative to $P_{cmd} (t)$ , and grid feasibility measured by voltage-violation magnitude aggregated over buses and time.
Security and robustness: relative degradation of profit and tracking performance under each attack type and strength, compared to benign operation.
Computational overhead: proof generation time at the aggregator and verification time at each DER, as a function of fleet size and model dimension.

All learning results are averaged over multiple random seeds, and we report mean and standard deviation.

5.6. Experimental Environment and Hardware

The power–cyber co-simulation and the cryptographic operations are executed on a high-performance workstation equipped with an AMD Ryzen 9 7950X CPU (16 cores, 5.7 GHz) and 128 GB of DDR5 RAM. The reinforcement learning agents are trained using PyTorch 2.1, and the distribution grid power flow is solved via OpenDSS v10.2.0.1. For the zero-knowledge verification layer, we utilize the Circom 2.1 compiler to define the aggregation circuits and the SnarkJS library to implement the Groth16 zk-SNARK protocol. The reported proof generation times account for the full witness generation and proving phase, averaged over 10 independent trials per fleet size.

6. Performance Evaluation

This section reports experimental results produced by the co-simulation pipeline in Section 5. The results are organized into benign performance, robustness to malicious aggregation, and cryptographic overhead. Unless otherwise stated, each point is averaged over five random seeds and the shaded band indicates

\pm 1

standard deviation.

6.1. Economic and Technical Performance

Figure 3 compares learning curves of total VPP profit under a benign aggregator. Centralized DRL provides an upper-bound reference. Under the tested IEEE 33-bus co-simulation setting, standard federated RL and zk-FRL closely track the centralized reference, with zk-FRL achieving performance close to the centralized benchmark under benign aggregation. We emphasize that this result provides encouraging evidence within the studied scenario, rather than establishing general superiority under all deployment conditions or larger-scale real-world systems. Notably, the transition from 32-bit floating-point parameters (Standard FRL) to 16-bit fixed-point encoding (zk-FRL) did not result in observable performance degradation or convergence instability. This empirical evidence confirms that the policy is robust to the precision truncation required for zero-knowledge verification, as the coordination benefits of federation significantly outweigh the negligible quantization error.

The steady-state economic and technical performance metrics, averaged over the final evaluation horizon, are summarized in Table 2. These results represent the mean and standard deviation across five independent simulation runs conducted in the power–cyber co-simulation environment detailed in Section 5.

We additionally varied the degradation coefficients around the default setting and observed that the qualitative ranking of the compared methods remained unchanged, indicating that the main conclusions are not driven by a single coefficient choice.

Reward-Weight Sensitivity Analysis: We further examined the sensitivity of the learned policy to the reward-weight configuration by varying the relative emphasis on degradation cost and penalty terms around the default setting while keeping the overall reward structure unchanged. The results show that the proposed zk-FRL framework remains stable under moderate weight variations. Increasing the degradation weight leads to more conservative battery usage and lower cycling stress, but slightly reduces total profit. Increasing the penalty weight improves tracking and grid-feasibility compliance, but overly large penalties reduce economic performance by discouraging flexible dispatch. Conversely, when the degradation or penalty weights are set too small, the learned policy becomes more aggressive and may obtain slightly higher short-term revenue at the expense of larger tracking error, greater constraint pressure, and higher battery wear. Overall, the default weight setting provides the best balance among economic performance, battery health, and operational feasibility, which is consistent with the intended design of the multi-objective reward.

6.2. Security and Robustness Analysis

We evaluate robustness by enabling a malicious aggregator at round

t_{a} = 50

. We first present the update-dropping attack (drop fraction

α = 0.5

) to illustrate the failure mode of trust-based federation, then summarize results across the broader attack suite.

Figure 4 illustrates a representative update-dropping scenario starting at round

t_{a} = 50

. Under Standard FRL, a malicious aggregator can deviate from the prescribed aggregation rule while still broadcasting a syntactically valid global model, which may lead to degraded learning dynamics. In zk-FRL, any deviation that violates the receipt-bound commitment set or the aggregation arithmetic causes the proof verification to fail at the DER side. In that case, each DER rejects the broadcast model and continues operating with its last verified parameters (i.e., a local “parameter-lock” fallback). This design changes the failure mode from silent model drift to explicit rejection, and limits the impact of aggregator-side manipulation to the quality of the most recent integrity-verified model.

Beyond instantaneous robustness, we also evaluate the impact of attack duration and environmental drift on the “parameter-lock” fallback. Specifically, we vary (i) the number of consecutive malicious rounds after

t_{a}

and (ii) the degree of non-stationarity in market and grid signals (e.g., price process and regulation command statistics) to quantify how long DERs can safely operate under the last verified parameters without re-training. This study characterizes a practical trade-off: rejecting invalid aggregation prevents silent poisoning, but prolonged rejection may slow adaptation when the underlying environment shifts. We report these results together with recovery behaviors (e.g., resuming federation once a valid proof is observed) in the following experiments.

Table 3 reports robustness across attack types using relative profit loss compared to benign steady state. We present results for both Standard FRL and zk-FRL, where zk-FRL may reject invalid broadcasts under aggregator deviation; the resulting performance reflects operation under the last verified parameters until a valid global model is available.

6.3. Scalability and Overheads of ZKPs

We evaluate the computational feasibility of the zk-FRL protocol by measuring the proving and verification wall-clock times across varying fleet sizes, as illustrated in Figure 5. For the aggregated actor model with dimension

D \approx 6.8 \times 10^{4}

parameters, the measured proof generation time at the aggregator increases approximately linearly with the number of participating DERs, while the verification time at each DER remains lightweight and largely insensitive to fleet size. This verifier/prover asymmetry is a key advantage of zk-SNARKs: even resource-constrained DER devices can independently audit aggregation integrity with low latency. In our implementation, proving is performed at the federated-round time scale and does not block the sub-second physical control loop, since DERs continue operating with the last verified parameters until a valid global model is available.

These overheads are compatible with settings where federated rounds occur on the order of minutes (or longer), while sub-second control loops would require additional optimizations such as update quantization, batching, and hardware-assisted proving. In our implementation, the cryptographic layer does not affect the physical feasibility of control actions within a round because verification is performed before adopting the next global model, and agents can continue operating with the last verified parameters when verification fails. For reproducibility, we report the circuit- and system-level measurements underlying Figure 5, including (i) the total number of SNARK constraints and witness size per round as a function of D and

| K_{t} |

, (ii) peak memory usage and wall-clock proving time at the aggregator (mean, standard deviation, and selected percentiles over repeated trials), (iii) verification time distribution at DERs, and (iv) the sizes of the proving and verification keys and the proof payload. We also specify the hash primitive used for commitments and Merkle membership in the circuit, the fixed-point scale q, and the hardware/software stack (compiler, proving system, and library versions) to enable independent replication.

The current scalability and performance conclusions should be interpreted within the tested feeder size, DER composition, communication settings, and market-signal regimes. While the reported results provide encouraging evidence under the considered IEEE 33-bus scenarios, broader studies involving larger feeders and more heterogeneous DER portfolios remain important future work before claiming deployment-level generality.

For reproducibility, we additionally report the total constraint count, proof size, proving and verification key sizes, peak memory usage, and the overhead breakdown between witness generation and proof generation. We also quantify the effect of fixed-point encoding by reporting the approximation error induced by the chosen scale q. These implementation details complement the asymptotic scaling discussion and clarify the concrete cost of the cryptographic layer.

Relative to representative zk-enabled FL frameworks, the proposed zk-FRL occupies a different point in the design space. In terms of computational overhead, our proof scope is restricted to aggregation correctness and round-data consistency, avoiding the substantially larger cost of proving full local-training correctness. In terms of privacy protection, zk-FRL preserves raw measurements and local trajectories at the DER side, but it does not aim to conceal model updates from the aggregator. In terms of robustness, the main benefit is strong protection against aggregator-side manipulation, including update dropping, weight tampering, stale replay, and model injection. This makes the framework particularly suitable for VPP coordination, where externally verifiable aggregation is more critical than end-to-end concealment of the learning process.

7. Final Remarks

This paper presented zk-FRL, a zero-knowledge federated reinforcement learning framework for large-scale VPP coordination in ancillary-service markets. The framework targets the practical trilemma faced by modern VPPs: achieving near-optimal economic performance while preserving DER raw-data privacy and maintaining operational security under realistic cyber threats. Our approach combines multi-objective DRL at the DER level with federated training that avoids centralizing fine-grained measurements and local experience trajectories, and adds a verification layer that makes the aggregation step externally auditable through signed round participation receipts, update commitments, and a zk-SNARK proof of correct aggregation under a fixed-point encoding with range constraints.

The evaluation in power–market–learning co-simulation indicates that, under benign conditions, zk-FRL attains economic performance close to a centralized reference while improving coordination relative to independent control. More importantly, zk-FRL changes the failure mode of federated learning under a compromised aggregator. When the aggregator deviates from the prescribed protocol by selectively dropping updates, manipulating weights, replaying stale models, or injecting arbitrary parameters, standard federated RL can drift silently and suffer sustained degradation. In contrast, zk-FRL enables each DER to detect such deviations by local verification and to reject invalid global models, thereby preventing the silent poisoning of the shared policy and bounding performance by the last verified model quality.

Beyond algorithmic performance, the framework supports a verifiable governance model for VPP operation. Because aggregation integrity is enforced cryptographically rather than procedurally, DER owners can participate without granting raw-data visibility (measurements and local experience trajectories) to a central coordinator, and operators can demonstrate protocol compliance without requiring access to those raw data streams. At the same time, the zk-SNARK layer enables independent verification that the published global model follows the prescribed aggregation rule over the declared round data; the framework targets verifiable aggregation integrity rather than concealing model updates from the aggregator. This property is aligned with emerging requirements for auditable control in critical infrastructure and suggests a pathway toward coordination mechanisms that avoid raw-data centralization while remaining independently verifiable.

Several directions remain open. Proof generation overhead at the aggregator is the primary scalability bottleneck, and reducing it will require system-level optimizations such as update quantization and batching, as well as hardware-assisted proving. Extending zk-FRL to incorporate more detailed degradation formulations and richer market products, including simultaneous participation in multiple ancillary services, is also important for deployment realism. Finally, while this work successfully addresses the “trust gap” at the aggregator level via verifiable aggregation, it does not inherently mitigate adversarial behavior originating from DER participants themselves, such as data or model poisoning. More broadly, the current conclusions should be interpreted as evidence within the tested feeder size, communication conditions, and market-signal regimes rather than as a claim of deployment-level generality. Extending the evaluation to larger feeders, stronger non-stationarity, and varying participation rates remains important future work. In a large-scale VPP, Byzantine participants could attempt to bias the global model by submitting malicious updates that are cryptographically valid but contain poisoned gradients intended to degrade fleet performance. Future research will focus on integrating our zk-SNARK framework with robust aggregation rules, including Krum, coordinate-wise median, and trimmed mean, together with Sybil-resistant identity management. This will provide a holistic security architecture capable of defending against both infrastructure-level breaches and participant-level adversaries, ensuring the long-term resilience of the trustless VPP coordination layer.

Author Contributions

Conceptualization, X.Z. and F.L.; methodology, X.Z. and F.L.; software, X.Z.; validation, X.Z. and F.L.; formal analysis, X.Z. and F.L.; investigation, X.Z.; resources, F.L.; data curation, X.Z.; writing-original draft preparation, X.Z.; writing-review and editing, F.L.; visualization, X.Z.; supervision, F.L.; project administration, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset available on request from the authors The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Xin Zhang is employed by the company China Tower Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Author Fan Liang has NO conflicts of interest.

References

Xie, Y.; Zhang, Y.; Lee, W.J.; Lin, Z.; Shamash, Y. Virtual Power Plants for Grid Resilience: A Concise Overview of Research and Applications. IEEE/CAA J. Autom. Sin. 2024, 11, 329–343. [Google Scholar] [CrossRef]
Fioriti, D.; Scarpelli, C.; Pellegrino, L.; Lutzemberger, G.; Micolano, E.; Salamone, S. Battery lifetime of electric vehicles by novel rainflow-counting algorithm with temperature and C-rate dynamics: Effects of fast charging, user habits, vehicle-to-grid and climate zones. J. Energy Storage 2023, 59, 106458. [Google Scholar] [CrossRef]
Esfahani, M.; Alizadeh, A.; Cao, B.; Kamwa, I.; Xu, M. Bridging theory and practice: A comprehensive review of virtual power plant technologies and their real-world applications. Renew. Sustain. Energy Rev. 2025, 222, 115929. [Google Scholar] [CrossRef]
Nosratabadi, S.M.; Hooshmand, R.A.; Gholipour, E. A comprehensive review on microgrid and virtual power plant concepts employed for distributed energy resources scheduling in power systems. Renew. Sustain. Energy Rev. 2017, 67, 341–363. [Google Scholar] [CrossRef]
Naval, N.; Yusta, J.M. Virtual power plant models and electricity markets–A review. Renew. Sustain. Energy Rev. 2021, 149, 111393. [Google Scholar] [CrossRef]
U.S. Department of Energy. Pathways to Commercial Liftoff: Virtual Power Plants; Technical Report; U.S. Department of Energy: Washington, DC, USA, 2023.
North American Electric Reliability Corporation (NERC). Privacy and Security Impacts of DER and DER Aggregators; Technical Report; NERC: Swindon, UK, 2023. [Google Scholar]
U.S. Department of Energy. DER Aggregator Code of Conduct; Technical Report; U.S. Department of Energy: Washington, DC, USA, 2023.
North American Electric Reliability Corporation (NERC). Cyber Security for Distributed Energy Resources and DER Aggregators; Technical report; NERC: Swindon, UK, 2022. [Google Scholar]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.Y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Du, Y.; Mendes, N.; Rasouli, S.; Mohammadi, J.; Moura, P. Federated learning assisted distributed energy optimization. IET Renew. Power Gener. 2024, 18, 2524–2538. [Google Scholar] [CrossRef]
Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; Shmatikov, V. How To Backdoor Federated Learning. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 26–28 August 2020; Volume 108, pp. 2938–2948. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1861–1870. [Google Scholar]
Groth, J. On the Size of Pairing-Based Non-interactive Arguments. In Proceedings of the Advances in Cryptology—EUROCRYPT 2016, 35th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Vienna, Austria, 8–12 May 2016; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9666, pp. 305–326. [Google Scholar] [CrossRef]
Wang, Z.; Dong, N.; Sun, J.; Knottenbelt, W.; Guo, Y. zkFL: Zero-Knowledge Proof-based Gradient Aggregation for Federated Learning. arXiv 2023, arXiv:2310.02554. [Google Scholar] [CrossRef]
Zhu, Y.; Wu, Y.; Luo, Z.; Ooi, B.C.; Xiao, X. Secure and Verifiable Data Collaboration with Low-Cost Zero-Knowledge Proofs. Proc. VLDB Endow. 2024, 17, 2321–2334. [Google Scholar] [CrossRef]
Baran, M.E.; Wu, F.F. Network Reconfiguration in Distribution Systems for Loss Reduction and Load Balancing. IEEE Trans. Power Deliv. 1989, 4, 1401–1407. [Google Scholar] [CrossRef]
Minchala-Ávila, C.; Arévalo, P.; Ochoa-Correa, D. A Systematic Review of Model Predictive Control for Robust and Efficient Energy Management in Electric Vehicle Integration and V2G Applications. Modelling 2025, 6, 20. [Google Scholar] [CrossRef]

Figure 1. Conceptual workflow of the proposed zk-FRL protocol. Each round starts with a round announcement and signed participation receipts, followed by local training and update commitment at DER agents. The aggregator then performs receipt-bound aggregation and generates a zk-proof for the published global model. Each DER verifies the proof locally and adopts the new global model only if verification succeeds.

Figure 2. System architecture of the proposed zk-FRL co-simulation platform. DER agents perform local training using private observations and trajectories, send model updates and commitments to the federated-learning aggregator, and receive the aggregated global model together with a zero-knowledge proof. The aggregator computes the global update and generates the proof, while each DER locally verifies aggregation correctness before adopting the new model.

Figure 3. Learning curves comparing total VPP profit under benign aggregation. Shaded regions represent

\pm 1

standard deviation over 5 random seeds.

Figure 3. Learning curves comparing total VPP profit under benign aggregation. Shaded regions represent

\pm 1

standard deviation over 5 random seeds.

Figure 4. Robustness analysis under a 50% update-dropping attack initiated at round

t_{a} = 50

. Under Standard FRL, aggregator-side deviation can degrade learning and economic performance, whereas zk-FRL detects invalid aggregation via proof verification and rejects non-compliant global models, falling back to the last verified parameters.

Figure 4. Robustness analysis under a 50% update-dropping attack initiated at round

t_{a} = 50

. Under Standard FRL, aggregator-side deviation can degrade learning and economic performance, whereas zk-FRL detects invalid aggregation via proof verification and rejects non-compliant global models, falling back to the last verified parameters.

Figure 5. Computational scaling of the zk-FRL protocol (aggregated actor dimension

D \approx 6.8 \times 10^{4}

). Proving time at the aggregator increases approximately linearly with fleet size, while verification time at DER agents remains lightweight and largely insensitive to fleet size.

Figure 5. Computational scaling of the zk-FRL protocol (aggregated actor dimension

D \approx 6.8 \times 10^{4}

). Proving time at the aggregator increases approximately linearly with fleet size, while verification time at DER agents remains lightweight and largely insensitive to fleet size.

Table 1. Notation summary.

Symbol	Meaning
i	Index of a DER agent
$I$ , $I_{t}$	Set of all agents; participant set in round t
t, $Δ t$	Control time index; control interval duration
$P_{b e s s, i} (t)$	BESS charge/discharge power of agent i (kW)
$C_{i}$	BESS energy capacity of agent i (kWh)
$S O C_{i} (t)$	State of charge of agent i at time t
$η_{c}, η_{d}$	Charging/discharging efficiencies
$P_{i}^{\max}$ , $S O C_{i}^{\min / \max}$	Power and SOC limits of agent i
$π_{e} (t)$ , $π_{a s} (t)$	Energy and ancillary-service prices
$r (t)$ , $C^{r e g} (t)$	Regulation command; committed regulation capacity
$P_{agg} (t)$ , $P_{cmd} (t)$	Aggregate BESS power; commanded aggregate power
$e_{a s} (t)$	Ancillary-service tracking error
$R_{a s} (t)$ , $R_{e} (t)$	Ancillary-service and energy-market revenue components
$C_{d e g, i} (t)$	Degradation cost of agent i
$V_{b} (t)$ , $S_{ℓ} (t)$	Voltage at bus b; apparent flow on line ℓ
$P^{g r i d} (t)$	Grid-violation penalty term
$θ_{i}$ , $θ_{g l o b a l}$	Local and global DRL model parameters
$Δ θ_{i}^{t}$	Model update of agent i in federated round t
$C (\cdot)$ , $C_{i}^{t}$	Commitment function; commitment to $Δ θ_{i}^{t}$
${rid}_{t}$ , $σ_{i}^{t}$	Round identifier; signed participation receipt
$R_{t}$	Public digest (e.g., Merkle root) of receipts in round t
$π_{z k p}$	Zero-knowledge proof of correct aggregation

Table 2. Benign performance summary over the final evaluation horizon (mean ± std over 5 seeds).

Method	Profit ($/day)	Tracking MAE (kW)	Voltage Viol. (p.u.·h)
Centralized (Oracle)	$1880 \pm 35$	$12.4 \pm 0.6$	$0.018 \pm 0.004$
zk-FRL	$1830 \pm 40$	$13.1 \pm 0.7$	$0.021 \pm 0.005$
Standard FRL	$1810 \pm 45$	$15.6 \pm 0.9$	$0.032 \pm 0.007$
Independent	$362 \pm 18$	$41.2 \pm 2.1$	$0.094 \pm 0.015$

Table 3. Robustness summary under malicious aggregator attacks (relative profit loss vs. benign steady state).

Attack (Starting at Round $t_{a}$ )	Standard FRL	zk-FRL
Drop updates ( $α = 0.5$ )	$- 73 %$	$- 1 %$
Weight manipulation (biased weights)	$- 41 %$	$- 1 %$
Stale replay ( $δ = 5$ rounds)	$- 28 %$	$- 1 %$
Model injection (replacement)	$- 65 %$	$- 1 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Liang, F. Trustless Federated Reinforcement Learning for VPP Dispatch. Electronics 2026, 15, 1303. https://doi.org/10.3390/electronics15061303

AMA Style

Zhang X, Liang F. Trustless Federated Reinforcement Learning for VPP Dispatch. Electronics. 2026; 15(6):1303. https://doi.org/10.3390/electronics15061303

Chicago/Turabian Style

Zhang, Xin, and Fan Liang. 2026. "Trustless Federated Reinforcement Learning for VPP Dispatch" Electronics 15, no. 6: 1303. https://doi.org/10.3390/electronics15061303

APA Style

Zhang, X., & Liang, F. (2026). Trustless Federated Reinforcement Learning for VPP Dispatch. Electronics, 15(6), 1303. https://doi.org/10.3390/electronics15061303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Trustless Federated Reinforcement Learning for VPP Dispatch

Abstract

1. Introduction

2. Related Work

2.1. Multi-Objective BESS Operation

2.2. DER Aggregation Architectures

2.3. Control Methods for DERs

2.4. Privacy and Security in Distributed Intelligence

3. System Model

3.1. System Architecture and Time Scales

3.2. Participants, Signals, and Information Structure

3.3. BESS Dynamics and Operational Constraints

3.4. Ancillary-Service Signal, Tracking, and Settlement

3.5. Battery Degradation Cost Model

3.6. Distribution-Grid Feasibility and Grid Penalties

3.7. Learning Objects and Cryptographic Commitments

3.8. Notation

4. The zk-FRL Framework

4.1. Local Agent Learning Problem

4.2. Federated Reinforcement Learning Loop

4.3. Threat Model and Trustless Aggregation

4.4. Zero-Knowledge Verification Layer

4.5. Protocols

5. Implementation

5.1. Co-Simulation Platform

5.2. Time Scales and Training Schedule

5.3. Baselines

5.4. Attack Suite and Robustness Settings

5.5. Evaluation Metrics

5.6. Experimental Environment and Hardware

6. Performance Evaluation

6.1. Economic and Technical Performance

6.2. Security and Robustness Analysis

6.3. Scalability and Overheads of ZKPs

7. Final Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI