Next Article in Journal
Robot Planning via LLM Proposals and Symbolic Verification
Previous Article in Journal
Machine Learning-Based Three-Way Decision Model for E-Commerce Adaptive User Interfaces
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scenario-Guided Temporal Prototypes in Reinforcement Learning

1
Elektro Gorenjska d. d., Distribution System Operator, Ulica Mirka Vadnova 3a, 4000 Kranj, Slovenia
2
Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000 Ljubljana, Slovenia
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2026, 8(1), 21; https://doi.org/10.3390/make8010021
Submission received: 22 November 2025 / Revised: 10 January 2026 / Accepted: 14 January 2026 / Published: 16 January 2026
(This article belongs to the Section Learning)

Abstract

Deep reinforcement learning policies are hard to deploy in safety-critical settings, because they fail to explain why a sequence of actions is taken. We introduce an intrinsically interpretable framework that learns compact summaries of recurring behavior and uses them for case-based decision making. Our method (i) discovers global regimes by grouping trajectories into a small set of recurrent patterns and (ii) learns a prototype-conditioned local policy that maps the current short-horizon pattern to an action (“this matches prototype X → take action Y”). Each action is accompanied by a similarity score to relevant prototypes, which provide the explanations. We evaluate our approach on two domains: (1) CarRacing (pixel-based continuous control) and (2) a real voltage-control problem in low-voltage distribution networks. Our results indicate that the method provides clear pre hoc explanations while keeping task performance close to the reference policy.

1. Introduction

Deep reinforcement learning (DRL) has achieved strong empirical performance across different domains, from games to real-world control. In safety-critical settings, operators must be able to verify why an action was taken. Standard deep policies rarely provide such verifiable rationales. Post hoc explainability can highlight salient inputs or approximate policy behavior, but it seldom provides a faithful time-consistent explanation for why a particular sequence of actions is taken [1]. Electrification and connected mobility further increase the need for interpretable control, because the grid operation is increasingly coupled to mobility infrastructure (e.g., EV charging and IoV-enabled services). Recent IoV-based deep learning frameworks focus on extracting driving and mobility behavior patterns from large-scale data streams and highlight open challenges related to safety, robustness, and deployment [2]. These trends motivate reinforcement learning controllers whose decisions can be explained consistently over time rather than only post hoc. In sequential decision-making, explanations should reflect both global context (the current state of the agent) and how recent states justify the chosen action.
We address this gap by combining Scenario-Based eXplainability (SBX) [3] with temporal prototypes inspired by prototype-based policies [4]. First, we use SBX to discover global regimes by grouping trajectories into a small set of typical scenarios. Each scenario is represented by one or a few real example trajectories. For example, in the CarRacing domain, a scenario can be “driving straight” or “approaching a left turn” as shown in Figure 1; in the power-grid domain, it can be described as “normal operation with all voltages within limits” or “midday PV peak with high voltages at the feeder end”. We use these representative trajectories as temporal prototypes. Second, we build an interpretable policy that, at each decision step, compares the recent short-horizon history with a library of prototypes and estimates their similarities. The action is produced as a similarity-weighted combination of prototype influences. This yields concrete case-based explanations: the agent acts, because the current situation most closely matches prototype X (and Y), which are associated with specific control patterns.
This yields a two-tier view:
  • Global layer (scenarios): SBX discovers a small number of recurring scenarios that describe how the system typically behaves over longer time spans.
  • Local layer (temporal prototypes): within this structure, temporal prototypes explain individual actions based on short recent histories: “this window looks like prototype #3, so we reduce consumption at these nodes”.
For example, in voltage control, the method provides the examples based on similar previously seen scenarios: “the last few time steps match the typical midday PV-peak prototype, therefore we activate flexibility primarily at end-of-line customers to prevent over-voltage”.
We evaluate our approach in two settings:
  • In CarRacing, a continuous-control domain with high-dimensional pixel-based observations, we show that temporal prototypes can preserve task fidelity while providing intuitive examples (e.g. turning right/turning left),
  • In a real-world voltage-control problem for power networks, we demonstrate how temporal prototypes reveal different operating regimes and local control patterns.
Across both settings, we quantify fidelity to a reference policy and illustrate the interpretability benefits through qualitative visualizations. Our contributions are as follows:
  • We introduce a framework that combines SBX for scenario discovery with temporal prototypes for exemplar-based explanations of DRL policies.
  • We provide a set of evaluation metrics that assess fidelity (task reward and action-level discrepancy) and prototype locality (nearest-neighbor coherence in an encoder embedding space).
  • We empirically validate the approach in continuous control and power-network voltage control, showing that it provides logical global structure and local explanations.
Prior prototype-based RL approaches typically explain actions using prototypes of single states (local explanations without temporal context), while scenario-based approaches summarize behavior at the trajectory level but do not explain each decision step. SGTP combines these two levels: SBX discovers a small set of global regimes and selects representative temporal medoids, and a prototype-mediated policy then produces step-wise actions as a function of similarity to these temporal prototypes. This yields time-consistent case-based explanations. To our knowledge, SGTP is the first framework that combines scenario-level trajectory clustering with prototype-mediated step-wise action explanations in deep reinforcement learning.
The paper is organized as follows: Section 2 reviews the related work on explainability in DRL and power systems. Section 4 formalizes SBX for global scenario discovery, and Section 5 introduces the prototype-based interpretable policy mechanism. Section 7 presents experiments and analyses, followed by a discussion and conclusions.

2. Related Work

Explainable Artificial Intelligence (XAI) facilitates the adoption of complex models by making their decision processes more understandable. Research on eXplainable Reinforcement Learning (XRL) can be broadly categorized into transparent (pre hoc) approaches and post hoc explanations [5]. Transparent approaches employ inherently interpretable models (e.g., prototype-based or rule-based structures), enabling direct inspection of their internal reasoning. Post hoc methods, by contrast, explain an already trained black-box model via auxiliary analyses without changing its architecture or training [6].
Post hoc explainability for deep RL often adapts ideas from supervised learning. Saliency-map style techniques attribute importance to input features or pixels and have been popular in vision settings [7]. Others leverage the agent’s interactions and time-local structure, for instance, framing interestingness around agent–environment interactions [8] or building self-explainable predictors of episode-level returns using structured probabilistic models with interpretable kernels [9]. Causal and agent-centric explanations have also been explored [10,11]. While such approaches can highlight relevant inputs or time steps, they typically provide explanations of a trained black-box policy rather than constraining the policy to reason with human-understandable components.
Prototype-based explanations provide an alternative that is interpretable by design. In supervised learning, prototype networks explain predictions via similarity to learned or human-selected exemplars [12,13]. Extending this paradigm to reinforcement learning, prototype-wrapper policies force decisions to be mediated by human-friendly prototypes; a recent example is the Prototype-Wrapper Network (PW-Net), which wraps a pre-trained agent and maps latent states to action decisions through prototype similarities [4]. Beyond interpretability, prototypes have been used to improve representation learning and exploration efficiency: Proto-RL pre-trains prototypical embeddings and uses prototype-driven intrinsic motivation to accelerate downstream policy learning in pixel-based control [14]. In model-based RL, prototypical context learning has also been explored for dynamics generalization, where learned prototypes summarize environment contexts and regularize latent world models for improved zero-shot generalization [15]. Complementary concept-based methods quantify human-interpretable concepts [16,17] in supervised learning. These directions are complementary to our aim: we target interpretability with temporal prototypes that capture the behavior of the agent.
Temporal and trajectory-level explanations in RL remain comparatively under-explored. Work on agents’ typical behaviors has begun to examine prototypical strategies [18], but most prior use of trajectories focuses on accelerating learning through demonstrations rather than explaining an agent’s actions [19]. Our study contributes to this gap by combining global scenario discovery with local time-resolved prototype explanations. We demonstrate the explanations in the power system domain, where safety is critical. In power systems and voltage control, explainability efforts have largely centered on post hoc feature attributions. For example, SHAP has been applied to quantify feature contributions in protective load shedding or voltage-related decisions [20], with DeepSHAP improving the computational efficiency [21]. Compared to single-step feature attribution, our approach explicitly models the temporal structure by discovering temporal prototypes to explain action sequences.
Positioning our contribution, (i) relative to saliency-style post hoc methods [7], we provide pre hoc exemplar-based reasoning; (ii) relative to self-explainable outcome predictors [9], we target the policy’s action decisions directly; and (iii) relative to prior prototype-learning and trajectory work [12,13,18,19], we emphasize scenario-level structure and temporal prototypes that together yield explanations for RL agents.

3. Deep Reinforcement Learning Background

We consider a Markov Decision Process (MDP) defined by ( S , A , P , r , γ ) , where S is the state space, A the action space, P the transition dynamics, r the reward function, and γ ( 0 , 1 ) the discount factor. A policy π ( a | s ) maps states to a distribution over actions (or deterministically to actions in continuous control). The objective is to maximize the expected discounted return E π t = 0 γ t r ( s t , a t ) .
We use a strong neural policy trained with Proximal Policy Optimization (PPO) [22] as the black-box reference π bb . PPO is an on-policy actor–critic method that improves the training stability by constraining policy updates using a clipped surrogate objective. In this work, PPO is used only to obtain a high-performing reference policy and a latent representation (the penultimate-layer embedding) for SBX clustering and prototype matching. SGTP does not modify the PPO training or the environment; it learns an interpretable prototype-mediated policy by supervised imitation of π bb .

4. SBX

Scenario-Based eXplainability (SBX) summarizes the policy into a small number of semantically meaningful scenarios and uses the scenario medoids as global explanations of the agent’s decision context [3]. A scenario is a trajectory that summarizes the characteristic observations–action evolution over time. SBX produces (i) a discrete scenario assignment for each trajectory; (ii) representative medoids per scenario; and (iii) human-facing aggregates (e.g., typical action curves and observations) to form explanations.
Given a trained policy π , we collect trajectories of observations–action pairs:
w t = ( o t , a t ) , , ( o t + L 1 , a t + L 1 ) ,       t = 0 , , T L .
In practice, consecutive observations are mapped by a fixed policy network to latent vectors x t R d , and SBX operates on latent variables X t R L × d . Then, consecutive actions are concatenated to the latents before clustering.

4.1. Embedding and Clustering

Trajectories are embedded with a fixed policy network g θ : R L × d R p , yielding embeddings z t = g θ ( X t ) . SBX discovers n scenarios by k-means clustering in the latent space. Let Z = { z i } i = 1 N be the set of trajectory vectors; SBX solves
min C , μ i = 1 N z i C μ ( i ) 2 2 ,         C R n × p ,   μ : { 1 , , N } { 1 , , n } ,
and assigns each trajectory to a scenario μ ( i ) . The number of scenarios n is selected by a Dynamic TimeWarping (DTW) score. For each scenario, SBX selects medoid trajectories (nearest to the scenario center in embedding space) as representatives. These medoids serve as temporal prototypes that are later used directly by the interpretable policy.

4.2. Human-Facing Summaries

For each scenario, we summarize the typical behavior by aggregating the state and action trajectories from trajectories representing individual clusters (e.g., mean ± std bands). For each selected medoid, we further present its own action trajectory and the mean ± std of the top-N nearest neighbor trajectories in embedding space, providing an actionable explanation based on prototypical trajectories and the corresponding actions.

5. Prototypes

We adopt a prototype-based inherently interpretable policy inspired by the Prototype-Wrapper Network (PW-Net) [4]. The idea is to force decisions to be mediated by a small set of human-friendly prototypes associated with action semantics. The resulting policy explains each decision by reference to its similarity to these prototypical states, yielding case-based pre hoc interpretability.

5.1. Markov Setting and Notation

We assume a Markov Decision Process with states s S and actions a R M (continuous) or a { 1 , , M } (discrete). A pre-trained policy π bb is a neural network with a final linear layer, and it is decomposed as
π bb ( s )   =   W   f enc ( s )   +   b ,
where f enc is the encoder and W , b the final layer parameters.

5.2. Prototype-Wrapper Architecture

Given a state s, let z = f enc ( s ) be its latent representation. For each action dimension i { 1 , , M } , we introduce K i prototypes, and for each prototype, we define a small projection network h i , j mapping z into a prototype-specific subspace:
z i , j   =   h i , j ( z ) ,         j { 1 , , K i } .
Action outputs are linear combinations of per-prototype similarities via a designer-specified weight matrix W :
a i   =   j = 1 K i W i , j sim z i , j , p i , j ,         i = 1 , , M .
For continuous control, post-processing enforces valid ranges (e.g., tanh for bounded signals, ReLU for nonnegative signals). For discrete control, a parameterized class logits are followed by a softmax.
Prototypes are specified as prototypical states s a i , a j p that embody an interpretable concept for action a i . Their latent representatives are p a i , a j = h a i , a j f enc ( s a i , a j p ) . The matrix W encodes how each concept contributes to each action (e.g., opposing weights for “turn left” vs “turn right”), ensuring each prototype has a clear semantic effect on the outputs.

5.3. Training Objective

PW-Net is trained to imitate the black-box policy while constraining decisions to flow through prototypes. We form a dataset of state–action pairs D = { ( s , π bb ( s ) ) } by rolling out the black-box policy and optimize the projection networks { h i , j } (keeping f enc , W , and the prototypes fixed):
min { h i , j }   E s D   a ( s ) π bb ( s ) 2 2 .
This pre hoc design yields transparency: at inference, the policy’s action is explicitly explained by similarities to a small set of prototypes.
Explanations are case-based: “the agent chose this action because the current state is similar to prototype(s) X.” Locally, the most influential prototypes are those with the largest contributions W i , j   sim ( z i , j , p i , j ) . Globally, the set of prototypes and their semantics (as encoded in W ) communicate the agent’s conceptual structure.

6. Scenario-Guided Temporal Prototypes (SGTP)

We integrate Scenario-Based eXplainability (SBX) with an extension of the PW-Net to temporal prototypes (prototypes of trajectories) to provide global scenario-level structure and local time-resolved explanations for a trained control policy. SBX is used to partition the behavior trajectories and select representative temporal prototypes. On top of the SBX-selected prototypes, we train a temporal prototype policy that maps latent trajectories to actions.

6.1. Data Preparation and Latent Extraction

We consider a trained policy π acting in discrete time. Observations are first mapped by the frozen policy network to latent vectors x t R d . We denote the latent trajectory by X t = ( x t , , x t + L 1 ) R L × d . We collect an offline dataset by rolling out the trained Proximal Policy Optimization (PPO) [22] agent and recording, at each time step, the policy’s penultimate-layer latent vector and the corresponding environment action. This yields per-episode sequences of latents and actions, which are then converted into trajectories of length L with left-padding for the first L 1 positions. The supervised target for each trajectory is the action at its last real-time step.

6.2. Temporal Prototype Policy

We introduce K temporal prototypes { P k } k = 1 K , each a length-L latent trajectories P k R L × d selected by SBX (medoids). A shared temporal encoder g θ : R L × d R p maps trajectories to embeddings z t = g θ ( X t ) and prototypes to e k = g θ ( P k ) . Following PW-Net, prototype activations use an L2-to-activation mapping:
a k ( t )   =   log z t e k 2 2 + 1 z t e k 2 2 + ε ,   ε > 0 .

6.3. SBX Prototype Selection

We select the number of scenarios k by evaluating DTW-based within-cluster distances for k { 2 , , K max } and choosing the smallest k that yields (i) a clear reduction in within-cluster DTW distance (elbow-style stabilization), (ii) stable cluster assignments across random initializations, and (iii) interpretable scenario summaries (distinct voltage/action profiles). This procedure avoids over-fragmentation while ensuring that each scenario corresponds to a recurring operating regime.
Trajectory length L is chosen to match the characteristic time scale of the underlying domain. In voltage control, the dominant exogenous drivers (load and PV generation) follow a daily cycle; therefore, we set L = 96 (one day at 15 min resolution), so that each trajectory window captures a complete operating regime including morning ramp, midday PV peak, and evening demand peak. In CarRacing, we set L = 100 to capture short maneuver segments (straight driving, corner entry, and turning) while keeping the temporal encoder efficient to train.

6.4. Inference and Explanations

At test time, we maintain a sliding window of the last L latent vectors, map this window to an embedding z t , compute its similarity to all temporal prototypes, and obtain the action as a linear combination of these similarities. Scenario assignments from SBX are used only to select prototypes. Our proposed approach yields the following:
  • Scenario-level explanations (global): SBX-obtained medoids summarize typical behaviors.
  • Temporal prototype-level explanations (local): per-prototype nearest trajectories illustrate characteristic action trajectories.
The key hyperparameters are L (prototype length), encoder size p, and the learning rate. We report (i) fidelity by validating MSE on the policy’s actions and (ii) qualitative nearest/self prototype visualizations to assess interpretability.

7. Experiments

We consider a continuous-control domain with high-dimensional observations (CarRacing) and a voltage control problem (electricity power network), reflecting complementary challenges. In each case, a pre-trained policy π bb serves as the encoder and as a reference behavior to be explained, and explanations are generated post-training without modifying the task environment.

7.1. Experimental Process

For each domain, we (i) collect observation–action trajectories from π bb ; (ii) form overlapping trajectories of length L; (iii) run SBX in an embedding space to select the number of scenarios and representative medoids; (iv) train a temporal prototype policy with a fixed number of prototypes (from SBX) to imitate the reference policy’s actions over these trajectories; and (v) evaluate the fidelity and interpretability on held-out episodes. We compare our approach against a black-box policy π bb , which provides an upper bound on task reward and serves as the action-reference for fidelity. We assess the quality of our temporal prototype explanations through nearest-neighbor coherence and scenario coverage metrics. We assess the results by the following metrics:
  • Task fidelity: task reward on held-out episodes; action-level discrepancy (mean-squared error for continuous actions or accuracy for discrete actions) between the prototype policy and π bb ,
  • Scenario quality: clustering quality and per-scenario support; qualitative inspection of scenario summaries (mean ± std of state/action trajectories),
  • Prototype locality: average embedding-space distance between prototypes and their top-N nearest trajectories; visual nearest-neighbor aggregates to assess exemplar quality.

7.2. CarRacing

We used the OpenAI Gym CarRacing-v2 continuous control task with pixel observations. Each observation is an RGB frame, and the agent produces a continuous action a = ( steer ,   gas ,   brake ) , where steer [ 1 , 1 ] , and gas   and   brake [ 0 , 1 ] . The goal is to complete a procedurally generated track efficiently while staying on the road. The reward encourages forward progress and penalizes off-track behavior and idling; episodes terminate upon finishing the track, if the car stays off the track or fails to make progress for a fixed number of steps (as defined in the CarRacing-v2 environment) or when reaching the step limit (typically 1000). The track is circular, which means that mainly the turning/transition and forward driving dominate, which is the reason for two scenarios.
Figure 2 illustrates frame grabs for two prototypical behaviors (turning left and driving forward), while Figure 3 summarizes the corresponding nearest trajectories in latent space as mean ± std action trajectories.
We used trajectories of length L = 100 , which cover sufficiently long segments to capture turns and transitions between straight driving and cornering, while remaining short enough for efficient training (the choice of L is domain-dependent).
Task fidelity. On held-out episodes, the temporal prototype policy achieves an aggregated trajectory-level MSE of 2.418 , where we sum squared errors over the full L = 100 window and all three action components ( steer ,   gas ,   brake ) relative to π bb . This value is not a percentage; it is expressed in squared action units. Normalizing by the trajectory length and action dimensions gives a per-step per-dimension MSE of 2.418 / ( 100 · 3 ) = 0.00806 , i.e., an RMSE of 0.00806 0.0898 . In relative terms, this corresponds to 4.5 % of full-scale for steering (range [ 1 , 1 ] ) and 9.0 % of full-scale for gas/brake (range [ 0 , 1 ] ).
Scenario quality. We inspected DTW-based within-cluster distances for k { 2 , , 8 } and the corresponding average action and state profiles. While additional clusters slightly reduce within-cluster distances (as expected), k = 2 already yields a clear and interpretable separation between predominantly straight-driving segments and turning/transition segments on this track, and we adopt this value for the CarRacing study.
Prototype locality. In the encoder embedding space, for each prototype, we computed the average distance to its top-25 nearest trajectories. These distances are small compared to typical pairwise distances in the dataset, indicating that prototypes represent dense frequently occurring behavior patterns rather than isolated outliers. Using 25 neighbors provides a balance between locality and stable aggregates.

7.3. Power Network Voltage Control

The voltage-control case study uses a real low-voltage network operated by Elektro Gorenjska (Slovenia). The network contains seven controllable customers (“active consumers”) equipped with PV and battery energy storage systems (BESS). We use two years of historical 15 min measurements to construct daily episodes of 96 steps. Due to confidentiality constraints, the network is anonymized; nevertheless, the full network model (topology and line parameters) is derived from the DSO asset database and used consistently for all simulations.
We model the network in steady-state power flow at each 15 min step using the measured (or reconstructed) load and PV profiles as exogenous injections. The controllable action vector a = [ α 1 , , α 7 ] represents flexibility commands per active consumer, applied subject to device bounds (PV availability and BESS power/energy limits). Voltage operational limits are enforced at 1.05 p.u. (upper bound) and 0.95 p.u. (lower bound).
An observation/state is the vector of the per-bus voltage magnitudes s = [ v 1 , , v n ] (in per unit). Actions are per-active-consumer flexibility commands a = [ α 1 , , α m ] with α i [ 1 , 1 ] : negative values decrease consumption (or increase net export), and positive values decrease the generation for active consumers (bounded by their instantaneous battery output). The agent acts every 15 min; episodes comprise 96 steps (one day).
At each 15 min step t, the controlled variable is the vector of bus voltage magnitudes v ( t ) = [ v 1 ( t ) , , v n ( t ) ] in per-unit (p.u.). The simulated voltages v b ( t ) are obtained by (i) applying the agent action to modify the net injections of controllable assets (PV curtailment and/or BESS charge/discharge) and (ii) solving a steady-state power-flow model for the resulting network operating point. The control objective is to keep the voltages within operational bounds:
V min v b ( t ) V max ,   b ,   t ,
where we use V max = 1.05 p.u., and V min = 0.95 p.u. These bounds are more conservative than the statutory voltage-quality tolerance bands specified in EN 50160 [23] and reflect typical operational practice in distribution networks, where control actions are triggered before operational limits are reached in order to preserve the equipment lifetime and power quality.
Following prior work on distribution–voltage control [24], we use a reward that balances the voltage quality, activation effort, and network losses. Trajectories are generated by a PPO policy trained in this environment. The control objective is to keep the voltages within operational limits while minimizing interventions and losses.
A reference PPO policy π bb is trained in this environment and serves both as the black-box-policy to be explained and as a source of latent representations. As in the CarRacing setting, we collect observation–action trajectories from π bb and construct overlapping trajectories of fixed length L.
We set the trajectory length to L = 96 (one full day at 15 min resolution), which captures the full daily demand and PV generation patterns. Scenario-Based eXplainability (SBX) is then run on the latent trajectory embeddings. Based on DTW-based within-cluster distances and qualitative inspection of voltage and action profiles for k { 2 , , 8 } , we select k = 3 scenarios. For each scenario, SBX identifies the medoid trajectories, which are used as temporal prototypes in the SGTP model.
Figure 4 shows the voltage profiles for the three scenarios at a critical bus. The critical bus is defined as the bus with the maximum voltage magnitude observed over the episode. Figure 5 illustrates the corresponding representative prototype-based action patterns for all of the active consumers that are contributing to the activations.
Task fidelity. The temporal prototype policy is trained to imitate π bb on the collected trajectories, analogously to the CarRacing setting. On held-out days, it closely tracks the reference policy in terms of (i) the overall reward, (ii) the number and magnitude of voltage limit violations, and (iii) the total activated flexibility. In practice, we observe only a modest degradation relative to π bb , indicating that mediating decisions through temporal prototypes preserves the essential control behavior. The results for the final reward are displayed in Table 1.
Scenario quality. The three SBX scenarios correspond to intuitive and recurring operating regimes (as identified by the network operators): (i) low-load periods with voltages close to nominal, (ii) typical daytime variable operation, and (iii) PV-dominated periods with elevated voltages, especially at feeder ends. Each regime has comparable support in the dataset (to the other regimes), ensuring that the selected scenarios represent frequently occurring events.
Prototype locality. For each temporal prototype, we compute the nearest neighbors in the encoder embedding space. The nearest-neighbor trajectories align with the intended semantics of each prototype (e.g., “midday PV peak with targeted curtailment at end-of-line buses”). At decision time, explanations take a case-based form: the system reports which prototypes are most activated and how their contributions combine to yield the final action. Domain experts at Elektro Gorenjska confirmed that these explanations match their intuition about typical operating regimes and corresponding control responses, making the SGTP behavior easier to audit and trust.
The results demonstrate several key insights about our approach. The base policy achieves the best rewards. The PW-Net Policy shows comparable performance, indicating that prototype-based explanations can be achieved without significant performance degradation. Our approach achieves a mean reward of 211.47 ± 14.60 (Table 1), representing a modest performance trade-off in exchange for enhanced interpretability through temporal prototypes and scenario-guided explanations.
Although our controllable resources are PV and stationary BESS, the resulting control problem is structurally analogous to multi-source energy systems in electric-vehicle (EV) applications, where heterogeneous energy sources and buffers must be coordinated in real time under safety and performance constraints. Smart cyber-physical multi-source EV architectures emphasize real-time sensing, actuation, and supervisory energy dispatch across interacting subsystems [25]. In this context, SGTP provides an interpretable decision layer by grounding each control action in similarity to representative temporal trajectories, which can improve operator trust and auditability.

8. Discussion

Our main goal is to provide clear and understandable explanations for reinforcement learning policies in settings where operators must trust and verify automated decisions. We introduce the methodology using a standard continuous-control benchmark CarRacing. As a representative safety-critical case study, we use voltage control in the distribution networks, which demonstrates the generality of the proposed mechanism of interpretability.
SGTP shows the following:
  • Temporal prototype policies can approximate a strong black-box policy while exposing which prototypical patterns influence each decision.
  • SBX-derived scenarios reveal a compact global structure of behavior (e.g., straight vs. cornering segments; distinct daily regimes in the power grid), which helps domain experts reason about the policy at a higher level.
  • Prototype neighborhoods in latent space provide a systematic way to check whether the explanations are grounded in frequently occurring behaviors rather than isolated examples.
A limitation of the current study is the absence of a universally accepted quantitative metric for “explanation quality”. Instead, we propose a practical evaluation methodology: task fidelity (reward and action discrepancy), scenario coverage and distinctness, and prototype locality via nearest-neighbor analysis. Together, these measures indicate whether the learned explanations are consistent, data-supported, and aligned with domain intuition, even if they do not reduce to a single scalar score.
Post hoc methods such as saliency maps or feature-attribution can highlight the correlations between the inputs and actions, but they do not provide a consistent case-based rationale grounded in representative behavior over time. Prototype-based wrappers provide pre hoc interpretability via similarities to exemplars, but they are typically defined over single states. SBX provides global scenario summaries over trajectories but does not by itself explain each decision step. Table 2 summarizes these differences.
Although this paper does not present a field deployment, the proposed SGTP mechanism is compatible with practical operational constraints. Inference requires (i) computing the policy encoder latent for the current observation, (ii) maintaining a sliding window of length L, and (iii) evaluating similarities to a fixed library of K prototypes followed by a linear readout. These operations are lightweight compared to training and can be executed within typical control-cycle budgets (e.g., 15 min dispatch in our voltage-control case). In safety-critical operations, SGTP is intended to complement—not replace—standard safeguards: action bounding, rule-based overrides, and human-in-the-loop approval for novel or low-similarity situations. The contribution of this work is therefore the interpretability and auditability of a learned controller, demonstrated in a realistic simulation setting commonly used in the literature for voltage-control.
SGTP explanations are grounded in historical trajectories generated by a reference policy π bb under the same environment dynamics and operational constraints. Prototype correctness is therefore tied to the correctness of π bb and to how representative the selected prototypes are of frequently occurring behavior. SGTP assumes that the reference policy π bb is already well-trained and operates under stationary environment dynamics. Explanations are faithful only insofar as the selected prototypes are representative of frequently occurring trajectories. Rare but safety-critical events may therefore require additional safeguards or explicit inclusion as prototypes.
Future work includes the following: identifying and incorporating landmark or rare-but-critical states (which SBX does not select as prototypes), allowing variable-length prototypes, and developing human-in-the-loop tools for editing and labeling prototypes, so that explanations can be further aligned with expert mental models. In future work, EV chargers (and potentially V2G assets) can be incorporated as additional controllable devices: SBX would discover regimes driven by joint PV–load–mobility patterns, and temporal prototypes would explain which historical multi-source operating pattern the controller is currently matching.

9. Conclusions

We presented SGTP, a pre hoc interpretability framework that (i) discovers scenario structure from trajectories and (ii) explains actions via temporal prototypes. The approach yields faithful explanations without materially degrading the control quality, as demonstrated in CarRacing and Power Network voltage control. Explanations take a case-based form—“this situation is similar to prototype X”—and are grounded by scenario summaries and prototype locality. In addition, the qualitative comparison in Table 2 positions SGTP relative to post hoc attribution, concept-based explanations, and prior prototype/scenario approaches, highlighting that SGTP jointly provides a temporal, case-based, and scenario-level structure.
SGTP offers practical steps: scenario detection, per-scenario prototypes, and nearest-neighbor coherence expose where explanations are strong or require refinement. Looking ahead, we plan to enable interactive prototype curation, incorporate uncertainty-aware explanation scores, and explore joint training schemes that couple prototype-based interpretability with context-aware latent dynamics. Together, these steps can help bridge the gap between high-performing DRL policies and the trust and insight required for their deployment in real systems.

Author Contributions

Conceptualization, B.D. and J.Ž.; methodology, B.D.; software, B.D.; validation, B.D. and J.Ž.; formal analysis, B.D.; writing—original draft preparation, B.D.; writing—review and editing, J.Ž.; supervision, J.Ž. All authors have read and agreed to the published version of the manuscript.

Funding

Jure Žabkar was partially supported by the Slovenian Research Agency (ARIS) (L2-4436-“Deep Reinforcement learning for optimization of LV distribution network operation with Integrated Flexibility in real-Time (DRIFT)”) and has received support from the Slovenian Research Agency (ARIS) as a member of the research program Artificial Intelligence and Intelligent Systems (Grant No. P2-0209).

Data Availability Statement

Restrictions apply to the availability of this data. Data were obtained from electricity distribution operator Elektro Gorenjska d.d. and are available from Dobravec B. with the permission of the aforementioned company and complying with general GDPR guidelines.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT from OpenAI, Model GPT-5, Version—2025 October, for the purposes of grammar checking, reference formatting, and manuscript formatting. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Alharin, A.; Doan, T.N.; Sartipi, M. Reinforcement Learning Interpretation Methods: A Survey. IEEE Access 2020, 8, 171058–171077. [Google Scholar] [CrossRef]
  2. Li, H.; Sahrani, S.; Sarker, M.R.; Xiao, Y. State-of-the-Art on IoV-Based Deep Learning Framework for Enhanced Driving Behavior Recognition: Recent Progress, Technology Updates, Challenges, and Future Direction. IEEE Access 2025, 13, 135969–135989. [Google Scholar] [CrossRef]
  3. Dobravec, B.; Žabkar, J. Explaining Voltage Control Decisions: A Scenario-Based Approach in Deep Reinforcement Learning. In Foundations of Intelligent Systems; Springer: Cham, Switzerland, 2024; pp. 216–230. [Google Scholar]
  4. Kenny, E.M.; Tucker, M.; Shah, J.A. Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypes. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  5. Milani, S.; Topin, N.; Veloso, M.; Fang, F. Explainable Reinforcement Learning: A Survey and Comparative Review. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
  6. Puiutta, E.; Veith, E.M.S.P. Explainable Reinforcement Learning: A Survey. arXiv 2020, arXiv:2005.06247. [Google Scholar] [CrossRef]
  7. Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
  8. Sequeira, P.; Gervasio, M.T. Interestingness Elements for Explainable Reinforcement Learning: Understanding Agents’ Capabilities and Limitations. Artif. Intell. 2019, 288, 103367. [Google Scholar] [CrossRef]
  9. Guo, W.; Wu, X.; Khan, U.; Xing, X. EDGE: Explaining Deep Reinforcement Learning Policies. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 12222–12236. [Google Scholar]
  10. Madumal, P.; Miller, T.; Sonenberg, L.; Vetere, F. Explainable Reinforcement Learning through a Causal Lens. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 2493–2500. [Google Scholar]
  11. Greydanus, S.; Koul, A.; Dodge, J.; Fern, A. Visualizing and Understanding Atari Agents. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1787–1796. [Google Scholar]
  12. Chen, C.; Li, O.; Tao, C.; Barnett, A.; Rudin, C.; Su, J. This Looks Like That: Deep Learning for Interpretable Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8930–8939. [Google Scholar]
  13. Nauta, M.; van Bree, S.; Seifert, C. Neural Prototype Trees for Interpretable Fine-Grained Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14933–14943. [Google Scholar]
  14. Yarats, D.; Fergus, R.; Lazaric, A.; Pinto, L. Reinforcement Learning with Prototypical Representations. In 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Virtual, 18–24 July 2021; PMLR: Cambridge MA, USA, 2021. [Google Scholar]
  15. Wang, J.; Zhang, Q.; Mu, Y.; Li, D.; Zhao, D.; Zhuang, Y.; Luo, P.; Wang, B.; Hao, J. Prototypical Context-Aware Dynamics for Generalization in Visual Control With Model-Based Reinforcement Learning. IEEE Trans. Ind. Inform. 2024, 20, 10717–10727. [Google Scholar] [CrossRef]
  16. Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F.; Sayres, R. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 2673–2682. [Google Scholar]
  17. Ghorbani, A.; Wexler, J.; Zou, J.; Kim, B. Towards Automatic Concept-Based Explanations. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 9273–9282. [Google Scholar]
  18. Ragodos, R.; Wang, T.; Lin, Q.; Zhou, X. ProtoX: Explaining a Reinforcement Learning Agent via Prototyping. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 27239–27252. [Google Scholar]
  19. Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
  20. Zhang, K.; Xu, P.; Zhang, J. Explainable AI in Deep Reinforcement Learning Models: A SHAP Method Applied in Power System Emergency Control. In Proceedings of the 2020 IEEE 4th Conference on Energy Internet and Energy System Integration (EI2), Wuhan, China, 30 October–1 November 2020; pp. 711–716. [Google Scholar] [CrossRef]
  21. Zhang, K.; Zhang, J.; Xu, P.D.; Gao, T.; Gao, D.W. Explainable AI in Deep Reinforcement Learning Models for Power System Emergency Control. IEEE Trans. Comput. Soc. Syst. 2022, 9, 419–427. [Google Scholar] [CrossRef]
  22. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  23. SIST EN 50160; Voltage Characteristics of Electricity Supplied by Public Distribution Networks. iTeh Standard: San Francisco, CA, USA, 2007.
  24. Wang, J.; Xu, W.; Gu, Y.; Song, W.; Green, T.C. Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks. IEEE Trans. Power Syst. 2021, 34, 3271–3284. [Google Scholar] [CrossRef]
  25. Tehrani, K. A smart cyber physical multi-source energy system for an electric vehicle prototype. J. Syst. Archit. 2020, 111, 101804. [Google Scholar] [CrossRef]
Figure 1. Handwritten illustration of a trajectory as it would have been driven by a human operator. The dotted line indicates the intended real-world trajectory.
Figure 1. Handwritten illustration of a trajectory as it would have been driven by a human operator. The dotted line indicates the intended real-world trajectory.
Make 08 00021 g001
Figure 2. Illustrative frame grabs corresponding to two temporal prototypes in CarRacing ( L = 100 ). The vertical separator highlights the boundary between prototypes; the arrow indicates the direction of progression within each prototype segment.
Figure 2. Illustrative frame grabs corresponding to two temporal prototypes in CarRacing ( L = 100 ). The vertical separator highlights the boundary between prototypes; the arrow indicates the direction of progression within each prototype segment.
Make 08 00021 g002
Figure 3. Prototype 1 and 2: nearest windows (mean ± std actions). Left represents the notion of “executing a left turn”, and the right represents the “driving forward” characteristics. Solid curves show the mean action trajectory; shaded regions denote the 5th–95th percentile envelopes over the nearest-neighbor windows.
Figure 3. Prototype 1 and 2: nearest windows (mean ± std actions). Left represents the notion of “executing a left turn”, and the right represents the “driving forward” characteristics. Solid curves show the mean action trajectory; shaded regions denote the 5th–95th percentile envelopes over the nearest-neighbor windows.
Make 08 00021 g003
Figure 4. Scenario medoid voltage trajectories. Each panel shows the SBX-selected medoid day (cluster representative) for all detected scenarios. Shaded regions denote the 5th–95th percentile envelope across the trajectories and their corresponding scenarios. Curves correspond to voltage magnitudes v b ( t ) (p.u.) at the critical bus. The horizontal line marks the operational upper bound V max = 1.05 p.u.; values above indicate over-voltage risk.
Figure 4. Scenario medoid voltage trajectories. Each panel shows the SBX-selected medoid day (cluster representative) for all detected scenarios. Shaded regions denote the 5th–95th percentile envelope across the trajectories and their corresponding scenarios. Curves correspond to voltage magnitudes v b ( t ) (p.u.) at the critical bus. The horizontal line marks the operational upper bound V max = 1.05 p.u.; values above indicate over-voltage risk.
Make 08 00021 g004
Figure 5. Representative prototype-based action profiles in the power network environment. Each curve shows the average activation pattern of the specific active consumer associated with one temporal prototype, illustrating distinct control regimes over the day.
Figure 5. Representative prototype-based action profiles in the power network environment. Each curve shows the average activation pattern of the specific active consumer associated with one temporal prototype, illustrating distinct control regimes over the day.
Make 08 00021 g005
Table 1. Policy performance comparison over 20 episodes in voltage control. The “Rel to Base (%)” column reports mean reward relative to the base policy.
Table 1. Policy performance comparison over 20 episodes in voltage control. The “Rel to Base (%)” column reports mean reward relative to the base policy.
PolicyMeanStdMedianMinMaxRel to Base (%)
Base Policy 221.80 12.60 223.25 201.00 257.50 100.0
PW-Net Policy 220.67 17.89 221.95 185.80 249.50 99.0
SGTP 211.47 14.60 216.55 168.40 231.80 95.0
Table 2. Qualitative comparison of explanation approaches in deep reinforcement learning.
Table 2. Qualitative comparison of explanation approaches in deep reinforcement learning.
ApproachPre HocTemporalCase-BasedGlobal Regimes
Saliency/attribution (post hoc)
Concept-based explanations
PW-Net-style prototypes (state)
SBX (scenario summaries)
SGTP (ours)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dobravec, B.; Žabkar, J. Scenario-Guided Temporal Prototypes in Reinforcement Learning. Mach. Learn. Knowl. Extr. 2026, 8, 21. https://doi.org/10.3390/make8010021

AMA Style

Dobravec B, Žabkar J. Scenario-Guided Temporal Prototypes in Reinforcement Learning. Machine Learning and Knowledge Extraction. 2026; 8(1):21. https://doi.org/10.3390/make8010021

Chicago/Turabian Style

Dobravec, Blaž, and Jure Žabkar. 2026. "Scenario-Guided Temporal Prototypes in Reinforcement Learning" Machine Learning and Knowledge Extraction 8, no. 1: 21. https://doi.org/10.3390/make8010021

APA Style

Dobravec, B., & Žabkar, J. (2026). Scenario-Guided Temporal Prototypes in Reinforcement Learning. Machine Learning and Knowledge Extraction, 8(1), 21. https://doi.org/10.3390/make8010021

Article Metrics

Back to TopTop