Agentic AI for Price-Only 15 min SDAC Market Diagnostics in Central and Eastern Europe

Ali, Șener; Oprea, Simona-Vasilica; Bâra, Adela

doi:10.3390/asi9050093

Open AccessArticle

Agentic AI for Price-Only 15 min SDAC Market Diagnostics in Central and Eastern Europe

by

Șener Ali

,

Simona-Vasilica Oprea

^*

and

Adela Bâra

Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies, 010374 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2026, 9(5), 93; https://doi.org/10.3390/asi9050093

Submission received: 31 March 2026 / Revised: 27 April 2026 / Accepted: 28 April 2026 / Published: 29 April 2026

Download

Browse Figures

Versions Notes

Abstract

The shift to 15 min market time units (MTUs) in single-day-ahead coupling (SDAC) increases temporal granularity, but complicates the interpretation of intra-hour electricity price spikes and rapid ramps. This paper examines whether architectural decomposition improves the reliability of large language model (LLM)-based diagnostics in price-only settings, rather than causal market analytics, under severe information constraints. We compare a proposed agentic workflow featuring structured context extraction, spike/ramp detection, hypothesis generation, consistency checks, and explicit uncertainty calibration against non-agentic baselines. The paper contributes: (i) a reproducible benchmark for 15 min diagnostic question answering in day-ahead markets, (ii) an agentic architecture tailored to structured time-series reasoning with explicit uncertainty handling, and (iii) empirical evidence that decomposition and verification improve evidence grounding and trustworthiness in market analytics. The evaluation includes 360 price-only cases sampled across autumn 2025, winter 2025–2026, and early spring 2026, balanced by bidding zone, temporal period, event type, and impact tier, comprising 180 spike and 180 ramp cases from six Central and Eastern European bidding zones (Bulgaria, Czechia, Hungary, Poland, Romania, and Slovakia). Using identical inputs, we assess automatic reliability metrics and human ratings. The agentic workflow improves reliability (∆ = +0.067, 95% CI [+0.049, +0.085]) and significantly increases calibrated price-only disclaimers (∆ = +0.500) relative to the monolithic LLM baseline. Human evaluation confirms higher overall quality (+0.74), helpfulness (+1.06), and correctness (+0.94), with a 65.5% pairwise win rate. Overall, the results support a narrower conclusion: structured decomposition and verification improve calibration and perceived explanation quality relative to a simple monolithic LLM baseline, but their advantages are not uniform across stronger non-agentic baselines and remain limited by the absence of exogenous market data.

Keywords:

agentic workflows; tool-augmented LLMs; SDAC; 15 min MTUs; day-ahead electricity-market analytics

1. Introduction

The adoption of 15 min market time units (MTUs) in single-day-ahead coupling (SDAC) has structurally changed the European electricity market by increasing temporal granularity [1]. While this transition aims to provide more insights about intra-hour system conditions, it also raises certain challenges for market participants, analysts, and system operators, such as fast ramps in residual load or more sensitive prices to short-lived imbalances, making it more difficult to explain why a specific quarter hour exhibits a spike or a rapid price change [2]. However, it is important for economic agents to understand the reasons that cause these changes, because they can impact bidding strategies, risk management, and post-trade analysis, especially in interconnected regions where cross-border effects propagate quickly [3].

Due to their ability to generate coherent explanations, summarise heterogeneous inputs, and support interactive question answering (QA), large language models (LLMs) have emerged as a useful tool for analytics and decision support [4]. Despite their progress, LLMs’ use in electricity-market diagnostics remains problematic. First, quarter-hour market analysis is strongly numeric and time-indexed: explanations must align with time stamps, computed deltas, and short windows of contextual evidence [5]. Second, LLMs can produce plausible, but weakly grounded causal narratives and can be unstable across repeated runs, limitations that directly affect trustworthiness when users need reliable reasoning [6].

The motivation for this study arose from the structural transformation of European electricity markets following the adoption of 15 min MTUs in SDAC. While increased temporal granularity improves market efficiency and better reflects intra-hour system conditions, it also makes price dynamics more volatile and harder to interpret. Intra-hour spikes and rapid ramps can emerge within narrow time windows, and small imbalances may translate into sharp price movements. For traders, analysts, and system operators, understanding why a specific quarter hour exhibits an unusual level or change has therefore become both more important and more challenging. At the same time, LLMs are increasingly used for analytical support because of their ability to summarise data and generate explanations. However, in high-frequency, price-only environments, LLMs often produce plausible, but weakly grounded causal narratives, sometimes attributing price changes to exogenous drivers without evidence. This creates a trust and reliability gap, particularly in regulated and high-stakes markets.

This paper proposes an agentic workflow for SDAC day-ahead analytics that aims to address these limitations. Specifically, we target the task of explaining why a given 15 min SDAC day-ahead price interval exhibits an unusual level or change (e.g., an intra-hour spike or rapid ramp), using only a bounded local window of the published day-ahead price series. Accordingly, the task should be interpreted as grounded descriptive diagnosis and controlled explanation generation over observed prices, not causal market analysis. Because no exogenous variables such as load, generation, outages, flows, or capacity constraints are provided, the system cannot establish true market drivers. We focus on price-only inputs to isolate the effect of architectural decomposition and verification without confounding gains with access to exogenous variables, and to explicitly test calibrated uncertainty under limited observability. The main idea is that planned, targeted extraction of time-window evidence, consistency checks, and explicit uncertainty handling can yield more reliable diagnostic answers than a single-pass LLM response. Our research question is whether architectural decomposition and verification improve evidence grounding (time stamps and numeric deltas), interval focus, and calibration of uncertainty in price-only settings. To isolate the effect of the architecture, we compare the agentic system against an LLM that receives the same case inputs and questions, but lacks explicit decomposition and verification.

Our evaluation relies on a benchmark of cases constructed from public market data, with emphasis on situations where 15 min resolution materially changes the interpretation (intra-hour spikes and rapid ramping patterns). We construct cases from publicly available SDAC day-ahead prices published via the ENTSO-E Transparency Platform API, focusing on six Central and Eastern Europe (CEE) bidding zones (Bulgaria, Czechia, Hungary, Poland, Romania, and Slovakia) across autumn 2025, winter 2025–2026, and early spring 2026. Each case defines a zone and a target quarter hour, provides a bounded time window of price series, and asks for a diagnosis with evidence anchored in time stamps and numeric values. Concretely, the input contains the target time stamp, a fixed local context window at 15 min granularity with prices in EUR/MWh, and derived indicators used to select and characterise cases (e.g., quarter-hour price deltas and spike/ramp flags).

Therefore, our study brings three contributions to the literature:

(a): First, we define a reproducible evaluation setup for 15 min day-ahead diagnostic QA, including a structured case format and a protocol for paired comparisons between non-agentic baselines and an agentic workflow.
(b): Second, we propose an agentic design tailored to price-only diagnostic explanation over structured time series, emphasising mechanism-aware reasoning and evidence extraction.
(c): Third, we provide a comparison focused on European Union member states, highlighting how agentic workflows affect reliability and evidence quality when only bounded price-window evidence is available.

In the EU, price signal and event diagnoses were changed by the transition from hourly to 15 min products. Patterns that were previously smoothed within an hour become visible as short spikes, step changes, or rapid ramps across adjacent quarters [7]. Short-lived constraints, ramping capability limits, forecast errors, and changes in residual demand can drive intra-hour variability in a high-frequency electricity market [8]. In interconnected markets, these effects propagate across borders [9]. Despite these domain characteristics, the intersection between 15 min market design and LLM-driven explanatory analytics remains underexplored, particularly for EU-coupled day-ahead contexts.

The remainder of the paper is structured as follows. Section 2 reviews the relevant literature on 15 min market design in SDAC, diagnostic analytics in electricity markets, and the use of LLMs for structured reasoning and decision support, with particular emphasis on reliability and uncertainty calibration. Section 3 introduces the proposed agentic architecture, detailing the decomposition into local context extraction, spike/ramp identification, hypothesis generation, consistency verification, and explicit uncertainty handling. This section also formalises the research question and explains the controlled comparison with non-agentic baselines. Section 4 reports results from automatic reliability indicators, zone-level heterogeneity analysis, and human evaluation and provides a discussion on the practical implications and utility of the findings. Finally, Section 5 concludes by summarising the main findings, outlining practical implications for market participants and system operators and suggesting directions for future research on trustworthy AI in high-frequency electricity markets.

2. Literature Review

The exponential development of LLMs has accelerated interest in their integration as a decision support tool across technical domains, including finance, operations and energy systems [10]. While they have great ability to generate natural-language explanations and support interactive analysis, research also highlights their limitations such as hallucinations, weak calibration of uncertainty, and inconsistent citation or evidence usage, particularly in settings requiring precise numeric grounding [11]. In electricity-market analytics, where explanations often depend on time-indexed signals and short-lived system states, these limitations can significantly reduce trust [12]. This section reviews the relevant literature on LLMs for market analytics, agentic workflows, and evaluation methodologies.

2.1. LLMs for Energy and Electricity Markets: Applications, Numeric Grounding, and Time-Series Reasoning

LLMs play the role of natural-language interfaces for energy data exploration, forecasting assistance, and operational decision support [13]. Prior studies highlight LLMs’ ability to interpret complex dashboards, summarise multi-source signals, and generate narratives about market events [14]. However, in wholesale electricity markets, structured numeric inputs (prices, volumes, load) represent an important component of analysis [15]. Moreover, it is essential to carefully align the analysis to market timelines and product granularity [16]. As a result, the ability to generate plausible text does not guarantee that it can also provide meaningful and grounded analysis [17].

Numeric evidence and time-series context represent an important component of electricity-market explanations. The literature shows that precise numeric reasoning, especially when the task requires tracking multiple time stamps, computing differences, or maintaining consistency across related statements represents a weakness of LLMs [18]. These limitations can be mitigated by using tools that offload arithmetic and enforce explicit references to data points [19]. For evaluation, researchers increasingly operationalise grounding via measurable indicators such as numeric consistency, time-stamp coverage and the ability to cite evidence aligned with the question scope [20].

2.2. Agentic Workflows and Tool-Augmented LLM Systems

Agentic workflows or tool-augmented pipelines represent solutions for improving reliability by not answering in a single step, but instead planning, calling tools, validating intermediate results, and synthesising a final response [21]. Multistep architectures have shown improvements in task completion and factual consistency in domains where decomposition and verification matter [22]. In such methodologies, the main benefit is represented by the separation of responsibilities and forcing the generation of structured intermediate outputs [23]. Nevertheless, agentic systems also introduce new failure modes: orchestration errors, compounding tool mistakes, and overly rigid pipelines that may miss salient signals if the planning step is poor [24].

2.3. Structural Decomposition, Schema-Grounded Evidence Extraction and Verification

A related body of work shows that LLM reliability can be improved by decomposing complex analytical tasks into structured intermediate steps rather than generating final answers in a single pass [25]. Schema-grounded extraction constrains the model to produce predefined fields such as synthesis methods, stability data, and electrical performance metrics, making outputs formatted, normalised, and easier to use directly in subsequent data analysis [26]. Attribution-oriented approaches then check whether generated claims are accurately supported by the available source documents [27]. These ideas are particularly relevant for explainable time-series classification, where explanations are often organised around time points, subsequences, or lookback windows, and may assess the importance of temporal changes or shifts in model behaviour [28].

2.4. Cross-Domain Agentic AI for Financial Market Diagnostics and Time-Series Analysis

Recent financial-domain studies show that agentic AI is increasingly used for complex data diagnostics beyond energy systems. Park proposes an LLM-based multi-agent framework for financial-market anomaly detection, where detected anomalies are converted into LLM-suitable questions and then reviewed by specialised agents for web research, institutional knowledge, cross-checking, consolidation, and management-level discussion [29]. This is closely related to our diagnostic setting, because both approaches use role-specialised agents to validate anomalous time-indexed market observations; however, our study differs by restricting the task to price-only 15 min SDAC evidence and evaluating grounding, calibration, and unsupported-claim control rather than external anomaly validation. QuantAgents is a multi-agent financial trading system that integrates simulated trading, risk-control analysis, market-news analysis, and manager-led coordination to evaluate investment strategies and support more forward-looking decisions in dynamic financial markets [30].

2.5. Evaluating Reliability-Evidence Focus, Stability, and Unsupported Claims

The quality of analytic systems can be determined based on several factors—correctness, usefulness, evidence support, and robustness to paraphrases or repeated runs—which makes the evaluation of LLM-based systems a challenging task [31]. One approach is represented by automated evaluation that approximates these dimensions through proxy metrics, but this method can be sensitive to prompt choices and may not fully capture domain correctness [32]. In this way, a more efficient strategy is to combine (i) automatic metrics with (ii) human assessment on a subset of cases using a structured rubric [33]. This motivates our evaluation design, which computes comparable automatic metrics for baseline-versus-agentic outputs and complements them with manual scoring for plausibility and evidence quality. A structured comparative analysis synthesising the most directly relevant LLM-based, agentic, reliability-oriented, and time-series diagnostic works is provided in Table 1.

Prior work suggests that LLMs can be used as an analytic tool, but they also present some limitations [34]. Agentic workflows and tool use promise improvements, yet there is limited evidence isolating the effect of agentic decomposition for EU day-ahead market diagnostics at 15 min resolution [35]. Moreover, 15 min products increase the need for interval-specific reasoning [36].

Our study addresses these gaps by (i) defining a structured benchmark of 15 min SDAC-related diagnostic cases constructed from public market data, (ii) implementing an agentic workflow that emphasises planning, evidence extraction, verification, and explicit uncertainty handling, and (iii) comparing it against non-agentic baselines under identical inputs to quantify reliability improvements attributable to the architecture.

3. Methodology

This study aimed to determine if an agentic LLM workflow achieves stronger performance than non-agentic baselines in terms of day-ahead market analytics at 15 min resolution. Our main hypothesis was that splitting the workflow into multiple steps (planning, structured extraction, verification, synthesis) produces outputs that are (i) more consistently grounded in the time series provided, (ii) more focused on the target 15 min interval, and (iii) less prone to unsupported claims than a monolithic prompt–response baseline.

Let

z \in Z

denote a bidding zone (CEE focus) and let the day-ahead price be observed on a 15 min grid:

T = {t_{1,} t_{2}, \dots, t_{n}}, Δ t = 15 m i n

(1)

with prices

p_{z} \in R

in EUR/MWh. Each query analyses a target time stamp

t * \in T

and a local context window.

3.1. Case Construction Using ENTSO-E Day-Ahead Prices

We queried the ENTSO-E Transparency Platform API [37] in order to obtain the public day-ahead price series

{(t, p_{z} (t))}_{t \in T_{z, d}}

for each zone

z

and day

d

. A case

c

is defined by a tuple:

c = (z, d, t^{*}, p_{c}, q_{c})

(2)

where

t^{*}

is an event time stamp and

q_{c}

represents a natural-language question template, and:

p_{c} = {(t, p_{z} (t)) | t \in [t^{*} - H_{p r e}, t^{*} - H_{p o s t}]}

(3)

is the local price window around

t^{*}

with fixed

H_{p r e,}

H_{p o s t}

(2 h each). To avoid near-duplicate cases, we apply a minimum separation constraint. If two candidates

t_{i}^{*}

,

t_{j}^{*}

satisfy:

| t_{i}^{*} - t_{j}^{*} | < τ_{m i n}, τ_{m i n} = 60 m i n

(4)

we retain at most one.

Our case suite contains only instances where the 15 min granularity matters. For each zone–day series, define the daily empirical quantile function

Q_{z, d} (\propto)

of

{{p}_{z} (t)} .

A time stamp

t

is a spike candidate if:

p_{z} (t) \geq Q_{z, d} (0.95)

(5)

Define the discrete 15 min price change:

{∆ p}_{z} (t) = p_{z} (t) - p_{z} (t - ∆ t)

(6)

and its absolute magnitude

{| ∆ p}_{z} (t) |

. A time stamp

t

is a ramp candidate if:

{| ∆ p}_{z} (t) | \geq Q_{z, d}^{∆} (0.90)

(7)

where

Q_{z, d}^{∆} (\propto)

is the daily empirical quantile of

{{| ∆ p}_{z} (t) |}

over valid

t

.

Spike and ramp candidates are not automatically included in the final benchmark. Instead, each candidate is characterised using impact-oriented price indicators computed from the local window: the absolute target price, the absolute 15 min price change, the price premium over the local-window median, and the absolute deviation from the local-window mean. These indicators define a price-impact score:

S_{c}^{i m p a c t} = m a x (| ∆ p_{z} (t^{*}) |, \max (p_{z} (t^{*}) - m e d i a n (p_{c}), 0), | p_{z} (t^{*}) - m e a n (p_{c}) |)

(8)

where

S_{c}^{i m p a c t}

denotes the impact score of candidate case

c

. Candidates are assigned to low-, medium-, and high-impact tiers according to this score within each event type. Final benchmark inclusion is then balanced across temporal period, bidding zone, event type, and impact tier, using tier quotas of 40% high-impact, 40% medium-impact, and 20% low-impact cases. Impact scores and impact tiers were used only for benchmark construction and balancing and were not provided to any evaluated system as part of the case input.

For each case, we define a label

l_{c} \in {s p i k e, r a m p}

based on the detector that produced it.

3.2. Systems Under Comparison

Let

I_{c}

denote the structured input for case

c

, consisting of

(z, q_{c}, p_{c})

encoded in JSON-like form.

3.2.1. Baseline Monolithic LLM

The baseline variant for our study is represented by a single LLM call:

a_{c}^{(B)} = f_{θ} (I_{c})

(9)

where

f_{θ}

is the chosen Gemini model (gemini-2.5-flash-lite) with fixed configuration (temperature = 0).

3.2.2. Baseline Fixed-Tool LLM

To provide a stronger non-agentic reference than the monolithic single-call baseline, we define a fixed-tool LLM baseline. Let

x_{c}

denote the structured input for case

c

, consisting of the target time stamp

t_{c}

, the bounded local price window

W_{c}

, and the natural-language diagnostic question. The fixed-tool baseline augments

x_{c}

with a deterministic statistical summary

s_{c}

computed from the same window:

s_{c} = ψ (W_{c})

(10)

where

ψ (\cdot)

is a fixed preprocessing function that extracts descriptive summaries, local extrema, quarter-hour deltas, and simple event indicators from the observed price series. The final answer is then generated by a single LLM synthesis step:

{\hat{a}}_{c}^{F T} = f_{L L M} (x_{c}, s_{c})

(11)

where

f_{L L M}

denotes the chosen Gemini model. Unlike the proposed agentic workflow, this baseline does not perform dynamic planning, role separation, iterative verification, or critic-guided revision. The sequence of operations is static and predetermined: deterministic summary extraction followed by one final generative step.

3.2.3. Baseline Statistical Diagnostic System

We also introduce a non-LLM statistical baseline designed specifically for price-only electricity-market explanation. For each case

c

, let

W_{c}

denote the target-centred local price window and let:

r_{c} = ϕ (W_{c})

(12)

be a deterministic rule-based feature representation derived from that window, where

ϕ (\cdot)

extracts statistical descriptors such as local maxima and minima, quarter-hour ramps, abrupt price changes, short-window volatility, and anomaly indicators. The final response is then produced by the following function:

{\hat{a}}_{c}^{S} = τ (r_{c})

(13)

where

τ (\cdot)

maps the detected signals into a fixed diagnostic template. Thus, the output is entirely determined by explicit rules and pre-specified textual structure without any generative reasoning step.

This baseline serves as a practically relevant non-LLM comparator for a task that is fundamentally structured and time series-based. It also helps determine whether the full agentic workflow adds value beyond straightforward event detection and deterministic narrative construction. Because this baseline is non-generative, it cannot benefit from flexible language synthesis, but it also avoids hallucinated causal narratives by construction.

3.2.4. Agentic Workflow Mapping

Figure 1 summarises the proposed agentic workflow, showing the sequence of specialised agents used to plan, structure the data, perform quantitative analysis, reason about mechanisms, verify claims, and synthesise the final response.

The agentic workflow splits the process into multiple steps:

a_{c}^{(A)} = S (V (K (Q (M (P (I_{c}))))))

(14)

where

a_{c}^{(A)}

is the final structured answer for case

c

and each operator corresponds to a role-specific agent, as follows.

Planner Agent ( $P$ ) represents the first layer of our architecture. Its role is to interpret the query and produce a task plan that specifies what evidence is required and which intermediate computations should be performed. Concretely, $P$ understands the user’s intent (e.g., spike explanation versus ramping explanation), identifies the target interval $I_{c} = [t^{*}, t^{*} + Δ t]$ , and defines constraints.
Market-Data Agent ( $M$ ) standardises data ordering, computes derived columns, and provides a structured representation used by later stages to ground the explanation by transforming the raw local price $p_{c}$ into a deterministic feature table $x_{c} (t)$ , including the discrete quarter-hour change ${∆ p}_{z} (t) = p_{z} (t) - p_{z} (t - ∆ t)$ .
Quantitative Analysis Agent ( $Q$ ) aims to make the results less dependent on the LLM’s free-form arithmetic. Thus, $Q$ performs lightweight quantitative checks by identifying the largest $| ∆ p |$ within a target neighbourhood, ranking candidate intervals by magnitude and generating summary statistics.
Mechanism Agent ( $K$ ) assesses the credibility of the current draft diagnosis and detected patterns under the market mechanism context and the resolution of analysis (15 min MTUs) and outputs plausibility checks, candidate mechanisms, and warnings based only on the provided inputs. In particular, $K$ enforces constraints such as (i) avoiding causal statements that require missing exogenous signals and (ii) ensuring that the narrative remains anchored to the specified target interval rather than drifting to unrelated periods.
Verifier Agent ( $V$ ) forces justifications for statements. In this way, the agent performs a verification over intermediate claims and planned conclusions, explicitly flagging (i) unsupported assertions, (ii) missing inputs that would be required to justify a stronger claim, and (iii) inconsistencies between stated drivers and the computed feature table.
Synthesizer Agent ( $S$ ) generates the final answer in a strict JSON schema. The synthesis is constrained to preserve the constructed evidence lines (time stamps and numeric deltas). The narrative components (drivers, uncertainty, summary) are phrased to reflect the available information.

3.2.5. Component Ablations Against the Full Agentic Workflow

To isolate the contribution of three selected components, we evaluate one-component-removed ablations against the full agentic workflow: without Planner, without Market Data, and without Verifier.

3.2.6. Implementation Details

Appendix A outlines the structured prompting architecture that operationalises the agentic workflow and explains why it produces more reliable and calibrated outputs than a monolithic LLM. The Planner Agent first decomposes each user query into explicit analytic steps, ensuring that the diagnostic task is framed in a structured and transparent manner rather than answered in a single narrative pass. The Market-Data Agent then restricts itself strictly to the numeric inputs provided, summarising patterns without inventing values and preserving the original feature table unchanged. This constraint enforces numeric grounding and prevents data hallucination. The Mechanism Agent introduces plausible market explanations, but based only on detected patterns and provides quantitative notes, explicitly separating hypotheses from facts. The Verifier Agent acts as a critic, systematically identifying unsupported causal claims, numeric inconsistencies, and missing data, thereby injecting epistemic discipline into the workflow. Finally, the Synthesizer Agent assembles the response under strict rules: evidence lines must remain unchanged, exogenous drivers cannot be stated as facts without inputs, and any causal interpretation must be labelled low-confidence when only price data are available.

These prompts reveal a deliberate separation of reasoning roles, planning, grounding, hypothesising, auditing, and synthesising, designed to reduce hallucinated causality, enforce time-stamp alignment, and calibrate uncertainty in 15 min SDAC price diagnostics.

3.3. Automated Evaluation

Given the fact that exogenous causal attribution cannot be validated from price-only inputs, the automated metrics are designed to measure reliability prerequisites rather than complete economic explanation quality. They quantify grounding, target-interval focus, calibration and unsupported-claim control, and are complemented by checks for genericity and actionability. Let

ε_{c}

be the set of evidence lines produced for case

c

and let

{| ε}_{c} |

be its cardinality.

3.3.1. Time-Stamp Coverage

Let

1_{t s} (e) = 1

if evidence line

e

contains an ISO time stamp, else 0. Then:

S C o v (c) = \frac{1}{{| ε}_{c} |} \sum_{e \in ε_{c}} 1_{t s} (e)

(15)

with

T S C o v (c) = 0

if

{| ε}_{c} | = 0

.

3.3.2. Delta-Format Coverage

Let

1_{∆} (e) = 1

if

e

contains a standardised delta format, else 0:

D e l t a C o v (c) = \frac{1}{{| ε}_{c} |} \sum_{e \in ε_{c}} 1_{∆} (e)

(16)

3.3.3. Target-Neighbourhood Focus

We define an indicator that at least one evidence time stamp lies in the neighbourhood

N_{c} (δ)

:

N e a r T a r g e t (c) = 1 (\exists t \in T (ε_{c}) s . t . | t - t^{*} | \leq δ)

(17)

where

T (ε_{c})

extracts time stamps present in evidence lines.

3.3.4. Calibration

Based on the fact that each case provides only a one-day-ahead price series, the true causal drivers of a well-calibrated answer cannot be confirmed from prices alone. We operationalise this as a simple binary indicator that checks whether the output contains a price-only limitation disclaimer. Let the model output for case

c

include an uncertainty field composed of a list of strings and a final_summary field represented by a string. We define the pool of text where a disclaimer may appear as:

T (c) = (⋃_{u \in {u n c e r t a i n t y}_{c}} {u}) \cup {{f i n a l_s u m m a r y}_{c}}

(18)

We then define a set of disclaimer patterns

D

that indicate price-only limitations. Next, we define the calibration indicator:

C a l (c) = 1 (\exists x \in T (c) s . t . \exists p \in D : p \subset x)

(19)

where

p \subset x

denotes that pattern

p

appears in the text

x

. Thus,

C a l (c) \in {0,1}

equals 1 if the output explicitly states that causal attribution is limited by price-only inputs and 0 otherwise.

3.3.5. Unsupported Exogenous Assertions

Although the model may mention exogenous concepts as missing information, we penalise only asserted exogenous drivers stated without hedging. Let

U E A (c)

be the count of exogenous keywords appearing in non-hedged sentences within the top_drivers and final_summary fields:

U E A (c) = \sum_{s \in S_{c}} \sum_{k \in K} 1 (k \in s) \cdot 1 (H e d g e (s))

(20)

where

S_{c}

is the set of claim-bearing sentences,

K

is the exogenous keyword set, and

H e d g e (s)

detects uncertainty markers.

3.3.6. Reliability Score

We combine the above into a reliability-oriented composite metric:

R e l (c) = ω_{1} T S C o v (c) + ω_{2} D e l t a C o v (c) + ω_{3} N e a r T a r g e t (c) + ω_{4} C a l (c) - ω_{5} g (U E A (c))

(21)

where

g (\cdot)

is a saturating penalty and weights

ω_{i}

are fixed a priori. The key comparison is the paired difference:

∆ R e l (c) = {R e l}^{(A)} (c) - {R e l}^{(B)} (c)

(22)

3.3.7. Reliability Score Without Calibration

Because the agentic workflow explicitly encourages uncertainty calibration, we also report a calibration-free sensitivity score that removes the calibration component. This tests whether the agentic workflow improves grounding and unsupported-claim control beyond the architecturally encouraged calibration behaviour.

R e l (c) = ω_{1} T S C o v (c) + ω_{2} D e l t a C o v (c) + ω_{3} N e a r T a r g e t (c) - ω_{5} g (U E A (c))

(23)

3.3.8. Genericity

Because stronger calibration may come at the cost of overly vague or non-specific explanations, we include a genericity metric. Let

O_{c}

denote the full output text for case

c

and let

G

be a predefined set of generic patterns. The genericity score counts how many such patterns appear in the output text:

{G e n}_{c} = \sum_{γ \in G} I [γ \subseteq O_{c}]

(24)

where

I (\cdot)

is the indicator function and

γ \subseteq O_{c}

denotes that pattern

g

appears in the output text.

3.3.9. Actionability

To complement the reliability-oriented metrics, we also report a simple actionability indicator. The motivation is that grounded and calibrated explanations are useful only if they still support practical interpretation. Let

O_{c}

denote the full output text for case

c

, and let

A

be a predefined set of action-oriented markers. The actionability metric is defined as:

{A c t}_{c} = I [\exists a \in A : a \subseteq O_{c}]

(25)

Thus,

{A c t}_{c} = 1

if the output contains at least one action-oriented term and 0 otherwise. This metric should be interpreted as a proxy indicator rather than a direct measure of real-world decision usefulness, because it is based on predefined action-oriented markers.

3.4. Human Evaluation

The human evaluation was conducted on a subset of 50 cases sampled from the benchmark, covering target intervals from 2025-10-01 to 2025-10-13. Two annotators had experience in data analysis and AI, but limited domain-specific expertise in electricity markets. The annotators evaluated anonymised and randomly ordered monolithic LLM baseline and agentic outputs without system labels, using a structured rubric covering overall quality, helpfulness, correctness, clarity, and pairwise preference. They were instructed to assess whether responses were grounded in the time stamps and numeric price evidence provided, avoid unsupported causal claims, and clearly state the limitations of price-only inputs. For each case

c

, annotators rated baseline and agentic outputs as follows:

h_{c, j}^{(m)} (s) \in {1,2, 3,4, 5}, s \in {A, B}, m \in {1, \dots, M}

(26)

We define the per-system mean rating:

{\bar{h}}_{c, j} (s) = \frac{1}{M} \sum_{m = 1}^{M} r_{c, j}^{(m)} (s)

(27)

and compute paired deltas:

∆ h_{c, j} = {\bar{h}}_{c, j} (A) - {\bar{h}}_{c, j} (B)

(28)

3.5. Statistical Analysis

For each metric

m (c)

, we compute paired differences

∆ m (c)

across all cases and report the mean difference:

{\hat{μ}}_{∆} = \frac{1}{N} \sum_{c = 1}^{N} ∆ m (c)

(29)

and bootstrap confidence intervals (CIs):

{C I}_{1 - α} (\hat{μ}_{∆}) = [q_{α / 2}, q_{1 - α / 2}]

(30)

where

q

represents bootstrap quantiles over resampled cases. All analyses are performed both overall and by zone to verify generalisation across CEE markets.

3.6. Validation of Results

To validate our statistical results, we summarise all metrics using paired case-level differences and report nonparametric uncertainty estimates together with standardised effect sizes. For each case

c

and metric

m

, we compute the paired difference as:

∆ m_{c} = m_{c}^{(a g e n t i c)} - m_{c}^{(b a s e l i n e)}

(31)

Then, we report the baseline and agentic sample means:

{\bar{m}}^{(b a s e l i n e)} = \frac{1}{N} \sum_{c = 1}^{N} m_{c}^{(b a s e l i n e)}

(32)

{\bar{m}}^{(a g e n t i c)} = \frac{1}{N} \sum_{c = 1}^{N} m_{c}^{(a g e n t i c)}

(33)

and the mean paired effect:

\bar{∆ m} = \frac{1}{N} \sum_{c = 1}^{N} ∆ m_{c}

(34)

Uncertainty in

\bar{∆ m}

is estimated via a case-resampling bootstrap with 5000 resamples. Each bootstrap replicate draws cases with replacements and recomputes

\bar{∆ m}

. The 95% confidence interval is obtained from percentiles 2.5 and 97.5 of the bootstrap distribution. To provide a standardised measure of effect magnitude that is comparable across metrics, we report paired Cohen’s

d

, defined as:

d = \frac{\bar{∆ m}}{s_{∆ m}}

(35)

s_{∆ m} = \sqrt{\frac{1}{N - 1} \sum_{c = 1}^{N} {(∆ m_{c} - \bar{∆ m})}^{2}}

(36)

The methodological flowchart is presented in Figure 2.

4. Results and Discussion

4.1. Case Set and Evaluation Coverage

Our study aimed to compare the performance of the agentic workflow against the non-agentic baselines across 360 price-only cases from Central and Eastern Europe spanning six bidding zones: Bulgaria, Czechia, Hungary, Poland, Romania, and Slovakia. Table 2 reports the distribution across zones.

4.2. Performance on Automatic Reliability Metrics

The comparison against non-agentic baselines yields mixed results. Relative to the monolithic LLM baseline, the full agentic workflow improves the composite reliability score by +0.067 (95% CI [+0.049, +0.085]) and increases calibration by +0.500. However, once stronger non-agentic baselines are introduced, the full agentic workflow no longer has the highest reliability score. Compared with the fixed-tool baseline, the full agentic workflow is lower by −0.092 (95% CI [−0.099, −0.085]), and compared with the statistical baseline, it is lower by −0.109 (95% CI [−0.114, −0.103]). Compared with the original monolithic baseline, the full agentic workflow is markedly more action-oriented (+0.778) and less generic (−0.219), while target-neighbourhood focus remains unchanged because both systems already achieve a ceiling value of 1.000. Against the stronger baselines, calibration and actionability are effectively tied at their maximum values under the present binary indicators, whereas the full agentic workflow shows slightly higher genericity (+0.047). Table 3 summarises the overall paired deltas for the full agentic workflow against all three baselines.

Figure 3 summarises the distribution of per-case reliability scores for baselines and agentic systems (Figure 3a) and visualises the paired per-case deltas (Figure 3b), highlighting both overall performance and case-level heterogeneity.

These results indicate that architectural design affects the reliability and practical usability of LLM-based diagnostics in 15 min electricity markets. Relative to the monolithic baseline, the agentic workflow yields better outputs under limited observability, particularly through stronger calibration. However, the comparison with stronger non-agentic baselines shows that these gains should be interpreted selectively rather than as uniform superiority across all reference systems.

4.3. Unsupported Causal Claims vs. “Hypothesis-Style” Mentions

An important element in price-only settings is to avoid unsupported exogenous attribution (e.g., load as factual drivers without evidence). The full agentic workflow substantially increases exogenous mentions relative to every baseline: +29.592 against the monolithic baseline, +24.339 against the fixed-tool baseline, and +24.628 against the statistical baseline. However, this increase is not accompanied by fewer unsupported exogenous assertions. Instead, unsupported exogenous assertions are higher for the full agentic workflow than for all three baselines: +1.022 versus the monolithic baseline, +1.358 versus the fixed-tool baseline, and +1.625 versus the statistical baseline. Figure 4 visualises the trade-off between how often a system mentions exogenous concepts and how often it states them as unsupported assertions, showing baseline and agentic outputs as paired points per case.

The rise in exogenous mentions suggests that the agentic workflow produces richer interpretive narratives, but the concurrent increase in unsupported exogenous assertions indicates that richer explanation does not always translate into stronger evidential discipline. In price-only settings, this trade-off highlights the importance of clearly separating observed price patterns from low-confidence explanatory hypotheses. This result should therefore be interpreted as a limitation of the current agentic design rather than as an unqualified improvement. Although the workflow encourages broader hypothesis generation, the automatic metric indicates that some exogenous drivers are still formulated with insufficient evidential caution.

4.4. Target-Neighbourhood Grounding–Ceiling Effect

The paired difference was

∆ = 0.000

for the target-neighbourhood focus metric compared to all three baselines. This suggests a ceiling effect. All systems reliably referenced time stamps near the target interval under the current prompting and/or post-processing constraints. Figure 5 complements the binary target-neighbourhood metric by showing the distribution of absolute time distances between all time stamps mentioned in the evidence and the case’s target time stamp.

4.5. Heterogeneity Across Bidding Zones

Improvements in reliability were observed consistently across zones, with mean

∆

reliability scores ranging from +0.049 for Slovakia to +0.088 for Hungary. Calibration gains were also consistent, ranging from +0.400 for Slovakia to +0.583 for Hungary. Unsupported exogenous assertions were higher for the full agentic workflow in every zone, ranging from +0.767 in Poland to +1.300 in Romania. Exogenous mentions also increased sharply in every zone, with deltas between +29.033 and +30.633. Table 4 reports per-zone paired deltas (agentic—monolithic LLM baseline) for the primary metrics together with 95% bootstrap confidence intervals, highlighting cross-zone heterogeneity in the observed effects.

Figure 6 shows cross-zone heterogeneity by plotting the mean per-zone improvement in reliability (agentic—monolithic LLM baseline) together with 95% bootstrap confidence intervals.

The direction of the effect was stable across all six bidding zones, supporting the practical relevance of the architecture beyond a single market. This cross-zone consistency suggests that the approach generalises reasonably well within the CEE context, although the magnitude of improvement remains heterogeneous across zones.

4.6. Ablation Study: Component Contributions Against the Full Agentic Workflow

Table 5 presents the performance of the full agentic workflow relative to three ablations variants: no Planner, no Market Data, and no Verifier. To isolate the incremental contribution of each component, we report paired per-case differences for each metric, defined as ∆m =

m_{f u l l} - m_{a b l a t i o n}

. Positive ∆m therefore indicates that the full workflow performs better than the corresponding ablated variant, suggesting that the removed component contributes positively to that metric.

The ablation results indicate that the Market-Data Agent is the dominant contributor to the measurable gains in evidence grounding and target-window traceability. Removing this component produces the largest degradation in reliability and reliability without calibration, and it eliminates the target-neighbourhood advantage, suggesting that deterministic structuring of the local price window is a core dependency of the workflow. By contrast, removing the Planner produces negligible changes across most metrics, indicating that its contribution is not strongly captured by the present reliability indicators. The Verifier also shows mixed effects: it reduces unsupported exogenous assertions and genericity, but its removal does not reduce calibration or target-neighbourhood focus and slightly increases the composite reliability score. Therefore, the ablation study suggests that the observed improvements should be interpreted as arising primarily from structured market-data extraction, while the Planner and Verifier provide more limited or metric-dependent benefits under the current evaluation framework.

4.7. Efficiency and Cost Analysis

The efficiency and cost analysis covers LLM usage, token consumption, and estimated cost in US dollars. As expected, the full agentic workflow is the most expensive evaluated system among the actual methods, with a mean wall-clock time of 12.201 s per case, a mean active LLM time of 12.182 s, 9063.350 total tokens, and an estimated mean cost of USD 0.002 per case (as in Table 6). The original monolithic baseline and the fixed-tool baseline are much cheaper and faster, requiring about 2.135 s and 2.038 s per case, respectively.

The ablations reduce runtime and token usage to different extents. Removing the Market-Data Agent lowers mean wall-clock time to 7.021 s and mean total tokens to 3671.989. Removing the Planner lowers mean wall-clock time to 9.966 s and mean total tokens to 8235.067. Removing the Verifier lowers mean wall-clock time to 6.509 s and mean total tokens to 3932.997. Workflow overhead outside active LLM calls remains negligible for all LLM-based methods.

4.8. Human Evaluation

Automatic metrics quantify structural properties (time-stamp presence, calibration signals, unsupported-claim heuristics), but they do not fully capture whether the explanation is useful or credible for market analysts. Thus, we also report the results obtained following human evaluation in Table 7.

Overall, the practical contribution of the proposed workflow should be interpreted as diagnostic-risk reduction rather than proof of economic gains, causal validity, or trading profitability. The findings suggest that deploying LLMs in high-frequency electricity-market analytics requires architectural safeguards that enforce grounding and uncertainty calibration.

The Krippendorff

α

values indicate moderate inter-annotator agreement for overall quality, helpfulness, and correctness, but lower agreement for more subjective dimensions such as clarity and pairwise preference. Therefore, the human evaluation should be interpreted as supportive evidence of perceived usefulness rather than definitive proof of practical decision value.

4.9. Validation and Robustness Across All Metrics

Table 8 summarises baseline and agentic performance across all studied metrics, reporting mean values and the paired mean effect with 95% bootstrap confidence intervals and paired Cohen’s

d

.

The validated estimates support the primary findings, indicating that the agentic workflow improves the reliability score and calibration. However, the agentic workflow also increases both exogenous mentions and unsupported exogenous assertions. In contrast, the target-neighbourhood metric remains unchanged due to a ceiling effect.

5. Conclusions

This study evaluated the performance of an agentic workflow against multiple non-agentic baselines for 15 min SDAC day-ahead diagnostic question answering based on price-only inputs. To isolate the architectural contribution, we held the underlying model and case information constant.

Relative to the monolithic baseline, the agentic workflow achieved higher composite reliability and stronger calibration across 360 price-only cases from six Central and Eastern European bidding zones. However, the comparison with stronger fixed-tool and statistical baselines shows that this advantage is not uniform across reference systems, and that deterministic preprocessing accounts for an important share of measurable reliability gains.

Human evaluation favoured the agentic outputs in overall quality, helpfulness, and correctness. Taken together, these results indicate that structured decomposition and verification can produce more grounded and cautious explanations in bounded price-only settings.

At the same time, the findings should be interpreted within the limits of the present study. Because the benchmark uses day-ahead prices alone and the reliability score is heuristic, this study does not establish full market causality, operational reliability, or trading value. Rather, it supports the narrower claim that agentic design improves descriptive diagnostic performance relative to a simple monolithic baseline under constrained observability.

Future work will incorporate additional public signals (e.g., ENTSO-E Transparency data on load, generation, outages, flows, and capacity) to enable stronger causal tests.

Author Contributions

Ș.A., S.-V.O.: conceptualisation, methodology, formal analysis, investigation, writing—original draft, writing—review and editing, visualisation, project administration. A.B., Ș.A.: validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review and editing, visualisation, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant from the Ministry of Research, Innovation and Digitization, CNCS/CCCDI—UEFISCDI, project number COFUND-DUT-OPEN4CEC-1, within PNCDI IV. This project was funded by the UEFISCDI under the Driving Urban Transitions Partnership, which is co-funded by the European Commission.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://github.com/sener3/energy-agents.

Acknowledgments

This work was supported by a grant from the Ministry of Research, Innovation and Digitization, CNCS/CCCDI—UEFISCDI, project number COFUND-DUT-OPEN4CEC-1, within PNCDI IV. This project was funded by the UEFISCDI under the Driving Urban Transitions Partnership, which is co-funded by the European Commission.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Appendix A. Agent Prompts

Appendix A.1. Planner Agent Prompt

System message

You are the planner agent for EU day-ahead electricity-market analysis at 15 min resolution.

Decompose the user’s question into short analytic tasks. Output STRICT JSON.

User message template

Case:

- zone: {case[“zone”]}

- resolution: {case.get(“resolution,””15 m”)}

- question: {case[“question”]}

Return JSON with keys: intent, target_window (optional), tasks (list of short steps).

Appendix A.2. Market-Data Agent Prompt

System message

You are the market-data agent. Summarise patterns based strictly on the numeric table provided.

Do not invent values. Output STRICT JSON.

User message template

Summary stats: {summary_stats}

Detected patterns (heuristic): {detected_patterns}

Return JSON with keys:

- summary_stats (object)

- detected_patterns (list of strings)

- features_table (keep as-is; do not rewrite)

IMPORTANT: Put the original features_table back in output unchanged.

Appendix A.3. Mechanism Agent Prompt

System message

You are the market-mechanism agent for EU day-ahead markets (SDAC, 15 min MTUs).

Given numeric patterns and relevant public events, propose plausible mechanisms and sanity checks.

Do not cite documents. Only reason from provided inputs. Output STRICT JSON.

User message template

Zone: {case[“zone”]}

Question: {case[“question”]}

Market data summary:

{state[“market_data”][“summary_stats”]}

Detected patterns:

{state[“market_data”][“detected_patterns”]}

Quant notes:

{state[“quant”][“key_effects”]}

{state[“quant”][“counterfactual_notes”]}

Return JSON with:

- plausibility_checks (list)

- potential_market_mechanisms (list)

- warnings (list)

Appendix A.4. Verifier Agent Prompt

System message

You are the critic/verifier agent. Your job: identify unsupported causal claims, numeric issues, and what data would be needed to be more certain. Be strict. Output STRICT JSON.

User message template

Zone: {case[“zone”]}

Question: {case[“question”]}

Inputs available:

- Market data summary: {state[“market_data”][“summary_stats”]}

- Detected patterns: {state[“market_data”][“detected_patterns”]}

- Quant: {state[“quant”]}

- Mechanism: {state[“mechanism”]}

Return JSON with keys:

- unsupported_claims (list)

- numeric_inconsistencies (list)

- missing_data_requests (list)

- revised_guidance (list) # how to phrase conclusions safely

Appendix A.5. Synthesizer Agent Prompt

System message

Return ONLY JSON matching the schema.

Do NOT change evidence lines. Copy them exactly.

Do NOT state exogenous causes (load/RES/outages/congestion/flows) as facts unless provided in inputs.

If only price series is provided, describe observed patterns and label any causes as hypotheses with low confidence.

User message template

Zone: {zone}

Question: {case.get(“question,””“)}

Target time stamp: {target_ts}

KEEP these evidence lines EXACTLY (do not edit):

{evidence}

Observed patterns (put into top_drivers; do NOT make them causal):

{top_drivers}

Uncertainty (keep calibrated; no contradictions like “time stamps missing”):

{uncertainty}

Draft final_summary (include explicit low-confidence hypotheses, not facts):

{final_summary}

Return JSON with keys:

- top_drivers: list[str]

- evidence: list[str] (must be exactly as provided above)

- event_links: list (empty unless case.public_events is used)

- uncertainty: list[str]

- final_summary: string

References

Mentens, L.; Peremans, H.; Springael, J.; Nimmegeers, P. Flexibility in short-term electricity markets for renewable integration and uncertainty mitigation: A comprehensive review. Smart Energy 2025, 18, 100183. [Google Scholar] [CrossRef]
Kiesel, R.; Paraschiv, F. Econometric analysis of 15-minute intraday electricity prices. Energy Econ. 2017, 64, 77–90. [Google Scholar] [CrossRef]
Parisio, L.; Bosco, B. Electricity prices and cross-border trade: Volume and strategy effects. Energy Econ. 2008, 30, 1760–1775. [Google Scholar] [CrossRef][Green Version]
Handler, A.; Larsen, K.R.; Hackathorn, R. Large language models present new questions for decision support. Int. J. Inf. Manag. 2024, 79, 102811. [Google Scholar] [CrossRef]
Wang, B.; Zhou, Y.; Ge, L.; Kung, S.-Y. Large-model-based smart agent for time series anomaly detection in power systems. Expert Syst. Appl. 2026, 296, 128917. [Google Scholar] [CrossRef]
Dokas, I.M. From hallucinations to hazards: Benchmarking LLMs for hazard analysis in safety-critical systems. Saf. Sci. 2026, 194, 107056. [Google Scholar] [CrossRef]
Märkle-Huß, J.; Feuerriegel, S.; Neumann, D. Contract durations in the electricity market: Causal impact of 15 min trading on the EPEX SPOT market. Energy Econ. 2018, 69, 367–378. [Google Scholar] [CrossRef]
Cramer, E.; Witthaut, D.; Mitsos, A.; Dahmen, M. Multivariate probabilistic forecasting of intraday electricity prices using normalizing flows. Appl. Energy 2023, 346, 121370. [Google Scholar] [CrossRef]
Keles, D.; Dehler-Holland, J.; Densing, M.; Panos, E.; Hack, F. Cross-border effects in interconnected electricity markets—An analysis of the Swiss electricity prices. Energy Econ. 2020, 90, 104802. [Google Scholar] [CrossRef]
Fan, Z.; Ghaddar, B.; Wang, X.; Xing, L.; Zhang, Y.; Zhou, Z. Artificial intelligence for optimization: Unleashing the potential of parameter generation, model formulation, and solution methods. Eur. J. Oper. Res. 2025, 332, 1–30. [Google Scholar] [CrossRef]
Ndum, Z.N.; Tao, J.; Ford, J.; Liu, Y. Automating Monte Carlo simulations in nuclear engineering with domain knowledge-embedded large language model agents. Energy AI 2025, 21, 100555. [Google Scholar] [CrossRef]
Heistrene, L.; Machlev, R.; Perl, M.; Belikov, J.; Baimel, D.; Levy, K.; Mannor, S.; Levron, Y. Explainability-based Trust Algorithm for electricity price forecasting models. Energy AI 2023, 14, 100259. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, J.; Lu, J.; Zhao, Y. Large Language Models Meet Energy Systems: Opportunities, Challenges, and Future Perspectives. Appl. Energy 2026, 403, 127076. [Google Scholar] [CrossRef]
Madani, S.; Tavasoli, A.; Khoshtarash Astaneh, Z.; Pineau, P.-O. Large Language Models Integration in Smart Grids. Energy Rep. 2025, 14, 1562–1577. [Google Scholar] [CrossRef]
Ziel, F.; Steinert, R. Electricity price forecasting using sale and purchase curves: The X-Model. Energy Econ. 2016, 59, 435–454. [Google Scholar] [CrossRef]
Fridgen, G.; Michaelis, A.; Rinck, M.; Schöpf, M.; Weibelzahl, M. The search for the perfect match: Aligning power-trading products to the energy transition. Energy Policy 2020, 144, 111523. [Google Scholar] [CrossRef]
Lavrinovics, E.; Biswas, R.; Bjerva, J.; Hose, K. Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective. J. Web Semant. 2025, 85, 100844. [Google Scholar] [CrossRef]
Majumder, S.; Dong, L.; Doudi, F.; Cai, Y.; Tian, C.; Kalathil, D.; Ding, K.; Thatte, A.A.; Li, N.; Xie, L. Exploring the capabilities and limitations of large language models in the electric energy sector. Joule 2024, 8, 1544–1549. [Google Scholar] [CrossRef]
Li, Y.; Ji, M.; Chen, J.; Wei, X.; Gu, X.; Tang, J. A large language model-based building operation and maintenance information query. Energy Build. 2025, 334, 115515. [Google Scholar] [CrossRef]
Gan, A.; Yu, H.; Zhang, K.; Liu, Q.; Yan, W.; Huang, Z.; Tong, S.; Hu, G. Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. arXiv 2025, arXiv:2504.14891. [Google Scholar] [CrossRef]
Zhang, L.; Fu, X.; Li, Y.; Chen, J. Large language model-based agent Schema and library for automated building energy analysis and modeling. Autom. Constr. 2025, 176, 106244. [Google Scholar] [CrossRef]
Ni, B.; Cai, X.; Shen, Z.; Meng, Z.; Zhao, J.; Cheng, Y.; Gui, X. Intelli-Dispatch-SQL: An LLM-based agent for reliable Text-to-SQL in power dispatching. Energy AI 2025, 22, 100591. [Google Scholar] [CrossRef]
Zhang, L.; Ford, V.; Chen, Z.; Chen, J. Automatic building energy model development and debugging using large language models agentic workflow. Energy Build. 2025, 327, 115116. [Google Scholar] [CrossRef]
Sapkota, R.; Roumeliotis, K.I.; Karkee, M. AI Agents vs. Agentic AI: A Conceptual taxonomy, applications and challenges. Inf. Fusion 2026, 126, 103599. [Google Scholar] [CrossRef]
Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
Xie, T.; Wan, Y.; Zhou, Y.; Huang, W.; Liu, Y.; Linghu, Q.; Wang, S.; Kit, C.; Grazian, C.; Zhang, W.; et al. Creation of a structured solar cell material dataset and performance prediction using large language models. Patterns 2024, 5, 100955. [Google Scholar] [CrossRef] [PubMed]
Rashkin, H.; Nikolaev, V.; Lamm, M.; Aroyo, L.; Collins, M.; Das, D.; Petrov, S.; Tomar, G.S.; Turc, I.; Reitter, D. Measuring Attribution in Natural Language Generation Models. arXiv 2021, arXiv:2112.12870. [Google Scholar]
Theissler, A.; Spinnato, F.; Schlegel, U.; Guidotti, R. Explainable AI for Time Series Classification: A Review, Taxonomy and Research Directions. IEEE Access 2022, 10, 100700–100724. [Google Scholar] [CrossRef]
Park, T. Enhancing Anomaly Detection in Financial Markets with an LLM-based Multi-Agent Framework. arXiv 2024, arXiv:2403.19735. [Google Scholar] [CrossRef]
Li, X.; Zeng, Y.; Xing, X.; Xu, J.; Xu, X. QuantAgents: Towards Multi-agent Financial System via Simulated Trading. arXiv 2025, arXiv:2510.04643. [Google Scholar] [CrossRef]
Buster, G.; Pinchuk, P.; Barrons, J.; McKeever, R.; Levine, A.; Lopez, A. Supporting energy policy research with large language models: A case study in wind energy siting ordinances. Energy AI 2024, 18, 100431. [Google Scholar] [CrossRef]
Jiang, G.; Ma, Z.; Zhang, L.; Chen, J. Prompt engineering to inform large language model in automated building energy modeling. Energy 2025, 316, 134548. [Google Scholar] [CrossRef]
Zouhar, V.; Cui, P.; Sachan, M. How to Select Datapoints for Efficient Human Evaluation of NLG Models? arXiv 2025, arXiv:2501.18251. [Google Scholar] [CrossRef]
Zhang, L.; Chen, Z. Opportunities of applying Large Language Models in building energy sector. Renew. Sustain. Energy Rev. 2025, 214, 115558. [Google Scholar] [CrossRef]
Antonesi, G.; Cioara, T.; Anghel, I.; Michalakopoulos, V.; Sarmas, E.; Toderean, L. A systematic review of transformers and large language models in the energy sector: Towards agentic digital twins. Appl. Energy 2025, 401, 126670. [Google Scholar] [CrossRef]
Kath, C.; Ziel, F. The value of forecasts: Quantifying the economic gains of accurate quarter-hourly electricity price forecasts. Energy Econ. 2018, 76, 411–423. [Google Scholar] [CrossRef]
ENTSO-E. ENTSO-E Transparency Platform Web API Endpoint. Available online: https://transparency.entsoe.eu/ (accessed on 1 February 2026).

Figure 1. Agentic workflow overview.

Figure 2. Methodological flowchart.

Figure 3. (a) Distribution of per-case reliability scores (baseline versus agentic). (b) Paired per-case reliability differences (agentic—baseline).

Figure 4. Unsupported exogenous assertions versus exogenous mentions.

Figure 5. Evidence for time-stamp proximity to the target interval (baseline versus agentic).

Figure 6. Forest plot of per-zone

∆

reliability score with 95% bootstrap confidence intervals.

Figure 6. Forest plot of per-zone

∆

reliability score with 95% bootstrap confidence intervals.

Table 1. Comparison of LLM-based, agentic, and reliability-oriented work relevant to the proposed diagnostic workflow.

Ref	Objective	Methods	Brief Findings	Generative Models
[11]	Automate Monte Carlo nuclear simulations (FLUKA workflow)	Domain-embedded LLM agents, RAG, GUI integration	Reduced error resolution time from days to <1 min; high accuracy (<0.001% uncertainty)	LLM agents + RAG
[13]	Review LLM applications in energy systems	Literature review of 22 LLM types and enhancement techniques	Identifies 13 LLM roles; highlights opportunities and challenges	GPT, LLaMA, ChatGLM, etc.
[14]	Review LLM applications in power systems	Analysis of 30 real-world applications	LLMs enhance grid ops, markets, security; reliability challenges	LLMs
[17]	Mitigate hallucinations in LLMs using knowledge graphs	KG integration with LLMs; evaluation benchmarks	KGs enhance reliability but open challenges remain	LLM + knowledge graph
[21]	Standardise LLM agents in building energy sector	JSON-based agent schema + open-source library	Enables reproducible and shareable LLM agents	LLM agents
[22]	Improve Text-to-SQL for power dispatching	Agent-based LLM framework with intent recognition + SQL validation	Significant gains in execution accuracy; robust across LLMs	LLM agents
[23]	Automate building energy modelling (BEM)	Multi-agent LLM planning workflow	Outperforms naive prompting and manual modelling in accuracy and efficiency	LLM agent workflow
[24]	Differentiate AI agents vs. agentic AI	Conceptual taxonomy and comparative framework	Agentic AI enables dynamic decomposition, autonomy, collaboration	LLM-based agentic systems
[32]	Auto-building energy modelling via prompt engineering	Prompt engineering (few-shot, CoT strategies)	Effective ABEM without fine-tuning; compact LLMs viable	LLM (prompt-based)
[34]	Review LLM applications in building energy	Literature review + survey	LLMs enhance control, automation, compliance; challenges remain	LLMs
[35]	Review transformers and LLMs in energy; propose agentic digital twin	Review + conceptual framework	Introduces agentic digital twin integrating FMs	LLMs, foundation models

Table 2. Case composition by zone and case type.

Zone	Total Cases	Spike Cases	Ramp Cases	Target Time Range (UTC)
Bulgaria	60	30	30	2025-10-04 to 2026-04-15
Czechia	60	30	30	2025-10-01 to 2026-04-11
Hungary	60	30	30	2025-10-04 to 2026-04-15
Poland	60	30	30	2025-10-01 to 2026-04-16
Romania	60	30	30	2025-10-04 to 2026-04-15
Slovakia	60	30	30	2025-10-01 to 2026-04-19
All	360	180	180	2025-10-01 to 2026-04-19

Table 3. Overall metric deltas for the full agentic workflow against the revised baselines with 95% bootstrap confidence intervals.

Comparison	Reliability	Calibration	Unsupported Exogenous Assertions	Exogenous Mentions	Target-Neighbourhood	Reliability (No Calibration)	Genericity	Actionability
Agentic—Monolithic LLM	+0.067 [+0.049, +0.085]	+0.500 [+0.447, +0.553]	+1.022 [+0.831, +1.214]	+29.592 [+29.111, +30.086]	+0.000 [+0.000, +0.000]	−0.083 [−0.094, −0.073]	−0.219 [−0.272, −0.169]	+0.778 [+0.733, +0.822]
Agentic—Fixed-Tool LLM	−0.092 [−0.099, −0.085]	+0.000 [+0.000, +0.000]	+1.358 [+1.222, +1.506]	+24.339 [+23.908, +24.770]	+0.000 [+0.000, +0.000]	−0.092 [−0.099, −0.085]	+0.047 [+0.028, +0.069]	+0.000 [+0.000, +0.000]
Agentic—Statistical	−0.109 [−0.114, −0.103]	+0.000 [+0.000, +0.000]	+1.625 [+1.508, +1.756]	+24.628 [+24.192, +25.058]	+0.000 [+0.000, +0.000]	−0.109 [−0.114, −0.103]	+0.047 [+0.028, +0.069]	+0.000 [+0.000, +0.000]

Table 4. Per-zone performance deltas (agentic—monolithic LLM baseline) with 95% bootstrap confidence intervals.

Zone	N	$∆$ Reliability	$∆$ Calibration	$∆$ Unsupported Exogenous Assertions	$∆$ Exogenous Mentions	$∆$ Target-Neighbourhood
Bulgaria	60	+0.070 [+0.024, +0.116]	+0.500 [+0.367, +0.633]	+0.950 [+0.550, +1.300]	+30.633 [+29.717, +31.583]	+0.000 [+0.000, +0.000]
Czechia	60	+0.074 [+0.028, +0.118]	+0.550 [+0.433, +0.667]	+1.083 [+0.617, +1.567]	+29.033 [+27.767, +30.367]	+0.000 [+0.000, +0.000]
Hungary	60	+0.088 [+0.045, +0.130]	+0.583 [+0.450, +0.700]	+1.200 [+0.600, +1.833]	+29.433 [+28.217, +30.617]	+0.000 [+0.000, +0.000]
Poland	60	+0.064 [+0.018, +0.109]	+0.450 [+0.317, +0.583]	+0.767 [+0.267, +1.283]	+29.117 [+27.900, +30.300]	+0.000 [+0.000, +0.000]
Romania	60	+0.056 [+0.013, +0.100]	+0.517 [+0.400, +0.650]	+1.300 [+0.866, +1.733]	+30.183 [+29.033, +31.367]	+0.000 [+0.000, +0.000]
Slovakia	60	+0.049 [+0.008, +0.090]	+0.400 [+0.267, +0.533]	+0.833 [+0.483, +1.167]	+29.150 [+28.050, +30.217]	+0.000 [+0.000, +0.000]
All	360	+0.067 [+0.049, +0.085]	+0.500 [+0.447, +0.553]	+1.022 [+0.831, +1.214]	+29.592 [+29.111, +30.086]	+0.000 [+0.000, +0.000]

Table 5. Ablation performance differences relative to the full agentic workflow, with 95% bootstrap confidence intervals.

Comparison	Reliability	Calibration	Unsupported Exogenous Assertions	Exogenous Mentions	Target-Neighbourhood	Reliability (No Calibration)	Genericity	Actionability
Full agentic—no market data	+0.631 [+0.627, +0.634]	+0.000 [+0.000, +0.000]	+1.189 [+1.078, +1.317]	+11.497 [+10.892, +12.056]	+1.000 [+1.000, +1.000]	+0.931 [+0.927, +0.934]	+0.047 [+0.028, +0.069]	+0.000 [+0.000, +0.000]
Full agentic—no planner	−0.003 [−0.006, +0.001]	+0.000 [+0.000, +0.000]	+0.033 [−0.133, +0.197]	−0.167 [−0.803, +0.475]	+0.000 [+0.000, +0.000]	−0.003 [−0.006, +0.001]	+0.006 [−0.025, +0.036]	+0.000 [+0.000, +0.000]
Full agentic—no verifier	−0.006 [−0.009, −0.003]	+0.000 [+0.000, +0.000]	+0.217 [+0.108, +0.342]	+13.081 [+12.631, +13.517]	+0.000 [+0.000, +0.000]	−0.006 [−0.009, −0.003]	+0.042 [+0.019, +0.067]	+0.000 [+0.000, +0.000]

Table 6. Efficiency summary by method (mean values).

Method	Wall Time (s)	Active LLM Time (s)	Workflow Overhead (s)	Total Tokens	Estimated Cost (USD)
Agentic	12.201	12.182	0.019	9063.350	0.002
Agentic—no market data	7.021	7.021	0.001	3671.989	0.001
Agentic—no planner	9.966	9.948	0.018	8235.067	0.002
Agentic—no verifier	6.509	6.492	0.018	3932.997	0.001
Baseline	2.135	2.127	0.008	1305.942	<0.001
Baseline fixed tool	2.038	2.009	0.029	1765.295	<0.001
Baseline statistical	0.030	0.000	0.000	0.000	0.000

Table 7. Human evaluation ratings.

Metric	Agentic (Mean, 95% CI)	Baseline (Mean, 95% CI)	$∆$ Agentic—Baseline (95% CI)	$Krippendorff ’ s α$
Overall (1–5)	3.77 [3.63, 3.90]	3.02 [2.83, 3.21]	0.74 [0.52, 0.97]	0.66
Helpfulness (1–5)	3.93 [3.78, 4.09]	2.88 [2.72, 3.02]	1.06 [0.86, 1.26]	0.61
Correctness (1–5)	3.93 [3.84, 4.02]	2.99 [2.80, 3.17]	0.94 [0.76, 1.14]	0.67
Clarity (1–5)	3.70 [3.61, 3.78]	3.21 [2.98, 3.43]	0.49 [0.24, 0.74]	0.43
Pairwise preference (win rate)	65.5% [52.2%, 77.8%]	6.7% [2.2%, 13.3%]	58.8% [41.1%, 75.6%]	0.48

Table 8. Monolithic LLM baseline and agentic framework performance.

Metric	Monolithic Baseline Mean	Agentic Mean	$Mean ∆$ (95% CI)	$Cohen ’ s d$
Reliability score	+0.643	+0.710	+0.067 [+0.049, +0.085]	0.385
Price-only disclaimer presence (calibration)	+0.500	+1.000	+0.500 [+0.447, +0.553]	0.999
Unsupported exogenous assertions	+2.194	+3.217	+1.022 [+0.831, +1.214]	0.548
Exogenous mentions	+3.628	+33.219	+29.592 [+29.111, +30.086]	6.418
Target-neighbourhood focus	+1.000	+1.000	0.000 [0.000, 0.000]	0.000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Ali, Ș.; Oprea, S.-V.; Bâra, A. Agentic AI for Price-Only 15 min SDAC Market Diagnostics in Central and Eastern Europe. Appl. Syst. Innov. 2026, 9, 93. https://doi.org/10.3390/asi9050093

AMA Style

Ali Ș, Oprea S-V, Bâra A. Agentic AI for Price-Only 15 min SDAC Market Diagnostics in Central and Eastern Europe. Applied System Innovation. 2026; 9(5):93. https://doi.org/10.3390/asi9050093

Chicago/Turabian Style

Ali, Șener, Simona-Vasilica Oprea, and Adela Bâra. 2026. "Agentic AI for Price-Only 15 min SDAC Market Diagnostics in Central and Eastern Europe" Applied System Innovation 9, no. 5: 93. https://doi.org/10.3390/asi9050093

APA Style

Ali, Ș., Oprea, S.-V., & Bâra, A. (2026). Agentic AI for Price-Only 15 min SDAC Market Diagnostics in Central and Eastern Europe. Applied System Innovation, 9(5), 93. https://doi.org/10.3390/asi9050093

Article Menu

Agentic AI for Price-Only 15 min SDAC Market Diagnostics in Central and Eastern Europe

Abstract

1. Introduction

2. Literature Review

2.1. LLMs for Energy and Electricity Markets: Applications, Numeric Grounding, and Time-Series Reasoning

2.2. Agentic Workflows and Tool-Augmented LLM Systems

2.3. Structural Decomposition, Schema-Grounded Evidence Extraction and Verification

2.4. Cross-Domain Agentic AI for Financial Market Diagnostics and Time-Series Analysis

2.5. Evaluating Reliability-Evidence Focus, Stability, and Unsupported Claims

3. Methodology

3.1. Case Construction Using ENTSO-E Day-Ahead Prices

3.2. Systems Under Comparison

3.2.1. Baseline Monolithic LLM

3.2.2. Baseline Fixed-Tool LLM

3.2.3. Baseline Statistical Diagnostic System

3.2.4. Agentic Workflow Mapping

3.2.5. Component Ablations Against the Full Agentic Workflow

3.2.6. Implementation Details

3.3. Automated Evaluation

3.3.1. Time-Stamp Coverage

3.3.2. Delta-Format Coverage

3.3.3. Target-Neighbourhood Focus

3.3.4. Calibration

3.3.5. Unsupported Exogenous Assertions

3.3.6. Reliability Score

3.3.7. Reliability Score Without Calibration

3.3.8. Genericity

3.3.9. Actionability

3.4. Human Evaluation

3.5. Statistical Analysis

3.6. Validation of Results

4. Results and Discussion

4.1. Case Set and Evaluation Coverage

4.2. Performance on Automatic Reliability Metrics

4.3. Unsupported Causal Claims vs. “Hypothesis-Style” Mentions

4.4. Target-Neighbourhood Grounding–Ceiling Effect

4.5. Heterogeneity Across Bidding Zones

4.6. Ablation Study: Component Contributions Against the Full Agentic Workflow

4.7. Efficiency and Cost Analysis

4.8. Human Evaluation

4.9. Validation and Robustness Across All Metrics

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Agent Prompts

Appendix A.1. Planner Agent Prompt

Appendix A.2. Market-Data Agent Prompt

Appendix A.3. Mechanism Agent Prompt

Appendix A.4. Verifier Agent Prompt

Appendix A.5. Synthesizer Agent Prompt

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI