1. Background & Summary
Large language models (LLMs) are increasingly being deployed as autonomous agents that operate within closed-loop systems requiring reasoning, decision-making, tool invocation, and status updates. Early agent frameworks showed that explicitly intertwining reasoning steps with concrete actions lets models break down complex goals, use external tools, and adjust their behavior based on intermediate results [
1,
2]. Subsequent work extended these principles to persistent and open-ended interaction, showing that LLM-based agents can maintain an internal state, explore environments over long periods of time, and learn from iterative feedback rather than isolated messages [
3,
4]. More recent studies have consolidated these developments, emphasizing the rapid expansion of agent-based LLM architectures across a wide range of application domains [
5].
As LLM agents gain autonomy, concerns have emerged regarding transparency, auditability, and causal accountability. In particular, understanding why an agent produced a specific action or decision requires more than access to its final result. Provenance frameworks have long addressed similar challenges in distributed and computing systems by formally representing the origins, dependencies, and transformations of data and actions [
6,
7]. In AI systems, this structured provenance is closely linked to interpretability, as it allows for the inspection of internal decision-making processes rather than relying solely on post-hoc explanations [
8]. From a human-centered perspective, principles of transparent interaction further highlight the importance of traceability of system behavior, particularly when users must trust, supervise, or correct autonomous agents [
9].
Beyond interpretability, the notion of causality plays a central role in assessing the actual responsibility of an agent’s reasoning in their actions. Causal models differentiate between correlations and true explanatory factors, providing a theoretical basis for assessing whether observed outcomes are based on valid decision-making processes [
10]. This distinction is particularly relevant for LLM-based agents, which are known to generate fluid but potentially deceptive traces of reasoning. Recent studies have raised concerns about fairness, accountability, and governance in synthetic and AI-generated datasets, emphasizing the need for explicit documentation of generation processes and normative assumptions [
11,
12,
13]. Large-scale audits of data provenance and attribution in AI systems further underscore the growing demand for transparency and verifiability of data traceability throughout the lifecycle of models and datasets [
14].
At the same time, the use of synthetic data has rapidly developed as a means of evaluating, comparing, and testing the resilience of intelligent systems under controlled conditions. Analyses of synthetic data generation methods underscore their value in terms of reproducibility and targeted evaluation, while cautioning against unexamined biases and hidden artifacts [
15,
16]. As part of the broader debate on AI reliability, reproducibility, traceability, and accountability are now recognized as fundamental requirements rather than optional features [
17].
These challenges are particularly pressing for LLM-based agents. Empirical tests have shown that, despite impressive language proficiency, LLMs often struggle with multi-step planning and fail to produce executable sequences of actions in complex tasks [
18]. Furthermore, the reasoning traces generated by language models may appear coherent while remaining causally unrelated to the actions actually taken, calling into question their reliability as explanations [
19]. Recent evaluation efforts have therefore begun to explicitly treat LLMs as agents, measuring not only task success but also decision consistency, tool use, and behavioral reliability over time [
20].
Within this context, the dataset introduced in this work is designed not as an open-ended collection of examples, but as a structured and finite set of analytically distinct agent interaction scenarios. Each scenario targets a specific class of agent behavior, such as decision rollback, tool failure recovery, memory inconsistency, or provenance divergence, enabling systematic coverage of agent-level reasoning phenomena rather than ad hoc illustration.
Motivated by these gaps, the dataset presented in this work provides a structured, provenance-focused collection of synthetic agent interaction traces designed to support systematic analysis of decision-making, tool invocation, memory updates, and causal attribution in LLM-based agents. The dataset is organized around a predefined set of scenarios, each representing a distinct and non-redundant analytical case under a fixed interaction schema. By combining deterministic generation, explicit provenance graphs, and scenario-level reference data, the dataset facilitates reproducible research on agent auditability, reasoning fidelity, and reliable autonomous behavior, without conflating behavioral analysis with downstream task performance.
To situate AgentSec within the broader landscape of agent evaluation benchmarks, tool-use datasets, and provenance frameworks,
Table 1 summarizes key structural differences. Unlike performance-oriented benchmarks, AgentSec is designed as a structured diagnostic dataset emphasizing explicit decision modeling, memory state evolution, and W3C PROV-DM compliant provenance graphs under deterministic generation.
Despite recent advances in agent benchmarking and tool-use evaluation, several important limitations remain. Most existing benchmarks focus primarily on task performance or final outputs rather than the internal decision-making process. Tool-use datasets often log execution outcomes without explicitly modeling intermediate reasoning states or memory evolution. Similarly, provenance-aware logging frameworks capture event traces but do not typically encode structured decision nodes aligned with agent reasoning steps. As a result, systematic auditing of reasoning fidelity, causal accountability, and internal state transitions remains challenging. These gaps motivate the need for structured, schema-validated datasets that explicitly capture decision traces, tool interactions, and provenance links in a unified and reproducible format.
2. Methods
2.1. Dataset Design and Generation Pipeline
The dataset was designed to capture fine-grained traces of agentic reasoning and action execution in large language model-based agents, with an explicit focus on decision transparency, tool interaction outcomes, memory evolution, and provenance consistency. The overall data generation pipeline is illustrated in
Figure 1, which summarizes the interaction flow, the logged trace layers, and the resulting provenance graph.
Each dataset instance originates from a user-issued request that triggers an agent interaction session. Upon receiving the request, the agent generates an internal decision trace representing the reasoning state prior to action execution. This trace records alternative options considered by the agent, the rationale guiding the selected option, and structured comparative elements (e.g., pros and cons of considered options). These decision traces are explicitly logged before any external tool is invoked, allowing reasoning steps to be analyzed independently from execution outcomes, as shown in
Figure 1A.
Following the decision stage, the agent may invoke one or more external tools. Tool calls are logged together with their execution status, including success, failure, or fallback behavior. This separation between decision intent and tool execution enables the dataset to represent cases where plausible reasoning leads to incorrect or incomplete actions, as well as recovery or retry strategies. The outcomes of tool calls directly influence the agent’s internal memory or state, which is subsequently updated, left unchanged, or marked as conflicted depending on the execution result.
All interaction components generated during a session are recorded as structured traces, including user and assistant interactions, decision traces, tool calls, and memory or state updates. These trace types form a sequential log that mirrors the temporal execution of the agent pipeline, as illustrated in
Figure 1B. This layered logging approach ensures that intermediate reasoning steps, side effects, and final responses remain observable and auditable.
To ensure trace-level accountability and causal interpretability, each session is additionally represented as a provenance graph following a structured entity–relation model. As shown in
Figure 1C, user interactions, decision traces, tool calls, memory updates, and assistant responses are modeled as distinct entities connected by explicit causal relations such as triggered, caused, derived from, and used by. This provenance representation enables the reconstruction of causal chains across reasoning and action steps and supports downstream analyses of faithfulness, error propagation, and decision reversibility.
Dataset instances were generated using predefined interaction scenarios encoded in a manifest file, which specifies the intended scientific objective, expected agent behavior, and success or failure conditions for each scenario. The generation process produces self-contained session records that conform to a shared JSON schema, ensuring structural consistency across all examples while preserving variability in agent behavior and outcomes.
2.2. Scenario Specification and Session Generation
Agent interaction sessions in the dataset are generated from a predefined set of scenarios designed to exercise distinct reasoning, tool-use, and memory behaviors. Each scenario specifies a controlled experimental setting in which an agent is expected to process a user request, reason over available options, interact with external tools if required, and update its internal state accordingly. The scenario-driven design ensures that all recorded sessions are intentional, reproducible, and aligned with clearly defined scientific objectives.
Scenarios are defined in a centralized manifest file that serves as the authoritative specification layer for the dataset. Each scenario entry includes a unique identifier and version number, together with a concise description of its scientific intent. This intent describes the targeted agent capability or failure mode, such as successful tool execution, tool failure recovery, memory conflicts, hallucination detection, or provenance branching. In addition, scenarios explicitly define the expected agent behavior, along with formal success criteria and failure conditions. These elements enable systematic validation of whether an interaction session conforms to its intended design.
For the present dataset release, the scenario specification defines a closed and finite set of 30 scenario instances. Each scenario is designed as a structurally distinct realization of agent behavior under the fixed session schema, rather than as a parametric variation of a shared template.
The scenario set was constructed to ensure coverage of predefined behavioral categories, including variations in decision trace structure, tool-call outcomes, memory operations, and provenance graph topology. Within each category, scenarios differ analytically by their internal execution paths and state transitions, ensuring that no two scenarios encode the same behavioral pattern. This design enables focused analysis of agent decision-making behaviors while maintaining a compact and interpretable dataset.
For each scenario, one or more interaction sessions are generated by instantiating a user request that matches the scenario specification. The agent processes this request following the interaction and reasoning pipeline described in
Figure 1, producing a sequence of decision traces, tool calls, memory updates, and final responses. Sessions may terminate successfully, partially succeed with recovery actions, or fail according to the predefined criteria. Importantly, all outcomes are retained in the dataset, allowing downstream users to study both correct and erroneous agent behaviors.
Session generation is deterministic at the structural level but allows controlled variability in reasoning paths and execution outcomes. While the schema and trace types remain fixed across all sessions, the content of decision rationales, the ordering of tool calls, and the resulting memory states may differ depending on the scenario and agent behavior. This balance between structural consistency and behavioral diversity is intended to support comparative analysis across scenarios without introducing uncontrolled noise.
Each generated session is serialized as an independent JSON file that conforms to a shared session schema. The schema enforces the presence of core components, including session metadata, interaction logs, decision traces, tool call records, memory traces, and provenance information. By validating all sessions against the same schema, the dataset guarantees interoperability and simplifies automated parsing and analysis.
To maintain traceability between high-level scenario definitions and concrete session data, each session record includes explicit references to the originating scenario identifier and version. This linkage enables users to group sessions by scenario type, reproduce specific experimental settings, and evaluate agent behavior relative to the intended design objectives. Together, the scenario specification and session generation process provide a controlled yet flexible framework for capturing agentic reasoning and action traces at scale.
In addition to the released static dataset, the repository provides a deterministic session generation script (dataset_generator.py) that programmatically instantiates predefined scenario types under the frozen schema. The generator produces structurally valid session files using fixed random seeds and deterministic identifiers, ensuring reproducibility of generated traces.
Each generated session is automatically validated against the project’s JSON schema using an AJV-based validation pipeline prior to being written to disk. While the current release defines a finite and curated set of 30 analytically distinct scenarios, the generator supports controlled extension by allowing developers to implement additional scenario types within the same structural contract.
This design separates scenario specification from structural enforcement, enabling AgentSec to function not only as a static diagnostic dataset but also as a reproducible session-generation framework for controlled auditing experiments.
2.3. Provenance Modeling and Causal Trace Representation
To ensure that each agent interaction can be examined, audited, and reused in a scientifically meaningful manner, the dataset adopts an explicit provenance-centric representation of agent behavior. Provenance is treated not as auxiliary metadata but as a first-class structural component that links user inputs, internal reasoning steps, tool invocations, memory updates, and final responses into a coherent causal trace.
The provenance model is grounded in the principles of causal dependency and traceability formalized in the PROV Data Model (PROV-DM), which provides a standardized vocabulary for describing entities, activities, and their relationships [
6]. Each session is represented as a directed acyclic graph in which nodes correspond to semantically distinct events occurring during agent execution, and edges encode causal relations between these events. This design enables the reconstruction of how an agent response emerged from a specific sequence of decisions and actions, rather than treating the response as an isolated output.
Within each session, provenance entities are instantiated for user interactions, decision traces, tool calls, memory or state updates, and assistant responses. These entities are connected through explicit relations such as triggered, caused, derived from, and used by, reflecting the temporal and causal ordering of the interaction flow. For example, a user interaction entity triggers one or more decision trace entities, which may in turn cause tool call activities whose outputs are derived into memory updates and ultimately used by the assistant response. Alternative reasoning paths, including rejected options or fallback strategies, are preserved as distinct provenance entities to avoid collapsing divergent internal behaviors into a single linear narrative.
This causal representation is intentionally aligned with foundational work on provenance as a mechanism for explaining complex computational processes [
7]. By encoding not only what occurred but also why and how it occurred, the dataset supports downstream analyses concerned with interpretability, debugging, and accountability. In particular, separating decision traces from tool executions makes it possible to distinguish reasoning failures from execution failures, a distinction that is critical when analyzing agent reliability.
The provenance graph also reflects concepts from causal inference, where understanding cause–effect relationships is essential for explanation and intervention [
10]. Rather than assuming that reasoning traces are faithful by construction, the dataset records the minimal causal structure required to assess whether an action or response genuinely depended on a prior reasoning step. This design choice directly addresses concerns raised in prior work showing that language models may produce fluent but unfaithful explanations that are not causally connected to their actions [
19]. In practical terms, the provenance structure explicitly encodes a decision-to-action chain linking (i) a decision trace entity, (ii) its stated rationale and selected option, and (iii) the subsequent tool calls, memory updates, and final response entities that depend on it. Because each relation is materialized as an explicit edge in the directed acyclic graph, it becomes possible to verify whether a final action or response is causally connected to its declared decision basis. This structured representation supports the analysis of potential “leaps in reasoning,” defined as cases where an action appears without a corresponding causal decision node, or where the declared rationale does not propagate to the executed outcome. By preserving both accepted and rejected decision branches as distinct provenance entities, the dataset enables fine-grained inspection of reasoning fidelity and causal consistency.
All provenance information is stored explicitly within each session file under a dedicated provenance field, alongside indexed lists of entities and links. This representation allows researchers to reconstruct execution graphs programmatically, compare alternative agent behaviors across scenarios, and evaluate causal consistency at scale. By formalizing provenance at the dataset level, the proposed design enables systematic study of agent reasoning faithfulness and causal accountability without imposing assumptions about the correctness or optimality of the agent’s behavior.
Recent work such as PROV-AGENT [
21] also proposes extending W3C PROV for modeling agentic workflows in dynamic and distributed environments. PROV-AGENT focuses on runtime provenance capture, representing model invocations, prompts, and tool interactions as first-class provenance components within operational systems.
AgentSec differs in scope and objective. Rather than proposing a runtime instrumentation framework, AgentSec provides a structured and deterministic dataset of agent interaction traces designed for auditing, structural validation, and causal analysis. In this sense, AgentSec is complementary to frameworks such as PROV-AGENT: while PROV-AGENT defines how agent provenance may be captured in live systems, AgentSec offers controlled scenario instances that can serve as diagnostic inputs for testing provenance reconstruction, reasoning-trace verification, and auditing pipelines.
The dataset schema is compatible with W3C PROV principles and can be mapped to extended provenance frameworks. However, AgentSec does not depend on any specific runtime provenance architecture and remains framework-agnostic by design.
2.4. Schema Definition and Data Validation
To ensure structural consistency, reproducibility, and long-term usability, the dataset is governed by a formally defined and immutable schema that specifies the exact structure of each session file. The schema defines the admissible fields, data types, required components, and permissible relationships among user interactions, decision traces, tool calls, memory updates, assistant responses, and provenance elements. By enforcing a strict schema, the dataset avoids ambiguity in interpretation and enables automated validation across all scenarios.
Each session is represented as a single JSON document whose structure is constrained by a frozen JSON Schema specification. Core components—such as session metadata, ordered interaction turns, and provenance graphs—are explicitly declared as required fields. Nested objects, including decision traces and tool call records, are defined with clearly scoped properties to prevent undocumented or implicit fields. This design choice ensures that every session adheres to the same semantic contract, facilitating comparative analysis across scenarios and enabling reliable downstream parsing by third-party tools.
Schema validation is performed using a deterministic validation pipeline based on a fixed validator configuration. All session files are validated against the schema prior to release, and validation is treated as a mandatory acceptance criterion rather than a best-effort check. Any deviation from the schema—such as missing required fields, invalid data types, or broken provenance references—results in rejection of the session file. This guarantees that all distributed data instances are schema-compliant and structurally sound.
Special attention is given to provenance integrity. The schema enforces referential consistency by requiring that every provenance link references valid entity identifiers defined within the same session. This constraint prevents dangling or ambiguous causal links and ensures that provenance graphs can be reconstructed without external assumptions. In addition, enumerated relation types are restricted to a controlled vocabulary to avoid semantic drift in causal interpretation.
To support transparency and reproducibility, the schema is explicitly frozen at the dataset release version and is not modified retroactively. Any future extensions or revisions are required to follow a versioned schema evolution policy, ensuring backward compatibility or explicit version differentiation. By combining a strict schema definition with systematic validation, the dataset provides a robust foundation for reproducible research on agent behavior, reasoning processes, and causal provenance analysis.
For long-term maintenance and structural traceability, each session instance includes a required schema_version field within session_metadata. This ensures that every released session explicitly declares the schema contract under which it was generated, supporting backward compatibility and future dataset evolution.
3. Data Records
3.1. Dataset Organization
An overview of the dataset directory structure, the linkage between scenario specifications and session files, and the internal structure of a representative session record is provided in
Figure 2.
As illustrated in
Figure 2a, the dataset root directory contains documentation files (README.md, DATASET_METADATA.md, and REVIEWER_CHECKLIST.md), a global scenario manifest (dataset_manifest.json), schema definitions, validation tools, and example session records. The directory schema/contains the frozen JSON schema (session_schema.json) that defines the structure and constraints of all session files. The directory examples/contains one JSON file per scenario instance, each representing a complete agent interaction session.
3.2. Scenario Manifest
The AgentSec dataset includes a total of 30 scenario instances, each represented as an individual JSON file and referenced explicitly in a central dataset manifest. Each scenario corresponds to a complete and self-contained agent session trace, capturing decision-making processes, tool interactions, memory operations, and provenance relations under a fixed and frozen schema.
The scenarios are intentionally organized to cover a predefined taxonomy of agent behavioral patterns, including tool success and failure modes, fallback strategies, memory conflicts and overwrites, decision rollbacks, and provenance branching structures. An overview of the scenario categories and their distribution within the dataset is provided in
Table 2. For the current dataset release, each behavioral category is represented by at least two distinct scenarios, ensuring analytical coverage while avoiding structural redundancy. No scenario is a parametric variation of another; instead, each contributes a unique combination of decision trace structure, tool-call sequence, memory behavior, and provenance topology. By “parametric variation,” we refer to scenarios that would share the same underlying decision structure and provenance topology while differing only in superficial parameters such as prompt wording, tool identifiers, or surface-level content. In contrast, the scenarios included in AgentSec differ at the structural level: they encode distinct causal graphs, decision branching patterns, tool invocation sequences, and memory state transitions. Thus, non-parametric here denotes structural non-equivalence rather than variation in textual content.
3.3. Relation
to Real-World Agent Complexity
Although the dataset is synthetic and structurally controlled, the scenario taxonomy was deliberately aligned with behavioral patterns documented in contemporary LLM-based agent frameworks. The selected categories—including cascading tool failures, decision rollbacks, memory conflicts, fallback strategies, and provenance branching—reflect typical execution irregularities observed in practical agent systems such as ReAct-style agents, tool-augmented LLM pipelines, and multi-step planning architectures. Rather than modeling domain-specific tasks, the scenarios abstract recurring structural complexities of real agent execution flows. This design ensures that the dataset captures representative agent-level reasoning dynamics while maintaining deterministic reproducibility.
The dataset manifest enumerates all scenario files and provides a stable reference for reproducibility and downstream analysis.
The dataset size () is intentionally constrained to represent structurally distinct and analytically non-redundant behavioral patterns rather than statistically scaled repetitions. Each scenario encodes a unique causal configuration of decision traces, tool interactions, memory updates, and provenance topology. Expanding the dataset through parametric variations of existing scenarios would increase volume without increasing structural diversity. As such, the dataset is positioned as a controlled diagnostic suite for logic auditing and provenance reconstruction, rather than as a benchmark optimized for model training or statistical performance evaluation.
Because the scenario taxonomy is finite and explicitly enumerated (
Table 1), the 30 scenarios collectively cover all predefined structural categories of agent decision behaviors defined in this release. No additional structural behavior class remains unrepresented under the current schema. Thus, dataset sufficiency is defined in terms of complete coverage of the behavioral taxonomy rather than statistical sampling density.
3.4. Dataset Micro-Statistics
To further justify the analytical scale of the dataset, we report structural micro-statistics computed across all 30 scenario files. These statistics are derived directly from the released JSON session records using the provided compute_micro_stats.py script. Rather than emphasizing volume, AgentSec prioritizes structural diversity and causal coverage.
Table 3 summarizes the minimum, average, and maximum values observed across key structural indicators.
Although the dataset contains 30 scenarios, these statistics demonstrate meaningful structural variability in reasoning depth, tool interaction patterns, and provenance topology. AgentSec is therefore positioned as a diagnostic and unit-test suite for auditing agent logic and causal consistency, rather than as a large-scale statistical benchmark for model training.
3.5. Session Records
Each session file represents a single, self-contained agent interaction episode and is stored in JSON format. A sliced view of a representative session file is shown in
Figure 2c. All session files conform strictly to the frozen schema defined in schema/session_schema.json.
A session record consists of the following main components:
session_metadata, containing deterministic identifiers, timestamps, and configuration parameters;
interactions, representing the user and assistant message turns;
decision_traces, documenting intermediate reasoning steps and alternative options considered by the agent;
tool_calls, recording all external tool invocations and their outcomes;
memory_traces, capturing state updates, conflicts, or no-op memory operations;
provenance, encoding a causal graph of entities and links that describe how decisions, actions, and responses are related.
All identifiers within a session are deterministic and reproducible across regenerations of the dataset given the same seed.
Figure 3 presents a concrete visualization of the provenance graph extracted directly from a representative scenario file. Unlike the conceptual illustration in
Figure 1C, this graph is reconstructed strictly from the structured provenance field stored in the session JSON record. The visualization highlights branching and reconvergence patterns between decision traces and tool calls, illustrating the structured causal topology that underpins the dataset.
Example Provenance Query
To illustrate the operational usability of the provenance structure, the dataset enables programmatic queries over decision nodes and their causal dependencies. For example, a researcher may retrieve all upstream tool outputs contributing to a specific decision node:
| # Example pseudo-query over a session JSON |
| target_node = “decision_3” |
| # Retrieve provenance edges |
| upstream_entities = [ |
| edge[“source”] |
| for edge in provenance_edges |
| if edge[“target”] == target_node |
| ] |
|
| print(upstream_entities) |
Such queries allow reconstruction of causal chains linking tool outputs, memory updates, and decision nodes. This supports auditing tasks such as detecting unsupported reasoning steps, identifying missing dependencies, or analyzing propagation of faulty tool responses through the decision trace.
3.6. Validation Tools
The directory tools/contains auxiliary scripts used to support dataset integrity. These include a consistency audit script that verifies the alignment between the manifest and session files, and a schema validation script based on AJV that checks all session records against the frozen JSON schema. These tools are provided to facilitate independent verification but are not required to use the dataset.
4. Technical Validation
The technical validity of the AgentSec dataset was ensured through a combination of formal schema validation, cross-file consistency checks, and deterministic regeneration controls. These procedures were designed to verify that all released data files conform to the declared specifications, remain internally coherent, and can be reliably reused by external researchers.
First, all session files were validated against a frozen JSON schema that explicitly defines the structure of agent interactions, including turn ordering, decision records, tool invocations, memory updates, and provenance links. Schema validation was performed using a standard JSON Schema validator, ensuring that mandatory fields are present, data types are respected, and no undeclared attributes appear in the dataset. The schema file is versioned and immutable for the current dataset release, preventing silent structural drift across versions.
Second, a dataset-wide consistency audit was conducted to verify the alignment between scenario specifications, session files, and manifest metadata. This audit checks that each session correctly references an existing scenario identifier, that scenario versions match those declared in the global manifest, and that the reported number of interaction turns corresponds to the actual content of each session file. All provenance references were additionally verified to ensure that each causal link points to a valid entity within the same session context, with no missing or dangling identifiers.
Third, reproducibility was supported through deterministic dataset generation procedures. All synthetic sessions were produced using fixed random seeds and explicitly documented generation parameters. This guarantees that the dataset can be regenerated identically from the same inputs and configuration, allowing independent verification of the released files. The generation logic itself was not modified after dataset freezing, and no post hoc manual edits were applied to the session data.
Finally, version control and dataset freezing policies were enforced prior to public release. Once the technical validation steps were completed, all scenario files, schemas, and example sessions were locked, and a single versioned snapshot was archived on Zenodo. Subsequent changes, if any, will require a new dataset version with an independent identifier, ensuring long-term traceability and citation stability.
Together, these validation procedures provide assurance that the AgentSec dataset is structurally sound, internally consistent, and suitable for reuse in research on LLM-based agents, decision tracing, and provenance-aware evaluation.
4.1. Quantitative Characterization of the Dataset
To complement schema-level validation, we provide a quantitative characterization of the released AgentSec dataset to make its structural complexity and behavioral diversity explicit. The current release contains 30 scenarios (30 sessions), comprising 67 decision nodes and 45 tool calls in total. Decision-making depth varies across sessions, with decision-trace lengths ranging from 0 to 5 (average 2.23), indicating that the dataset includes both minimal/no-decision cases and multi-step decision processes. Tool usage is also heterogeneous: the average number of tool calls per session is 1.50, with 33 successful tool calls and 11 failures (73.3% success rate), enabling analysis of both nominal and failure-mode executions. For provenance complexity, the resulting provenance graphs exhibit an average depth of 4.53 edges (maximum 7) and a branching factor close to 1 on average (maximum 3), reflecting mostly linear causal chains with occasional branching structures.
Table 4 summarizes these statistics, and
Figure 4 visualizes the distribution of decision-trace lengths across sessions. All metrics are computed directly from the released JSON files.
4.2. Illustrative Empirical Evaluation of Reasoning Fidelity
To demonstrate the practical utility of AgentSec for auditing reasoning fidelity, we conducted a small illustrative experiment using two controlled session variants derived from the same scenario specification. In the first variant (faithful case), the decision selected by the agent is fully supported by the declared justification and tool outputs recorded in the provenance graph. In the second variant (unfaithful case), the selected decision is intentionally misaligned with the documented tool result, creating a causal inconsistency. Using the structured provenance links, we programmatically evaluated whether each decision node had a traceable and semantically aligned supporting basis in prior tool calls or memory updates. In the unfaithful variant, the provenance chain revealed a mismatch between the selected_option field and the referenced tool output, resulting in a detectable inconsistency in the causal graph. This simple experiment illustrates how AgentSec enables automated detection of reasoning misalignment through graph traversal and structural verification, thereby supporting research on reasoning fidelity and causal accountability in LLM-based agents.
5. Usage Notes
The AgentSec dataset is intended for research on large language model-based agents, with a particular focus on decision tracing, tool usage, provenance modeling, and causal consistency. The dataset can be used without any proprietary software and relies exclusively on standard data formats.
Each session file represents a complete and self-contained interaction trace and can be processed independently. Researchers are encouraged to begin by loading the session JSON files together with the associated schema definition to ensure structural compliance before analysis. Standard JSON parsing libraries are sufficient to inspect and manipulate the data.
The dataset is particularly suitable for tasks such as comparing declared reasoning traces with executed actions, analyzing tool invocation patterns, studying recovery behaviors after failed actions, and examining provenance graphs that link decisions, tools, and outcomes. Because success and failure conditions are explicitly encoded in the dataset manifest, automated evaluation pipelines can be implemented without requiring manual annotation or interpretation.
The explicit inclusion of failure-mode scenarios also enables counterfactual analysis. Researchers may reverse-engineer erroneous decision paths, simulate alternative choices, and evaluate how modifications at specific decision nodes alter downstream actions and provenance structures. Although the dataset is not designed as a training benchmark, its structured failure traces can support robustness-oriented methodological research.
When using the dataset for benchmarking or comparative studies, users should ensure that evaluations are performed at the session level rather than aggregating partial traces across scenarios, as each scenario encodes a distinct decision structure and objective. Mixing sessions from different scenarios without accounting for their specifications may lead to misleading interpretations.
The size and scope of the AgentSec dataset are intentionally constrained to prioritize analytical clarity over volume. Each scenario represents a distinct decision-making configuration, defined by a unique combination of decision traces, tool invocation outcomes, memory behaviors, and provenance structures. Expanding the dataset without introducing new structural patterns would increase redundancy rather than analytical value. As a result, the dataset is designed to support focused, scenario-level analysis rather than large-scale statistical aggregation.
The dataset is fully synthetic and structurally deterministic. While this design ensures clean ground-truth causal relationships, reproducible execution traces, and controlled behavioral isolation, it does not capture the ambiguity, noise, or distributional variability of real-world user interactions and deployment environments. AgentSec is therefore not intended to model large-scale operational agent behavior or ecological complexity. Instead, it is positioned as a controlled diagnostic and unit-test suite for validating auditing pipelines, provenance reconstruction methods, reasoning-trace consistency analyses, and structural causal verification under precisely defined experimental conditions.
While the current release does not explicitly model adversarial attacks such as false data injection or communication disruptions, the structured separation between decision traces, tool outputs, memory updates, and provenance links enables controlled simulation of such perturbations. Researchers may inject manipulated tool responses or altered memory states and analyze how these disruptions propagate through the causal graph. This design makes the dataset suitable as a diagnostic framework for studying robustness and fault propagation in LLM-based agent architectures.
No special hardware requirements are needed to work with the dataset. All files can be processed on standard computing environments, and no network access is required once the data have been downloaded.
6. Limitations
AgentSec is intentionally synthetic and structurally deterministic. This design enables precise control over causal structure, clean ground-truth provenance alignment, and full reproducibility of execution traces. However, these strengths also define its boundaries.
First, the dataset does not reflect the ambiguity, inconsistency, or distributional variability typical of real-world user prompts. Real deployments involve incomplete instructions, shifting objectives, adversarial inputs, and contextual noise that are not represented in the current scenarios.
Second, the controlled scenario design isolates specific behavioral patterns (e.g., memory conflicts, decision rollbacks, tool misuse) in a decomposed and analyzable form. While this isolation is beneficial for structural auditing, it does not capture emergent multi-factor interactions that may occur in complex operational environments.
Third, AgentSec is not suitable for evaluating statistical robustness, large-scale behavioral generalization, or performance under open-world uncertainty. It is not a benchmark for measuring model accuracy, safety under adversarial pressure, or deployment-level reliability.
Instead, the dataset is explicitly designed as a diagnostic and unit-test suite for verifying auditing pipelines, provenance reconstruction methods, reasoning-trace consistency mechanisms, and structural causal validation procedures under controlled experimental conditions.
Future extensions may incorporate semi-synthetic or hybrid interaction traces to bridge structural auditing with ecological variability, while preserving reproducibility guarantees.
7. Data Availability
The AgentSec dataset described in this Data Descriptor is publicly available via Zenodo under the persistent DOI
https://doi.org/10.5281/zenodo.18369965 (accessed on 1 March 2026). The archived release provides a frozen and versioned snapshot of the dataset corresponding to this manuscript and includes all scenario files, the dataset manifest, schema definitions, validation tools, and accompanying documentation required for data reuse and verification.
The source repository for the dataset is hosted on GitHub at
https://github.com/yasserhmimou9/AgentSec-Dataset (accessed on 1 March 2026). The GitHub repository provides open access to the dataset structure, supporting materials, and future updates, while Zenodo ensures long-term preservation, citation stability, and reproducibility.