1. Introduction
The smart grid is the next generation of power systems offering promises of a wide variety of benefits in efficiency, reliability, and safety. Some of the key features of smart grids are agile reconfigurability and dynamic optimization of grid operations, rapid detection and response to faults in the system, integration of renewable power sources with conventional fossil fuels, and providing of pervasive monitoring facilities for power systems. An important step in Industrial Revolution 4.0 is the digitization of Industry 3.0 and bringing together the Information and Communication Technology (ICT) and Operational Technology (OT) for controlling the physical processes, their monitoring, and maintenance [
1,
2,
3]. In the case of power systems, smart grids are the emerging point of Industrial Revolution 4.0. IT systems in smart grid technology include different ICT servers, Communication Technology, supervisory and monitoring infrastructure, etc. On the other hand, OT comprises Programmable Logic Controllers (PLCs), Remote Terminal Units (RTUs), Intelligent Electronic Devices (IEDs), Phasor Measurement Units (PMUs), relays, Human–Machine Interfaces (HMIs), etc. [
3,
4]. Seamlessly combining ICT and OT can provide efficient methods to augment the capabilities of smart grids. However, the resulting expanded connectivity and remote programmability/reconfigurability can broaden the attack surface and increase cybersecurity vulnerabilities [
5,
6]. Recent examples of attacks on power grids and industrial control systems (ICS) show the crucial importance of these systems and the extensive impacts that can result if the security of these systems is compromised. For example, the coordinated cyber attack on the Ukrainian power grid in 2015 caused a power loss for 6 h, affecting about 225,000 customers in Ukraine [
7]. As another example, the well-known Stuxnet malware, which is known to use zero-day covert attacks, was used to attack nuclear industrial systems in 2010 [
8]. Other recent examples of sophisticated attacks on ICS/CPS include TRITON [
9] (also known as Trisis and HatMan), which targeted safety instrumented systems in a petrochemical plant in 2017; Industroyer2/Sandworm [
10], which targeted IEC-104 SCADA communications in Ukrainian power infrastructure in 2022; the Oldsmar water treatment facility attack [
11] in Florida, where an attacker gained remote access and attempted to alter the sodium hydroxide (lye) levels to dangerous values; FrostyGoop [
12], which caused heating system service disruptions in Ukraine in 2024 by injecting unauthorized Modbus TCP commands; Fuxnet/Blackjack [
13], which performed flood attacks on sensor networks in a Russian underground water and sewage and communications infrastructure company in 2024; ELECTRUM-linked attack [
14] on the power grid in Poland in 2025; and the emergence of modular “malware frameworks” such as Pipedream [
15] (often described as a “Swiss army knife” of malware) that are designed to be able to target a wide range of industrial control devices. These incidents underscore the evolving ICS/CPS threat landscape and the urgent need for robust security frameworks for smart grids and other ICS/CPS.
Industrial control systems, including smart grids, are complex systems consisting of embedded device nodes interconnected by communication networks and interfaced to physical processes. Relevant devices for smart grids include, for example, MTUs (Master Terminal Units), RTUs, RTACs (Real-Time Automation Controllers), PLCs, relays, PMUs, PDCs (Phasor Data Concentrators), and HMI. These devices communicate using various protocols such as DNP3, IEEE C37.118, Modbus, IEC61850, SEL Fast Msg, OPC-UA (Open Platform Communications Unified Architecture), IEC 60870-5, etc., which are all industrial communication standards primarily developed several decades back without specific focus on security. The temporal evolution of the cyber–physical system is governed by the device behaviors (e.g., logic/rules programmed on the devices), the communications/interactions between the devices, and the physical dynamics of the system. A malicious manipulation of a device behavior or of the communication network (e.g., device spoofing, packet injection or manipulation, etc.) by an intruder/adversary can lead to catastrophic consequences in the power grid, including destabilization of the grid and damaging physical components in the grid. Therefore, techniques to monitor these interactions and processes in real time and flag any anomalies with low latency and high accuracy would be vitally beneficial to the security of the grid. Such a technology should not only consider the basic communication specifications between the grid devices but also whether the observed temporal processes are consistent with the expected behaviors and dynamics of the grid. Since the smart grid is a composite of the controller devices, power system dynamics, network communication channels, and interplay between these components, a comprehensive monitoring system should be able to track the temporal behaviors in real time and detect any abnormalities. Analogously to anomaly monitoring systems for other cyber–physical systems [
16] and autonomous vehicles [
17], such a comprehensive monitoring system should span controller-focused anomaly monitoring (CFAM) for validating behaviors of controllers and other devices in the grid, network-focused anomaly monitoring (NFAM) for validating network-level transactions and statistics, system-focused anomaly monitoring (SFAM) for validating temporal process dynamics, and cross-domain anomaly monitoring (CDAM) for validating the interplay between controller/system/network components (
Figure 1).
To address this crucial need, we develop a real-time integrity verification methodology (TRAPS—Tracking Real-time Anomalies in Cyber-Physical Systems) in this paper to detect abnormal behaviors in the power system by continuous dynamic behavioral analysis of the cyber–physical system. The proposed TRAPS approach is based on a Signal Temporal Logic (STL) condition-based anomaly monitoring and Intrusion Detection System that processes real-time observations from communication network packet captures through a hierarchical semantic extraction and Directed Acyclic Graph (DAG)-based tag processing pipeline to transform them into a time series of semantic events and observations, collectively referred to as semantic tags. These semantic tag time series are then evaluated against expected temporal properties to detect and localize anomalies and visualize them in a dashboard graphical user interface (GUI). The contributions of this paper are as follows:
Development of a flexible and scalable end-to-end framework that can directly operate on streaming raw network traffic and map back in real-time to the spatio-temporal operation of the CPS and evaluate integrity relative to high-level semantic behavioral properties of the CPS
Development of a DAG-based hierarchical tag processing framework enabling iterative transformations of time series of semantic tags and an integrity verification framework that enables real-time monitoring of configurable STL-based behavioral specifications defined over the hierarchically computed time series of semantic tags.
Demonstration of the efficacy of the proposed methodology with several attack scenarios on a hardware-in-the-loop (HIL) testbed, which includes both physical and virtual power devices, interfaced to a dynamic power system simulator.
The proposed TRAPS approach enables essentially a distributed sequence-of-events monitor that outputs time-series observations of semantic variables that can be iteratively processed according to configurable DAG-based definitions and dynamically queried for integrity verification against configurable STL-based behavioral specifications. Furthermore, a key aspect of the approach is that the set of behavioral specifications to be monitored is open and extensible so as to be customizable to meet the needs of particular CPS. The generality and flexibility of the behavioral specification structure facilitates the encoding of various types of expected semantic properties of the CPS (e.g., device control logic that implies that some events should be followed by some other events, physics models that indicate changes in physical signals based on the operation of devices such as relays, communication configuration implying that separate communications between certain device pairs should have either temporal or value-based dependencies, etc.) to any desired level of detail. The hierarchical tag processing and STL-based integrity verification engine provide a real-time semantic view of the CPS operation on top of which any STL-based behavioral specifications can be configured for monitoring and the proposed framework enables a fully automated data flow from raw traffic to real-time semantic integrity verification and alert generation based on the configured specifications.
This paper is organized as follows. The related literature is reviewed in
Section 2. We describe the proposed methodology in
Section 3, including the threat model and problem formulation (
Section 3.1), network packet parsing (
Section 3.2), observation set extraction and processing (
Section 3.3 and
Section 3.4), integrity verification (
Section 3.5), anomaly localization (
Section 3.6), and the visualization dashboard (
Section 3.7).
Section 4 presents experimental results on a hardware-in-the-loop testbed demonstrating the efficacy of the proposed framework under various attack scenarios.
Section 5 provides concluding remarks with a summary and directions for future work.
2. Related Works
Several types of attacks on smart grid systems have been considered in the literature (e.g., [
5,
6,
18,
19,
20]) including measurement integrity attacks, false data injection (FDI), false command injection (FCI), control logic modification, and denial of service (DoS) attacks including time-delay and jamming attacks. Coordinated attacks, which use multi-stage complicated patterns to increase attack efficacy, and cascading attacks exploiting a single point of failure to propagate the effect to the other points of the system have also been considered [
3,
21]. To defend against the various attacks, defenses (termed in general as Intrusion Detection Systems or Anomaly Detection Systems) have been developed using a variety of approaches and underlying techniques as discussed below.
Signature-based methods use a “blacklist” of signatures of prior attack/anomaly events to detect intrusions of the same category, but cannot detect unknown/zero-day attacks with new signatures. Examples of methods of this type include [
18] which detected machine-in-the-middle (MITM) attacks on the DNP3 protocol through Snort rules [
22], which applied the ML-based fusion of cyber and physical sensors to detect FCI/FDI attacks on DNP3 protocols, and [
23] which applied Suricata, an open-source network IDS, to detect anomalies for network protocols, including IEC61850, based on software rules.
Specification-based methods model the system’s behavior using its specifications, especially at the network level, and analyze the observed behavior such as the communication protocol details to detect abnormalities. Typical limitations in the available methods of this type include support for only specific protocols, limited scalability, and the ability to monitor only specific types of behaviors/events. Specification/behavior-based IDS have been developed considering various protocols such as IEEE C37.118 [
24,
25], DNP3 [
26], IEC60870 [
27], IEC61850 [
28,
29], and Modbus/TCP [
30]. Combinations of multiple IDS approaches have also been studied such as: the combination of signature-based and model-based methods using Snort in [
27]; combination of access-control, protocol whitelisting, model-based, and multi-parameter-based detection methods in [
31]; combinations of host-based and network-based detection methods in [
32]; and the combination of access-control, protocol-based, and behavioral whitelists [
33]. Process-aware monitoring methods based on knowledge of the underlying CPS behavior have been studied [
21]. Other specification-based approaches in the literature include the monitoring of values of process variables in terms of rules defined in a specific description language in [
34], state tracking methods [
35,
36], and sequence of events monitoring using a Discrete-Time Markov Chain model in [
37]. Moving-target defense methods against false data injection attacks in the context of state estimators have been studied [
38,
39,
40,
41] based on dynamically altering some aspects of the system configuration, such as changing line impedances using distributed flexible AC transmission system (D-FACTS) devices. In the broad context of CPS across different domains, STL-based methods have been developed for monitoring and analysis for both continuous-time and discrete-time signals. Recent advances in this direction include cumulative-time extensions to STL and associated monitoring algorithms [
42] based on evaluating the sum of all timesteps for which an STL formula is true, informative online monitoring for STL [
43] based on causation and relevance evaluations to provide more informative context for STL violations, and the formally proved compilation of STL fragments into synchronous observer implementations [
44]. STL methods have also been developed for hybrid systems with both discrete and continuous components using SMT-based robust model-checking techniques [
45]. Formal control system approaches have been leveraged to enable model predictive monitoring of dynamical systems under STL specifications [
46] by assuming that the observed state signal traces are generated by a dynamical system with a known model but unknown control signal. Tool support for STL-based monitoring is also maturing such as RTAMT [
47], which provides online/offline monitor implementations designed to integrate with the Robot Operating System (ROS) and with Matlab/Simulink. Recent work has also begun to address privacy/security aspects of monitoring itself, such as oblivious monitoring for discrete-time STL using fully homomorphic encryption [
48]. The proposed TRAPS framework is synergistic and complementary with these works, which address monitors over explicit system signals/models, while the primary focus of TRAPS is on the complementary problem of the extraction of semantically meaningful time-series observations from heterogeneous OT network traffic and then monitoring open and configurable STL behavioral specifications over those derived tags for real-time integrity verification of power grid CPS.
Learning-based methods use data-driven machine learning to detect anomalous or abnormal patterns in the system’s traffic/signals. Challenges when applying these methods include difficulty in obtaining extensive training datasets, lack of explainability of ML prediction results that can also make it difficult to localize underlying causes of detected anomalies and guide appropriate remediations, and limitations in generalizability under changes in data distribution (domain shift). Learning-based methods [
49,
50,
51] have been applied, for example, to the detection of false data injection attacks [
52,
53,
54,
55], jamming [
56], and time-delay attacks [
57] and anomaly detection in transmission protective relays [
58], wide-area protection systems [
59], distribution systems [
60,
61], and Modbus communications [
62]. Host-based anomaly detectors using analog/digital side channels such as system calls and Hardware Performance Counters (HPCs) have been developed (e.g., [
63,
64,
65]). Recent work has also explored
hybrid approaches that combine formal/specification-based monitoring with data-driven learning, aiming to combine the benefits of interpretability and well-definedness provided by the specifications approach while improving adaptivity and robustness in complex CPS environments by leveraging data-driven methods. For example, hybrid knowledge-driven and data-driven techniques have been proposed to synthesize run-time monitors for CPS by combining prior domain knowledge with learned models from data in [
66]. Learning-based time-series anomaly detection methods have also been applied to extract informative representations from raw signals. Approaches in this direction include self-supervised disentangled reconstruction-based representation learning for time-series anomaly detection by learning both recurrent/consistent patterns and irregular variations in the latent space [
67] and transformer-based architectures combined with probabilistic filtering to identify anomalous CPS signals [
68] by capturing the dynamics and temporal dependencies in CPS within a dynamic state-space model. In related research, methods have also been developed to make monitoring more efficient and adaptive at run-time, e.g., self-triggered strategies for STL monitoring tasks that reduce monitoring effort (and thereby computational burden and energy expenditure) when the system appears to be behaving nominally [
69].
Real-time processing methods operating on streaming data have been addressed in recent work to transform raw telemetry and event streams into structured higher-level representations (
semantic streaming) to improve interpretability and facilitate downstream applications to CPS monitoring and intrusion/anomaly detection. For example, in [
70], a semantic analysis approach combined with self-supervised embeddings and geospatial context features was proposed to enhance intrusion detection for IoT and sensor networks by extracting more meaningful representations from streaming observations. In a broader streaming analytics context under varying data distributions (concept drift), comparative evaluations and benchmark-driven analyses have been addressed in [
71,
72], studying practical trade-offs among different anomaly detection methods in online settings. Also, in a broader CPS context, semantic event-handling architectures aimed at building explainable CPS (ExpCPS) have been developed in [
73] by structuring event processing pipelines around semantic abstractions rather than raw signals based on a semantic event-handling module that is designed to be integrated into ExpCPS architectures across different domains.
In contrast to prior methods discussed above, the key benefits of TRAPS are: a unified framework for the monitoring of at-scale heterogeneous communication traffic against an open and extensible set of behavioral properties using hierarchical semantic tag processing and STL-based monitoring in a protocol-agnostic and extensible framework for real-time validation of the entire cyber–physical loop; end-to-end pipeline from raw network packet capture to protocol-agnostic semantic parsing, semantic tag extraction, situational awareness, integrity verification, anomaly detection, localization, and visualization; and computational simplicity and scalability enabling real-time processing of high-bandwidth traffic and simultaneous monitoring of several hundreds of tags and STL conditions. Unlike approaches that focus on specific attack types (e.g., delays, DoS, false data injection), TRAPS verifies semantic event sequences across multiple devices and domains enabling dynamic end-to-end auditing of behavioral specifications that can span correlations, causations, and other CPS behavioral properties.
TRAPS is synergistic and complementary with emerging CPS/OT security trends such as the incorporation of verifiable data flow and data query techniques. In particular, blockchain-based mechanisms [
74], such as verifiable decentralized identities and data integrity, can enable secure and verifiable data flows for the cyber–physical Web 3.0 [
74]. Also, advanced data query systems, such as VQL (Verifiable Query Layer) [
75] and TeLEx (Two-Level Learned Index for Secure Queries) [
76], offer enhanced efficiency and security for querying large-scale distributed and blockchain systems. While VQL provides cloud-deployable, efficient, and cryptographically verifiable data query services for blockchain systems, TeLEx introduces a two-level learned indexing methodology for enabling rich query functionalities on enclave-based blockchain systems by leveraging Trusted Execution Environment (TEE) and oblivious RAM techniques. These emerging techniques which utilize the immutable and decentralized nature of blockchain systems to facilitate robust and secure data management/queries within CPS can provide vital benefits synergistic with CPS monitoring solutions, such as TRAPS, by ensuring trust in the data shared/queried across distributed components. Real-time monitoring frameworks such as TRAPS complement blockchain technologies by enabling the dynamic validation of behavioral properties to flag anomalies, intruders, or other compromises of the CPS. These technologies collectively strengthen defenses against dynamic threat actors with emerging adversarial tactics, thereby offering robust defense-in-depth solutions.
3. The Proposed Method
The proposed framework is based on the pipeline architecture shown in
Figure 2, where raw data is processed through a sequence of layers to extract time-series observations of hierarchically defined semantic tags, that are then used for anomaly detection relative to a set of STL-based behavioral specifications and visualization in an operator dashboard. The threat model and problem formulation are summarized in
Section 3.1. The individual components of the proposed framework are then discussed in the following subsections.
3.1. Threat Model and Problem Formulation
The threat model addressed is formally defined below along with the key elements of the considered problem formulation.
Protected Assets and Security Objectives: The protected assets in a CPS such as the smart grid include: (i) the correct operation of devices (e.g., relays, RTACs, PMUs, PDCs), (ii) the integrity of network communications between devices, and (iii) the integrity of physical process behavior. The security objective of the defender is to detect deviations from expected CPS behavior in real-time with low latency and high accuracy, so as to enable rapid response to mitigate potential damage.
System State and Observations: Let
denote the system state at time
t, where
represents the internal states of devices (e.g., relay logic states, controller variables),
represents the physical system states (e.g., voltages, currents, power flows), and
models a representation of the “communication states" (e.g., message sequences, timing). The defender does not have direct access to
, but instead observes network traffic through a monitoring point (e.g., an RSPAN port). Let
denote the time series of observed network packets, where
is the timestamp,
and
are source and destination identifiers,
is the protocol/message type, and
is the payload content. Through the semantic extraction pipeline described in
Section 3.3 and
Section 3.4, the raw packets are transformed into an observation set
of semantic tags, where
is a tag identifier and
is the corresponding value.
Adversary Model: We consider an adversary who gains unauthorized access to the OT network of the CPS and introduces perturbations to either device behavior or network communications. Formally, the adversary can modify the effective system state to add a perturbation as , where represents adversarial perturbations such as:
Firmware/logic modifications on devices such as relays or RTACs (thereby affecting ), e.g., altering relay control logic or masking/delaying commands (and therefore possibly indirectly affecting also and );
Insertion of MITM devices to manipulate communications between devices (thereby affecting and possibly indirectly and ), e.g., modifying, delaying, replaying, or dropping messages;
Injection of unauthorized network traffic, e.g., false commands or flood attacks for denial of service (thereby affecting and possibly indirectly and ).
Adversary Capability Bounds: The adversary may compromise one or more devices or communication links, potentially simultaneously (e.g., modifying firmware/logic of a device while also simultaneously injecting network traffic, masking sensor messages from multiple devices, etc.). However, we assume that the adversary cannot manipulate all observations relevant to a given behavioral specification so as to completely mask the existence of an anomaly. Formally, for each STL-based behavioral specification defined over a subset of tags , we assume there exists at least one tag in whose observations remain uncompromised. This assumption is consistent with typical CPS attack vectors where adversarial access originates from specific entry points (e.g., compromised firmware on specific devices, MITM on specific communication links, intruder device sending spurious commands). Hence, adversarial effects are localized to parts of the CPS that the adversary has gained access to and cannot feasibly affect all observable network communications. Furthermore, the heterogeneity of devices and protocols in real-world CPS deployments typically makes it infeasible for an attacker to simultaneously control all observable communications. Also, note that there is no assumption that any specific measurements or communication channels are trusted a priori. Rather, the proposed framework’s robustness derives from the ability to define behavioral specifications that span multiple independent observation sources, requiring attackers to compromise multiple independent parts of the CPS to evade detection (e.g., both the command path and the measurement path). As with any anomaly detection approach, we assume that the adversary is not so powerful that they can manipulate all relevant measurements so as to completely mask the existence of an anomaly since such an adversary of unlimited capacity who can manipulate all observations can always elude detection.
Attack Classification: A broad categorization of attack types relevant to power grid CPS/OT environments is summarized in
Table 1 along with representative examples of the attack types and their effects on system behavior and corresponding STL-based specifications, deviations from which are aimed to be detected by the TRAPS framework. These attack categories map naturally to the MITRE ATT&CK Matrix for ICS [
77] and MITRE EMB3D [
78] threat model frameworks, which provide detailed taxonomies of cyber kill chain elements (tactics, techniques, and procedures—TTPs) and device vulnerabilities, respectively, in the ICS/CPS context. MITRE ATT&CK lists TTPs across the several stages of a cyber-attack lifecyle, ranging from initial access to eventual impact. Components of various stages such as network connection enumeration (in Discovery stage), adversary-in-the-middle (in Collection stage), denial of service (in Inhibit Response Function stage), unauthorized command message (in Impair Process Control stage), and loss of control (in Impact stage) map directly to the attack categories in
Table 1. The MITRE EMB3D framework organizes embedded device vulnerabilities into a threat heat map across networking, hardware, system software, and application software domains. The attack types in
Table 1 primarily draw from scenarios modeled from networking (e.g., TID-404—Remotely Triggerable Deadlock/DoS, TID-406—Unauthorized Messages or Connections, TID-407—Missing Message Replay Protection, TID-412—Network Routing Capability Abuse), system software (e.g., TID-202—Exploitable System Network Stack Component, TID-204—Untrusted Programs Can Access Privileged OS Functions, TID-205—Existing OS Tools Maliciously Used for Device Manipulation, TID-211—Device Allows Unauthenticated Firmware Installation, TID-213—Faulty FW/SW Update Integrity Verification, TID-215—Unencrypted SW/FW Updates), and application software (e.g., TID-301—Applications Binaries Modified, TID-304—Manipulate Run-Time Environment, TID-309—Device Exploits Engineering Workstation, TID-311—Default Credentials, TID-328—Hardcoded Credentials) device vulnerability categories in the MITRE EMB3D framework.
System and Trust Assumptions:
Network Observability: The defender has access to a network monitoring point (e.g., RSPAN) that provides visibility into all relevant OT network traffic in the CPS.
Timing: Observations are timestamped at the monitoring point (e.g., an RSPAN port) based on packet arrival times, thereby providing a common reference clock, which is not derived from device-local clocks. The framework does not require clock synchronization across distributed CPS devices since all timing is relative to the monitoring point’s clock, thereby avoiding issues with clock desynchronization or drift across devices. Furthermore, since the timing thresholds in timing-based behavioral specifications (e.g., time windows in pre-/post-conditions) are configurable, they can be set to accommodate typical network latencies and timestamp jitter in the specific deployments.
System Behavioral Specification: The behavioral specifications of the CPS are defined as a set of STL properties based on the expected behavior of the CPS (e.g., device control logic, physics constraints, communication configurations). These properties are configured based on CPS design documentation (e.g., DNP3 point map lists, relay control logic documentation) and formal specifications and verified through historical “golden" traces. However, the behavioral specifications can, in general, be incomplete (e.g., missing characterizations of control logic of some devices) in which case the proposed framework enables the detection of deviations from the specifications that are included. To handle potential incompleteness, the framework adopts a “safety envelope” approach, enforcing the defined subset of critical properties (e.g., safety constraints) rather than requiring a complete model of all CPS behaviors. This structure supports incremental maintenance, allowing operators to refine or add behavioral specifications over time without system downtime. Additionally, the framework facilitates robustness by allowing the definition of behavioral properties that span diverse, independent domains (e.g., physical states, network timings, control logic), thereby enabling robust and sensitive anomaly detection that is not reliant on overly constrained or brittle specifications of any particular single-point/single-sensor behavioral properties.
Defender Objective: The task for the defender is to enable real-time mapping from the raw network traffic (which may comprise multiple OT communication protocols) to the higher-level semantic observation set , and to enable continuous evaluation of against a set of STL-based behavioral specifications to detect and localize any deviations as anomalies.
Defender and Attacker Success Criteria: The success criteria for the defender and attacker are defined as follows:
Defender success: The defender succeeds when the framework achieves (1) high detection rate, i.e., all adversarial actions that cause a violation of at least one behavioral specification are detected; (2) low false positive rate, i.e., normal CPS operation that satisfies all behavioral specifications does not trigger anomaly alerts; and (3) low detection latency, i.e., anomalies are flagged within a short time window after the occurrence of the violating observation.
Attacker success: The attacker succeeds if they achieve their operational objective (e.g., manipulating physical process behavior, injecting false commands) while evading detection by the anomaly monitoring framework.
3.2. Network Packets Parsing
The raw network traffic (either live or as a pcap) that is the input to TRAPS comprises of the communications between the various devices in the smart grid using several different protocols such as the following supported by our current implementation of our system: DNP3, IEEE C37.118, Modbus, IEC61850 GOOSE, IEC61850 MMS, and SEL Fast Msg (a proprietary protocol by Schweitzer Engineering Laboratories Company), IEC60870-5, OPC-UA, and Telnet protocols. Two versions of our framework were implemented, as discussed further in
Section 3.8 and
Section 4.2.4. In the first version, which was primarily Python-based (with some computational hot spots implemented in C++), the network packet parsers were implemented as a set of scripts based on open-source libraries such as Scapy [
79], Pyshark [
80], and the Hammer library [
81] to process the network traffic to parse and extract the payload contents using methodologies analogous to [
82,
83]. For parallel processing, the parser components were structured as a set of separate Docker containers for each communication protocol with a front-end ingest module to detect the application layer protocol for each incoming packet and forward it to the appropriate protocol-specific parser for extracting the payload contents. The outputs of the protocol-specific parsers were combined into an MQTT streaming feed that is then used by the semantic tag processing component. In addition to the MQTT feed, the combined output stream from the parsers was also exposed via a REST API interface, with both the push (streaming) and pull (REST API) interfaces to the parser outputs supporting filtering on properties such as IP addresses, protocols, and message types. In the second version of our framework, which is primarily Go-based with the architecture discussed in
Section 3.8, the network packet parsers are instead generated via declarative specifications using the Kaitai framework, which yields highly efficient binary payload parsers. These Kaitai-generated codes are then run in parallel using lightweight Go green threads (goroutines) with channel-based message passing and synchronization. This architecture yields significantly higher performance as discussed further in
Section 4.2.4. The overall TRAPS prototype is structured such that the downstream components including the semantic tag processing can run as separate threads in the same process (obtaining data from the parsers via in-process channels) or as a separate process (in which case the data is passed via MQTT as before). When running as a separate process, the downstream components can run on a separate machine for even more efficient parallelization (as well as potentially enabling a distributed network of network capture nodes with local parser components feeding to a centralized machine for semantic tag processing and anomaly detection).
3.3. Observation Set Extraction
The output of the protocol-specific parsers is a time series of records of the form where
is the packet’s timestamp;
and are the source and destination of the packet, respectively, which could be IP and/or MAC addresses depending on the protocol;
is the protocol and message type of the packet;
is the set of measurements/values in the packet’s payload such as analog and digital values in IEEE C37.118, Modbus coils, Modbus holding registers, DNP3 analog inputs and outputs, etc.
The specific information mapping (i.e., which fields in a DNP3 message correspond to what physical quantities) are installation-specific and can vary widely. Hence, after the parsing of raw fields in the network packets, a key step is mapping the fields to semantic variables. For this purpose, TRAPS uses a flexible query set structure wherein functions defined over the raw fields are used to populate the values of semantic variables as appropriate for the particular installation and the particular network communication protocol. For example, in IEEE C37.118, the constituent fields are typically phasors, analogs (e.g., currents and voltages), and digitals (e.g., status values). The queries to extract semantic variables (“raw tags”) from the time series
is defined as a set of packet filtering rules of the form
where
,
, and
have the same meaning as in
,
is an attribute address specifier (e.g., index of the data in DNP3 binary inputs/outputs, address of an input register in Modbus, etc.), and
is a tag identifier to be raised (along with the corresponding timestamp) whenever the filtering rule is triggered due to a matching
. Note that multiple packet filtering rules could be triggered by a single packet. The algorithmic structure of this component is shown in Algorithm 1.
| Algorithm 1 Filtering time series of parsed packets to generate time series of raw tags |
- 1:
for packet in parsed packets do - 2:
for j = 1 to m do - 3:
- 4:
if then - 5:
if attributes addressed by exist in then - 6:
Extract attributes addressed by from - 7:
Push into output time series (with corresponding timestamp) - 8:
end if - 9:
end if - 10:
end for - 11:
end for
|
To reduce the manual configuration effort in configuring the packet filtering rules which map raw fields in network traffic packets to semantic variables, TRAPS includes utility scripts to automatically ingest substation-specific configuration files in standard formats (such as CSV-based point lists for DNP3/Modbus, CSV-based phasor and analog/digital element lists for C37, and Substation Configuration Language or SCL files for IEC 61850) and generate the packet processing and semantic tag extraction rules. These scripts internally use the Python-based API of our framework to enter the rules into the underlying system. The utilization of automated ingestion scripts ensures that the installation-specific mappings are derived directly from the power system design documents, thereby streamlining deployment and reducing the risk of configuration errors.
3.4. Observation Set Processing
Since the behavioral properties of the CPS might be most naturally described not in terms of raw tags but in terms of variables that are computed as functions of multiple tags over multiple time instants. Hence, TRAPS includes a hierarchical tag processing engine that allows definitions of computed tags as functions of other raw/computed tags. To facilitate a flexible structure for defining hierarchical dependencies of computed tags, a DAG
is used in which each node represents a time series of a particular tag and is constructed as a function of the previously extracted time series of tags in the node’s dependency list. The functional dependency structure is represented as a set of filtering rules of the form
where
denotes the dependency list (of raw/computed tags),
is a function encoding the calculations required to obtain updated values of
(along with their corresponding timestamps) from the time-series values of the tags in the dependency list, and
is a tag identifier to be raised whenever the filtering rule is triggered, similar to the corresponding designator for raw tags. The algorithmic structure of this component is shown in Algorithm 2 where the input queue
holds both raw tags raised from Algorithm 1 and computed tags pushed as part of Algorithm 2 (to iteratively process downstream dependencies in the DAG). The time series of observations for each semantically extracted tag is of form
with
being the timestamp and value, respectively, of that tag.
| Algorithm 2 Extracting time series of computed tags |
- 1:
while True do - 2:
Get next tag q from queue (or wait until there is one). - 3:
for each child node k of q in DAG do - 4:
Compute using time-series values of tags from and add computed value to time series of observations for tag . - 5:
Push to . - 6:
end for - 7:
end while
|
3.5. Observation Set Static & Temporal Integrity Verification
A crucial property of the CPS is that its expected behavior (as defined by device logic, system dynamics, etc.) implies various correlations/causalities among the time series of tags. These include both dependencies/relationships between values of two time series (e.g., expected relationships between relay open/closed status and voltage values) and temporal properties (e.g., an event in one time series expected to happen before or after an event in a second time series). These relations could stem from physics-based and behavior-based properties of the CPS. For example, physics-based properties result from power system physical laws such as the Kirchhoff’s voltage and current laws (e.g., interdependencies between PMU measurements at different locations) while behavior-based properties relate to device configurations, network characteristics, etc. For example, in the case of devices such as proxies, protocol converters, or command forwarders in the smart grid, the values of corresponding time series from before-proxy and after-proxy traffic have expected relationships, both in terms of the numerical values of payload contents and temporal relationships between messages (e.g., time delays before retransmission). Deviations from expected correlation/causality relations indicate anomalies or abnormal behavior that could stem from cyber attacks, physical attacks, or physical malfunctions.
To enable flexible monitoring spanning these various types of correlation/causality relations, we define several types of condition structures discussed below that could hold between time series of different tags. To show examples of the condition structures, we use the notations , , to denote the time series of observations of various tags.
Threshold conditions such as
where
,
, and
denote the value threshold, lower timing threshold, and upper timing threshold, respectively, for the observations.
Match conditions such as
where
and
denote matching time instants of the time series of observations
and
(e.g., time values such that
). More generally, functional match conditions (with an arbitrary function
f) across multiple time series of observations can be defined by the form
where
denote matching time instants across the different time series of observations (e.g., time values such that
).
Pre-conditions are conditions that an event (defined in general in terms of values from one or more time series of observations) should have been preceded by some other defined event within some time interval; for example,
where
and
are arbitrary functions and
is a threshold on timing. For example, a condition that the time series of observations
should track (possibly with a delay) the time series of observations
would be represented with
and
where
is a threshold for the matching of
and
.
Post-conditions are conditions that some specified event should be followed by some other defined event within some time interval; for example,
where
and
are arbitrary functions and
is a threshold on timing.
The algorithmic structure of the integrity verification component for flagging condition violations is shown in Algorithm 3, where
denotes the set of all conditions defined in a particular deployment configuration. Besides the example structures above, the conditions can involve dependencies on arbitrary numbers of tags as well as the time history of the tags. Also, the conditions can involve other similarity measures between time series such as
norms/distances, dynamic time warping, correlation measures, etc.
| Algorithm 3 Algorithm for flagging condition violations |
- 1:
for condition do - 2:
if deviation from condition check as in ( 1)–( 6) then - 3:
Push anomaly detection flag on c to anomaly queue with metadata on timestamp and variables used in computation of condition c - 4:
end if - 5:
end for
|
As discussed above, the integrity verification engine operates by evaluating time series of semantic tags against the configurable set of STL-based specifications, encompassing both static constraints (such as instantaneous value thresholds, allowed enumerated states, or enforcing invariants across tags) and temporal properties (such as event orderings, delay ranges, periodicities, and temporal correlations). Each STL specification encodes a behavioral property that, when violated, triggers an anomaly alert. The verification process is multi-stage; initially, individual tag values are checked for compliance with their defined invariants (e.g., within physical safety bounds or protocol value constraints), followed by the evaluation of temporal patterns across tags (e.g., verifying event sequences such as a command issuance resulting in a corresponding state change within an expected time window). To ensure real-time performance effectively scales to large-scale CPS, the integrity verification algorithm utilizes efficient data structures specifically picked for the purpose. Specifically, match and threshold conditions are evaluated in constant time (
) for each incoming observation using direct hash-table-based lookups. Temporal conditions (pre- and post-conditions) utilize time-indexed queues to efficiently manage the active time windows, ensuring that the processing complexity remains bounded by the number of active temporal dependencies and does not increase with operating time. This streaming time-window-based processing architecture ensures bounded resource usage (e.g., memory utilization) since incoming observations are processed in a streaming fashion without requiring any growing buffers/states that could cause cumulative errors or degradation over time, and thereby facilitates the robust operation of the integrity verification pipeline over long time periods. The satisfaction or violation of each STL formula is tracked in real-time, and violation events are logged together with contextual information such as which tags were involved and which property was violated (as discussed further in
Section 3.6). This approach allows for fine-grained distinction between transient anomalies and persistent specification violations, and also facilitates further analysis by operators. Furthermore, this condition-based verification provides a unified mechanism to enforce cross-domain semantic consistency. For instance, a physical state change (e.g., a breaker opening observed via PMU measurements) can be correlated with a cyber command (e.g., a DNP3 operate command) and the corresponding network traffic patterns (e.g., the underlying packets from the relevant RTAC to the relay), with these diverse observations abstracted into tags and their relationships encoded as conditions. This abstraction enables the framework to efficiently track semantic behavior across the cyber–physical loop to detect anomalies that might be semantically consistent within a single domain (e.g., a valid relay open command) but violate cross-domain consistency (e.g., missing corresponding PMU voltage drop or abnormal network traffic).
3.6. Anomaly Localization
Each raw tag and computed tag maintains a provenance information as to which specific underlying communication observation was involved in the observed value of the particular tag. Hence, considering the devices in the CPS and the observed communications between devices as a communication graph
, the flagging of a condition violation directly indicates potential physical locations of the anomaly based on the edges (communication links) related to the constituent underlying tags in the condition check and the corresponding adjoining nodes. The DAG structure of raw and computed tags enables efficient retrieval of underlying raw tags corresponding to any flagged anomaly. Hence, anomaly scores are maintained for each node and edge and these scores are incremented each time a related anomaly is flagged. To enable the operator to rapidly see the most likely anomalous nodes/edges, the anomaly scores are normalized over the graph
and used for color coding in a graphical visualization (
Figure 3). The algorithmic structure of this component is shown in Algorithm 4.
| Algorithm 4 Algorithm for anomaly provenance scoring |
- 1:
Set anomaly scores to 0 ∀ nodes and edges in graph . - 2:
for in anomaly queue do - 3:
Look up all underlying raw tags for using DAG. - 4:
for in effective raw tags do - 5:
Look up corresponding nodes and edge for and increment their anomaly scores. - 6:
end for - 7:
end for - 8:
Normalize anomaly scores for nodes and edges so that and where and are the sets of nodes and edges in graph and is the anomaly score for a node/edge.
|
3.7. Visualization Dashboard
The user front-end of TRAPS is a dashboard GUI implemented using Grafana to visualize summaries of detected anomalies and overall semantically parsed observations with a hierarchical interactive interface providing an easy-to-use top-level summary as a broad overview and on-demand interactive mechanisms to access additional details when desired. The visualization dashboard shows various elements of situational awareness and anomaly detection, such as observed nodes and communications (along with salient communication properties such as request/response timing), tag values and tag histories, and detection of anomalies in expected match/pre-/post-conditions and the provenance of detected anomalies. Also, a graph of the network architecture with a color-coded visualization of detected anomalies is embedded in the GUI (sample screenshot in
Figure 3). The dashboard also provides plots of tag histories and tabular views of communications and tag values (screenshots omitted for brevity).
3.8. Implementation Architecture
The algorithmic structure of TRAPS offers multiple avenues for parallelization. Leveraging the modular structure of the TRAPS pipeline from the initial raw network ingest to the semantic processing and anomaly detection/localization components, the implementation architecture of the current prototype is shown in
Figure 4.
At the network ingest front-end, the packet parser component uses a producer–consumer worker pool architecture where one thread reads from the source (either a PCAP or live network traffic) and dispatches packets to a set of worker threads via buffered channels. A consistent hashing strategy based on connection tuples (using source and destination identifiers) is used which guarantees that all packets belonging to the same flow are processed by the same worker. This flow affinity enables each worker thread to maintain an isolated parser state for protocols like DNP3 (which require fragment reassembly) without requiring complex locks or shared memory, significantly reducing contention. To ensure that the aggregated output from the threads is accurately time-ordered, a decoupled writer pattern is used where workers accumulate results into reusable batch buffers and send them to a dedicated output thread of the packet parser component. The output thread implements a resequencing buffer using a min-heap, enabling reconstruction of the original packet order from the asynchronously processed batches before writing the JSON stream output from the packet parser component.
The JSON intermediate representation (IR) is protocol-agnostic, enabling the unified processing of heterogeneous protocols (DNP3, Modbus, IEC61850, C37.118, etc.) by the following stages of the TRAPS pipeline. The JSON IR stream is ingested by the semantic tag processor, which is based on a DAG computation model in which both the incoming JSON messages and the tags emitted during computation are processed through a pool of worker threads based on dependency queues to allow efficient recursive computation of dependent tags while maintaining time-consistent ordering. The semantic tag processor outputs a time series of tags, which are then ingested by the anomaly monitor component, which utilizes a thread pool to process groups of STL property verifications over the incoming tag time series. The anomaly monitor implements match/threshold conditions using constant-time hash-table lookups and temporal conditions (pre-/post-conditions) using time-indexed queues whose memory usage is bounded by the number of active STL conditions rather than operating time. The anomaly monitor outputs a time series of violation events with timestamps and contextual metadata (tags involved, violated property). This time series is then used by the anomaly localizer, which runs as a separate thread, to track provenance information (node and edge anomaly scores) for anomaly indicators by referring to the DAG structure to recursively identify the underlying raw tags that contributed to any flagged computed tag or condition violation. The REST API server runs as a separate thread to handle on-demand requests from the dashboard (Grafana-based in a browser) and/or third-party systems by fetching information as needed from the other components. The REST API server thread maintains local data caches to reduce queries to the other components.
While the initial prototype version of the pipeline was implemented primarily in Python 3 (with some hot spots in C++), an optimized implementation was then developed in Go (while keeping only the user-facing configuration API in Python). Although C++ can typically offer slightly higher single-threaded performance, the choice of Go over C/C++ was primarily based on the more lightweight multi-threading (green threads, i.e., goroutines) and more efficient inter-thread communication primitives (channels) in Go, which were found in tests of some of the more computationally intensive, but parallelizable, parts (packet payload parsing, tag computations) to yield around 10–15% higher throughput compared to C++.