4.1. Problem Formulation
To lay a rigorous foundation for subsequent method design, this section first explicitly defines the black-box cyber-physical system (CPS) falsification problem and identifies its core challenges.
4.1.1. Black-Box CPS and Safety Property Definition
A black-box CPS is formally characterized as a tuple , where in each component is defined as follows:
: bounded input space, encompassing all feasible control signals (e.g., throttle or brake commands for automotive systems) that the agent can generate; : observable output space, which constitutes the sole accessible information in black-box scenarios. Internal states (e.g., engine torque) remain unobservable, while observable signals include metrics such as vehicle speed and control error, : unknown deterministic transition function. It maps the historical output sequence and current input to the next output , : initial output of the system, serving as the starting point for each experimental episode.
The target safety property, referred to as the finite future reach property, is defined using metric temporal logic (MTL) as . The components of this formula are elaborated below:
: temporal operator denoting “always hold”, meaning the base formula must be satisfied at every time step within the interval ; : basic MTL formula, composed of atomic propositions, Boolean connectives, and short-interval temporal operators; T: finite time horizon, corresponding to the termination time L used in subsequent algorithms.
4.1.2. Falsification Problem Objective
The primary objective of black-box CPS falsification is to identify an input sequence such that the resulting output sequence violates the safety property . Mathematically, this violation condition is equivalent to .
Here, represents the robustness value of the output trace with respect to the MTL property. If such an input sequence exists, it is termed a counterexample; otherwise, the safety property is considered satisfied within the time horizon T.
4.1.3. Core Challenges in Black-Box Falsification
Existing deep reinforcement learning (DRL)-based falsification methods struggle to tackle two critical challenges inherent to black-box scenarios, which form the focus of this research.
Inadequate temporal modeling due to unobservable states: Black-box CPS only provides output signals (e.g., robustness values, sensor measurements), while hiding internal states such as vehicle acceleration. Traditional unidirectional temporal networks (e.g., single-layer LSTM) fail to capture bidirectional temporal dependencies between historical and future outputs, resulting in imprecise state inference.
Sparse reward signals: Conventional reward functions only offer feedback when the safety property is violated (i.e., robustness ), leading to prolonged “zero-reward” periods during the agent’s exploration phase. This sparsity slows policy convergence and increases the risk of the agent getting trapped in local optima.
To address these dual challenges, we propose the DRL-BiT-MPR framework. Its overall architecture is illustrated in
Figure 1. The framework integrates a bidirectional temporal network to remedy inadequate temporal modeling and a multi-granularity reward function to overcome reward sparsity, operating within a cohesive offline–online workflow.
4.4. Pretraining of the LSTM Prediction Module
The LSTM predictor is trained offline before the online falsification process begins. It is trained to predict future observable outputs based on historical sequences. The training uses a dataset of 10,000 unlabeled random input–output sequences collected from each benchmark model in offline simulation. The goal is to learn a model that maps the historical sequence of length L to the future K steps.
The input to the predictor is the historical observable signals, with the sequence length L specific to each model. The output is the future K-step observable signals , where K is determined by the maximum dynamic response delay of the system.
For the CARS model, the response delay between throttle/brake input and inter-vehicle distance robustness change is two steps, so . The predictor outputs and .
For the AT model, engine speed and vehicle speed have a maximum response delay of three steps, so . The predictor outputs , , and .
For the PTC model, the convergence delay of control error is two steps, so . The predictor outputs , and the operating mode , .
The predictor is trained by minimizing the mean squared error (MSE) between its predictions and the true future outputs from the offline simulation data:
The trained LSTM model is the output of this offline phase. It is frozen and integrated into the bidirectional temporal network during the online phase, where it provides a data-driven projection of future states to enrich the agent’s limited observation.
4.5. Bidirectional Temporal Network
The bidirectional temporal (BiT) network is designed to address the state uncertainty inherent in partially observable black-box CPS falsification. In this setting, the agent must reason about and violate temporal logic constraints using only sequences of past observations. Conventional recurrent networks process only historical data, resulting in a state representation that is inherently retrospective. This limitation is critical because temporal logic properties require reasoning about future system evolution. The BiT network overcomes this by constructing an augmented state representation that integrates both a compressed encoding of past observations and a predicted trajectory of future outputs. This bidirectional context provides the agent with forward-looking information, which is essential for making effective control decisions under partial observability.
The adoption of the bidirectional temporal network as the core perception module is driven by the fundamental challenge of state uncertainty under partial observability in black-box CPS falsification. In such settings, the agent only has access to historical input–output sequences, while the internal system dynamics remain hidden. Conventional unidirectional recurrent networks are inherently limited to processing past data, constructing a state representation that is purely retrospective. This approach fails to capture the bidirectional temporal causality intrinsic to physical systems, where the current hidden state is not only a consequence of past inputs but also constrains the evolution of future outputs. To actively mitigate this uncertainty, the BiT network is designed to construct a more informed state estimate by integrating two complementary information streams: a compressed encoding of the past trajectory and a predicted future trajectory. This design enables the agent to disambiguate the current system context more effectively than from history alone, explicitly addressing partial observability and reducing reliance on a precise internal system model.
The bidirectional temporal network serves as the core module for the agent to perceive the black-box CPS environment and generate effective control inputs. Addressing the key challenge of only accessible output signals (with hidden internal states) in experiments, the network performs input sequence construction, bidirectional convolutional feature extraction, feature fusion, and action generation. By mining temporal correlations from limited observational data, it provides reliable support for policy decision-making.
The core perception module, the bidirectional temporal network, is designed to construct an informed state representation
by integrating both past observations and predicted future information. Its architecture and the flow of data through its components are detailed in
Figure 2. The network operates through a sequence of stages: it begins with the formulation of an input sequence from historical outputs, proceeds to generate future state predictions via a pre-trained LSTM module, and then processes the combined temporal context through parallel forward and backward convolutional pathways. The outputs of these pathways are subsequently fused to form a unified feature vector that is passed to the policy network for action generation.
4.5.1. Input Sequence Definition for Benchmark Models
Input sequences are customized for different benchmark CPS models based on their observable output characteristics, ensuring effective capture of temporal dynamics:
CARS model: The only observable output is the robustness value. The input sequence is defined as , where denotes the robustness value of the CPS safety property at time step t.
AT model: Observable outputs include engine speed robustness, vehicle speed robustness, and gear state. The input sequence is , where is engine speed robustness, is vehicle speed robustness, and g represents gear state (ranging from to ).
PTC model: Observable outputs consist of control error and operating mode. The input sequence combines a 4-step control error sequence and a 2-step operating mode sequence , forming a 6-dimensional input vector.
4.5.2. Future Output Prediction for Input Sequences
To enable the backward convolution to capture future temporal constraints, a pre-trained LSTM predictor is integrated into the BiT network. This predictor generates future observable signals to complement the historical input sequence, addressing the unobservability of future states in black-box CPS. Prior to deployment, the LSTM predictor is trained using 10,000 unlabeled random input–output sequences (offline simulation data) from each benchmark model. Key design details are as follows.
Input is the historical observable signals, with the sequence length consistent with the L value of each model. Output is the future K step observable signals (), where K is determined by the maximum dynamic response delay of each model.
CARS model: The response delay between throttle/brake input and inter-vehicle distance robustness change is 2 steps, so (predicting ).
AT model: Engine speed and vehicle speed have a maximum response delay of 3 steps, so (predicting ).
PTC model: The convergence delay of control error is 2 steps, so (predicting and operating mode ).
The LSTM predictor is trained by minimizing the mean squared error (MSE) between its predicted outputs and the true future outputs from offline simulation data, as defined in Equation (
19).
During real-time operation, the LSTM predictor outputs future K-step signals. These signals are concatenated with the historical sequence to form the complete bidirectional input sequence for subsequent convolution operations.
A precise understanding of model dependency in this context is necessary. The framework aims to reduce reliance on a precise internal system model, such as governing equations or state variables, which are unavailable in black-box settings. The pre-trained LSTM predictor employed here is not such an internal model. It functions as a general-purpose temporal sequence learner, trained on input–output pairs to capture statistical patterns of correlation over time. Its purpose is to provide plausible future signal trends based on recent history, not to replicate the underlying system dynamics. The requirement for offline simulation data for training is a common foundation for data-driven methods. Crucially, the predictor operates without accessing or approximating the system’s internal equations. Therefore, the framework maintains adherence to the black-box assumption by utilizing only observable data and learned temporal correlations.
This complete sequence forms the augmented state representation . It is this representation, explicitly enriched with predicted future information, that is passed to the policy network for action generation, thereby closing the loop between perception and control.
4.5.3. Bidirectional Convolution and Feature Fusion
Bidirectional convolution is applied to the complete input sequence to process temporal correlations in parallel, overcoming the limitations of unidirectional temporal modeling. The specific operation process is detailed below:
Forward convolution takes the ordered complete sequence as input. A convolution kernel with a stride of 1 is used, and the ReLU function serves as the activation function. This module learns causal dependencies between historical observations and current/future outputs. Backward convolution first reverses the complete sequence to , then applies the same kernel and ReLU activation as the forward convolution. This module captures the constraints of future observations on current decision-making.
After convolution, the forward feature map (dimension: , using 64 convolution filters) and backward feature map (same dimension as ) are concatenated along the feature dimension to form a fused feature vector . This vector integrates bidirectional temporal information and is fed into the policy network to generate the final CPS control input.
4.5.4. BiT Network Parameter Selection Basis
Key parameters of the BiT network, including historical sequence length L and convolution kernel size , are determined based on the dynamic characteristics of each benchmark CPS model. This ensures effective mining of temporal correlations while avoiding redundant information.
Historical sequence length is set to cover the minimum dynamic cycle of the model’s observable signals. For the CARS model, inter-vehicle distance robustness stabilizes 3 steps after input adjustment, so . For the AT model, engine speed robustness (3 steps to stabilize) is the slowest among multi-signals, so for ; 2-step and are added to balance other signals’ dynamics. For the PTC model, control error stabilizes 4 steps after input adjustment, so for ; 2-step operating mode is added to capture discrete state changes.
Convolution kernel size is a unified kernel is adopted for two reasons. The time dimension of 3 matches the “cause→effect” chain of CPS dynamics, fully capturing causal dependencies. A smaller kernel misses intermediate dynamic links, while a larger kernel introduces redundancy and increases computational complexity; the feature dimension of 1 avoids cross-dimension interference between multi-signals.
The enriched state representation addresses the theoretical requirement for Markovian state inputs within the partially observable black-box CPS environment. In a standard Markov decision process, the state must encapsulate all relevant historical information for optimal decision-making. The raw observation under partial observability violates this Markov property. The proposed representation is explicitly designed as an information state. It integrates a compressed history with a predicted future trajectory to form a sufficient statistic of the interaction history. This construction recovers an approximate Markov property within the learning framework, enabling the problem to be treated as an MDP with as the effective state input. Consequently, the application of standard policy gradient methods is theoretically justified.
4.5.5. Prediction Error Analysis and Mitigation Strategy
The use of predicted future outputs within the state representation warrants conceptual justification. In a strict online black-box setting, true future information is indeed inaccessible. The predictor is not intended to circumvent this fundamental constraint. Instead, it serves as an inductive bias or an internal simulation module that, based on learned temporal patterns from historical data, generates plausible hypotheses about immediate future trajectories. This provides a richer, forward-looking context that aids in disambiguating the current hidden state under partial observability, a function analogous to planning or foresight in biological agents. The optimality of the resulting policy is therefore contingent upon the accuracy of these predictions. The following analysis formally addresses the impact of prediction error and introduces mechanisms to mitigate its effects, ensuring robust falsification performance even when predictions are imperfect.
A critical methodological concern for any learning-based falsification approach that relies on predicted future information is its robustness to prediction inaccuracies. To directly address this concern and ensure the reliability of our framework, this subsection formally analyzes the impact of prediction errors and introduces an adaptive mitigation strategy. The goal is to safeguard the falsification performance of the DRL-BiT-MPR framework even when the state predictions are imperfect.
The BiT network relies on a pre-trained LSTM to predict future K-step outputs for complementing temporal information. However, prediction errors may degrade state perception accuracy. This section quantifies the impact of prediction errors on falsification performance and proposes a dynamic weight adjustment mechanism combined with an error compensation strategy to ensure the BiT network maintains optimal performance under prediction biases.
Prediction error is defined with the LSTM-predicted outputs denoted as
for
and the true system outputs as
. The mean squared error (MSE) of prediction error is formulated as
Prediction errors affect BiT network performance through two pathways. Biases in future features
lead to inaccurate bidirectional convolutional feature extraction. Incorrect future context disrupts temporal dependency modeling, especially in the falsification of complex properties such as nested MTL constraints.
To mitigate the negative impacts of prediction errors, a two-stage mitigation strategy is proposed. First is the dynamic weight adjustment mechanism. It adaptively adjusts the fusion weight of historical and future features based on the real-time prediction error
. The weight formula is
where
determined via grid search to balance sensitivity and stability and
is the prediction error at the current time step. The feature fusion method is updated to
Here,
denotes features extracted from historical sequences via forward convolution, and
denotes features extracted from future predicted sequences via backward convolution. As
increases,
decreases, and the network automatically reduces reliance on predicted features.
Second is the embedding of the error compensation term. The prediction error sequence is treated as an additional feature and concatenated to the input sequence of the BiT network, enabling the network to learn error patterns and adaptively correct them. The updated input sequence is
where
for
is the sequence of prediction errors over the last K steps, supplementing temporal correlation information of errors.
In summary, the proposed mitigation strategies ensure that the framework does not rely on perfect predictions. The dynamic weight adjustment allows the agent to automatically discount unreliable predictions, while the error compensation term enables learning of systematic biases. The predicted future information is thus utilized as an informative temporal feature rather than a ground-truth signal. The augmented state retains its superiority over a purely historical state because it provides a richer, if imperfect, context for decision-making, as evidenced by the performance gains in our ablation studies.
4.6. Multi-Granularity Reward Function
To address the issues of insufficient single-step reward feedback and sparse signals in black-box environments, this research decomposes the reward function into three levels: fine-grained, medium-grained, and coarse-grained mechanisms. This design ensures the agent receives feedback across different time scales—immediate step-by-step guidance, mid-term temporal correlation feedback, and long-term goal feedback—thereby avoiding local optima and accelerating policy convergence.
4.6.1. Fine-Grained Reward
Operating at the single-step scale, the fine-grained reward provides immediate robustness feedback. It takes the current time step’s robustness value as the core and incorporates a temporal-difference correction term to strengthen feedback on robustness changes between adjacent steps. The formula is defined as
In this formula,
implies that a smaller robustness value leads to a larger reward.
is the temporal-difference weight coefficient, set to 0.2 in experiments.
represents the robustness change between consecutive steps: if
, the temporal-difference correction term becomes negative, reducing the total reward and prompting the agent to further optimize the input; if
,
increases significantly, resulting in a substantial boost to the total reward and thus reinforcing the learning of counterexample inputs.
4.6.2. Medium-Grained Reward
Operating at the window scale, the medium-grained reward provides feedback on temporal correlations. Using a sliding event window (window size set to based on experimental validation) as the calculation unit, it is computed every time steps based on the cumulative robustness decrease within the window. This ensures the agent captures mid-term temporal trends in robustness.
4.6.3. Coarse-Grained Reward
Operating at the episode scale, the coarse-grained reward provides long-term goal feedback. Calculated at the end of each episode (when the time step reaches T or a counterexample is found), it is determined by the minimum robustness value throughout the episode and the proximity to a counterexample, guiding the agent toward the ultimate falsification goal.
4.6.4. Total Reward Calculation
The total reward is a weighted sum of the three levels of rewards. A weight of 0.5 is assigned to the fine-grained reward to prioritize immediate feedback. The medium-grained and coarse-grained rewards are assigned weights of 0.3 and 0.2, respectively, to complement mid-term correlation capture and long-term objective guidance, thus preventing the agent from falling into local optima.
The hierarchical interplay of these three reward levels and their integration into the overall learning loop is illustrated in
Figure 3. The diagram visualizes how fine-grained rewards provide per-step feedback, medium-grained rewards operate over a sliding window, and coarse-grained rewards assess the complete episode, ultimately converging into a single scalar reward that guides policy updates.
4.6.5. Design Rationale and Justification
The hierarchical architecture of the multi-granularity reward function is a deliberate design response to the core challenges of signal sparsity and inefficient exploration in black-box falsification. Each tier of the reward mechanism addresses a distinct facet of the learning problem, and their integrated operation provides structured guidance across multiple temporal scales.
The fine-grained reward component directly counteracts the problem of sparse feedback by supplying a dense, stepwise learning signal. It operates on the robustness degree at each time step, ensuring that every action receives evaluative feedback. The exponential term establishes a continuous mapping between robustness values and rewards, while the temporal-difference component provides immediate directional cues by incentivizing reductions in robustness. This transforms a static performance metric into a dynamic gradient for policy optimization.
Medium-grained rewards address the challenge of policy stagnation in local plateaus. By evaluating performance over a sliding temporal window and rewarding the maximum robustness improvement observed within that horizon, this component introduces a medium-term perspective. It encourages the agent to develop strategies that yield sustained progress over sequences of actions, which is critical for falsifying properties with extended temporal dependencies.
The coarse-grained reward ensures alignment with the global falsification objective. Calculated upon episode termination and based on the overall trace robustness, this component provides a stable, long-term signal that grounds the entire exploration process. It mitigates potential misdirection from transient fluctuations in the finer-grained rewards and consistently reinforces the ultimate goal of identifying a violating trace.
The synthesis of these components through a weighted sum creates a cohesive feedback system. This multi-scale architecture effectively converts the sparse, binary outcome of traditional temporal logic falsification into a rich and continuous learning signal. The resultant guidance is instrumental in achieving the demonstrated improvements in both sample efficiency and falsification success rate.
The three-layer granularity structure is a principled decomposition of the long-horizon falsification task into discrete temporal scales. Fine, medium, and coarse grains correspond to the immediate stepwise, the phased multi-step, and the global episodic temporal scales inherent to sequential decision-making under temporal logic constraints. This tripartite structure provides necessary and sufficient coverage of the feedback spectrum. It delivers dense guidance for local optimization, counters policy stagnation in intermediate phases, and maintains alignment with the terminal objective. Introducing additional layers would not yield commensurate benefits but would increase computational and tuning complexity.
The layer count is a fixed architectural feature of the framework, derived from the fundamental temporal hierarchy described above. Generalization across diverse CPS models or specifications is achieved by scaling the temporal parameters within each layer, not by altering the number of layers. Specifically, the window size for medium-grained evaluation and the episode horizon for coarse-grained assessment are calibrated according to the characteristic time constants and the dominant temporal operators of the specific system-property pair. This ensures the framework’s adaptability while preserving a consistent and interpretable reward topology.
4.6.6. Parameter Selection and Evaluation
The parameters within the reward function, including the temporal-difference coefficient
in Equation (
24) and the layer weights for the multi-granularity reward, are determined through a combination of principled design and empirical validation.
The coefficient in the fine-grained reward balances the influence of the absolute robustness value and its temporal derivative. A value of zero would ignore the direction of change, while a value too large could destabilize learning by overemphasizing single-step fluctuations. The chosen value was found to provide stable and consistent learning progress across all benchmark models during preliminary empirical studies.
The weights for the multi-granularity reward summation are set to , , and . This distribution reflects a hierarchical prioritization where immediate, stepwise feedback is most critical for guiding local search, followed by medium-term trend evaluation to escape plateaus, with global episodic guidance providing foundational direction. These weights were not subjected to an exhaustive grid search to avoid overfitting to a specific model or property. Instead, they were established based on their conceptual alignment with the respective importance of each time scale and then validated by observing consistent performance improvements across all three diverse CPS benchmarks (CARS, AT, PTC). The robustness of the results to minor variations in these weights confirms that the framework is not overly sensitive to their precise values, provided the hierarchical relationship is maintained.
4.7. Parameter Selection Methodology
The selection of hyperparameters within the DRL-BiT-MPR framework follows a systematic methodology grounded in the physical dynamics of cyber-physical systems, rather than arbitrary per-model empirical tuning. This principled approach ensures both scalability across system complexities and transferability across application domains.
4.7.1. Unified Selection Principles
Parameter selection is governed by three interconnected principles. First, time constant alignment ensures that the historical sequence length L and prediction horizon K correspond to measurable system temporal characteristics. Specifically, L must encompass the dominant stabilization period of the observed signals, while K should match the inherent response delay between control inputs and their observable effects on system outputs.
Second, multi-scale balance dictates the distribution of reward weights in the MPR function. The fine, medium, and coarse granularity rewards are allocated to provide immediate search guidance, maintain exploration momentum, and ensure alignment with the long-term falsification objective, respectively. This structured reward shaping addresses the exploration-exploitation trade-off inherent in temporal logic falsification tasks.
Third, robustness by design guides the selection of parameter values that reside within flat regions of the performance landscape, where minor deviations induce minimal degradation in falsification success. This principle prioritizes stable performance over fragile optimality, a consideration validated through the systematic sensitivity analysis presented in
Section 5.9.
4.7.2. Application to Case Studies
These unified principles are consistently applied across the three distinct cyber-physical system case studies examined in this work. For the CARS model, analysis of inter-vehicle distance dynamics indicates stabilization within three simulation steps, leading to the selection . The observed two-step delay between throttle or brake inputs and corresponding changes in distance robustness justifies .
In the AT model, a differential analysis of signal dynamics is employed. Engine speed requires three steps to stabilize, motivating for this signal, while vehicle speed v and gear state g exhibit faster dynamics, leading to for these components. The maximum system response delay of three steps determines .
For the PTC model, the control error demonstrates a four-step stabilization period, resulting in , whereas the operating mode m evolves more slowly, warranting . A two-step convergence delay for the control error underpins the selection .
The consistent application of these dynamical principles across models characterized by differing continuous, hybrid, and temporal behaviors confirms that the parameter selection methodology is not an artifact of model-specific tuning. Instead, it reflects a generalizable approach based on fundamental CPS characteristics, supporting the method’s potential for broader application.
4.8. Analysis of Method Properties
This section presents a systematic analysis of three fundamental properties of the proposed DRL-BiT-MPR framework: convergence behavior, soundness guarantee, and completeness consideration. These properties are essential for evaluating the reliability and applicability of any falsification method in safety-critical cyber-physical systems.
4.8.1. Convergence Analysis
The proposed DRL-BiT-MPR framework employs deep reinforcement learning in a black-box environment, which precludes the provision of formal mathematical convergence guarantees. Nevertheless, we examine its convergence characteristics through architectural design choices and empirical observations.
First, the multi-granularity reward function provides dense-shaped rewards rather than sparse binary signals. This reward design offers continuous learning guidance throughout the exploration process, which has been established in the literature to facilitate more efficient policy gradient optimization.
Second, the bidirectional temporal network architecture maintains gradient flow by incorporating both historical observations and predicted future states. This design mitigates the vanishing gradient problem that commonly impedes training in long-horizon temporal tasks, thereby promoting more stable learning dynamics.
Third, empirical evidence supports the practical convergence of our method. The experimental results demonstrate consistent performance improvement across training episodes. Our method typically achieves stable performance within 100 to 150 episodes, representing faster and more reliable convergence compared to both traditional DRL baselines and the advanced PPO-LSTM agent.
4.8.2. Soundness Guarantee
Soundness represents the assurance that any reported counterexample constitutes a genuine violation of the specified property. This property is paramount for safety-critical applications where false positives could lead to erroneous conclusions.
Our method ensures operational soundness through a rigorous verification mechanism. Every candidate counterexample generated by the policy undergoes validation through re-simulation of the cyber-physical system model. The robustness value is computed using the exact metric temporal logic semantics defined in
Section 3. The falsification process terminates and reports a counterexample only when the computed robustness value is strictly negative. This verification step is integral to Algorithm 1 and guarantees that no false positives are reported.
| Algorithm 1 Falsification for by Reinforcement Learning |
Note: The algorithm begins after the offline phase ( Section 4.4) has provided the pretrained LSTM predictor . The online phase described here utilizes this fixed model. Require: A finite future reach safety property , its monitoring formula , a system f, an agent a Ensure: A counterexample input signal x if exists
- 1:
Parameters: The end time L, the maximum number of episodes N - 2:
for numEpisode to N do - 3:
, the initial (output) state of f - 4:
, - 5:
the empty input sequence - 6:
while do - 7:
, - 8:
, - 9:
- 10:
- 11:
if then - 12:
return x as a falsifying input - 13:
end if - 14:
end while - 15:
- 16:
end for
|
Empirical validation across all experimental runs confirms this soundness guarantee. One hundred percent of reported counterexamples exhibited robustness values below negative 0.01, demonstrating the absence of erroneous violations in our experimental evaluation.
4.8.3. Completeness Considerations
For black-box cyber-physical systems with continuous or hybrid dynamics, achieving formal completeness defined as guaranteed discovery of any existing counterexample is computationally intractable. Our method instead adopts the well-established concept of probabilistic completeness from the sampling-based falsification literature.
The theoretical foundation of this approach states that, given a sufficient sampling budget, meaning the number of episodes approaches infinity, the probability of discovering an existing counterexample approaches unity. This represents the standard completeness notion for sampling-based methods.
Our method enhances this probabilistic completeness through two mechanisms. The multi-granularity reward function naturally guides exploration toward regions of low robustness where counterexamples are more likely to reside. Simultaneously, the bidirectional temporal network focuses the search on temporally plausible input sequences, avoiding wasted exploration on behaviorally impossible patterns.
Empirical performance metrics substantiate this enhanced completeness. The consistently superior success rates and reduced sample counts documented in the experimental evaluation demonstrate that our method achieves higher falsification efficiency compared to all baseline methods. This performance gain indicates more efficient coverage of the counterexample space and consequently a higher probability of discovery for any fixed sampling budget.
4.8.4. Comparative Analysis
Table 1 provides a comparative summary of how different falsification approaches address these key properties. The comparison elucidates the fundamental trade-off between formal guarantees and practical applicability that characterizes contemporary falsification methodologies.
The analysis reveals a clear methodological positioning. While the proposed framework sacrifices the formal guarantees available to white-box verification methods, it gains the capability to handle complex black-box cyber-physical system models that lie beyond the reach of formal verification techniques. This trade-off represents not merely a practical necessity but a well-justified methodological choice for real-world cyber-physical system falsification, where system complexity often precludes formal analysis while rigorous safety validation remains imperative.
4.9. Algorithm Overview
The DRL-BiT-MPR falsification process follows a two-phase design. After the offline phase (
Section 4.4) provides the pretrained LSTM predictor
, the online phase executes the reinforcement learning algorithm described here to find a counterexample for the finite future reach safety property
.
Algorithm 1 defines the core inputs: the safety property , its monitoring formula , the system under test f, and the RL agent a. It also uses two key parameters: the per-episode time step limit L and the maximum episode count N. The agent a incorporates the bidirectional temporal network, which utilizes to construct its state representation.
The algorithm executes N episodes. Each episode initializes the time step counter i to 0, sets as the system’s initial output state, calculates the initial reward, and initializes empty input and state sequences x and y.
An inner loop runs while
i stays below
L. In each iteration, the agent selects input
based on its current state and the previous reward, feeds
into
f to generate the next output
, appends it to
y, updates the reward per
Section 4.6, and increments
i. The agent’s state is built using the BiT network, which integrates historical outputs and predictions from
. Any violation of
by the output trace
y prompts the algorithm to return
x as the falsifying input and terminate immediately.
If no violations are detected, the agent resets for the next episode. The algorithm concludes after all N episodes. Finding no counterexamples suggests the system likely satisfies within the explored input space and time horizon.
4.10. Comparative Analysis with Conventional DRL Methods
Having detailed the components of the DRL-BiT-MPR framework, we now elucidate the fundamental distinctions between our approach and conventional DRL methods in the context of CPS robustness falsification. These distinctions are designed to address the core limitations outlined in
Section 1.
In terms of temporal state representation, conventional DRL falsifiers typically employ unidirectional recurrent networks to encode a history of observations. This results in a latent state representation that is inherently past-dependent and may be insufficient under partial observability. In contrast, our bidirectional temporal (BiT) network actively constructs a state representation by fusing encoded historical observations with predicted future outputs. This design explicitly models the bidirectional temporal dependencies characteristic of CPS, thereby achieving a more informed and accurate state estimation without requiring access to internal system dynamics.
Regarding the reward mechanism, a major bottleneck for conventional DRL is the sparse, often binary reward signal based solely on the final specification violation. This sparse feedback leads to inefficient exploration. Our multi-granularity reward (MPR) function is specifically engineered to overcome this by providing structured, dense feedback across three distinct time scales. The continuous guidance from step-level to phase-level rewards dramatically improves exploration efficiency, directly tackling the reward sparsity problem that plagues black-box falsification.
Concerning model dependency, many advanced DRL methods rely on system models for reward shaping or belief state updates. In contrast, both the BiT network, which learns from input–output sequences, and the MPR function, defined on observable robustness, operate without internal model knowledge. This design choice enhances the framework’s applicability to genuine black-box scenarios.
These targeted innovations are expected to translate into superior empirical performance. Specifically, more accurate state-aware policies are anticipated to yield higher falsification success rates, while the guided exploration is expected to significantly reduce the number of required simulations, thereby improving sample efficiency. The experimental results presented in the following section quantitatively validate these expectations.