Finding More Property Violations in Model Checking via the Restart Policy

: Model checking is an efﬁcient formal veriﬁcation technique that has been applied to a wide spectrum of applications in software engineering. Popular model checking algorithms include Bounded Model Checking (BMC) and Incremental Construction of Inductive Clauses for Indubitable Correctness/Property Directed Reachability(IC3/PDR). The recently proposed Complementary Approximate Reachability (CAR) model checking algorithm has a performance close to BMC in bug-ﬁnding, while its depth-ﬁrst strategy sometimes leads the algorithm to a trap, which will waste lots of computation. In this paper, we enhance the recently proposed Complementary Approximate Reachability (CAR) model checking algorithm by integrating the restart policy, which yields a restartable CAR model (abbreviated as r-CAR). The restart policy can help avoid the trap problem caused by the depth-ﬁrst strategy and has played an important role in modern SAT-solving algorithms to search for a satisfactory solution. As the bug-ﬁnding in model checking is reducible to a similar search problem, the restart policy can be useful to enhance the bug-ﬁnding capability. We made an extensive experiment to evaluate the new algorithm. Our results show that out of the 749 industrial instances, r-CAR is able to ﬁnd 13 instances that the state-of-the-art BMC technique cannot ﬁnd and can solve more than 11 instances than the original CAR. The new algorithm successfully contributes to the current model-checking portfolio in practice.

Given a software design M as the model and the formal specification (property) P, which is often written by some temporal logic [28], model checking checks whether P holds for all behaviors of M. To achieve this goal, a model-checking algorithm explores the state space of M by starting from the initial states to all their reachable states in M.Moreover, model-checking techniques terminate the exploration as soon as (1) a counterexample as witness of the property violation is detected (In general, finding property violations refers to the same thing as finding bugs/counterexamples), or (2) the proof is accomplished that the initial states can never reach the states which violate the property P. If P is a safe property [29], the length of the counterexample becomes finite.As a result, the safety model checking can be reduced to the reachability analysis problem [30], and we focus on the safety model checking in this paper.
Although model checking has been widely used in software and hardware verification, the performance improvement is still eagerly on demand to help solve more industrial instances.It is well known that no model-checking technique is the best one to dominate all others, and different algorithms perform differently for different benchmarks [31].Although invented nearly three decades ago, Bounded Model Checking (BMC) [32,33] is still considered as the most efficient technique for detecting property violations, or say, bugs.Meanwhile, Interpolation Model Checking (IMC) [34] and Incremental Construction of Inductive Clauses for Indubitable Correctness (IC3) [35], or Property Directed Reachability (PDR) [36], are shown to be more fit for proving correctness.Therefore, a portfolio of model checking techniques is often maintained by either academic or industrial model checkers to solve different problems.
Recently, a new model-checking algorithm named Complementary Approximate Reachability (CAR) [37], was proven to complement BMC on bug-findings, i.e., detecting property violations, and IC3/PDR on correctness proofs.That is, CAR is able to solve instances that BMC or IC3/PDR cannot solve within the given time and hardware sources.The achievement from CAR inspires us that, even though relevant techniques have been deeply investigated for decades, there are possibilities to improve the model-checking performance such that it can be more useful for the industry.In this paper, we focus on CAR and present an improved search strategy inside the algorithm to gain a better bug-finding performance.
CAR was inspired by IC3/PDR and the traditional reachability analysis [37], which maintains an over-approximate state sequence for correctness proof and an under-approximate state sequence for bug-finding.CAR utilizes the depth-first search strategy to find new states that meet the constraints, which are used to refine the under-approximate state sequence or collect the relevant information to refine the over-approximate state sequence if failed.The algorithm terminates as soon as either a bad state is in the under-approximate sequence, which indicates a counterexample has been detected, or an invariant has been computed based on the over-approximate sequence, which indicates the correctness proof has been asserted.For more details, see below.CAR can be performed in both forward and backward directions.Since evidences have shown that Backward-CAR is better than Forward-CAR at bug-finding [38], we follow the observation and focus on improving Backward-CAR.In the rest of the paper, all mentions of "CAR" represent Backward-CAR unless it is specifically clarified.
Although CAR has shown the advantage of detecting bugs for safety model checking and outperforms IC3/PDR in bug-finding, it cannot solve as many unsafe instances (those with bugs) as BMC in the current stage [39].The depth-first strategy may lead the algorithm to a trap for those unsafe cases it is unable to solve.As a result, to keep searching for new states is almost impossible for the algorithm to locate the bad states and only wastes the computation sources.Such a similar phenomenon occurs on solving the satisfiability of Boolean formulas (SAT) [40], in which the search can also be in the trap if the order of variable assignments is not properly chosen.To tackle such issue, researchers propose a restart policy such that the current search path is discarded and a new one can be selected to get rid of the trap [41].Their experiments show that such a simple strategy is very efficient to help speedup SAT solving, particularly for those satisfiable instances.
Inspired by the results achieved by applying the restart policy to modern SAT solvers [41], we leverage a similar idea to enhance the performance of CAR in bug-finding.The new algorithm is named r-CAR (restartable CAR).In our designation, the restart policy is invoked as soon as the size of the new elements of the over-approximate state sequence in a single search reaches the frequency k × t, where k is the length of the over-approximate sequence and t is a given threshold, which can be dynamically updated based on a given growth rate gr during the search.That means, if the current threshold is t and the growth rate is gr, the threshold will be updated to be (t × gr) when the restart is invoked next time.Moreover, the search will be restarted the next time as soon as the size of the new elements of the over-approximate sequence reaches (k × t × gr).As a result, the restart frequency depends on the threshold and the corresponding proportion to update it.Once the restart is triggered, CAR deletes all state information collected in the current search and starts a new one immediately.Notably, the previous path information has been stored in the over-approximate sequence such that it is guaranteed the new search is able to find a different path with all founded before.
We conduct a comprehensive experimental evaluation on the 749 industrial instances from the Hardware Model Checking Competition in 2015 [42] and 2017 [43].We implement our new algorithm based on the SimpleCAR model checker [38,44] and compare the bugfinding performance to the original CAR in SimpleCAR, as well as the BMC and IC3/PDR algorithms that are implemented in the state-of-the-art model checker ABC [45].The results show that, given the same time and hardware sources, r-CAR can solve 13 new unsafe instances compared to BMC and find 11 more counterexamples compared to the original CAR by feeding different restart configurations.Moreover, r-CAR is able to outperform IC3/PDR on bug-finding (checking unsafe instances).The new algorithm helps increase the diversity to solve more instances.Therefore, we show that combining the restart policy with CAR is able to increase the power of the current model-checking portfolio in the industry.
In summary, this paper makes the following contributions: • We propose r-CAR, an enhanced CAR-based model-checking algorithm, to detect more property violations, or, say, bugs/counterexamples.

•
We implement r-CAR to produce a practical model checker called RestartCAR and conduct an extensive experimental evaluation to show that RestartCAR improves the capability of the SimpleCAR model checker to find more unsafe instances by feeding different restart configurations.

•
We further identify the practical restart configurations for RestartCAR and study the effectiveness of r-CAR.

Related Work
Compared to Theorem Proving [46], another mainstream formal verification technique, the advantage of model checking is to avoid massive manual work and enable automatic verification, which is accomplished by performing the exhausted search on the graph constructed from the model together with the property.However, the main challenge in model checking is the exponential scaling of the model's state space, the so-called "state-explosion problem" [47].
Early approaches to model checking [48,49] were based on an explicit search of the model's transition graph, where nodes represent states and edges represent system transitions.Such explicit-state techniques typically do not scale well beyond models with a few million states [50].A major breakthrough, in the early 1990s, was the introduction of symbolic techniques, which replaced explicit search with Boolean reasoning techniques.The development of Binary Decision Diagrams (BDDs) [51] led to the development of BDD-based symbolic model checking, which enabled the verification of systems with 1020 states [52].Yet BDDbased techniques rarely scale to models with more than 1000 Boolean state variables, which limits their applicability to the verification in industry [53].
In the late 1990s, SAT solving emerged as a highly effective Boolean reasoning technique [54].The first application of SAT solving to model checking was in the context of bounded model checking (BMC), in which the search over model behavior is subject to a depth bound [32].This approach, where model checking is reduced to a sequence of SAT-solving calls, one for each depth bound, has been shown to be highly effective in practice, particularly for detecting property violations (bugs) [49].Yet BMC is incomplete, as it can only reveal the presence of counterexample behavior, but not prove their absence, which led to a quest to develop SAT-based complete model-checking techniques.This is still a very much active area, as no single approach has proven to be superior to all other approaches, cf.[40].While some approaches have tried to find ways to extend BMC to make it complete, e.g., [55], others have tried to follow the approach of BDD-based model checking.
There are two ideas in BDD-based model checking [52]: (1) a set of states can be represented by a Boolean formula (a BDD is a special case), and (2) a key operation in searching the state space is the image/pre-image operation, in which we symbolically compute the set of successor/predecessor states of a given set S of states.Much of the research in BDD-based symbolic model checking has focused on the efficient implementation of the image operation, cf.[56].One direction of research on SAT-based complete model-checking techniques has been on a SAT-based implementation of the image operation.While standard SAT solving returns a single satisfying assignment when the formula under test is satisfiable, there is a variant, called All-SAT, that returns a representation of all satisfying assignments, cf.[57].
All-SAT-based symbolic model checking did not, however, prove to provide a highly scalable approach.Two other SAT-based approaches emerged in the following years.Interpolation-based model checking [34] combines the use of Craig Interpolation as an abstraction technique with the use of BMC as a search technique.IC3/PDR starts with an over-approximation and is gradually refined to be more and more precise [35,36].Both approaches have proven to be highly scalable and are today parts of the algorithmic portfolio of modern model checkers, such as ABC [45].Normally, users prefer to using BMC for bug-finding (checking unsafety) and using IC3/PDR and IMC for correctness proving (checking safety).Recently, a new model checking algorithm, CAR, was presented, and the preliminary results showed that it succeeds in complementing BMC, IC3/PDR, and IMC by solving instances that cannot be solved by those three algorithms [37,44].
Although BMC, IC3/PDR, IMC, and CAR utilize the SAT technique to achieve the search task, they follow different strategies.Explicitly, BMC and IMC use the breadthfirst strategy, while IC3/PDR and CAR use the depth-first search strategy.Since IMC is developed upon BMC, its bug-finding performance is completely dominated by that of BMC, according to previous literature [44].In addition, IC3/PDR pays more effort to generate the so-called minimal inductive clauses for proving correctness such that its overall performance on bug-finding is not as good as that of CAR.Therefore, BMC and CAR are the best two options for bug-finding.On one hand, BMC has been proposed for decades, and is now very difficult to improve the performance.On the other hand, CAR is a new algorithm that leaves many potential slots for improvement.As a result, this paper presents one candidate solution to improve the bug-finding performance on CAR.Compared to the original CAR [44], the new algorithm r-CAR enhances it by introducing the restart policy such that the depth-first search inside CAR can be restarted if the current path is determined as a non-promising one to find the solution.Moreover, the import of restart significantly improves CAR's performance in bug-finding while preserving the performance in proving correctness, as shown in the experimental section.
In general, among all modern model checking algorithms mentioned above, BMC can find far more counterexamples than IC3/PDR and IMC within the same time and memory limit, but BMC does not have the ability to prove correctness.The CAR algorithm has the ability to prove correctness and has a performance close to BMC in bugs-finding.IC3/PDR and IMC focus more on proving correctness and therefore has a better performance in proving correctness and a relative poor performance in bugs-findings compared to CAR and BMC.After we introduce the restart policy into the CAR algorithm, we obtain the r-CAR algorithm, which retains CAR's ability of proving correctness and enhances the ability to find counterexamples.Moreover, r-CAR can find more counterexamples by running different parameter combinations in parallel, which is not available to other algorithms.Please check Table 1.All the aforementioned model checking algorithms are originally bit-level techniques that can only handle Boolean transition systems.Recently, several efforts have been made to immigrate such bit-level algorithms to the so-called word-level model checking, using the SMT engine instead of the SAT one due to the increasing interests in the SMT domain [58][59][60][61].Normally speaking, the bit-level model checking techniques are used mainly in hardware verification, while the work-level model checking techniques focus on software verification.

Boolean Transition System
Modern SAT-based model checking techniques consider the Boolean transition system as the model.A Boolean transition system Sys is defined as a tuple (V, I, T), where V is a set of Boolean variables and each state s of the system is in 2 V , the set of truth assignments to variables in V.I is the set of initial states.If we mark the copy of V as V to represent the set of primed variables, T is the transition relation of the system over V ∪ V .We say that state s 2 is a successor of state s 1 , if We say the state t is reachable from state p in (resp.within) k steps, if there exists a finite path with length k (resp.smaller than k) such that s 0 = p and s k = t are true.All states that are reachable from the initial states I are called the reachable states of Sys.Given a safety property P (represented as a Boolean formula) and Boolean transition system Sys = (V, I, T), we say the system is safe for P if every reachable state s of Sys satisfies P, i.e., s |= P. Otherwise, the system is unsafe.We call the state violating P a bad state and use ¬P to denote the set of all bad states.A path from an initial state in I to one of the bad states in ¬P is called a counterexample.
Let X ⊆ 2 V be a set of states in Sys.We define the relation R(X) to be the set of successors of the states in X, i.e., R(X) = {s |(s, s ) ∈ T for s ∈ X}.We define R i (X) = R(R i−1 (X)) for i > 1.Similarly, we define R −1 (X) as the set of predecessors of states in X and R −i (X) analogously for i > 1.
We call a Boolean variable a or its negation ¬a as a literal.Let L be a set of literals.A cube is a Boolean formula with the form of l where l ∈ L. Analogously, a clause is a Boolean formula with the form of l, where l ∈ L. It is not trivial to see that a state of Sys is a cube.In the rest of the paper, we will mix-use the terms state and cube for convenient description.

The High-Level Description of CAR
Derived from the traditional reachability analysis, CAR can perform in both the forward and backward directions.As Backward CAR has been shown better than Forward CAR [38], in the rest of the paper, we focus on Backward CAR and all mentions of "CAR" represent Backward CAR.The CAR algorithm maintains a finite under-approximate state sequence F = F 0 , F 1 , ..., F k (k ≥ 0) starting from I (the set of initial states), i.e., F 0 = I, and each F i is a subset of states reachable from I in i steps.Such an under-approximate sequence is called the F-sequence.In addition, CAR maintains an over-approximate sequence B = B 0 , B 1 , ..., B k (k ≥ 0) starting from the bad states, i.e., B 0 = ¬P, and a state is included in B i (i ≥ 0) if it can reach ¬P in i steps.The sequence is called the B-sequence.In addition, each element B[i] of the B-sequence is named a frame and i is the frame level.States in F-sequence are represented as a disjunction of cubes, while the states in B-sequence are represented as a conjunction of clauses.
A summary of both F-and B-sequences including the initialization, constraints, and safety/unsafety checking conditions are listed in Table 2.The F-sequence is defined recursively that (1) F 0 = I, i.e., the first element of the sequence is the set of all initial states, and (2) F i+1 ⊆ R(F i ) for i ≥ 0, i.e., the element of the sequence at position i + 1 is a subset of states which represent the successors of those at position i.Since each F i represents only a part of real states at position i, the F-sequence is under-approximate.Because the F-sequence does not include all state information, we can only use it to check unsafety.That is, if a state in some F i is also a bad state in ¬P, a counterexample is found and the unsafety result can be reported.Analogously, the B-sequence is defined recursively that (1) B 0 = ¬P, i.e., the first element of the sequence is the set of all bad states (represented by ¬P), and (2) B i+1 ⊇ R −1 (B i ) for i ≥ 0, i.e., the element of the sequence at position i + 1 is a superset of states which represent the predecessors of those at position i.Since each B i includes the information of all real states at position i, the B-sequence is over-approximate.Since the B-sequence includes more state information than the real ones, we can only use it to check safety.If every state in some B i+1 is included in some B j for 0 ≤ j ≤ i, the correctness is proved and the safety result is reported.

F-Sequence (under)
B-Sequence (over) Figure 1 shows the schema on how to refine elements in both sequences.The crux is a Boolean formula φ = s ∧ T ∧ B(i) , in which s is a state in the F j , T is the transition relation formula, and B(i) is the i-th element of the B-sequence (B(i) is the prime version).Informally speaking, the formula φ queries whether one of the successors of state s can be in B(i).The query can be sent to a SAT solver, and if a satisfying assignment is returned, the F-sequence can be updated based on the information from the assignment (see Figure 1b).Otherwise, the B-sequence can be refined according to the unsatisfiable cores from the SAT solver, which is a subset of s (Figure 1a).As the length of the B-sequence being increased, we enumerate the elements in F-and B-sequences to feed the above formula φ and therefore update all information of the sequences.

SAT Calls with Assumptions and Unsatisfiable Cores
As introduced above, the CAR algorithm frequently invokes the SAT calls whose inputs have the form of A ∧ B, where B = T ∧ B(i) is a CNF formula, a Boolean formula with the form of c i , where each c i is a clause, and A (= s) is a cube.We use the notation SAT(A, B) to represent such SAT queries and take A as the assumptions of the SAT solver.Although the result of SAT(∅, A B) is equivalent to that of SAT(A, B), using the latter one is typically more flexible for incremental SAT solving, which is a very efficient mechanism to frequently invoke SAT solvers in practice.There are two different outcomes from an SAT solver when handling the query SAT(A, B).If the result is satisfiable, an assignment of the formula A ∧ B is provided by the SAT solver.Otherwise, A ∧ B is unsatisfiable and an Unsatisfiable Core (UC) uc ⊆ A, which is a subset of the assumptions A, and can be returned by the SAT solver.In CAR, the assignments are used to extract the new explicit states of the system that are added to the under-approximate sequence, while the unsatisfiable cores are collected to refine the over-approximate sequence.

Algorithm Design 4.1. Algorithmic Description of CAR
The pseudo-codes for the main procedures of CAR are shown in Algorithm 1.The entry procedure takes a system Sys = (V, I, T) and a safety property P as the inputs, and outputs are safe if an invariant is detected (Line 14), which indicates the correctness is proven, or unsafe if a counterexample is found (Line 7), which means a property violation exists.The texts in red are introduced to implement the restart policy, which will be explained in the next section.

Algorithm 1 Main Procedures of r-CAR and CAR (without texts in red)
Input: Sys = (V, I, T) and Safety Property P; Output: return safe or unsafe.Cube t := get_assignment();

26:
F j+1 := F j+1 ∪ t supposing s is in F j (j ≥ 0); return true; The main framework of CAR is shown from Line 1 to Line 15 of Algorithm 1.The first SAT call at Line 1 is used to check whether there is a counterexample with the length of 0, which means that some initial state in I is also a bad state in ¬P.If the SAT query returns unsatisfiable, CAR initializes the B-sequence and F-sequence at Line 3, according to the rules in Table 2.The whole loop from Line 5 to Line 15 increases the length of the B-sequence gradually (see Line 13) and first calls the UNSAFECHECK procedure to search new states and returns unsafe if a counterexample is found.Notably, inside the procedure, the length of the F-sequence can be increased while that of the B-sequence cannot.Meanwhile, the Fand B-sequence can be updated during the search inside the procedure.If UNSAFECHECK proceeds but no counterexamples are detected, the SAFECHECK procedure is then used to check whether an invariant can be found based on the information of the F-sequence.The whole loop terminates as soon as one of the above two procedures returns, as discussed in [37].A summary of procedures in CAR is listed below: • PICKSTATE at Line 6 takes the F-sequence as the input and uses certain decision strategies to enumerate and select a state from the sequence.For example, we may enumerate the states from the beginning (resp.end) to the end (resp.beginning) of the sequence, which can be implemented in a trivial way.The procedure returns an empty set ∅ if all states in the sequence are considered but no more available states can be chosen.• REORDER at Line 21 takes a state as the input.Inspired by the concept of assumptions in modern SAT solvers, this procedure maintains two non-conflict policies named intersection and rotation, which are designed to generate smaller unsatisfiable cores so as to boost the efficiency of the algorithm.The procedure reorders the literals in the state s to generate its new copy ŝ (Cube ŝ at Line 21), which is then transferred to the SAT solver as assumptions.For example, given a state s = (l 1 , l 2 , l 3 , l 4 ), the returned state ŝ may be (l 3 , l 4 , l 1 , l 2 ) according to the reorder policy inside the procedure.Although the SAT query result remains the same, the latter assumptions may lead to a smaller unsatisfiable core (UC) and the literature [39] has shown the efficiency of such reorder heuristics.• get_assignment at Line 25 returns a satisfying assignment of the input formula if the SAT query result is satisfiable.A new state t, which is a successor of s, can be extracted from the assignment.Details are referred to in [37].• get_unsat_core at Line 30 generates an unsatisfiable core uc, which is a subset of the assumptions ŝ in the current SAT call, if the query result is unsatisfiable.It is trialed to see that uc is also a subset of s.Essentially, the unsatisfiable core uc represents a set of states (including s) that does not meet the query.Using uc instead of s to update the over-approximate sequence is proven to be more effective.

Restart Policy
The restart mechanism has been widely implemented in modern SAT solvers to improve their performance.The motivation comes from the observation that the search inside the solver may become trapped due to an improper order of the assignments to the variables in the Boolean formula.Under such scenarios, to keep searching is almost impossible to find the final result but only wastes the computation sources.Therefore, it is reasonable to abandon the current search path and restart it again with different variable assignments.Studies have shown that such a simple strategy turns out to be very efficient to help solve more satisfiable instances [41].It is surprising to see that CAR also suffers from a similar problem during the state search, and the idea of applying the restart policy to CAR comes out straightforward.
As shown in Algorithm 1, the texts in red are pseudo-codes added to integrate the start policy in CAR and therefore produce r-CAR.We use a counter variable count to record the number of unsatisfiable cores generated in the current search.The count increases every time a new unsatisfiable core is computed (Line 31) and will be set to 0 after each restart (Line 8).The insight is that too many unsatisfiable cores are computed in a single search probably leads to a trap.In addition, a threshold that can be dynamically updated is provided, and the restart policy is triggered as soon as the condition count > frequency becomes true (Line 3).Notably, we use a flag restart to control whether the restart policy should be triggered (Line 12), whose value is updated based on the return value of the RestartPoint procedure (Line 18).
Once the restart policy is triggered, CAR abandons the current search and starts over again.However, the restart frequency is a key reason that affects the final performance.If the frequency is set too high, CAR may lose the instances that can be solved when no restart is applied to the algorithm.On the other hand, if the frequency is set too low, it may not be helpful to solve more instances that cannot be solved when no restart is applied to the algorithm.In the implementation, we control the restart frequency according to a threshold, whose value is initialized at the beginning (threshold 0 at Line 4 of Algorithm 1) and then can be updated based on a growth rate gr.The value of threshold 0 determines the initial frequency of the restart policy through the equation frequency = threshold 0 * (k + 1), where (k + 1) represents the length of the current B-sequence.How the restart frequency dynamically updates depends on gr.After each time the CAR algorithm triggers the restart policy, the threshold is multiplied by gr, leading to the increment in the restart frequency.
In the UNSAFECHECK procedure, RESTARTPOINT is invoked to judge whether it is ready to restart.The procedure takes an integer k representing the length of B-sequence and an integer count, which counts the number of new unsatisfiable cores (Line 31) generated in the current search, as the inputs.Since the B-sequence is over-approximate, generating new unsatisfiable cores exactly makes the B-sequence more precise, which may prevent the algorithm from searching the same path.Therefore, we take the length of the B-sequence into consideration, and the restart frequency is the product of the length of B-sequence and threshold.As soon as count is larger than frequency, RESTARTPOINT returns true and the restart flag becomes true, which makes the procedure UNSAFECHECK terminate with the output false (Lines 18-20).Once the restart point is reached, all recursive calls in UNSAFECHECK are returned as false, leading to the termination of the loop at Lines 6-9 and the entry to the procedure RESTART at Line 15.
The RESTART procedure at Line 7 of Algorithm 2 resets the unsatisfiable core counter count (Line 8) and enlarges the threshold with growth rate gr (Line 9).Compared to the previous search from initial states I, we have updated B i with a certain number of unsatisfiable cores, which probably generates a different search path from the previous ones.The procedure BACKTRACK contains the process of returning to initial states I, eliminating the F-sequence to release the memory, and some auxiliary work such as the reconstruction of the SAT solver.To clearly show the scenario before and after introducing the restart policy, we show the difference between the searching diagrams of CAR and r-CAR (CAR + restart policy) in Figure 2.Although the figures are greatly reduced, zooming in on the pictures will not cause distortion.From the figure, we found that CAR searches very deeply, consuming lots of CPU time but returning with no counterexample.Meanwhile, Figure 2b shows the searching path of r-CAR.It can be observed that r-CAR does not spend too much time on a certain path and restarts the search several times.Finally, a counterexample with a length of seven is found.You can find the trace of the counterexample on the top-right corner of Figure 2b.A counterexample is a path from the initial state to a final state.The initial state and final state are surrounded by red circles.

Experimental Setup
We implement r-CAR based on the SimpleCAR model checker [44].As mentioned before, the restart frequency has a significant influence on the effectiveness of the restart policy.In our conjecture, the frequent restarts in r-CAR may not preserve the advantages already achieved in original CAR, while a low frequency cannot help solve new instances.In our proposed algorithm, two parameters threshold and gr are introduced to determine the restart frequency in a dynamic way.We evaluate different combinations of these two parameters.We assign a relatively small value to threshold and a value equal to or greater than 1 to gr, e.g., threshold = 128, gr = 1.2, aiming to avoid the disadvantage of frequent restarts by gradually increasing the threshold after each restart.
We compare our r-CAR implementation to SimpleCAR, which implements the original CAR, and ABC [45], a prestigious model checker in the community which implements BMC and IC3/PDR and won the hardware model checking competition several times.Notably, there are different kinds of BMC implementations in ABC, and we select the one invoked by the bmc2 command in the tool, which has the best performance based on previous evaluations [38].Both SimpleCAR and ABC use the Minisat SAT solver [62,63] as the computation engine for model checking.
All the experiments are performed on a cluster which consists of 2304 processor cores in 192 nodes, and each node runs RedHat 6.0 with a 2.83 GHz CPU and 48 GB of memory (RAM).In the experiments, for each algorithm, the time and memory limits of each instance are set to be 1 hour and 8GB as default.
We evaluate all algorithms against 749 industrial benchmarks from the single safety property track (SINGLE) of the HWMCC in 2015 [42] and 2017 [43], whose categories are listed in Table 3.Each instance in the benchmark is an aiger model [64], which formalizes the And-Invertor Graph of a circuit together with the safety property to be verified.Latches are an important part of an aiger model.Sequential circuits have latches as state elements.In a model, the number of latches reflects the complexity of the model to a certain extent.Specifically, the state space is 2 l , where l is the number of latches.In Table 3, we grouped all benchmarks according to their source.For example, there are 180 cases, whose names are started with "6s" and provided by IBM.In these 180 cases, the smallest case contains no latches, while the largest contains 467,369 latches, with an average of 16,674 latches per case.This paper focuses on unsafety checking, under which a counterexample can be provided to help identify the property violation.We use the aigsim tool from the Aiger package [65] to check whether the produced counterexamples are correct.We report that the counterexamples generated from all checkers pass the tests successfully.

RQ1:
What is the appropriate configuration for the restart policy?In the experiments, the original CAR (without restart policy) is able to solve 145 unsafe instances by providing counterexamples.To evaluate the performance of the restart policy on r-CAR, we first fix the initial threshold to be 128 and make the growth rate gr vary from 1.0 to 16.0.The number of solved and distinctively solved (for the meaning, see the figure) instances with the corresponding parameters are shown in Figure 3a.From the figure, the restart policy effectively expands the algorithm's diversity to find considerable new counterexamples with different configurations.In particular, the restart strategy has better results when the value of gr is in the range of one to two, which acquires the most amount of new instances (seven or eight).When gr is set to be larger than 16, it can always surprisingly find 5 "distinctly solved" instances.The reason is that these five distinctly solved instances (6s218b2950, oski15a01b03s, oski15a01b43s, oski15a10b11s, oski15a10b14s) only need one single turn of restart when the threshold is set to be 128, which means that gr plays no effect on these five cases.It is more appropriate to set gr in the range of 1.0 to 3.0 as there are some differences in their "distinctly solved" instances, which means we can obtain more counterexamples in total through operating the algorithm with different parameters in parallel.We then vary the value of threshold from 64 to 8192 by fixing gr = 1.2 (as the representative), and the corresponding results are shown in Figure 3b.The restart policy performs better when the value of threshold is smaller than 256, under which not only more "distinctly solved" but also several unique instances are detected.For example, "oski15a08b15s" can only be found by "64-1.2","6s351rb15" and "oc8051topo08"can only be solved by "128-1.2".In our conjecture, certain instances are sensitive to the particular combinations of the parameters that determine the frequency of the restart policy.Setting the initial threshold to be larger than 1024 is too large for a one-hour execution to make the restart strategy work.In short, the threshold is recommended to be set in the range of 32 to 256, while gr is recommended to be set in the range of 1.0 to 3.0 for benchmarks from HWMCC.Unfortunately, there is currently no good way to obtain the parameters suitable for specific cases.More precisely, the methods to extract the characteristics of a model are currently lacking.Larger models do not necessarily adapt to more frequent restarts.We think how to define and extract the characteristics of a model is a meaningful follow-up work.
It should be highlighted that, although IC3/PDR can also perform differently by varying the parameters to generate the inductive clauses [31], it helps more significantly to prove safe instances.Meanwhile, applying the restart policy to CAR results in a better performance on solving unsafe instances, which cannot be achieved by varying different parameters inside IC3/PDR.
Due to the fact that the state space of a model grows exponentially with the number of variables, we achieve little marginal benefit when allocating more linear time to the computational task.For example, out of a total of 749 instances, the original CAR can find 145 counterexamples in 1 hour and can find 147 counterexamples in 5 hours, with only 2 more counterexamples in an extra 4 hours.Similarly, it takes BMC 1 hour to find 153 counterexamples and 5 hours to find 159 counterexamples, with only 6 more counterexamples in an extra 4 hours.Considering that r-CAR finds many different counterexamples when using different parameters, it is better to separately allocate computing resources to run r-CAR with different parameter combinations in parallel.
RQ2: Is restarting the policy effective?We focus on the number of counterexamples found by different algorithms, shown in Figure 4. We combine the results from the five configurations ("64-1.2","128-1.2","128-1.5","128-3.0",and "256-1.2") of r-CAR with each configuration running for one hour, to represent running these five different parameter combinations in parallel, noted as r-CARselect.Correspondingly, both CAR and BMC run for five hours.The BMC implementation in ABC finds 159 counterexamples in 5 hours, CAR can find 147 counterexamples, and r-CAR-select can find 158 counterexamples.We can conclude that the restart policy is effective as it helps CAR find 11 more counterexamples than before (from 147 to 158).In addition, the performance of r-CAR-select is roughly equal to BMC (158 and 159), which is considered to be the most effective and widely used algorithm in bug-finding.To better compare the experiment result of r-CAR and BMC, we mark the number of counterexamples that cannot be found by the other side as "unique solved".Similarly, the "unique solved" of "CAR-5h" in Figure 4 represent the number of counterexamples that cannot be found by BMC.As we can see, r-CAR-select finds 13 counterexamples that cannot be found by BMC, much more than before (CAR finds 7), which affirms our claim that the restart policy plays a non-negligible role as a part of the portfolio to check property violation or bug-finding.To evaluate the performance of r-CAR from another angle, we compare the time spent in solving each case between r-CAR vs. BMC and r-CAR vs. IC3/PDR, the results of which are shown in Figure 5a,b, respectively.The X-axis of these two figures is the time spent for the best result from r-CAR-select, while the Y-axis represents the time spent for BMC (resp.IC3/PDR) to solve the problem.Obviously, each point above the diagonal represents a single case in which BMC (or IC3/PDR) spends more time to find a counterexample than r-CAR-select and vice versa.It should be noted that points with the abscissa or ordinate of 3600 represent the instances that the corresponding method cannot find this counterexample in 3600 s.In Figure 5a, we can find that for those instances that can be solved by both algorithms, either BMC or r-CAR can solve most of them in a short time (less than 600 s).Even though BMC performs better in the instances that can be solved by both algorithms, r-CAR uniquely solved several instances that BMC could not solve.These two methods complement each other very well in bug-finding.Meanwhile, from the results shown in Figure 5b, r-CAR-select succeeds in solving many more instances than IC3/PDR within 3600 s and also solves them much faster.Obviously, r-CAR outperforms IC3/PDR in bug-finding.
We then show the comparison of the overall performance among different approaches in Figure 6.We can clearly see that r-CAR-select, CAR, and BMC dominate PDR in bugfinding.Compared to BMC, CAR and r-CAR-select solve fewer instances in the early stage.The reason for this is that BMC is based on the strategy of breadth-first search, which normally operates fast at the beginning but can become slower as the depth increases.We can find that BMC solves fewer new instances after 10 min.r-CAR follows the depth-first search strategy, and the restart is triggered as soon as the depth of the searching path reaches the threshold.Therefore, the restart policy gives r-CAR more opportunities to detect the property violation as time increases, and the restart policy gradually closes the gap between r-CAR and BMC.Considering that r-CAR-select represents the result of running five parameter combinations in parallel, even if we multiply the time cost of r-CAR-select by five times, we know that after a total of about five hours, r-CAR can catch up with BMC, based on the result of  It should be clarified that r-CAR with a single configuration cannot outperform BMC.We argue that comparing r-CAR-select, which consists of five different restart configurations, to BMC is still fair because we give BMC 5 h.The BMC implementation (abc-bmc2) we select is the best one as far as we know, and abc-bmc2 can solve all instances that other BMC implementations can solve, according to our preliminary experiments.As a result, testing more BMC implementations cannot affect the conclusions made in this paper.
The ability to perform differently with different configurations is the advantage of r-CAR, which cannot be achieved by either BMC or IC3/PDR.

Conclusions
To summarize, we apply the restart policy to CAR, aiming to get rid of the trap which occurs during the search and make the algorithm not terminate in a reasonable time.The results of the experiments show that the restart policy increases the diversity of the CAR algorithm, though there is no single configuration that can improve the overall performance significantly.The new finding of 11 unsafe instances indicates the efficiency of the restart policy in the domain.Moreover, the CAR algorithm with the restart policy can now find 13 unsafe instances with counterexamples that BMC cannot find, which enhances the ability of the current model checking portfolio.This is the first work to understand how the restart policy performs on model checking, and our experiments result have proven the effectiveness of the restart policy.We expect that our research can be helpful to understand the performance characteristics of the restart policy.It is possible to run different parameter combinations which control the restart frequency in parallel to solve previously unsolvable cases.In future work, we plan to design more elaborate and sophisticated restart mechanisms to improve the overall performance of CAR such that it is able to outperform BMC in bug-finding with a single restart configuration.Due to the fact that different problems are sensitive to different restart frequencies, it is also interesting to introduce the learning techniques to learn the best solution for different instances.
(a) The schema to update elements in the B-sequence.(b) The schema to refine elements in the F-sequence.

Figure 1 .
Figure 1.The schema to refine elements in the B-and F-sequences.This figure is better viewed online.

Figure 2 .
Figure 2. The search paths of CAR and r-CAR (CAR + restart policy) algorithms on solving the instance "oski15a08b09s".Zooming in on the pictures will not cause distortion.

Figure 3 .
Figure3.Number of unsafe benchmarks solved from the experiments.The category "distinctly solved" benchmarks are solved by CAR with the corresponding restart policy but not by the original CAR.The "solved" benchmarks solved by CAR with and without the restart policy.X-axis 128-1.0means the initial threshold = 128, gr = 1.0, and the same applies to others.

Figure 4 .
Figure 4. Comparison of the number of counterexamples.
(a) Comparison on time spent by r-CAR-select and BMC.
(b) Comparison on time spent by r-CAR-select and IC3/PDR.

Figure 5 .
Figure 5.Comparison of time spent by r-CAR-select with other algorithms.

Figure 4 .
As mentioned before, bug-finding is not the strong point of IC3/PDR, and the results of IC3/PDR shown in the figure support this claim firmly.

Figure 6 .
Figure 6.The overall performance of CAR, r-CAR, BMC, and PDR.

Table 1 .
Modern model checking algorithms.

Table 2 .
The summary of key structures in CAR.
can reach the bad states in B i .As the state s is selected from the F-sequence, which stores states reachable from I, a counterexample is found and the procedure returns true.If the SAT call is satisfiable with i > 0 (Lines 25-27), we invoke get_assignment() to obtain a new state t, which is added into the F-sequence as the successor of s, and recursively invoke UNSAFECHECK(t, i − 1, k).• SAFECHECK at Lines 35-42, takes an integer k and the length of the B-sequence as the inputs.By enumerating i from 0 to k, the procedure checks whether all the states in B i+1 have been contained in the union set of B 0 , B 1 , ..., B i .If this is the case, We can assert that all the states that can reach the bad states B 0 = ¬P have been included in the B-sequence.Because the initial states I are not in the B-sequence, the system Sys is safe for the property P.This procedure exactly implements the safety check condition shown in Table2.
• UNSAFECHECK from Lines 17 to 34 takes a state s, an integer i representing the current frame level of the B-sequence, and an integer k representing the length of the B-sequence as inputs.The procedure first reorders the input state s to ŝ through the REORDER procedure, and then invokes an SAT call SAT( ŝ, T ∧ B i ) to check whether state s can directly reach states in B i .If the result is unsatisfiable, it calls get_unsat_core() to obtain an unsatisfiable core c ⊆ s.Considering that ¬c represents the over-approximate set of states which should not be in B i+1 because they cannot reach states in B i , the unsatisfiable core c is added to B i+1 .On the other hand, if the SAT query is satisfiable and the given integer i is 0 (Line 23), that is, state s • RestartPoint at Line 18 returns true if CAR is ready to restart the state search according to the restart policies introduced below.

Table 3 .
The categories of benchmarks.
Results of the restart policy with the initial threshold = 128.Results of the restart policy with gr = 1.2.