Scale-Free Random SAT Instances

: We focus on the random generation of SAT instances that have properties similar to real-world instances. It is known that many industrial instances, even with a great number of variables, can be solved by a clever solver in a reasonable amount of time. This is not possible, in general, with classical randomly generated instances. We provide a different generation model of SAT instances, called scale-free random SAT instances. This is based on the use of a non-uniform probability distribution P ( i ) ∼ i − β to select variable i , where β is a parameter of the model. This results in formulas where the number of occurrences k of variables follows a power-law distribution P ( k ) ∼ k − δ , where δ = 1 + 1/ β . This property has been observed in most real-world SAT instances. For β = 0, our model extends classical random SAT instances. We prove the existence of a SAT– UNSAT phase transition phenomenon for scale-free random 2-SAT instances with β < 1/2 when the clause/variable ratio is m / n = 1 − 2 β ( 1 − β ) 2 . We also prove that scale-free random k-SAT instances are unsatisﬁable with a high probability when the number of clauses exceeds ω ( n ( 1 − β ) k ) . The proof of this result suggests that, when β > 1 − 1/ k , the unsatisﬁability of most formulas may be due to small cores of clauses. Finally, we show how this model will allow us to generate random instances similar to industrial instances, of interest for testing purposes.


Introduction
Over the last 20 years, SAT solvers have experienced a great improvement in their efficiency when solving practical SAT problems. This is the result of some techniques like conflict-driven clause learning (CDCL), restarting and clause deletion policies. The success of SAT solvers is surprising, if we take into account that SAT is an NP-complete problem, and in fact, a big percentage of formulas need exponential size resolution proofs to be shown unsatisfiable. This has led some researchers to study what is the nature of real-world or industrial SAT instances that make them easy in practice. Parallelly, most theoretical work on SAT has focused on uniform randomly selected instances. Nevertheless, nowadays we know that most industrial instances share some properties that are not present in most (uniform randomly-choosen) SAT formulas. It is also well-known that solvers that perform well on industrial instances, do not perform well on random instances, and vice versa. Therefore, a new theoretical paradigm that describes the distribution of industrial instances is needed. Not surprisingly, generating random instances that are more similar to real-world instances is described as one of the ten grand challenges in satisfiability [48,47,38,39].
Over the last 10 years, the analysis of the industrial SAT instances used in SAT solvers competitions has allowed us to have a clear image of the structure of real-world instances. Williams et al. [51] proved that industrial instances contain a small number of variables (called the backdoor of the formula) that, when instantiated, make the formula easy to solve. Ansótegui et al. [9] showed that industrial instances have a smaller tree-like resolution space complexity (also called hardness [15]) than randomly generated instances with the same number of variables. Ansótegui et al. [6] proved that most industrial instances, when represented as a graph, have a scale-free structure. This kind of structure has also been observed in other real-world networks like the World Wide Web, Internet, some social networks like papers co-authorship or citation, protein interaction network, etc. Ansótegui et al. [10], Ansótegui et al. [4] show that these graph representations of industrial instances exhibit a very high modularity. Modularity has been shown to be correlated with the runtime of CDCL SAT solvers [45], and has been used to improved the performance of some solvers [11,50,40]. It is also known that these graph representations are self-similar [3] and eigen-vector centrality is correlated with the significance of variables [37].
Defining a model that captures all these properties observed in industrial instances is a hard task. Here, we focus on the scale-free structure. We will define a model and propose a random generator for scale-free SAT formulas, extending our work presented in CCIA'07 [8], CCIA'08 [5], IJCAI'09 [7] and CP'09 [6]. This model is parametric in the size k of clauses and an exponent β. Formulas are sets of m independently sampled clauses of size k with possible repetitions. Clauses are sets of k independently sampled variables, without repetitions, where each variable x i is chosen with probability P (x i ) ∼ i −β , and negated with probability 1/2. In this paper, we also study the SAT-UNSAT phase transition phenomena in this new model using percolation techniques of statistical mechanics. We prove that random scale-free formula over n variables, exponent β and O(n (1−β)k ) clauses of size k are unsatisfiable with high probability (see Theorem 5). This means that, for big enough values of β, the number of clauses needed to make a formula unsatisfiable is sub-linear on the number of variables, contrarily to the standard random SAT model. We also prove that scale-free random 2-SAT formulas with exponent β < 1/2 and a ratio of clause/variables m/n > 1−2β are also unsatisfiable with high probability (see Theorem 4). This last result, together with a coincident lower bound found by Friedrich et al. [31] allow us to conclude that scale-free random 2SAT formulas show a SAT-UNSAT phase transition threshold.
During the revision of this article, many new results related to the phase transition on scale-free random formulas have been found. Friedrich et al. [30] generalize the notion of scale-free random k-SAT formulas and prove that there exists an asymptotic satisfiability threshold (in the sense of [29]) for β < 1 − 1/k, when the number of clauses is linear in the number of variables. Friedrich and Rothenberger [32] find sufficient conditions for the sharpness of this threshold, generalizing Friedgut [29]'s result for uniform random formulas. [33] generalize the notion of scale-free formula to the notion of non-uniform random formula, only assuming that variable x i is selected with probability p i (where p 1 ≥ p 2 ≥ . . . ≥ p n ) and determine the position of the threshold for k = 2. Cooper et al. [24] and Omelchenko and Bulatov [46] analyze the configuration model for 2-SAT where, instead of fixing the probability of every variable, they fix the degree of every variable. If these degrees follow a power-law distribution, the location of the satisfiability threshold (for k = 2) is the same as in our model. This article proceeds as follows. In Section 2 we review some methods to generate scale-free random graphs. One of these methods is the basis of the definition of scale-free random formulas, introduced in Section 3. In Section 4, we summarize some properties of industrial or real-world SAT instances, described in detail in our work presented at CP'09 [6]. We prove the existence of a SAT-UNSAT phase transition phenomenon in scale-free random 2-SAT instances in Section 5. This is done using percolation techniques. In Section 6, we prove that when the β parameter that regulates the scale-free struture of formulas exceeds a certain value, the SAT-UNSAT phase transition phenomena vanishes, and most formulas become unsatisfiable due to small cores of unsatisfiable clauses.

Generation of Scale-Free Graphs
Generating scale-free formulas has an obvious relationship with the generation of scale-free graphs. In this section we review some graph generation methods developed by researchers on complex networks.
A scale-free graph is a graph where node degrees follow a power-law distribution P (k) ∼ k −γ , at least asymptotically, where exponent γ is around 3. Preferential attachment [13] has been proposed as the natural process that makes scale-free networks so prevalent in nature. This process can be used to generate scale-free graphs as follows. Given two numbers n and m, we start at time t = m + 1 with a clique of size m + 1 where all nodes have degree m (in the limit when n tends to infinity, the starting graph is not relevant). Then, at every time t = m + 2, . . . , n, we add a new node (with index t), connected to m distinct and older nodes s < t, such that the probability that a node s gets a connection to this new node t is proportional to the degree of s at time t. This process generates a scale-free graph with asymptotic exponent γ = 3, average node degree E[k] = 2 m and minimum degree k i ≥ m, for all nodes. We can also prove that the expected In order to explain the origin of scale-free networks where γ = 3, several models have been proposed [26]. One of these models is based on the aging of nodes [25]. This means that the probability of a node s (created at instant s) to get a new edge at instant t is proportional to the product of its degree and (t − s) −α , where t − s is the age of the node. This model generates scale-free graphs when α < 1. When α → 0, the exponents of the power-laws P (k) ∼ k −γ and E[k i ] ∼ i −β are γ = 3+4(1−ln 2)α and β = 1/2−(1−ln 2)α, respectively. Therefore, the value of α may be used to tune the values of γ and β.
In the previous methods, growth in the number of nodes is essential. There are other methods, usually called static, where the number of nodes is fixed from the beginning and during the process we only add edges.
The simplest method, assuming uniform probability for all graphs with a scale-free degree distribution in the degree of nodes, is the configuration method, that can be implemented as follows.
Given a desired number of nodes n and exponent γ, for every node i ∈ {1, . . . , n}, generate a degree k i following the probability P (k) = k −γ /ζ(γ), independently of i. Here, ζ(x) = ∞ i=1 i −x is the Riemann zeta function. Then, generate a graph with these node degrees, ensuring that all them are generated with the same probability. This can be done, for instance, with a unfold-fold process: In the unfolding, we replicate node i, with degree k i , into k i new nodes with degree 1. Then, we randomly generate a graph where all nodes have degree equal to one, ensuring that all 1-regular graphs with n i=1 k i nodes are generated with the same probability. Then, in the folding, we merge the k i nodes that came from the replication of i, into the same node. When there is an edge between two nodes and these two nodes are merged, a self-loop is created. Similarly, when we have two edges i 1 ↔ j and i 2 ↔ j and i 1 and i 2 are merged, a duplicated edge is created. Therefore, we reject the resulting graph, if it contains self-loops or multiple edges between the same pair of nodes. Alternatively, we can also apply the Erdös-Rényi generation method to the unfolded set of nodes, with average node degree equal to one. In this later case, we would ensure that after folding, node i has a degree close to k i , since in the Erdös-Rényi model, node degrees follow a binomial distribution (a Poisson distribution P (k) = e −z z k k! in the infinite limit, where z is the average degree, z = 1 in our case). The previous method has two problems. First, the resulting graph (after the unfold-fold process) will have average node degree equals to: If we want to obtain a graph with a distinct average degree, we have to modify the probability P (k) for small values of k, and ensure that P (k) follows a power-law distribution only asymptotically for big values of k. In other words, we only require P (k) to follow a heavy-tail distribution. Second, a great fraction of generated graphs will contain self-loops or multiple-edges after folding. This means that a great fraction of graphs will be rejected, which makes the method inefficient. However, the model can be useful to translate some properties of the Erdös-Rényi model to scale-free graphs via the unfolding-folding process and the configuration model [14,17].

3
The unfolding-folding procedure was described by Aiello et al. [2]. They, instead of assigning a random degree to each node, describe a model where, given two parameters α and γ, 1 we choose a random graph (with uniform probability, and allowing self-cycles) among all graphs satisfying that the number of nodes with degree x is e α /x γ . When γ > 2, the average node degree in this model is also ζ(γ−1) ζ(γ) . Alternatively, instead of fixing the degree of every node, we can fix the expected degree of every node E[k i ] = w i . In order to construct a graph where nodes have this expected degree E[k i ] ∼ w i , we only need to generate edge i ↔ j with probability P (i ↔ j) ∼ w i w j . If we want to generate a scale-free graph where P (k) ∼ k −δ , for sparse graphs, it suffices to fix w i = i 1/(δ−1) [21,36] (see also Theorem 1).
Our scale-free formula generation method is based on this static scale-free graph generation model with fixed expected node degrees. Basically, nodes are replaced by variables. Then, instead of edges, we generate hyper-edges. Negating every variable connected by a hyper-edge with probability 1/2, we get clauses.

Scale-Free Random Formulas
In this section we describe the scale-free random SAT formulas model. We consider k-SAT formulas over n variables, denoted by x 1 , . . . , x n . A formula is a conjunction of m possibly repeated clauses, represented as a multiset. Clauses are disjunctions of k literals, noted l 1 ∨ . . . ∨ l k , where every literal may be a variable x i or its negation ¬x i . We identify ¬¬x with x. We restrict clauses to not contain repeated occurrences of variables. This avoids simplifiable formulas like x ∨ x ∨ y and tautologies like x ∨ ¬x ∨ y. In general, we represent every variable by its index, and negation as a minus, writing i instead of x i , and −i instead of ¬x i . In other words, a variable x is a number in {1, . . . , n}, and a literal a number in {−n, . . . , n} distinct from zero. We use the notation ±x to denote either x or ¬x. The number of occurrences of literal l in a formula is denoted by k l , and K x = k x + k ¬x denotes the number of occurrences of variable x. The size of a formula F is |F | = m k.
In the following, we will use the notation P (x) ∼ f (x) to indicate that random variable x follows the probability distribution f (x). The notation f (n) ≈ g(n) indicates that lim n→∞ Definition 1 (Scale-free Random Formula). In the scale-free model, given n, m and β, to construct a random formula, we generate m clauses independently at random from the set of 2 k n k clauses, sampling every valid clause with probability where every literal l i is sampled with probability In practice, we generate a variable x with probability P (x) = x −β / n i=1 i −β , negate it with probability 1/2, repeat the process k times, and reject clauses containing repeated variables. Therefore, the probability of a clause satisfies the inequality

Some Properties of the Model
In the case of the graph generator, we reject self-loops and repeated edges between two nodes. This makes distribution of degrees to follow a power-law, only asymptotically and for sparse graphs. In our case, we reject clauses with repeated variables. This is the reason that invalidates the reverse direction in the previous inequality. It also makes formulas to follow a power-law distribution in the number of variable occurrences only asymptotically (see Theorem 1). In the following we will discuss when the approximation Notice that n i=1 i −β = H n,β are the generalized harmonic numbers. When n tends to infinity and β = 1, using the Euler-Maclaurin formula, they can be approximated as where ζ(β) is the Riemann zeta function. When β = 1, we have where γ is the Euler constant. This means that, when n tends to infinity, the probability of sampling variable x i is P (x i ) = o(1), when 0 ≤ β ≤ 1, and P (x i ) = i −β /ζ(β) + o(1), when β > 1. The fact that the probability of sampling a variable does not vanish, when the number of variables tend to infinity and β > 1, may be troublesome. In particular, the probability of generating clauses with duplicated variables does not vanish, even for constant clause sizes. Similarly, to avoid duplicated variables, we also have to impose an upper bound k = o(n min{1/2,1−β} ). Lemma 1. When 0 ≤ β < 1, the sizes of clauses are k = o(n min{1/2,1−β} ) and n tends to infinity, the probability of generating a clause with a duplicated variable tends to zero.
In these conditions, the probability of a random variable and the probability of a random clause in a formula are Proof: We will use a result known as surname problem [41], that generalizes the birthday paradox. Let X 1 , . . . , X k be independent random variables which have an identical discrete distribution P (X = i) = p i , for i ≥ 1. Let R k be the coincidence probability that at least two X j have the same value. Let r k = 1 − R k be the non-coincidence probability. Then, r k may be computed using the recurrence r 0 = 1 and The coincidence probability can be computed as R 1 = 0 and In our case, we face the problem of choosing k independent variables, and we want to compute the probability of getting a duplicated variable, hence a rejected clause. When β < 1, we have: Depending on whether β k is greater or smaller than 1, the first or the second term of P k will dominate.
In our case, assuming k = O(n α ), and replacing the value of P j , we get ). Therefore, it suffices to assume that α < min{1/2, 1 − β} to ensure that R k = o(1).

Lemma 2.
In a scale-free random formula over n variables and m = C n clauses of size k = O(1), generated with exponent 0 < β < 1, the expected number of occurrences of variable x i is Proof: By Lemma 1 and equation (1), since 0 < β < 1 we have The following theorem ensures that the formulas we get are scale-free, in the sense that the number of occurrences of variables follow a power-law distribution P (K) ∼ K −δ , for big enough values of K. Theorem 1. In scale-free random formulas over n variables, with m = C n clauses of size k, and generated with exponent 0 < β < 1, when n tends to ∞ being C and k constants, the probability that a variable has K occurrences, where K = Ω( √ n log n) or K = Ω (n 2 log n) β 2+β , follows a power-law distribution P (K) ∼ K −δ , where δ = 1/β + 1.
Proof: In the limit when n → ∞, by Lemma 1, P (x i ) ≈ C i −β is the probability of sampling a variable x i , for some constant C = 1/ n j=1 j −β ≈ (1 − β) n β−1 that depends on n. Let K i be the number of occurrences of variable i in a randomly generated formula F . We have E[K i ] = |F | C i −β . Chernoff's or Hoeffding's bounds ensure that, under certain conditions that we will consider later, Now we want to approximate the probability F (K) = ∞ k=K P (k) dk that a variable occurs at least K times. Given a value K, let i be the index of the variable satisfying E[K i ] = K. Under these conditions, all variables with index smaller that i will have more than K occurrences, and those with indexes between i + 1 and n have less than K occurrences. Therefore, F (K) = i/n, for the particular i defined above. From Then, the probability P (K) is Hence we obtain a discrete power-law distribution with exponent δ = 1/β + 1.
The problem is that E[K i ] is a good approximation of K i only when i is small. For instance, when i = Ω(n), we have P (x i ) = Θ(n −1 ) and E[x i ] = Θ(1). In this situation, when n → ∞ being C and k constants, the number of occurrences K i of the variable x i follows a Poisson distribution with constant variance. This means that, even in the limit n → ∞, we can not assume that i < j implies K i > K j , when i = Ω(n). In the following we will find an upper bound for the index i of the variable (a lower bound for the value of K) ensuring that E[K i ] is a good approximation of K i , when n → ∞. We will use both Hoeffding's and Chernoff's bounds.
In what follows, let be C be the constant such that |F | ≈ C n is the size of the formula.
Hoeffding's bound states that, if X = X 1 + . . . + X n is the sum of identical and independent Bernoulli variables, then Given a value of K, let's fix two variables i and j such that We have P (K j ≥ K) ≤ 2/n 2 , and for all variables r with bigger indexes We have already argued that F (K) = P (k ≥ K) ≈ i/n. Using j, we have a strict bound By Lemma 2, we get Replacing the expressions for i and j, we get Similarly, we can prove the same lower bound P where X is the sum of independent random variables in the range [0, 1]. In order to ensure that the K i 's are sorted, we require that, in the limit n → ∞, we have P (K i < K i+1 ) = O(n −1 ). We take the value of δ that satisfies By Lemma 2 and the Taylor expansion And, for this value of δ, we impose From this, we get the minimum value of i for which P ( The value of K = E[K i ] corresponding to this variable x i gives us a value from which on we can expect to observe the power-law distribution in P (K).

Implementation of the Generator
The generation method is formalized in Algorithm 1. The function sampleVariable(β,n) may be imple-Algorithm 1: Scale-free random k-SAT formula generator. Input: n, m, k, β Output: a k-SAT instance with n variables and m clauses 1 F = ∅; We can compute a vector p such that p[i] = i j=1 j −β / n j=1 j −β at the beginning of the algorithm. Then, every time we call sampleVariable, we compute a random number r uniformly distributed in [0, 1), using a dichotomic search, look for the smallest i such that p[i] > r, and return such i.
Alternatively, if n is big we can use the following approximated algorithm. If we want to generate numbers x with probability density f (x), we can integrate F (x) = f (x) dx, find the inverse function, and compute F −1 (y), where y is a uniformly random number in [0, 1]. Our probability function is discrete. However, when 0 < β < 1, and both X → ∞ and n → ∞, we can approximate it as Therefore, computing the inverse, sampleVariable may be computed as where Y is a uniform random variable in [0, 1). This way, avoiding the use of the vector p and the dichotomic search, we save a O(log n) factor in the time-complexity and a O(n) factor in the space-complexity of the generator.

Industrial SAT Instances
In the previous section we have scale-free random SAT instances. We want this models to generate formulas as close as possible to industrial ones. Therefore, we want to compute the value of β that best fits industrial instances. For this purpose we have studied the 100 benchmarks (all industrial) used in the SAT Race 2008. All together, they contain n = 25693792 variables, with a total of n i=1 K i = 349760681 occurrences. Therefore, the average number of occurrences per variable is E [K i ] = n i=1 K i /n = 13.6. If we used the classical (uniform) random model to generate instances with this average number of occurrences, most of the variables would have a number of occurrences very close to 13.6. However, in the analyzed industrial instances, close to 90% of the variables have less than this number of occurrences, and more than 60% have 6 or less occurrences. The big value of the average is produced by a small fraction of the variables that have a huge number of occurrences. This indicates that the number of occurrences could be better modeled with a power-law distribution. This was already suggested by Boufkhad et al. [19].
In order to check if those industrial instances (all together) are scale-free SAT formulas, and estimate the value of β, we compute the number of occurrences of each variable of each industrial instance. Then, we rename the indexes of such variables such that K i ≥ K i+1 , for i = 1, . . . , n − 1. Now, before comparing K i with i −β / n j=1 j −β , we renormalize both functions such that both are defined in [0, 1] and its integral in this range is 1. Hence, we define for the empirical K i , the empirical function φ ind as and, for the theoretical function P (i), the theoretical function φ(x; β, n) as In Figure 1 we represent both functions with normal axes, and with double-logarithmic axes. Notice that in double logarithmic-axes, the slope of φ ind (x) allows us to estimate the value of β = 0.82.
Theorem 1 allows us to ensure that the distribution of frequencies on the number of occurrences of variables follows a power-law distribution, with exponent δ = 1/0.82 + 1 = 2.22.
Finally, we have generated a scale-free random 3-SAT formula with n = 10 7 variables, m = 2.5 · 10 7 clauses and β = 0.82. In Figure 2, we show the frequencies of occurrences of variables of this formula and compared it with those obtained for the SAT Race 2008, and the line with slope α = 1/0.82 + 1 = 2.22.

Phase Transition in Scale-Free Random 2-SAT Formulas
Chvátal and Reed [22] proved that a random formula with (1 + o(1))cn clauses of size 2 over n variables, is satisfiable with probability 1 − o(1), when c < 1, and unsatisfiable with probability 1 − o(1), when c > 1, where o(1) represents a quantity tending to zero as n tends to infinity.
As will see in this section, a similar result for scale-free random 2-SAT formulas can be obtained using percolation and mean field techniques.
Percolation theory describes the behavior of connected components in a graph when we remove edges randomly. Erdös and Rényi [27] are considered the initiators of this theory. In this seminal paper on graph theory they proposed a random graph model G(n, m) where all graphs with n nodes and m edges are selected with the same probability. Gilbert [35] proposed a similar model G(n, p) where n is also the number of nodes, and every n 2 possible edge is selected with probability p. For not very sparse graphs (when p n 2 → ∞), 10 both models have basically the same properties taking m = n 2 p. Erdös and Rényi [28] also studied the connectivity on these graphs and proved that • when n p < 1, i.e. m < n/2, a random graph almost surely has no connected component larger than O(log n), • when n p = 1, i.e. m = n/2 a largest component of size n 2/3 almost surely emerges, and • when n p > 1, i.e. m > n/2, the graph almost surely contains a unique giant component with a fraction of the nodes and no other component contains more than O(log n) nodes.
Phase transitions is a phenomenon that has been observed and studied in many AI problems. Many problems have an order parameter that separates a region of solvable and unsolvable problems, and it has been observed that hard problems occur at critical values of this parameter. Mitchell et al. [42] found this phenomena in 3-SAT when the ratio between number of clauses and variables is m/n ≈ 4.3. Gent and Walsh [34] observed the same phenomenon with clauses of mixed length.
There is a close relationship between SAT problems and graphs. Both, percolation on graphs and phase transition in SAT (or other AI problems) are critical phenomena and both can be studied using mean field techniques from statistical mechanics. Percolation theory has been used and inspired works in the literature about random SAT and satisfiability threshold, e.g. in Achlioptas et al. [1] to determine the satisfiability threshold of 1-in-k SAT and NAE 3-SAT formulas. Some results on graphs have been previously extended to 2-SAT. For instance, Sinclair and Vilenchik [49] adapted Achlioptas processes for graphs into formulas. Bollobás et al. [18] investigated the scaling window of the 2-SAT phase transition, finding the critical exponent of the order parameter and proving that the transition is continuous, adapting results of Bollobás [16] for Erdös-Rényi graphs. The relationship between percolation in random graphs and phase transition in random 2-SAT formulas is suggested in many other works. For instance, Monasson et al. [44] when studying the phase transition in 2 + P -SAT (a mixture of (1 − p)m clauses of size 2 and p m clauses of size 3) already mention that "It is likely that the 2SAT transition results from percolation of these loops...". Cooper et al. [24] use the emergence of a giant component in a graph to prove the existence of a phase transition in 2-SAT random formulas with prescribed degrees, using the configuration model. They find, for this model, the same criterion as Friedrich et al. [31] and us in Theorem 2.
Given a random 2-SAT formula with m clauses over n variables, we can construct an Erdös-Rényi graph where the 2 n literals are nodes, and the m clauses are edges. At the percolation point m = (2 n)/2 of this graph a giant component emerges. Just at the same point m = n the 2-SAT phase transition threshold is located. However, despite the coincidence in the point, the relation between both facts is not direct: a giant component in the graph is not the same as a giant (hence, unsatisfiable) loop of implications in the SAT formula. The connection between two edges a ↔ b and b ↔ c in the graph is given by a common node (literal) b. Whereas, in the SAT formula, the resolution between a ∨ b and ¬b ∨ c is through a variable b that is affirmed in one clause and negated in the other. In this section, we elaborate on the relation of giant components in graphs and unsatisfiability proofs in 2-SAT formulas.

A Criterion for Phase Transition in 2-SAT
Unsatisfiability proofs of 2-SAT formulas are characterized by bicycles. Let F be a 2-SAT formula. Any sequence of literals x 1 , . . . , x s satisfying ¬x i ∨ x i+1 ∈ F , for any i = 1, . . . , s − 1, is called an implication sequence. We say that y implies y ′ , if there exists an implication sequence of the form y, x 1 , . . . , x n , y ′ . Any implication sequence of the form x 1 , . . . , x s , x 1 is called a cycle. A bicycle is a cycle x 1 , . . . , x n , x 1 such that there exists a variable a satisfying {a, ¬a} ⊆ {x 1 , . . . , x n }.
A 2-SAT formula is unsatisfiable if, and only if, it contains a bicycle [12,22]. We will also consider random graphs with n nodes and m edges, 3 and connected components, defined as subsets of nodes such that any pair of them is connected by a path inside the component. A random graph of size n is said to contain a giant connected component if almost surely 4 it contains a connected component with a positive fraction of the nodes. Given a model of random graphs, we say that c is the percolation threshold if any random graph with n nodes and more than c n edges almost surely contains a giant component. In a random graph, the degree of a node x, noted k x , is a random variable. The random variable k represents the degree of a random node chosen with uniform probability. 5 As we commented above, we can represent any 2-SAT formula as a graph where nodes are literals, and clauses a ∨ b are edges between literals a and b. In classical 2-SAT random formulas, since literals are chosen independently with uniform probability, the generated graph will be an Erdös-Rényi graph following the model G (2 n, m). However, a connected component in the graph is not necessarily an unsatisfiability proof of the formula.
First, in a random SAT formula, we may have repeated clauses, which means that from m clauses we will obtain less than m edges. However, for a linear number of clauses, when β < 1/2, there are (1 − o(1)) m distinct clauses or edges. In the classical case, in the limit n → ∞, with a linear number of clauses m = O(n), and a quadratic number of possible clauses, the probability of any clause is O(n −2 ), and the probability of being repeated m O(n −2 ) = O(n −1 ). Therefore, the fraction of repeated clauses is negligible. For scale-free 2-CNF formulas, in Theorem 5, we will see that, if β < 1/2 then clauses have probability o(n −1 ). Precisely, the most probable 2-CNF clause is x 1 ∨ x 2 . This means, that after generating O(n) clauses, the probability that a new generated clause has already been generate previously is bounded by This probability bounds the value of the fraction of repeated clauses, that it is meaningless when β < 1/2. Second, graph connected components and cycles are not the same structure. Therefore, the existence of a giant connected component and the existence of a giant cycle are independent facts.
Molloy and Reed [43] and Cohen et al. [23] have studied the existence of giant components in random graphs with heterogeneous and fixed node degrees. Molloy and Reed [43] prove that the critical point is at where λ i is the fraction of nodes with degree i. Whereas, Cohen et al. [23] independently prove (but in a much more informal way) that the critical point is characterized by where k is the degree of a random node, and E denotes expectation. It is easy to see that both criterion are exactly the same. Interestingly, the criterion depends not only on the expected degree of nodes, but also on the expected square degree of the nodes, hence on the variability of node's degrees. The variability on the nodes degree plays an important role in the location of the percolation threshold. For instance, in the Erdös-Rényi model, the percolation threshold is located at m/n = 1/2, hence the expected degree of nodes is 1/2. However, the expected degree of nodes belonging to the same connected component 6 of size r is, at least, (r − 1)/r ≈ 1. This discrepancy is only possible if the variability in node's degree is high. This also explains why, in regular random formulas, where we impose variables to occur exactly the same number of times (instead of the same average number of times), we get distinct phase transition thresholds.
Cohen et al. [23] starts assuming that loops of connected nodes may be neglected. In this situation, the percolation transition takes place when a node i, connected to a node j in the connected component, is also connected in average to at least one other node, i.e. when E[k i | i ↔ j] = ki k i P (k i | i ↔ j) = 2.
Molloy and Reed [43] give a more detailed proof that we will try to summarize. Given the list of fixed degrees k i of every node, they describe a random algorithm that constructs (exposes) all graphs compatible with these degrees with the same probability, exposing connected components one by one: Let c i be the degree of node i on the partially exposed graph. Initially, set c i = 0, for every node. Then, until c i = k i , for all nodes, repeat the following actions. If, for some node i, we have 0 < c i < k i , then (case A) select it; otherwise, (case B) choose freely a node i such that c i = 0. Then, in both cases, choose another node j = i with probability P (j) ∼ k j − c j . Expose the edge i ↔ j, and increase c i and c j . Notice that every time we execute case B, we start the exposition of a new connected component of the graph.
Let X r be the random variable representing the number of open vertexes in partially exposed nodes, i.e. X r = ci>0 k i − c i , after the rth edge i ↔ j has been exposed. Notice that we execute case B when we have X r−1 = 0, and we get X r = k i + k j − 2. When we execute case A, there are two situations: (case A1) if node j is a partially exposed node (i.e. 0 < c j ), then X r = X r−1 − 2, and (case A2) if node j has never been exposed (i.e. c j = 0), then X r = X r−1 + k j − 2.
Suppose that cases B and A1 does not happen very often. Then, the expected change in X r is and, since X r − X r−1 ≥ −1, a standard result of random walk theory ensures that if Q(λ) > 0 then, after Θ(n) steps, X r is almost surely of order Θ(n); and if Q(λ) < 0, then X r returns to zero fairly quickly. In the first case, we generate a giant connected component of size Θ(n), and in the second case, no component is larger than O(log n). In order to prove that executions of case A1 do not hurt, Molloy and Reed prove that the probability of choosing a partially exposed node (a node with c j > 0) is negligible unless we have already exposed a fraction Θ(n) of the nodes in the current connected component. Theorems 2 and 3 establish a similar criterion for the existence of a giant set of implied literals from a given one. This almost surely implies unsatisfiability of the formula. The proof of the theorems resemble Molloy and Reed's and Cohen et al.'s proofs. In Theorem 2 we fix the number of occurrences of every literal, whereas in Theorem 3 we fix the number of occurrences of variables. Compared with the definition of Q in Molloy and Reed's, we observe that in Theorem 3, the 2 is replaced by a 3. In Theorem 2, we combine the number of literals k i with the number of their negated k −i , and the constant is a 1 instead of a 2. Notice that the condition in this case is equal to the condition found by Cooper et al. [24] for the configuration method and prescribed literal degrees. Proof: The proof resembles Molloy and Reed's proof for the percolation threshold on graphs. This proof is quite long, and our proof does not differ very much. Therefore, we will only sketch it. In our case, we do not deal with connected components. In fact, we do not expose the random formula with our algorithm. We assume that we already have the formula, and we describe in Algorithm 2 how to enumerates the set of literals implied by a given initial literal x.
The Boolean variable o y denotes if the literal has been reached from the initial literal x and the counter c y denotes the number of clauses containing y that we have already removed from the formula. Therefore, k y − c y is the number of clauses containing y that still remains in F . When x implies y and ¬y, for some variable y, we say that x implies a contradiction. In this case, x also implies ¬x. The algorithm returns the set of literals implied by x or a contradiction (in this second case, we abort, since we already have x → ¬x that is what we want to check). Notice also that c y > 0 implies o y = true ∨ o ¬y = true. The algorithm is deterministic, if you consider the formula given. However, for a random formula, the algorithm perform exactly the same steps and can be seen as a random algorithm. Similarly, we can define the random variable X r = o¬x=true k x − c x after iteration r. At every iteration, this variable satisfies: Notice that line 7 decreases X r in 1, line 8 decreases X r in 1, when o ¬z = true, and line 6 increases X r in k ¬z − c ¬z , when o z = f alse. However, if both o z and o ¬z are false, then c ¬z is zero. After case C, we get a contradiction and finish. In case B, the expected gain in X r is z k z In the case A, the random variable only decreases by one. Like Molloy and Reed's, we can also argue that the case A is negligible, unless we have already added to the set of implied literals a constant fraction of them.
Therefore, reproducing all the lemmas of Molloy and Reed's proof, we can conclude that, when z k z (k ¬z − 1) > 0, almost surely there exists a constant 0 < c < 1 such that for a fraction c of initial literals x, the set of literals implied by x is a fraction c of all literals or contains a contradiction, and hence, X implies ¬x. For a particular variable x, the probability that x implies ¬x and ¬x implies x is at least c 4 . The probability that the formula is satisfiable is at most (1 − c 4 ) n , that tends exponentially to zero as n tends to infinity. Theorem 3. Let F be a 2-CNF formula generated in a random model with variables {x 1 , . . . , x n }, where every variable x i is selected with probability P i and negated with probability 1/2, and variables in clauses are not correlated. Assume that P i = o(1) and m = O(n). Let K i = 2 m P i be the expected number of occurrences of variable Proof: The proof is, like in Theorem 2, based on Molloy and Reed [43]'s proof. In this case, however, the expected gain in the random variable X r is given by: Ki 2 since Ki 2 is the expected value of k i and Ki−1 2 is the expected value of k −i conditioned to the existence of one positive occurrence of x i . Then, the condition E[X r − X r−1 ] > 0 is equivalent to For the proof of Theorem3, we could also use the Cohen et al. [23]'s argument. In the case of graphs, we get a giant connected component when a node i, connected to a node j, is also connected in average to at least one other node. Formally, when the expected degree of i, conditioned to the fact that i and j are connected, is E[k i | i ↔ j] = 2.
In our case, in order for a giant cycle to emerge, when there is a clause x ∨ y, we have to find, at least, another clause containing ¬x. In this situation, the expected number of other clauses containing x is 2, that added to the original clause x ∨ y, gives a minimum number of 3 clauses containing x. Given a pair of literals x and y, let ±x ∨ y express the fact: "x ∨ y ∈ F or ¬x ∨ y ∈ F . Formally, our criterion can be written as This criterion is the necessary and sufficient condition to continue the construction of a set of clauses, ensuring that the probability that this set contains a fraction of the literals tends to one.
Using Bayes, we have Given a pair of literals x and y, the probability that either x ∨ y or ¬x ∨ y are one of the clauses of the formula, conditioned by the fact that the number of occurrences of variable x is k (and assuming that clauses are not repeated) is: P (±x ∨ y | K x = k) = k 2(n−1) and, the probability of the same fact without condition: 2(n−1) . Therefore defines an unsatisfiability threshold. The previous theorems ensure that, when the criterion is satisfied, there is a giant bicycle containing a fraction of the literals, and the formula is unsatisfiable. However, if the formula is unsatisfiable, it can be due to a small bicycle. Therefore, the reverse implication is not necessarily true. In other words, Theorems 2 and 3 establish a sufficient (but not necessary) condition for unsatisfiability of random 2-SAT formulas, which result into an upper bound for the phase transition point. However, we conjecture that, either giant bicycles are more probable than small bicycles and the percolation threshold (obtained with the criterion) is equal to the phase transition point, or, if small bicycles are more probable, the phase transition point is at c = 0.

Classical 2-SAT Formulas
Theorems 2 and 3 may be used to find the phase transition point in terms of number of clauses divided by number of variables. In this subsection, we apply the technique to (classical) random 2-SAT formulas.
We start with a formula (or graph), not necessarily at the critical threshold. Then, we apply a percolation process where a fraction 1 − p of randomly selected clauses (edges) are removed, such that the remaining p fraction of edges are in the critical threshold. If we start with the complete formula with all possible 2 2 n 2 clauses over n variables, and remove clauses with uniform probability, this process generates a (classical) random 2-SAT formula in the SAT-UNSAT transition point (except for the lack of repeated clauses).
If k ′ x is the number of occurrences of literal x in the original graph, then, after removing the (1 − p) fraction, the new distribution on the number of occurrences is P (k for any literal x. Since K x = k x + k ¬x , and k x and k ¬x are independent variables with the same distribution, If we impose the criterion of Theorem 3 to this new formula we get For the complete formula we have k ′ x = 2(n − 1) for any literal, therefore p = 1/(2n − 5/2). The expected number of clauses in the phase transition threshold is This proves that the clause/variable fraction at the 2-SAT phase transition threshold is at most m/n = 1, reproducing the results of Chvátal and Reed [22].
For Now, consider the case of (classical) regular random 2-SAT formulas. These are random formulas where the number of occurrences of a literal minus the number of occurrences of another literal is, at most, one. Assume that all literals have exactly the same number of occurrences k x = m/n. Applying Theorem 2, without any need of percolation process, we get n i=−n k i (k −i − 1) = 2n m n ( m n − 1) = 0. Therefore, m/n = 1 is an upper bound for the phase transition point, reproducing the results of Boufkhad et al. [20]. Notice that, in this case, the conditions of Theorem 3 are not fulfilled: k x and k ¬x are not independent random variables. If we consider the proof of this Theorem, since in a random regular formula k x = k ¬x , if this formula contains a clause x ∨ y, we only need to require that E[K x | x ∨ y] = 2 in order to ensure that there is another clause containing ¬x. With this new criterion, and reproducing the proof of Theorem 3, we obtain that the threshold in a regular random formula is E[K 2 ] E[K] = 2.
In the first statement of the following theorem, we prove that when the clause/variable ratio exceeds this value, formulas are almost surely unsatisfiable. (2) Scale-free random 2-SAT formulas over n variables, exponent β = 1/2 and more that 4 n log −1 n + O(n 1/2 log −1 n) 7 In their paper, they write β instead of δ, but we prefer to use β with the same meaning as in [7]. distinct clauses, or exponent 1/2 < β < 1, and more than distinct clauses, are unsatisfiable with probability 1 − o(1).
Proof: In the case of scale-free formulas we cannot start the percolation process from the complete formula, since the uniform-random deletion of clauses do not give rise to scale-free formulas. Therefore, we can simply impose the criterion to the original formula. We will do all the computations using the number of occurrences of variables K x , instead of the number of occurrences of the literal k x , and applying Theorem 3.
Since β < 1, by Lemma 1, repetitions of variables in clauses may be neglected, and the probability that a particular literal in the formula corresponds to the variable x is given by Since the election of every variable for every possible literal of the formula is independent, the number of occurrences of x follows a binomial distribution In the limit m → ∞ the distribution approaches a Poisson distribution where Recall that in scale-free formulas K x follows a distinct probability distribution for every variable x, therefore we have to average over all variables x −2β Applying equations (1) and (2), we get The last two cases are meaningless, since we have assumed that β < 1 in other parts of the proof. The first three possibilities prove the two statements of the theorem. In the second and third case, since we cannot prove that the fraction of repeated clauses is meaningless, we obtain a bound on the number of distinct clauses.
From Friedrich et al. [31] and Theorem 4 we can conclude: Corollary 1. Scale-free 2-SAT formulas over n variables and exponent β < 1/2 have a SAT-UNSAT phase transition threshold when the variable/clauses ratio is We have experimentally analyzed the fraction of satisfiable random scale-free 2-SAT formulas depending on the parameter β and fraction of clause/variable m/n. The results are plotted in Figure 3, for formulas with n = 10 5 variables. We observe that the phase transition predicted by Theorem 4 is quite precise, except when β 1/2. In the limit n → ∞, the fraction of satisfiable formulas with n variables and c n clauses tends to zero when c > 0. However, as the number of clauses needed to make the formula unsatisfiable grows as n 2(1−β) , when β is close to 1/2 the confluence is very slow.
In order to test experimentally the second statement of Theorem 4, we have analyzed the fraction of satisfiable formulas with respect to m/α, where α = 1 (1−β) 2 ζ(2β) n 2(1−β) . In Figure 4, we show the results for β = 0.7. We observe that, for distinct values of n, the transition between SAT and UNSAT is around α. However, for increasing values of n the transition does not seem to become more abrupt.

Unsatisfiability by Small Cores
In the proof of Theorem 4 we have already seen that, when β > 1/2 the number of clauses needed to make a 2-SAT formula unsatisfiable is sub-linear. Therefore, the phase transition factor -understood as a constant c such that, on the limit n → ∞, formulas with less that c n clauses are satisfiable and those with more than c n clauses are unsatisfiable-is zero. In this section, we will prove that, when β exceeds a certain bound, scale-free formulas become unsatisfiable due to a small subset of clauses containing variables with small indexes. Moreover, this result holds for clauses of any size.
Proof: The probability of a clause only containing the smallest k variables is This inequality would be an equality, if we allow tautologies and simplifiable clauses (i.e. repeated variables) in formulas.
Using (1) we get In the limit n → ∞, the probability of generating the clause x 1 ∨ . . . ∨ x k after generating n (1−β)k independent clauses is Therefore, the probability of generating the clause x 1 ∨ . . . ∨ x k is 1 − o(1) when the number of clauses is m = ω(n (1−β)k ). The same applies for other 2 k clauses with distinct signs, and, if k = O(1), to a refutation of the formula only using these set of clauses.
As in classical random formulas, the expected number of truth assignments that satisfy a scale-free random formula is 2 n (1 − 2 −k ) m . This imposes a linear upper bound on the number of clauses of satisfiable scale-free formulas, i.e. a random scale-free formula with m = c n clauses of size k over n variables such that c > 2 k log 2 is unsatisfiable with probability 1 − o(1). Therefore, the bound in Theorem 5 only improves this other linear bound when (1 − β)k < 1, hence when β > 1 − 1/k. Figure 5 shows an experimental estimation of how many clauses are needed to make unsatisfiable 50% of the random formulas generated with distinct values of β and k = 3, as a function of the number of variables.  Theorem 5 predicts that the number of clauses in a satisfiable scale-free 2-SAT formula cannot grow faster that O(n 2(1−β) ), due to the emergence of small cores. When 1/2 < β < 1, the second statement 20 of Theorem 4, predicts exactly the same exponent 2(1 − β) for the emergence of a giant bicycle. This suggest that, in this range of β, the probability of existence of a small and a giant unsatisfiable core of clauses is similar. However, experimental results (see Figure 4) suggest that the SAT-UNSAT transition is quite smooth, like in classical 1-SAT. This suggest that small cores are, in fact, more prominent. Another argument in this direction is as follows: Let C(V ) be the subset of clauses only containing variables of the subset V of variables. The greater |C(V )|/|V | is, the higher is the probability to have an unsatisfiable core inside C(V ). In the case of scale-free random k-SAT formulas, let C r be the set of clauses only containing variables {1, . . . , r}. We can estimate For (1 − β)k ≥ 1, i.e. β < 1 − 1/k, the maximum of this function is r = ∞. For (1 − β)k < 1, i.e. β > 1 − 1/k, the maximum is finite: Notice that (1 − β)k is the exponent predicted by Theorem 5, and that for 2-SAT, 1 − 1/k = 1/2. Therefore, we get another proof that at β = 1 − 1/k we get a change in the behavior of scale-free random k-SAT formulas. When n → ∞, for β ≤ 1 − 1/k the most probable is to get a very large core that involves a fraction of the whole set of clauses. For β > 1 − 1/k the most probable is to get a small core only involving a finite set of clauses and variables {1, . . . , (1−(1−β)k) −1/(1−β) }.

Conclusions
We have proposed a new model of generation of random SAT formulas that mimic better the properties observed in real world formulas. In particular, the number of occurrences of variables follow a power-law distribution, as observed in the industrial SAT instances used in competitions. This is obtained by assigning a distinct probability P (i) ∼ i −β to every variable i ∈ {1, . . . , n}, where β is a parameter. This model generalizes (classical) random SAT formulas by taking β = 0.
We prove the existence of a SAT-UNSAT phase transition for 2-CNF formulas. This result is obtained using a novel technique based on percolation techniques. For arbitrary k-CNF formulas, we prove that formulas with more than ω(n (1−β)k ) clauses are unsatisfiable with probability 1 − o(1). More precisely, that when β > 1 − 1/k formulas are unsatisfiable due to a small set of clauses that only involve most frequent variables.