Resilience Analysis for Double Spending via Sequential Decision Optimization

: Recently, diverse concepts originating from blockchain ideas have gained increasing popularity. One of the innovations in this technology is the use of the proof-of-work (PoW) concept for reaching a consensus within a distributed network of autonomous computer nodes. This goal has been achieved by design of PoW-based protocols with a built-in equilibrium property: If all participants operate honestly then the best strategy of any agent is also to follow the same protocol. However, there are concerns about the stability of such systems. In this context, the analysis of attack vectors , which represent potentially successful deviations from the honest behavior, turns out to be the most crucial question. Naturally, stability of a blockchain system can be assessed only by determining its most vulnerable components. For this reason, knowing the most successful attacks, regardless of their sophistication level, is inevitable for a reliable stability analysis. In this work, we focus entirely on blockchain systems which are based on the proof-of-work consensus protocols, referred to as PoW-based systems, and consider planning and launching an attack on such system as an optimal sequential decision-making problem under uncertainty. With our results, we suggest a quantitative approach to decide whether a given PoW-based system is vulnerable with respect to this type of attack, which can help assessing and improving its stability.


Introduction
In recent years, concepts originating from the blockchain idea have gained popularity. Their software realizations are based on a mixture of traditional techniques (peer-to-peer networking, data encryption) and modern concepts (consensus protocols). Digital currencies represent assets of these systems, their transactions are written and kept in an electronic ledger as a part of operation of the blockchain system. Their main difference from a traditional financial system is that the assets (crypto-currencies) are not issued and supervised by a central authority, but by joint efforts of a network consisting of independent computers, all running the same/similar software. Such a network searches for consensus which yields a common version of the ledger shared by all participants. The consensus is reached in terms of a process, which is called mining and is usually backed by economic incentives. Proponent of blockchain systems argue that they can achieve the same level of certainty and security as those governed by a central authority at significantly lower costs. In fact, (the author thanks an anonymous referee), costs can be lower in some cases for the users, because of the lack of the service provider fees. However, in general, blockchain systems are more resource and energy consuming than centralized ones. Still, in return they provide decentralization that is the "splitting of trust" among a set of entities (possibly the entire network). Furthermore, due to the distributed, decentralized, and homogeneous architecture of the network, a blockchain system can reach a high level of stability due to data redundancy and hard/software replicability.
Following a mining process, all network participants append, validate, and mutually agree on a common version of the data history, which is usually referred to as the blockchain ledger. Some authors consider the invention of mining as a real break-through which has solved a long-standing consensus problem in computer science, although this development must be considered in the context of notable research advances in consensus protocols like Byzantine fault tolerant protocols. There is also criticism of this approach. A critical point is that to reach a consensus, some real physical resources/efforts must be spent or at least allocated. For instance, the traditional Bitcoin protocol requires participants to solve cryptographic puzzles with real consumption of computing power and energy, in terms of the proof of work (PoW). Other blockchain systems avoid resource consumption and require a temporary allocation of diverse resources, for instance the ownership of the underlying digital assets (proof of stake), or their spending (proof of burn). Furthermore, commitment of storage capacity (proof of storage), or a diverse combination of resource allocation/consumption are also used.
Let us briefly elaborate on the proof of work whose details can be found in an excellent book by Andreas Antonopoulos: "Mastering Bitcoin" [1]. We focus on the Bitcoin protocol which was initiated by [2], with refinement on the double-spending problem by [3], and later in [4] with further considerations addressing propagation delay in [5]. In this framework, the ledger consists of a chain of blocks and each block contains valid transactions. The nodes compete to add a new block to the chain, and doing so, each node attempts to collect transactions and to solve a mathematical puzzle. Once this puzzle is solved, it is made public to other nodes. This protocol also prescribes that if a peer node reports a completed block, then it must be verified, and if this block is valid, it must be attached to the chain, all uncompleted blocks shall be abandoned and a new block continuing the chain must be started. However, even following these rules, the chain forks regularly, which results in different nodes working on different branches. To reach a consensus in such cases, the protocol prescribes that a branch with shorter length must be abandoned as soon as a longer branch becomes known.
Let us return to the stability of the PoW protocol in the sense of its resilience to attacks. Please note that within a blockchain system, the nodes are running a publicly available open-source software (for mining) which can easily be modified by any private user to control the computer nodes to undermine the system. In principle, there are many ways of doing this. One of the most obvious among malicious strategies would be an attempt to spend the electronic money more than once. The analysis of such a strategy is referred to as the double-spending problem.
In the classical [2,3] formulation of this problem it is suggested that a merchant waits for n ∈ N confirming blocks after a payment by the buyer, before a product/service is provided. While the network is mining these n blocks, the attacker tries to build his/her own secret branch containing a version of history in which this payment is not made. The idea is to not include the paying transaction into private secret branch. The attacker hopes that the private branch will overtake the official branch and will be incorporated into the long-term chain. If this strategy succeeds, then the private secret branch becomes official and the payment disappears in the ledger after the product/service is taken by the attacker. Nakomoto [2] provides and Rosenfeld [3] refines an estimate of the attacker's success probability depending on his/her computational power and the number n of confirming blocks.
Let us emphasize that [2,3] provide merely an idea why succeeding in the double-spending attack could be difficult since their analysis focuses on a simplified situation and lacks several important aspects. First, note that in the original work [3], the success estimation of the double spending is based on the assumption that the attacker can start the race having pre-mined one block. Still, it is not clear how to achieve an advantage of being able to start the race with one block ahead of the official chain. In fact, the present contribution is devoted to a systematic study of this interesting question.
Second, the work [2,3] merely calculates the probability of the secret chain getting ahead of the official one, ignoring the mining costs and revenues/losses from a successful/failed attack. Furthermore, the possibility of canceling secret mining (if the block difference becomes too high) is not considered. Most important, however, is that it is assumed that the paying transaction must be placed right after the fork. Please note that this assumption is justifiable only if the merchant requires immediate payment after a purchase is agreed upon, otherwise canceling the deal. However, in reality, the attacker may be able to freely choose the time of payment, in particular when buying goods from web portals. That is, an attempt to overtake the official chain before launching an attack can give an advantage.

Remark 1.
In this work, we focus exclusively on PoW-based systems to analyze their vulnerability with respect to the double-spending threat, since other blockchain systems (all those based on different consensus algorithms, like permissioned networks) are immune to this type of attack. In this context, we examine the effect of pre-mining on the profitability of the double spending with two effects: Obviously, the success probability of the attack increases with the number of pre-mined blocks, while on the other hand, a longer mining race reduces rewards due to mining costs, i.e., whether the paying transaction must be placed immediately depends on the protocol's the block reward policy. Here, many technical details become crucial: For instance, ref. [6] investigates differences between Bitcoin and Ethereum with respect to rewards for stale and uncle blocks. However, such details are not covered by the present approach which elaborates on a general view and provides an algorithmic solution to the corresponding double-spending problem. Still, the code presented in this paper is flexible and can cover a wide range of situations, leaving enough space for specifications. For instance, there is an obvious linear relation between costs of secret mining and secret capacity fraction. In reality, this relation may be more complex, depending on mining hardware and its ownership. For this reason, we model mining costs and the capacity fraction with separate parameters leaving enough flexibility to tailor our implementation to a given problem, as illustrated in the Section 8 .

Contribution of the paper:
We discuss security assessment of double spending in terms of discrete-time finite-horizon stochastic control using an optimal stopping and a switching model. In contrast to an infinite-horizon discounted-reward Markov decision models suggested in the literature (see [6][7][8]), we obtain exact solutions and express our result via present-time monetary units which allows direct conclusions. In the optimal stopping formulation, we show how to choose the optimal payment moment depending on the length difference between the official and secret chains, mining capacity ratio, confirming block number, and on the revenue/loss from the success/failure of the attack. We upgrade this framework to a stochastic switching model and show how to decide whether it is worth attacking a given PoW-based system. This insight may allow important conclusions on its vulnerability. For all problems solved in this work, we provide a complete implementation and full source code listings which can be used for further adaptations.
Paper organization: Section 2 presents a literature review relevant to our paper, whereas Section 3 provides a motivation for our approach, which requires a finite-horizon framework introduced in Section 4. With methodological background from Section 5, we address the optimal stopping and switching models in Sections 6 and 7, whose implementation is illustrated by code listings and a number of numerical case studies in Section 8 with conclusions given in Section 9. The Appendix A is devoted to technical details.

Stochastic Models in the Analysis of Blockchains
Performance evaluation and improvement of blockchain systems relies on stochastic modeling, analysis, and optimization and is an active area with a substantial number of publications meanwhile. To analyze blockchain systems, diverse methods encompassing random walks, queuing models, Markov processes, stochastic control, and game theoretical models have been successfully applied. Let us mention some representative literature related to the proof-of-work protocol. Other consensus protocols (see [9]) are also discussed in terms of interesting models (for instance in [10]), but are outside of our scope. For a detailed overview of literature about applications of stochastic methods to blockchains, we refer the interested reader to the recent work [11].
Applications of queuing theory deal with modeling of transaction arrivals and block generation times and are discussed in [12,13]. An early and important work [14] on application of Markov processes to blockchains discusses vulnerability of the PoW protocol in the framework of the so-called selfish mining. More precisely, the work [14] suggests that a pool of miners may work secretly together to obtain higher payoffs than other miners violating the protocols by postponing block publishing. Such behavior is referred to as selfish mining whose investigation is extended, among others, by [5,15], the latter also considers propagation delay. The double-spending problem is discussed in [16] using renewal theory and in [17] using random walks.
The theory of Markov decision processes is applied to blockchain analysis in [6][7][8]. These publications are directly related to the present work. In [7], the idea of pre-mining on a secret branch invalidating a transaction for double spending is investigated. The authors recognize that if the payment moment can be chosen by the attacker, then the double spending succeeds with probability one. Using an infinite-horizon Markov decision formulation, the action to "adopt" (abandon secret mining) "override" (publish a longer secret chain), "match" (publish a secret chain of the same length), and "wait" (continue secret mining) are optimized in a specific framework. This study elaborates on the difference in communication to full nodes versus light nodes and its role for the success of the attack. A further refinement of this approach is suggested in [6]. This work optimizes a similar range of decisions for maximization of the proportion between secret and total mining rewards. Furthermore, the study [6] introduces an appropriate benchmark, defined as the minimal value of the double spending gain which makes the optimal selfish mining more profitable than the honest mining. Using this benchmark, diverse blockchain systems are compared and conclusions are derived. In Section 7, we discuss advantages and incremental contribution of our approach in relation to [6][7][8].

The Double-Spending Problem
Let us briefly discuss the classical results before we elaborate on our contribution. In the framework of the double-spending problem, it is assumed that a continuous-time Markov chain taking values in Z describes the difference in blocks between official and secret branches. As in [3], we consider this process at time points at which new block in one of the branches is completed, which yields a discrete-time Markov chain (Z t ) ∞ t=0 . Having started secret mining after the block including the attacker's payment (at block time t = 0, Z 0 = 0) the attacker considers the following situation: At each time t = 1, 2, 3 . . . , a new block in one of the branches (official or secret) is found, the block difference changes where q ∈]0, 1[ is the ratio of the computational power controlled by the attacker to the total mining capacity. Let us agree on the generic case where the attacker controls a smaller part of the mining power 0 < q < 1/2 than that controlled by honest miners. In this case, if at any block time t = 0, 1, 2, . . . the block difference is z ∈ Z, then the probability a ∞ (z, q) that the secret branch overtakes the official branch within an unlimited time after t is given by Furthermore, at a time when the n-th block in the official branch is mined, the probability that the attacker has mined m = 0, 1, 2, . . . blocks follows the negative binomial distribution whose distribution function is given by Both results (1) and (2) are combined in [3] to obtain the success probability of the double spending as follows: Consider the situation where the attack starts when the length difference between the official and the secret branches is k ∈ Z with k ≥ −n. Consider first the situation that at the time the n-th block in the official chain is completed, the attacker has mined m blocks with m − k > n in which case the secret branch can be published immediately. The probability of this event is given by Next, consider the opposite event, assuming that when the n-th official block is completed, the attacker has not overtaken the official chain in which case m − k ≤ n. In this case, the probability of winning the race is given by Clearly, the total success probability of the double spending is given by the sum of its probabilities in both cases and equals to r q,n (k) = 1 − F q,n (n + k) + ( q 1−q ) 1+k F 1−q,n (n + k) for n ∈ N, k ∈ Z, k ≥ −n, and q ∈]0, 1 2 [.
For instance, consider the situation that the attacker starts the fork-off and launches the payment at the same time, then k = 0 and if the merchant waits for n = 6 confirming blocks, then the attacks succeeds with probabilities r q,n=6 (k = 0) which are relatively small if the attacker controls a small part (six q = 0.06 or eight q = 0.08 percent) of the mining capacity: r 0.06,6 (0) = 0.00037, r 0.08,6 (0) = 0.0025 As a result, waiting for six blocks after the payment has been considered to be secure in the sense that with realistic efforts, it is practically impossible to succeed with double spending.

Remark 2.
Please note that in the original work [3], the estimation of the double spending is based on the assumption that the attacker can start the race having pre-mined one block, i.e., k = −1. This leads to a different (see Figure 1) success probability r q,n (−1) = 1 − F q,n (n − 1) + F 1−q,n (n − 1).
Still, it is not clear how to achieve an advantage of being able to start the race with one block ahead of the official chain. In fact, the present contribution is devoted to a systematic study of this interesting question. The above analysis [3] calculates the probability of the alternative blockchain getting ahead of the official one. It does not consider revenues and losses from a successful/failed attack. Furthermore, the possibility of canceling the secret mining (if the block difference becomes too high) is not considered. Most important however, is the question why the paying transaction must be placed right after the fork. Please note that this assumption is justifiable only if the merchant requires immediate payment after the purchase is agreed upon, otherwise canceling the deal. However, in reality, the attacker may be able to freely choose the time of payment, in particular when buying goods from web portals. That is, an attempt to overtake the official chain before launching an attack can give an advantage in the spirit of the above remark.

Block Difference Dynamics
Consider a finite time horizon where t ∈ {0, . . . , T} represents the number of blocks mined in the official chain since the branch has forked off, i.e., we suppose that our secret mining starts at the block time t = 0. We interpret T ∈ N as the maximal length of the official branch, which can be abandoned if a longer branch has been discovered. To the best of authors knowledge, the current Bitcoin protocol does not have such a restriction, meaning that the shorter branch must always be discarded, independently of its length. However, other systems discuss 'checkpoints' and 'gates' with similar functionality. A finite time horizon yields conceptual advantages (providing an exact solution) and presents a negligible deviation from reality since T is sufficiently large and can be changed in the calculation. Let us introduce all the ingredients required for a formal discussion of the decision problems formulated above. Introduce the block difference process where Z t is the branch length difference between the official and secret branch at times t = 0, . . . , T when one new block in the official branch is completed.
We show that the transition probabilities of (Z t ) T t=0 satisfy for all t = 0, . . . , T with a geometric distribution where q ∈ [0, 1] is the fraction of the capacity controlled by secret miners, the proof of the assertions (6) and (7) is found in Appendix A. Consider also (Z t = t − Z t ) T t=0 with Z t is the length of the secret branch at the times t = 0, . . . , T when one new block in the official branch is completed.
This process possesses independent identically geometrically distributed increments In what follows, we show that determining the double-spending attack which has the highest expected total reward (from the viewpoint of the attacker) yields an optimal stochastic stopping/switching problem. To ease reader's understanding, we start with a simplified situation (neglecting secret mining costs, rewards for published blocks, and the possibility of abandoning the attack at any stage). Such a problem can be formulated as an optimal stopping problem. Thereafter, based on the setting of this stopping problem, we consider a more realistic approach and upgrade the optimal stopping to an optimal switching framework, which takes into account mining costs, rewards for published blocks, and the possibility to abandon secret mining. Before introducing all details in subsequent sections let us sketch the core ideas and explain how our results can be used to assess vulnerability of a given PoW-based system.

Decisions under Uncertainty: Optimal Switching and Stopping
Sequential decision-making arises in many applications and is usually addressed under the framework of discrete-time Stochastic Control. The theory of Markov Decision Processes/Dynamic Programming provides a variety of methods to deal with such questions. In generic situations, approaching solutions even for simplest decision processes may be a cumbersome process (ref. [18][19][20]). However, for the questions formulated in the present work, a specific truncation technique will be applied to state the problem on a finite space within a finite time horizon, which makes all results obtainable by finite number of algebraic operations at machine precision.
Let us introduce a particular Markov decision problem class: The optimal stochastic switching (see [21]). On a finite time horizon {0, 1, . . . , T} consider an agent concerned with the problem of sequential decision-making: At any time t = 0, 1, . . . , T − 1 an action a ∈ A from a finite set A of all available actions must be chosen. This decision returns an immediate reward/costs but also influences the future state evolution, i.e., at any time, an action optimally balances between the current rewards/costs of control and all future situations. In the framework of optimal stochastic switching, the decision variable has two components (p, z) ∈ E = P × R d consisting of operation mode p and environment state z, thus the state space E is a Cartesian product of a finite set P all operation modes and the Euclidean space R d . Therefore, the evolution (Z t ) T t=0 of the second component is supposed to follow a Markov process with the interpretation that Z t = z is the situation in the global environment at time t which is relevant for making decisions but cannot be changed by an agent's actions. Contrary to this, the current operation mode p ∈ P is under full deterministic control of the agent at any time. This aspect is modeled in terms of a function α : P × A → P, (p, a) → α(p, a), which describes a deterministic change of the operation mode by the agent's actions with the interpretation that α(p, a) ∈ P is the new mode if the action a ∈ A was taken in the previous mode p ∈ P. Now, let us specify the control costs. Assume that taking an action a ∈ A causes an immediate reward r t (p, z, a) which depends on the state (p, z) ∈ E and on the action a ∈ A through given reward functions r t : E × A → R which may be time dependent. When the system arrives at the last time step t = T in the state (p, z) ∈ E, the agent collects the scrap value r T (p, z), described by a pre-specified scrap function r T : E → R. At each time t = 0, . . . , T − 1 the decision rule π t is given by a mapping π t : E → A, prescribing at time t in the state (p, z) ∈ E the action π t (p, z) ∈ A. A sequence π = (π t ) T−1 t=0 of decision rules is called a policy. When controlling the system by the policy π = (π t ) T−1 t=0 , the positions (p π t ) T t=0 and the actions (a π t ) T−1 t=0 evolve recursively Having started at initial values p π 0 = p 0 and Z 0 = z 0 , the goal of the controller is to maximize (over all possible policies) the expectation of the total reward is called the value of the policy π and represents the total reward accumulated within the entire time.
For technical details and solution algorithms to switching systems, we refer the interested reader to [22]. Furthermore, there are applications to pricing financial options [23], natural resource extraction [21], battery management [24] and optimal asset allocation under hidden state dynamics [25], many applications are illustrated using R in [26].
Let us now introduce the standard backward induction algorithm which is used to obtain a solution to an optimal switching problem. Given a switching problem as above, introduce the stochastic kernels for all p ∈ P, a ∈ A, z ∈ R d which act on all functions v on E = P × R d where the above expectation exists. Using these kernels, the policy value is obtained recursively by the policy valuation algorithm To obtain a policy π * = (π * t ) T−1 t=0 , which maximizes the total expected reward, one introduces for each t = 0, . . . , T − 1 the so-called Bellman operator acting on each function v : E → R where the above expectation exists. Now, consider the Bellman recursion, also referred to as the backward induction: Under appropriate assumptions, there exists a recursive solution for t = T − 1, . . . , 0, p ∈ P, and z ∈ R d . The functions (v * t ) T t=0 resulting from the backward induction are called value functions, they determine an optimal policy π * = (π * t ) T−1 t=0 via for p ∈ P, z ∈ R d , for all t = 0, . . . , T − 1.
We shall emphasize that solutions to even simplest switching problems are sometimes surprising and non-intuitive. Frequently, observing an optimal solution helps to understand the original questions.
As an illustration, we consider two classical problems (borrowed from [18]) whose solutions are non-intuitive, at a first glance.

Game I:
Consider a card desk face down with b ∈ N + black and r ∈ N + red cards. On each turn, the player chooses whether to draw a card from the desk or not. If the player decides to take a card, then he gains $1 if a black card is taken, and loses $1 if a red card is drawn. Once the card is taken, it is put aside and will not be returned to the desk. Is it possible that b < r and it is still worth starting to draw? Game II: An equal number of red and black cards r = b ∈ N + are face down on a table. I am turning the cards over one by one, and at any time you can say "stop" and I turn over one next card. If the card is black, you win $1, if it is red you lose $1. If you do not stop before the last card, the last card's color is used to decide whether you win or lose. What is the optimal strategy for playing this game?
An analysis of the Game I shows that in some situations, it is indeed worth playing if there are more red than black cards. For instance, even for b = 4, r = 6 the value of the optimal policy is still positive, 2/300. In a very similar Game II, it is surprisingly never worth playing since each policy returns the same value, which is zero.
The simplest and probably the most important special case of optimal stochastic switching is optimal stopping. Here, it is known that if the process (Z t ) T t=0 is stopped at τ = 0, 1, . . . , T − 1, then the agent receives a value R τ (Z τ ) determined by a pre-specified stopping reward function z → R t (z), t = 0, . . . , T − 1, z ∈ R d . If the process is not stopped within 0, 1, . . . , T − 1 then the agent receives R T (Z T ) determined by the given scrap function (t, z) → R T (z). The stopping problem is formulated as follows: Given (Z t ) T t=0 and (R t ) T t=0 as above, calculate the maximum and one of its maximizers to Please note that the maximization is defined over stopping times, which comprise all random times not depending on future events. An optimal stopping problem can be equivalently formulated as an optimal stochastic switching problem. For this, define two positions and two actions P = {1, 2}, A = {1, 2}. Here, the positions "stopped" and "goes" are represented by p = 1, p = 2 respectively and the actions "stop" and "go" denoted by a = 1 and a = 2. With this interpretation, the position change is given by Please note that with this matrix the operation mode "goes" (p = 2, second row) remains valid only if the action "go" (a = 2, second column) is applied. If the systems is stopped (p = 1) or the action is to stop (a = 1) then the operation mode transitions to "stopped" (p = 1) and never leaves. The reward at time t = 0, . . . , T − 1 and the scrap value are defined by for all p ∈ P, a ∈ A, z ∈ R d . For an optimal stopping problem, the backward induction can be written more compactly. Specifically, we introduce the value functions (V t ) T t=0 and the expected value functions (V E t ) T−1 t=0 recursively by The so-called continuation region is defined by (18) and the optimal stopping time τ * is obtained as the first exit time of the process (Z t ) T t=0 from the region C τ * = inf{t = 0, . . . , T : (t, Z t ) ∈ C}.

Attack Planning as an Optimal Stopping Problem
Assume that the attacker can freely choose the time of payment. Doing so, he/she can work on a private secret branch long before his/her payment is placed. For such situation, the analysis of the double spending is different from the approach explained in Section 3 and requires solving a stopping problem.
Consider the block difference dynamics (Z t ) T t=0 from (5). Launching the double-spending attack at the block time τ = 0, . . . , T simply means that the payment will be included into block τ + 1 of the official branch. (Recall that the secret branch contains an invalidation of the paying transaction by a non-inclusion of the attacker's paying transaction into secret branch). That is, the crucial question is how to choose the block time τ = 0, . . . , T optimally at which the payment is made. Notice that although the state space Z of the Markov chain (Z t ) T t=0 is infinite, all relevant situations occur within a finite range. On this account, it is possible to formulate an equivalent optimal stopping/switching problem whose state process follows a finite-state and finite-horizon Markov chain. The idea is to appropriately adjust the original Markov dynamics to not leave a finite state range. For this, let us agree that {Z t > t + n} = {Z t < −n}, t = 0, 1, 2 . . . , represents a sure opportunity for a successful double spending launched at time τ = t.
Indeed, by attacking immediately with payment at τ = t if {Z t < −n} occurs, the last confirmation block is obtained at the time t + n, and the next official block is obtained at block time t + n + 1 when the secret branch is at least of the same length Z t+n+1 ≤ Z t + n + 1 ≤ 0 as the official chain, i.e., right before the block time t + n + 1, the secret branch must have been longer.

Remark 3.
Please note that the insight (19) can be combined with the geometric distribution (9) to conclude that if the mining capacity ratio is positive, then the probability to succeed in the double-spending attack at least once in an infinite sequence of attempts equals to one. Indeed, suppose that q > 0. Having started secret mining, the probability of the event {Z 1 > 1 + n} that the attacker has more than n + 1 blocks in the secret chain at the time when one official block has been completed is positive due to (9). However, if the attacker has not succeeded in overtaking {Z 1 ≤ 1 + n}, then the secret branch will be discarded, and a new chain bifurcation will be started, this time right after the current official block, with the attempt to overtake the official branch by more than n + 1 at the time when the next official block is obtained. This second independent attempt yields a success with the same positive probability. Repeating this procedure, one obtains a sequence of Bernoulli experiments, each with the same positive success probability, which yields a success with probability one after a finite number of trials. Now, we clarify the relevant time horizon of the stopping problem. Because we have imposed a finite limit T on the length of the official branch, we agree that for τ > T − n a successful attack is not possible. Specifically, since the payment is placed into block τ + 1 and n confirming blocks are expected, the last confirmation block τ + n > T would be beyond the maximal branch length which can be abandoned. That is, we can assume that the time τ must be chosen within the finite-horizon τ = 0, . . . ,T with the last time pointT = T − n. The decision whether to attack must be based on the current block time t = 0, . . . ,T and on the recent block difference Z t .
The event that an attack launched at time τ = 0, . . .T is successful can be expressed in the form There is a "less than or equal to" in this expression since if at the block time t = 1, . . .T the process has reached non-positive domain Z t ≤ 0, then immediately before the physical time corresponding to t, the block difference has been negative because at t one official block was completed (block difference increased at t to Z t ≤ 0).
In the second step, we define the stopping reward function as where the numbers C > 0 and c > 0 represent the gain and the loss resulting from the success or failure of the attack, respectively. Finally, let us agree that τ =T + 1 stands for the attacker's option to not launch any attack, which can be optimal if the chance of overtaking the official branch is too low, in view of a potential loss from an unsuccessful attack. To model such an opportunity, we define the scrap function for the time argument t =T + 1 as Having introduced all ingredients, the choice of the attack time τ * yields the double-spending problem in the optimal stopping formulation: determine the maximum and a maximizer to where T denotes all {0, . . . ,T + 1} valued stopping times.
The next section deals with a solution to this problem. Like almost all stopping questions, our double-spending problem (23) is solved in terms of a recursive algorithm rather by an explicit formula. That is, investigating its solution structure requires numerical experimentation, thus parameter dependence of the optimal strategy is not obvious. Hence, we include a solution code, implemented in R.
From the numerical experiments conducted so far, the authors observed that the solution is natural, intuitive, and not surprising. Specifically, in all calculations we determine the same behavior: The only optimal strategy is to follow secret mining without launching an attack until the block difference reaches or exceeds a critical value which depends on model parameters. For instance, the optimal attack is triggered when/if the secret branch overtakes the official branch by two or more blocks (we illustrate this by an example). In all experiments, we also observed that the optimal strategy is time-homogeneous (the block difference, triggering the attack does not depend on the length of the official branch).

Remark 4.
Let us explain how to interpret these outcomes, derive conclusions and elaborate on what has been gained compared to existing results.
• First, our analysis shows that under a (realistic) assumption that the payment moment can be chosen by the attacker, estimating success probability of a double-spending attack an ill-posed question. Specifically, since the increments of the branch length difference follow a geometric distribution (6), the attacker will succeed with probability one, by simply repeating over and over again the chain bifurcation, any time after having been overtaken by the official branch. Please note that this argument applies to a repeated sequence of attempts of arbitrary length rather to a single attempt, as pointed out in the remark after (19).
• Second, the only reason such a strategy is not profitable are the costs of private mining related to the potential gain/loss from a successful/unsuccessful attack and their probabilities. These economic aspects are crucial and, in difference to the previous work, are reflected in our approach by the gain/loss parameters C and c along with quantified success/failure probability of the double-spending. • Third, our results can be used for vulnerability assessment (the author is extremely grateful to an anonymous referee for suggestions, which helped addressing PoW stability in terms of optimal switching techniques) of a given PoW-based system. However, for this we must include beyond gain/loss parameters and mining capacity relation also further details: Costs of mining, rewards for published blocks and the option to abandon secret mining at any time, resulting in three operation modes, rather than two in the optimal stopping case. This yields a more complex model. We shall sketch such an approach in the following section.

Attack Optimization in the Optimal Stopping Formulation
In our numerical approach we use the length of the secret chain (Z t )T +1 t=0 from (8) as our state process. According to the observation (19), the evolution of the underlying state process needs to be examined merely on a finite range Having re-defined the reward (21) in accordance to (19) and (22) as we equivalently re-formulate the problem (23) as determine a maximizer τ * to T → R, τ → E(R τ (Z τ )) where T denotes all {0, . . . ,T + 1}-valued stopping times.
To solve the above optimal stopping problem, we introduce the value functions (Ṽ t )T t=0 to (26) in terms of the standard backward induction, which is initialized by the expected value functioñ and is followed recursively for t =T,T − 1, . . . , bỹ SinceṼ t (x) = C for all x > t + n, each value functionṼ t (x) needs to be calculated only for states x = 0, 1, . . . , t + n. We thus obtain instead of (27)-(29) Please note that in the last equality, the conditional expectation can be calculated as Having determined the value functionsṼ t (x) for t = 0, . . . ,T and x = 0, . . . , t + n, continuation region is obtained by and the optimal attack time τ * is obtained as the first exit time of the process (Z t )T t=0 from the region C τ * = inf{t = 0, . . . ,T + 1 : (t,Z t ) ∈ C}.

Algorithmic Solution
Before we present an algorithmic solution, let us show how to calculate the rewards (21). In order to determine the probability P(S(t)| Z t = x) in the expression (21), we use the time and space homogeneity of the transition kernel to obtain To calculate the probabilities in this expression, let us consider a truncation of the dynamics (Z t ) T t=0 by making upper and lower ranges of the state space absorbing. Specifically, given the lower and upper boundaries l, u ∈ Z in the state space, consider an alternative Markovian dynamic (Z (l,u) t ) T t=0 on the truncated state space {l − 1, . . . u + 1} ⊂ Z whose transition matrix p (l,u) = (p (l,u) x,x = P(Z (l,u) x,x =l−1 , t = 0, . . . , T is obtained from the transition matrix by the truncation procedure: Please note that the evolution of (Z (l,u) t ) T t=0 coincides with that of (Z t ) T t=0 on all states x with l ≤ x ≤ u but as soon as (Z t ) T t=0 leaves this area, the dynamics becomes trapped in the lower l − 1 ∈ Z or in the upper u + 1 ∈ Z state depending on which boundary l or u has been crossed. Using this truncation technique, we obtain the required probabilities explicitly.

Lemma 1.
(a) Suppose that x ∈ Z, n ∈ N with x ≥ −n, then for l, u ∈ Z with l ≤ −n and x + n + 1 ≤ u x,x for x = 1, . . . , x + n + 1. (32) The proof of this lemma is found in Appendix A. Let us outline the use of the above lemma for determining the conditional probabilities (30) on a range of relevant states x ∈ {−n, −n + 1, . . . , x max } for T − t ≥ n. First, consider the n + 1-step transition probabilities. Using (32), we conclude that with l = −n and u = x max + n + 1 we obtain Using the space homogeneity of the transition kernel (6), we shift all states and boundaries by n + 2 to ensure that with l = 2 and u = 2n + x max + 3 Similarly, with the same boundaries l = 2 and u = 2n + x max + 3 we obtain for all x ≥ −n (p (l,u) ) n+1 x+n+2,y .
Moreover, given k ∈ N, for the boundaries with l = 2 and u = k + x max + 1 , we obtain for all x ∈ {1, . . . , In order to calculate the conditioned probability (30) for t ∈ {0, . . . ,T} and x = −n, . . . , t, the above truncation technique yields where the matrices P and Q are obtained by setting x max = t and k =T − t in Please note that with the function (33) P(S(t)| Z t = x) = wT ,t q,n , (x) for t = 0, . . . ,T and x = −n, . . . , t which shows that for largeT = T − n wT ,t q,n (x) ≈ r q,n (x) for t = 0, . . . ,T and x = −n, . . . , t.
Having calculated (30) in this way, the reward (25) is obtained for t = 0, . . . ,T bỹ These results can be combined to formulate an algorithm that calculates the optimal stopping time and the optimal value function: • Step 1: Initialize the backward induction bỹ V Ẽ T (x) = 0 for x = 0, . . . ,T + n, set t :=T.
Next, let us use the secret branch length (Z t ) T+1 t=0 from (8) as a state process and introduce control costs as function of this state. If mining is abandoned (p = 1), then there are no costs r t (1, z, a) = 0, for all z ∈ N, a ∈ A, t = 0, . . . , T.
In this mode p = 2, abandoning secret mining by a = 1 has two interpretations. If the attacker is ahead of the official chain (Z t = z > t), then the secret chain will be published and the attacker receives a reward ρ ≥ 0 for all blocks mined so far Similarly, if the attack is launched (p > 2) then again, mining costs must be paid: In this mode (p > 2) abandoning secret mining has again two interpretations. If the attacker is ahead of the official chain (Z t = z > t), then the secret chain will be published and the attacker receives a reward for each block mined so far. Furthermore, if the official chain was overtaken and at least n confirmation blocks are received (which corresponds toZ t = z > 0 and p = 2 + n), then also a revenue C > 0 from a successful double spending is collected. However, if there are not enough confirmation blocks (2 < p < 2 + n) then the attack was unsuccessful which causes a loss c > 0: At the end t = T + 1 of the time horizon, if an attack has been launched but the secret chain was not published (2 < p < 2 + n), then the attack was unsuccessful which yields a loss c > 0: r T+1 (p, z) = −c1 {2<p≤2+n} , for all z ∈ N, p ∈ P.
With the above specifications, we introduce the double-spending problem in the optimal switching formulation as follows: given (p 0 , z 0 ) ∈ E determine the maximum and a maximizer to Recall that the system starts with chain bifurcation (z 0 = 0) by secret mining (in the mode p 0 = 2). Once the optimal strategy π * from (40) is determined, the PoW vulnerability can be assessed in terms of the optimal policy value at this point. Specifically, if v π * 0 (p 0 , z 0 ) = v π * 0 (2, 0) = 0, then there is no profitable double-spending attack (41) since π * yields the same gain/loss as abandoning secret mining immediately.

Remark 5.
In practice, assessing PoW stability may require more complex considerations than solving (40). As mentioned earlier, secret miners can slow down honest mining: The idea (see [14] (the authors thank to an anonymous referee), refs. [6][7][8]) that once ahead of the official chain, secret miners can reveal blocks from their private branch to the public such that the honest miners switch to the recently revealed blocks, abandoning their shorter public branch. This strategy leads honest miners to waste resources working on blocks that are already mined, in the sense that they are working on the secret chain and are behind of secret miners.

Remark 6.
Selfish mining can be considered to be an attack with the purpose of obtaining larger rewards for mining than that of the honest pool, or to dominate the mining capacity for government of the network. In some sense, selfish mining can be considered to be a part of our optimal double-spending problem due to revenue from the secretly mined blocks. However, we do not model a strategy for secret block publications, thus the core mechanism of selfish mining is not included in the preset approach. These aspects should be considered to further refine the double-spending analysis.

Attack Optimization in the Optimal Switching Formulation
Given the state process (Z t ) T+1 t=0 from (8) and switching matrix (39) the optimal control problem (40) is solved via backward induction (13), (14). However, (14) requires determining an expectation with respect to a geometric distribution, which involves an infinite number of summations. Still, this calculation can be reduced to a finite number of operations since our value functions are constant, starting from a sufficiently large state variable. More precisely, we verify below that our assumption that there is a maximal chain length T which can be abandoned implies that ν * t (p, z) = ν * t (p, T + 1) for all z ∈ N with z > T. Specifically, the value functions ν * t (p, z) of the optimal policy can be explicitly calculated for large values z > T as Indeed, if the secret chain exceeds z > T the maximal branch length which can be abandoned then further mining is not profitable and the gain from publishing a longer branch is 1 {1<p} ρ(T + 1). Furthermore, having a secret branch of this length z > T, there is a good chance of the double-spending attack succeeding. For this, the publication of the chain should be postponed until the last confirmation block is received. Please note that the number of confirmation blocks required is n − (p − 2) whereas T − t is the number of official blocks to be received until the time horizon ends. Suppose that the attack is already launched p > 2 then it succeeds if n − (p − 2) ≤ T − t with the publication of the secret branch (until block T + 1) after the official block t + n − (p − 2) ≤ T, immediately after the last confirmation block is received. This explains the term C1 {t≤T−n+p−2} in the expression (42). Otherwise, if n − (p − 2) > T − t then the last confirmation block arrives after the official block T, thus the attack does not succeed which gives a loss term −c1 {t>T−n+p−2} in the formula (42). If the network is not attacked yet p = 2, then the attack shall be launched immediately if n ≤ T − t, otherwise there will be no attack. This ensures that the last confirmation block is received at t + n ≤ T before the end of the time horizon, giving the gain term 1 {p=2} C1 {t≤T−n} in (42).
Let us summarize the control costs formulated in Section 7 as for t = 0, . . . , T, p ∈ P and z ∈ N and provide a solution to the stochastic switching problem (40) in terms of the following algorithm: • Step 1: Calculate the expected value function using scrap values from (45): ν E T+1 (p, z) = r T+1 (p, z) for p ∈ P, z = 0, . . . , T, set t := T.
We provide a numerical illustration of this algorithm in Section 8.2.

Remark 7.
Our approach provides several advantages compared to the framework of infinite-horizon Markov decision approach applied in [6][7][8] due to following aspects: Using a finite time horizon, we consider a wider policy class than in the infinite-horizon discounted-reward approach (all policies instead of those which are stationary). Furthermore, we obtain exact solutions by a finite number of algebraic operations (rather than relying on convergence). As a result, all our policy values are expressed exactly in present-time monetary units since there is no artificial discounting. Please note that unlike in the infinite-horizon discounted-reward setting, monetary policy values allow direct conclusions since there is no need for comparison and benchmarking. Finally, using a finite number of operational switching modes yields a compact and natural problem description with few actions and a relatively small state space. Nevertheless, Markov decision models (particularly those from [6]) address and manage many technical details using existing Markov decision solvers.

Numerical Illustration of Optimal Stopping
Let us illustrate the above algorithm by an implementation in the scientific computing language R. We define all auxiliary functions required by (33) (matrices P and Q in (34) and (35)) 1 rm(list = ls()) # remove all objects 2 library(expm) #install.packages("expm") 3 make_matrix<-function(q, d) # routine to generate matrices 4 { # required by the function w() 5 stopifnot(d>=2) 6 mat<-matrix(data=0, nrow=d, ncol=d) 7 mat[d,d]<-mat[1,1]<-1 8 for (i in 2:(d-1)) { 9 for (j in 2:(i+1)) mat[i,j]<-(1-q)*q^(i+1-j) 10 mat[i, 1]<-q^(i) 11 } 12 return(mat) } and the transition matrix of (Z t )T t=0 required by (38).  The result of this calculation is illustrated in Figure 2 which depicts the continuation and the stopping regions by dashed lines (in black) and by solid lines (in red) respectively. Recall that we agreed to consider the states visited by (Z t ) T t=0 for block difference not greater than n due to (19). Hence, the graph of the relevant states forms a triangle-type figure whose bottom range turns out to be the stopping region. In fact, we observe that the conditions of launching a double-spending attack are achieved if the block difference between the official and the secret branches attains some critical value (−2 in this calculation). In line with our intuition, this means that the attacker must wait until the private chain overtakes the official by at least two blocks. Thereafter, the payment shall be placed (attack launched) while the secret mining must continue until the end of the time horizon. Surprisingly, this critical value (−2) does not depend on time, thus the optimal exercise strategy is time-homogeneous, which is rarely seen in finite-horizon optimal stopping problems. This phenomenon is observed for diverse sets of parameters in all numerical calculations and can be explained by a weak dependence of the rewards on time and by a time-homogeneity at the last time pointT = T − n at which the attack can be launched having in mind that the race effectively continues until T, by construction. attains some critical value (−2 in this calculation). In line with our intuition, this means that the attacker has to wait until the private chain overtakes the official by at least two blocks. Thereafter, the payment shall be placed (attack launched) while the secret mining must continue until the end of the time horizon. Surprisingly, this critical value (−2) does not depend on time, thus the optimal exercise strategy is time-homogeneous, which is rarely seen in finite-horizon optimal stopping problems. This phenomenon is observed for diverse sets of parameters in all numerical calculations and can be explained by a weak dependence of the rewards on time and by a time-homogeneity at the last time pointT = T − n at which the attack can be launched having in mind that the race effectively continues until T , by construction.
Remark: Es expected, the optimal stopping policy heavily depends on model parameters and changes with number of required conformation blocks, costs, rewards and capacity ratio. The interested reader is encouraged to experiment with our code for different parameters to investigate diverse situations.  (18) . The continuation region is depicted by dashed lines, while and the stopping regions by solid lines.

Remark 8.
As expected, the optimal stopping policy heavily depends on model parameters and changes with number of required conformation blocks, costs, rewards and capacity ratio. The interested reader is encouraged to experiment with our code for different parameters to investigate diverse situations.

Numerical Illustration of Optimal Switching
We illustrate the algorithm presented above by an implementation in R. First let us define the matrix for calculation (47) of conditional expectation by the same code as in the second listing from Section 8.1. which yields beyond value function maximization also the maximizing actions.
20% for C = 50 and C = 100. Let us emphasize that such calculations can be used to determine the block number required to secure a transaction depending on its size.

Conclusions
Unfortunately, even relatively unsophisticated (classical textbook-style) double-spending attacks happen regularly. These malicious actions cause huge losses to investors and jeopardize the perspectives of the promising blockchain technology. In fact, the situation is worrying. The point here is that the double-spending problem actually concerns more than a single payment which may disappear later. The sheer possibility to rewrite the ledger with a deep re-organization of its blocks may cause enormous consequences.
In this work, we show that planning an attack on a PoW-based system can be formulated as optimal sequential decision problem. Therefore, we consider two cases: A simplified model of a double-spending attack, which can be treated as an optimal stopping problem, and a more detailed modeling which requires an optimal stochastic switching toolbox.
In the optimal stopping situation, the strategy consists of a secret mining, followed by a later payment. The optimal payment moment is determined by the length difference between the official and secret chains since their fork-off and depends on model parameters (mining capacity ratio, confirming block number and on the revenue/loss from the success/failure of the attack). A more complex stochastic switching model upgrades this framework by introducing the option to abandon secret mining at any time. Furthermore, a switching model provides also a more realistic context since it takes into account mining costs and rewards for published blocks. Most importantly, the optimal strategy can be used to determine whether it is worth attacking the PoW-based system. This insight may allow important conclusions on its vulnerability. However, to address this topic within an entirely realistic situation, the present models must be further developed to include propagation delay, and uncertainty in observations. Furthermore, complex consensus protocols based on (delegated) proof of stake, proof of storage, proof of burn or their combinations must be investigated from a similar perspective. Finally, also the possibility to slow down the honest mining by diverse malicious actions (in the spirit of [14]) must be examined. Here, a deeper understanding of the natural bifurcations of the official chain (which slows down its growth) and the attacker's opportunity to enforce it (by publishing blocks, jamming the network and causing propagation delays) are crucial. All these problems must be systematically addressed to improve stability of block chain systems.
Acknowledgments: This work would not have been possible without the advice, help, kind support, and very significant contributions of Peter Taylor. The author would also like to thank to anonymous referees for their criticism and remarks which helped us improving this work. In particular, the author expresses deepest gratitude to the referee suggesting an investigation of PoW stability in terms of optimal switching techniques. Furthermore, the author appreciates helpful communication with the editor of MDPI and thanks F. Hinz and P. Hinz for discussions and Vonida UG (haftungsbeschränkt) for their support.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A
Let us derive the assertions (6) and (7) from the relation between mining capacities.
Proof. The time, required to complete the next block follows an exponential distribution since the process of mining can be described by repeated attempts to solve a cryptographically puzzle by independent random trials. Indeed, taking into account that the waiting time to first success in a sequence of Bernoulli trials is geometrically distributed and the time spent on each trial is short, the exponential distribution provides an excellent approximation for the time required to complete a next block. For this reason, the block numbers mined in the secret and official branch since their fork-off can be described in physical time u ∈ R + by independent Poisson processes (N S u ) u∈R + , (N O u ) u∈R + . Furthermore, the corresponding intensities λ S , λ O ∈]0, ∞[ are determined by the mining capacity ratios and the total difficulty (for details, see [1,5]) and are proportional λ S = λq, λ O = λ(1 − q) to the capacity fractions q, (1 − q) ∈]0, 1[ of the miners. Therefore, the factor λ ∈]0, ∞[ incorporates the difficulty of mining. That is, the probability of having mined j ∈ N secret blocks during a time required for the completion of one official block is given by (the author thanks Florian Hinz) P(N S t = j|t is the first jump time of (N O u ) u∈R + ) = The proof of Lemma 1 is given below:

Proof.
(a) To show (32), recall that given x ∈ Z, n ∈ N with x ≥ −n and x ∈ {1, . . . , x + n + 1} the probability P(Z n+1 = x | Z 0 = x) is the sum over probabilities of all trajectories of (Z i ) n+1 i=0 which start at x and finish at x . (A1) Please note that each such trajectory cannot exceed x + n + 1 since at each time i, the Markov chain can jump up Z i+1 = Z i + 1 by one unit at most. For the same reason, each trajectory (A1) also cannot go below −n since otherwise it would not reach x ∈ {1 . . . x + n + 1}, i.e., the dynamics (Z i ) n+1 i=0 can be equivalently replaced by (Z (l,u) i ) n+1 i=0 in (A1) if (l, u) satisfies l ≤ −n and x + n + 1 ≤ u. The transition probables of (Z (l,u) i ) n+1 i=0 over n + 1 steps can be obtained from the entries of the power p (l,u) n+1 of its transition matrix p (l,u) , which shows (32) . Next, in order to determine (31), we use (32) to observe that (p (l,u) ) n+1 x,x = 0 ∑ y=l−1 (p (l,u) ) n+1 x,y .
(b) Consider P(min k i=0 Z i ≤ 0 | Z 0 = x ) as the sum over probabilities of all sample paths of (Z i ) k i=0 which start at x ≥ 1 and enter the set of non-positive states. With the same arguments as in the proof of a), the process (Z i ) k i=0 can replaced in (A1) by a truncated dynamic (Z (l,u) i ) k i=0 with (l, u) satisfying l = 1 and x + k ≤ u, which gives and finishes the proof.