computers Mind Your Outcomes: The ∆ QSD Paradigm for Quality-Centric Systems Development and Its Application to a Blockchain Case Study †

: This paper directly addresses a long-standing issue that affects the development of many complex distributed software systems: how to establish quickly, cheaply, and reliably whether they can deliver their intended performance before expending signiﬁcant time, effort, and money on detailed design and implementation. We describe ∆ QSD, a novel metrics-based and quality-centric paradigm that uses formalised outcome diagrams to explore the performance consequences of design decisions, as a performance blueprint of the system. The distinctive feature of outcome diagrams is that they capture the essential observational properties of the system, independent of the details of system structure and behaviour. The ∆ QSD paradigm derives bounds on performance expressed as probability distributions encompassing all possible executions of the system. The ∆ QSD paradigm is both effective and generic: it allows values from various sources to be combined in a rigorous way so that approximate results can be obtained quickly and subsequently reﬁned. ∆ QSD has been successfully used by a small team in Predictable Network Solutions for consultancy on large-scale applications in a number of industries, including telecommunications, avionics, and space and defence, resulting in cumulative savings worth billions of US dollars. The paper outlines the ∆ QSD paradigm, describes its formal underpinnings, and illustrates its use via a topical real-world example taken from the blockchain/cryptocurrency domain. ∆ QSD has supported the development of an industry-leading proof-of-stake blockchain implementation that reliably and consistently delivers blocks of up to 80 kB every 20 s on average across a globally distributed network of collaborating block-producing nodes operating on the public internet.


Introduction
In order to avoid expensive design and implementation failures, it is critical to establish sufficiently early in the design cycle that software systems will meet both their functional requirements and their non-functional requirements. This paper describes ∆QSD, a novel metrics-based and quality-centric paradigm that uses formalised outcome diagrams to explore the performance consequences of design decisions, and so to determine system viability ahead of expensive implementation work. The paradigm has been successfully used in a number of commercial settings, including telecommunications, avionics, and space and defence. The paper introduces the concepts underlying ∆QSD, formalises these,

1.
System requirements are often vague and/or contradictory, and they can change both during and after development; 2.
Complexity forces hierarchical decomposition of the problem, creating boundaries, including commercial boundaries with third-party suppliers, that may hinder optimal development and hide risks; 3.
Time pressure forces parallel development that may be at odds with that hierarchical decomposition, and it encourages leaving 'tricky' issues for later, when they tend to cause re-work and overruns and leave tail-risks; 4.
Cost and resource constraints force resources to be shared both within the system and with other systems (e.g., when network infrastructure or computing resources are shared); they may also require re-use of existing assets (own or third-party), introducing a degree of variability in the delivered performance; 5.
The performance of particular components or subsystems may be incompletely quantified; 6.
System performance and resource consumption may not scale linearly (which may not become apparent until moving from a lab/pilot phase to a wider deployment); 7.
At scale, exceptional events (transient communications and/or hardware issues) can no longer be treated as negligibly rare, and their effects and mitigation need to be considered along with the associated performance impacts.
Thus, what is needed is (1) a way of capturing performance and resource requirements that accommodate all the various sources of uncertainty; and (2) a process for decomposing a top-level requirement into subsystem requirements that provides confidence that satisfying all the lower-level requirements will also satisfy the top-level one. For functional aspects of system behaviour, there are various ways of dealing with this [7]. However, while established software engineering approaches do exist for dealing with performance [8], these all have significant limitations.

The ∆QSD Systems Development Paradigm
This paper directly addresses those issues by defining the ∆QSD systems development paradigm and providing a high-level formalism that can be used throughout the system development process. ∆QSD is a quality-centric paradigm, focusing on meeting timeliness constraints and an acceptable failure rate of the top-level outcomes with acceptable resource consumption. The paradigm has been used successfully by a small team in Predictable Network Solutions in a variety of large industrial projects, collectively saving billions of dollars and person-centuries of development effort. It informs high-level management and system design decisions by showing where conflicts exist (or may exist) between system designs and required outcomes. It is able to compute the predicted performance at any stage of the design process, where performance is seen broadly as comprising timeliness, behaviour under load, resource consumption, and other key system metrics.
Central to ∆QSD is the concept of an outcome, which is defined as a specific system behaviour with specified start and end points, localised in space and time. In ∆QSD, the system engineer models the system as an outcome diagram, which is a graph that captures the causal relationships between system outcomes. ∆QSD defines a system design as a sequence of outcome diagrams that capture the essential observational properties of the system, independent of the details of system structure and behaviour. This sequence starts with a fully unspecified system and ends with either (i) a fully specified (or a convincingly specified-enough) system (deemed as constructible), or (ii) the conclusion that the system goals are infeasible.
The formalism allows exploration of the design space by assessing the consequences of the decisions that are taken (and possibly retracted) at each refinement step, giving rise to threaded decision trees. For each partially specified system, we compute the predicted timeliness and behaviour and resource consumption of the system under load, obtaining one of three possible conclusions: (1) infeasibility-hence, ceasing further development and revising former design decisions; (2) slack-hence, ceasing further optimisation because the system is good enough; or (3) indecisiveness-hence, requiring additional scrutiny until one of the alternative conclusions can be drawn. The paper gives one large example, blockchain diffusion, that illustrates how ∆QSD can be used in practice, explaining how the formalism can be used to drive the design process and associated decision making. This example is a real-world application that is in continuous use as a core part of the Cardano blockchain technology (https://cardano.org/ (accessed on 14 December 2021)).
Most performance analysis approaches require the system to be fully specified, or even implemented, which is a serious disadvantage, since it does not allow the properties of subsystems to be encapsulated and hierarchically (de)composed. By contrast, ∆QSD satisfies compositionality, the principle that the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules that are used to combine them. For compositional properties, what is "true" about subsystems (e.g., their timeliness, their resource consumption) is also "true" about their (appropriate) combination: there exists an invariant (e.g., timeliness, aspects of functional correctness) that must hold over the reified components of the system. This is key to managing complexity within the systems' development life-cycle.
In the broader software development space, functional programming techniques are improving the compositionality of functional aspects of software systems, and they can deliver high assurance of functional correctness when combined with appropriate formal methods [9]. The ∆QSD paradigm represents a similar step change in handling the "nonfunctional" aspects of performance and resource consumption. By treating delay and failure as a single object, called 'quality attenuation', our paradigm can be thought of as a combination of passage time analysis and failure mode effects analysis (FMEA).

Main Contributions of this Paper
The main contributions of this paper are as follows: 1.
Introduce ∆QSD, a formalism (Section 5) that focuses on rapidly exploring the performance consequences of design and implementation choices, where: (a) Performance is a first-class citizen, ensuring that we can focus on details relevant to performance behaviour; (b) The whole software development process is supported, from checking the feasibility of initial requirements to making decisions about subtle implementation choices and potential optimisations; (c) We can measure our choices against desired outcomes for individual users (customer experience); (d) Analysis of saturated systems is supported (where a "saturated system" is one with resources that have reached their limits, e.g., systems with high load or high congestion); (e) Analysis of failure is supported.
We use term-rewriting for formalising refinements (Definition 3 in Section 5) and denotational semantics for formalising timeliness analysis (Section 5.3) as well as load analysis (Section 5.4).

2.
Describe key decisions made in the development process of a real system-i.e., the Cardano blockchain, which is presented as a running example-and show how ∆QSD is able to quickly rule out infeasible decisions, predict behaviour, and indicate design headroom (slack) to decision makers, architects, and developers (Section 4).
While the ∆Q concept has been described in earlier papers [10,11] and used to inform a number of large-scale system designs, these previous contributions have only used it in an informal manner. By providing a formal definition of ∆QSD, and showing how it can be used in practice, we are taking an important step towards a general evidence-based engineering methodology for developing real-time distributed systems.

Structure of the Paper
This paper has the following structure: • Section 2 introduces the running example that we will use throughout the paper: block diffusion in the Cardano blockchain. • Section 3 defines the basic concepts that underlie the ∆QSD formalism: outcomes, outcome diagrams, and quality attenuation (∆Q). We also compare outcome diagrams with more traditional diagrams such as block diagrams. • Section 4 gives a realistic example of the ∆QSD paradigm, showing a step-by-step design of block diffusion (introduced in Section 2) based on quality analysis. This example introduces the basic operations of ∆QSD in a tutorial fashion. The example uses realistic system parameters that allow us to compute predicted system behaviour. • Section 5 gives the formal mathematical definition of ∆QSD and its main properties. With this formal definition, it is possible to validate the computations that are used by ∆QSD as well as to build tools based on ∆QSD. • Section 6 gives a comprehensive discussion about related work from three different viewpoints: theoretical approaches for performance analysis (Section 6.1), performance design practices in distributed systems (Section 6.3), and programming languages and software engineering (Section 6.4).
• Section 7 summarises our conclusions, discusses some limitations of the paradigm, and describes our plans to further validate ∆QSD and to build a dedicated toolset for real-time distributed systems design that builds on the ∆QSD paradigm.

Running Example: Block Diffusion in the Cardano Blockchain
A blockchain is a form of distributed ledger. It comprises a number of blocks of data, each of which provides a cryptographic witness to the correctness of the preceding blocks, back to some original 'genesis' block (a 'chain' of blocks, hence 'blockchain') [12]. Nodes in the system use some specified protocol to arrive at a distributed consensus as to the correct sequence of blocks, even in the presence of one or more 'adversaries' that aim to convince other nodes that a different sequence is correct. One such consensus protocol is Ouroboros Praos [13], which underpins Cardano (https://www.cardano.org (accessed on 14 December 2021)), one of the world's leading cryptocurrencies. Ouroboros Praos uses the distribution of 'stake' in the system (i.e., the value of the cryptocurrency tokens that are controlled by each node) to randomly determine which node (if any) is authorised to produce a new block in the chain during a specific time interval (a 'slot'); the more stake a node controls, the more likely it is to be authorised to produce a block. For this to be effective, it is important that the block-producing node has a copy of the most recently produced block, so that the new block can correctly extend the existing chain. Since the block producer is selected at random, this means that the previous block needs to have been copied to all block-producing nodes; we call this process 'block diffusion'. Since blocks are produced on a predetermined schedule and each block depends on its predecessor, block diffusion is a real-time problem; each block must be diffused before the next block can be produced. In order to be robust, the consensus algorithm is designed to withstand some imperfections in block diffusion; hence, the effective requirement is that blocks should be well-diffused "sufficiently often". Put another way, the probability that a block fails to arrive in time for the production of the next block must be suitably bounded. The engineering challenge is to quantify this probability as a function of the design and of the parameter choices of the implementation.
The scale of the challenge is illustrated by Cardano. Cardano is a global-scale distributed system that eschews centralised management. At the time of writing, 2948 globallydistributed nodes cooperate to produce and distribute blocks for $45.77B of cryptocurrency that is associated with 956,092 distinct user addresses. The stake distribution at the time of writing is shown in Figure 1.
In Cardano, slots are one second long and blocks are produced every 20 s on average. An initial implementation of Cardano (code-named 'Byron') was functionally correct but proved incapable of effective block diffusion without rigid control of the nodes and their topology; a re-implementation (called 'Jormungandr') targeted higher performance by using a different programming language (Rust instead of Haskell), but this also missed the block diffusion target by a wide margin. A further, and ultimately successful, reimplementation (called 'Shelley' [14]) used Haskell to retain strong correctness assurances but applied the principles that are discussed in this paper to also ensure adequate performance in a fully decentralised deployment. ). The y-axis represents the number of blocks produced by each "stake pool"; the x-axis represents the stake that is held by the pool (in Ada).

Key Design Decisions
In the design of Shelley, a number of inter-related decisions had to be made. These included the following:

1.
How frequently should blocks be produced? Proof-of-Work systems are limited in their throughput by the time taken to 'crack' the cryptographic puzzle; proof-ofstake systems do not have this limitation and so have the potential for much higher performance both in terms of the volume of transactions embedded into blocks and the time take for a transaction to be fully incorporated in the immutable part of the chain. Thus, the interval between blocks is a key parameter.

2.
How are nodes connected? It might seem that connecting every node to every other would minimise block diffusion time; however, the lack of any control over the number and capabilities of nodes makes this infeasible. Nodes can only be connected to a limited number of peer nodes; then, the number of connected peers and how they are chosen become important.

3.
How much data should be in a block? Increasing the amount of data in a block improves the overall throughput of the system but makes block diffusion slower.

4.
How should blocks be forwarded? Simply forwarding a new block to all connected nodes would seem to minimise delay, but this wastes resources, since a node may receive the same block from multiple peers. In the extreme case, this represents a potential denial-of-service attack. Splitting a block into a small header portion (sufficient for a node to decide whether it is new) and a larger body that a node can choose to download if it wishes mitigates this problem but adds an additional step into the forwarding process.

5.
How much time can be spent processing a block? Validating the contents of a block before forwarding it mitigates adversarial behaviour but can be computationally intensive, since the contents may be programs that need to be executed (called 'smart contracts'); allowing more time for such processing permits more, and more complex, programs but makes block diffusion slower.
The remainder of this paper shows how such design decisions can be quantified using the ∆QSD paradigm.

Formulating the Problem
We assume that a collection of blockchain nodes is assembled into a random graph (randomness is important in a blockchain setting for mitigating certain adversarial behaviours). In each time slot, a randomly chosen node may generate a block, and we are interested in the probability that the next randomly chosen node has received that block before it generates the next block.

Problem Statement
Starting from blockchain node A, what is the probability distribution of the time taken for a block to reach a different node Z when A and Z are picked at random from the graph?
Since the graph is random with some limited node degree N, there is a strong chance that A is not directly connected to Z, and so, the block will have to pass through a sequence of intermediate nodes B, C, . . . The length of this sequence is a function of the size and node degree of the graph [15]. The (distribution of) time to forward a block directly from one node to another is known (e.g., by measurement).

Foundations
In the remainder of this paper, we will take the system of discourse to be fixed for the design engineer. We assume that this system has a number of tasks that must be performed. In order to perform a task that is not considered to be atomic by the design engineer, the system might need to perform several other subtasks. The process of clarifying the details of the system by breaking higher-level tasks into such subtasks is what we call refinement (Definition 3 in Section 5). By refining a system, one goes from a coarser granularity of the design to a finer one (see Sections 4.1-4.3 for examples). Sometimes, the design engineer will return to a coarser grained design, as discussed in Section 7.1.2, in order to take a different direction of refinement (see  for examples). Reasons why they might want to do so include: to investigate other aspects of their system; to compare two alternative design choices; or because a refinement fails to meet the necessary performance or other requirements. Thus, ∆QSD is design exploration in the world of refinements.
This section sets the stage for presenting design exploration in action (Section 4) by introducing the fundamental concepts: outcomes (Section 3.1), outcome diagrams (Section 3.2), and quality attenuation (Section 3.3). Then, it gives a simple example of how to approach problems à la ∆QSD (Section 3.4). This section ends in a discussion on why ∆QSD advises a new diagram in the presence of all the existing ones in Software Engineering (Section 3.5).

Outcomes
An outcome is what the system obtains by performing one of its tasks. Each task has precisely one corresponding outcome, and each outcome has precisely one corresponding task. We say that an outcome is 'performed' to mean that the corresponding task of an outcome is performed. Likewise, we might use task adjectives for outcomes too, even though outcomes and tasks are inherently different. For example, by an atomic outcome, we mean an outcome whose corresponding task is itself atomic.
We take an event-based perspective, in which each outcome has two distinct sets of events: the starting set of events (any one of which must happen before the task can commence) and the terminating set of events (at least one of which must happen before the task can be considered complete). Each of those sets consists of events that are of particular interest (as opposed to just any event). We call such events of interest the observables. For example, an observable in the starting set, S o , of an outcome o is of interest because it signifies the point in time and 3D location at which o begins. Likewise, an observable from the terminating set, T o of o is an event that contains information regarding the location where o finishes. While it may seem unusual to refer explicitly to location in a computer science context, when considering distributed systems, the outcomes of interest are precisely those that begin at one location and finish at another. Of course, once an observable from S o occurs, there is no guarantee that one from T o will occur within o's duration limit, d(o) (i.e., the relative time by which o is required to complete). However, when an observable T o does occur within the duration limit after one from S o , o is said to be done.
Diagrammatically, we show an outcome using an orange circle. As shown in Figure 2, we depict the starting set and the terminating set of an outcome using small boxes to the left and right of the outcome's circle, respectively. The starting set is connected to the outcome from the left, and the terminating set is connected to the outcome from the right. When they are unimportant for an outcome, we do not include the starting set and the terminating set of that particular outcome in the diagram. We consider one special kind of outcome. Consider the situation where a design engineer is aware that an outcome is not atomic. They will eventually need to further break the outcome into its suboutcomes. Nevertheless, the current level of granularity is sufficient to carry out a particular analysis (see Sections 5.3 and 5.4 for two example analyses). In ∆QSD, a black box can be used for that particular outcome. Black boxes are those outcomes that achieve one of the following:

1.
Can be easily quantified without even a need for them to be named; 2.
Are beyond the design engineer's control (and so may need to be quantified by external specification or measurement); or, 3.
Are ones for which the design engineer has intentionally left the details for later.
Outcome variables are the variables that we use to refer to a given outcome.

Outcome Diagrams and Outcome Expressions
The description of a system in terms of its outcomes requires the causal relationships between the outcomes to be captured. In ∆QSD, these relationships are captured in outcome diagrams. In addition to its graphical presentation, each outcome diagram can be presented algebraically, using its corresponding outcome expression. As shown in Figure 3, outcome diagrams offer four different ways to describe the relationships between outcomes.

Quality Attenuation (∆Q)
From the perspective of a user, a perfect system would deliver the desired outcome without error, failure, or delay, whereas real systems always fall short of this; we can say that the quality of their response is attenuated relative to the ideal. We denote this 'quality attenuation' by the symbol ∆Q and reformulate the problem of managing performance as one of maintaining suitable bounds on ∆Q [16]. This is an important conceptual shift because 'performance' may appear to be something that can be increased arbitrarily, whereas ∆Q (similar to noise) is something that may be minimised but that can never be completely eliminated. Indeed, some aspects of ∆Q, such as the time for signals to propagate between components of a distributed system, cannot be reduced below a certain point.
Since the response of the system in any particular instance can depend on a wide range of factors, including the availability of shared resources, we model ∆Q as a random variable. This allows various sources of uncertainty to be captured and modelled, ranging from as-yet-undecided aspects of the design, to resource use by other processes, to behavioural dependence on data values.
In capturing the deviation from ideal behaviour, ∆Q incorporates both delay (a continuous random variable) and exceptions/failures (discrete variables). This can be modelled mathematically using Improper Random Variables (IRVs), whose total probability is less than one [17]. If we write ∆Q(x) for the probability that an outcome occurs in a time t ≤ x, then we can define the 'intangible mass' of such an IRV as 1 − lim x→∞ ∆Q(x). In ∆Q, this encodes the probability of exception or failure. This is illustrated in Figure 4, which shows the cumulative distribution function (CDF) of an IRV (with arbitrary time units). We can define a partial order on such variables, in which the 'smaller' attenuation is the one that delivers a higher probability of completing the outcome in any given time: This partial order has a 'top' element, which is simply perfect performance: ≡ (∀ x ∆Q(x) = 1), and a 'bottom' element, which is total failure (an outcome that never occurs): ⊥ ≡ (∀ x ∆Q(x) = 0). We can write specifications for system performance using this partial order by requiring the overall ∆Q to be less than or equal to a predefined bounding case.
Where the ∆Q is strictly less than the requirement, we say there is performance slack; when it is strictly greater than the requirement, we say there is a performance hazard (cf. Definitions 5 and 8).
Assessments might also find the current level of information about a system to be indecisive-neither slack nor hazard. The simplest reason for indecisiveness is the partiality of ≤ in Equation (1). Another reason for indecisiveness might be conflict between different analyses. For example, timeliness analysis (Section 5.3) might show slack whilst load analysis (Section 5.4) shows hazard. A third reason might be that even though the formulations end up indicating slack or hazard, the system is still detailed so little that the result of the analysis should not be counted on.
The relationships between outcomes that are shown in Figure 3 then induce corresponding relationships between the ∆Qs of those outcomes, as explained in Section 5.3.
The key to the compositionality of the paradigm is that the partial order is preserved by the operations that combine ∆Qs. Thus, for example, considering the sequential composition of either of two alternative outcomes with a third, This enables an overall timeliness requirement to be broken into 'budgets' for sub-outcomes. More details of this approach are given in [11].

Simple Example
Consider the simple distributed system of a web browser interacting with a set of servers that collectively provide a web page. The outcome that is of interest to the user starts with the event of clicking on a URL, and it ends with the event of the page being fully rendered. This corresponds to the first row of Figure 5. The second row shows the distinction between the user and the browser, and the third row exposes the back-end servers. A typical web page will contain a variety of elements that are served by servers from different host domains. So, for each element, the browser (and its supporting O/S) must first resolve the corresponding domain name, then establish a connection to the given server, and finally download and then render the provided content. Thus, for each element that needs to be displayed, the ∆Q is the sequential composition of the ∆Qs of the component steps described above; and the ∆Q of rendering the whole page is an all-to-finish combination of the ∆Qs of all the elements. Note that this formulation automatically deals with the possibility that any of the steps may fail, and it provides the resultant failure probability for the whole process in addition to the distribution of expected completion times.
This simple model can be further refined as needed to meet real-world requirements. For example, DNS resolution might provide alternative server addresses for load-balancing purposes, and each of these servers might have different ∆Qs when providing the same content to the user (perhaps because they are located in different geographical locations or are provisioned using systems with different CPU or storage capabilities). We can represent this as a probabilistic choice between these outcomes, which is weighted by the probability that a specific server is used. This weights the corresponding ∆Q. In addition, we might also consider the effect of load and contention for shared resources, for example network interface bandwidth or rendering capacity, or the impact of different DNS caching architectures on performance. These aspects of system performance design are formalised in Section 5.

Alternatives to Outcome Diagrams-Why a New Diagram?
The ∆QSD paradigm introduces the concept of outcome diagrams. It is perfectly reasonable to ask at this point: "Why another diagram? What is it that outcome diagrams capture that UML diagrams, for example, cannot?" Let us answer these questions by comparing outcome diagrams with UML. We first recall the two main properties of outcome diagrams in the ∆QSD paradigm: • An outcome diagram specifies the causal relations between outcomes. An outcome is a specific system behaviour defined by its possible starting events and its possible terminating events. For example, sending a message to a server is an outcome defined by the beginning of the send operation and the end of the send operation. The action of sending a message and receiving a reply is observed as an outcome, which is defined by the beginning of the send operation and the end of the receive operation. Outcomes can be decomposed into smaller outcomes, and outcomes can be causally related. For example, the send-receive outcome can be seen as a causal sequence of a send outcome and a receive outcome. • An outcome diagram can be defined for a partially specified system. Such an outcome diagram can contain undefined outcomes, which are called black boxes. A black box does not correspond to any defined part of the system, but it still has timeliness and resource constraints. Refining an outcome diagram can consist in replacing one of its black boxes with a subgraph of outcomes.
A crucial property of an outcome diagram is that it is an observational concept. That is, it says something about what can be observed of a system from the outside, but it does not say anything about how the system is constructed internally.

UML Diagrams
UML is a rich language defined to model many different aspects of software, including its structure, behaviour, and the processes it is part of. The UML 2 standard defines 14 kinds of diagrams, which are classified into structural diagrams and behavioural diagrams. We first note two general properties of outcome diagrams that UML diagrams do not share:

•
Observational property: All UML diagrams, structural and behavioural, define what happens inside the system being modelled, whereas outcome diagrams define observations from outside the system. The outcome diagram makes no assumptions about the system's components or internal states.
• Wide coverage property: It is possible for both UML diagrams and outcome diagrams to give partial information about a system, so that they correspond to many possible systems. As long as the systems are consistent with the information in the diagram, they will have the same diagram. However, an outcome diagram corresponds to a much larger set of possible systems than a UML diagram. For an outcome diagram, a system corresponds if it has the same outcomes, independent of its internal structure or behaviour. For a UML diagram, a system corresponds if its internal structure or behaviour is consistent with the information in the diagram. This means that a UML diagram is already making decisions w.r.t. the possible system structures quite early in the design process. The outcome diagram does not make such decisions.
In the rest of this section, we compare outcome diagrams to two UML diagrams, namely the state machine diagram and the component diagram.

State Machine Diagram
A state machine diagram is a finite state automaton. It defines the internal states of a system and the transitions between them. The state diagram captures the causality between the actions taken when the system changes states, but this does not map directly to the outcomes observed by an external user. However, there is a relationship between a state diagram and an outcome diagram. An outcome can map to a sequence of state transitions, whereas, by examining the actions of a state diagram, it is possible to deduce the outcomes to expect from taking those actions.

Block Diagram
A block diagram specifies a system as a set of elements with their interconnections. We illustrate the difference between block diagrams and outcome diagrams using a simple example system: a user querying a front end that is connected to a database ( Figure 5). The figure shows the refinement process: a system with an initially unknown structure is refined stepwise into a system that has a completely known structure. For the outcome diagram, the system performance can be obtained directly by composing the ∆Qs of the outcomes, using the rules described in Section 5. For the block diagram, it is harder to obtain system performance. This is because the block diagram does not define the expected outcomes of a system or their causality. The block diagram by itself does not have sufficient information to allow system performance to be calculated: we also need to know the expected outcome and the sequence of messages sent between blocks needed to achieve that outcome. As a final remark, the block diagram constrains the system structure to always have a front end and a database, whereas the outcome diagram is consistent with many alternative system structures.

Design Exploration Using Outcome Diagrams
This section simulates how a design engineer could explore the blockchain diffusion example that was described in Section 2, using outcome diagrams. Figure 6 depicts that design exploration in the form of a threaded decision tree in the search space. Each node in the tree is an outcome diagram. Every node is labelled with a description plus the section in this paper where it is discussed. There are two types of edges: solid edges represent refinement steps (Definition 7), whilst dashed edges represent backtracks to take alternative directions of refinement. The formalism used in this section is presented in Section 5.

Starting Off
Initially, the design engineer knows almost nothing about the system. Perhaps, all they know is that there will be the following two observation locations: • A − : Block is ready to be transmitted by A. • Z + : Block is received and verified by Z.
The corresponding outcome diagram is in which the only outcome is a black box. As will be detailed in Section 5, the outcome expression to describe that outcome diagram is a ( for black boxes).

Early Analysis
Given that the design engineer is not content with the current level of granularity, they wish to further detail the diagram by giving the black box a name, such as o A Z . In ∆QSD, we call adding that further detail a refinement. That refinement step is depicted below.

Refinement Refinement
Here, the outcome diagram that is above the dashed line is refined into the one below the dashed line. As will be discussed in Section 5, the (rewrite) rule that authorises this refinement is We call this rule (UNBX) for unboxing (a black box). The rule states that in a context C, a black box can be rewritten to any other outcome expression (but not to a black box). In this case, we choose the black box to be rewritten to an outcome variable called o A Z . This indicates the outcome of hopping directly from A to Z. Before producing more of our block diffusion algorithm's outcome diagram, we would like to take the time to apply some analysis. Refinements aside, suppose for a moment that there are two hops to make from A to Z: first from A to an intermediate node B, and, then, from B to Z. The corresponding outcome diagram for the two-hop journey from A to Z would then be: Here, o A B and o B Z are the outcomes of hopping from A to B and from B to Z, respectively. Note also that the observation location between the above two outcomes is labelled B + /B − . That is because the observation B + and B − take place at the same location. For that reason, we will simply write B to refer to that observation location. The same convention is used for similar intermediate locations. Then, it is easy to obtain the outcome diagram for three hops: While outcome diagrams are visually more attractive, outcome expressions are algebraically more attractive. For example, the corresponding expression for two hops is is the symbol we use for sequential composition: The sequential composition of o A B and o B Z is needed because the latter causally depends on the former. Likewise, the outcome expression for three hops is Then, generalising that to n hops is easy: Parameterisation by n hops is useful because it helps the design engineer determine the right n for their blockchain. For example, a relevant question is: What is the optimal n for block diffusion to be timely and for its load to be bearable? The formalisation in Section 5 instructs the design engineer as to how to achieve that and other goals. Before detailing the how, we take our moment to analyse a smaller example. Consider the two-hop scenario. Provided that the design engineer has ∆Qs for both o A B and o B Z , they can use Definition 4 to work out the ∆Q of which is the convolution of the two constituent ∆Qs: In a similar vein, the design engineer can work out the n-hop scenario's ∆Q for n > 1.
Then, using the formulation given in Definition 5, the design engineer can determine the constraints on n that are needed in order for block diffusion to meet the overall timeliness requirements.
In practice, the time that is needed to transfer a block of data one hop depends on four main factors: 1. The size of the block; 2. The speed of the network interface; 3. The geographical distance of the hop (as measured by the time to deliver a single packet); 4. Congestion along the network path.
When we consider blockchain nodes that are located in data centres (which most block producers tend to be), the interface speed will typically be 1 Gb/s or more. This is not a significant limiting factor for the systems of interest (see Section 5.4 for an analysis that explains this). In the setting that we are considering, congestion is generally minimal, and so this can also be ignored in the first instance. This leaves (i) block size, which we will take as a design parameter to be investigated later; and (ii) distance, which we will consider now. For simplicity, we will consider three cases of geographical distance: 1.
Short: The two nodes are located in the same data centre; 2.
Medium: The two nodes are located in the same continent; 3.
Long: The two nodes are located in different continents.
For pragmatic reasons, Cardano relies on the standard TCP protocol for data transfers. TCP transforms loss into additional delay, so the residual loss is negligible. At this point, we could descend into a detailed refinement of the TCP protocol, but equally we could simply take measurements; the compositionality of ∆QSD means that it makes no difference where the underlying values come from. Table 1 shows measurements of the transit time of packets and the corresponding transfer time of blocks of various sizes, using hosts running on AWS data centre servers in Oregon, Virginia, London, Ireland, and Sydney. Since we know that congestion is minimal in this setting, the spread of values will be negligible, and so in this case, the CDFs for the ∆Qs will be step functions. The transfer time for each block size is given both in seconds and in multiples of the basic round-trip time (RTT) between the hosts in question. Since the TCP protocol relies on the arrival of acknowledgements to permit the transmission of more data, it is unsurprising to see a broadly linear relationship, which could be confirmed by a more detailed refinement of the details of the protocol. Given the randomness in the network structure and the selection of block-producing nodes, there remains some uncertainty on the length of an individual hop. At this point, we will assume that short, medium, and long hops are equally likely, which we can think of as an equally-weighted probabilistic choice. In numerical terms, this becomes a weighted sum of the corresponding ∆Qs, as given in Table 1. This gives the distribution of transfer times per block size shown in Figure 7.

Refinement and Probabilistic Choice
Recall that A and Z are names for randomly chosen nodes, so the number of hops between A and Z is unkown. ∆QSD tackles that uncertainty by offering an outcome diagram that involves probabilistic choice between the different number of hops that might be needed. Strictly speaking, a probabilistic choice is a binary operation. Hence, when there are more than two choices, the outcome diagram will cascade probabilistic choices. In the general formulation, there are at most n hops. In order to produce that, the design engineer exercises a step-by-step refinement of the single-hop outcome diagram. The first refinement introduces the choice between one or two or more hops, as shown in Figure 8. There are two outcome diagrams in Figure 8: the one above the dashed line and the one below. The underlying green area is not a part of the two outcome diagrams itself, but it is there to indicate which part of the diagram above the dashed line is being refined into which part of the diagram below. In the absence of the left-side arrow, the direction of refinement can also be determined using the colour of the underlying green area. The pale side of an underlying green area is for what is being refined, whereas the dark side is for the result of the refinement.

Refinement Refinement
The equivalent outcome expression of the lower diagram in Figure 8 , which is a probabilistic choice between one or two hops with respective weights m 1 and m 1 . The corresponding (rewrite) rule of the figure is: which we call (PROB) (for probabilistic choice). Here is how we applied (PROB) to arrive from the single hop to the probabilistic choice between one hop and two hops: That is, C in the above refinement is an empty context.
Next, the design engineer further refines the two+-hop part to the probabilistic choice between two or three hops, as shown in Figure 9. Again, in that figure, the underlying green area is not a part of either diagram. It only serves as a visual indicator, showing which part of the upper diagram is being refined into which part of the lower one. For the equivalent term rewriting of Figure 9, we use (PROB) again. However, instead of an empty context, here, the context is

Refinement Refinement
The design engineer can continue refinement until a predetermined number of hops is reached. Alternatively, they can keep the number of hops as a parameter and analyse the corresponding parameterised outcome expression for timeliness, behaviour under load, etc. Figure 10 shows the result of applying Equation (2) to the sequence of outcome expressions corresponding to one, two, . . . five sequential hops using the transfer delay distribution shown in Figure 7, for a 64 kB block size. It can be seen that there is a 95% probability of the block arriving within 2 s. In contrast, Figure 11 shows the corresponding sequence of delay distributions for a 1024 kB block size, where the 95th percentile of transfer time is more than 5 s. If we know the distribution of expected path lengths, we can combine the ∆Qs for different hop counts using (PROB). Table 2 shows the distribution of paths lengths in simulated random graphs having 2500 nodes and a variety of node degrees [18]. Using the path length distribution for nodes of degree 10, for example, then gives the transfers delay distribution shown in Figure 12.

Alternative Refinements
Suppose that instead of investigating the number of hops, the design engineer is now interested in studying the steps within a single hop. There are various ways to do this. In Sections 4.4-4.7, we will consider four different ways that can be used when A and Z are neighbours, each of which refines o A Z . These refinements are all instances of the (ELAB) (rewrite) rule (for elaboration): The following sections are also important for another reason. So far, we have traversed the threaded tree of refinement in a depth-first way; the upcoming subsections traverse that tree in a breadth-first way. ∆QSD allows the design engineer to choose between depth-first and breadth-first refinement at any point in their design exploration.

Breaking Down Transmissions into Smaller Units
Network transmissions are typically broken down into the transmission of smaller units. Depending on the layering of the network protocols, that might, for example mean dividing a high-level message into several smaller packets. In a similar vein, the design engineer might decide to study block diffusion in terms of smaller units of transmission. For example, they might want to study the division of o A Z into n smaller unit operations o u 1 A Z , . . . , o u n A Z . The resulting outcome diagram is shown in Figure 13. Then, the corresponding outcome expression would be

Header-Body Split
In Cardano Shelley, an individual block transmission involves a dialogue between a sender node, A, and a recipient node, Z. We represent the overall transmission as o A Z . This can be refined into the following sequence:

1.
Permission for Header Transmission (o ph Z A ): Node Z grants the permission to node A to send it a header.

2.
Transmission of the Header (o th A Z ): Node A sends a header to node Z.

3.
Permission to for Body Transmission (o pb Z A ): Node Z analyses the header that was previously sent to it by A. Once the suitability of the block is determined via the header, node Z grants permission to A to send it the respective body of the previously sent header.

4.
Transmission of the Body (o tb A Z ): Finally, A sends the block body to Z. The motivation for the header/body split and the consequential dialogue is optimisation of transmission costs. Headers are designed to be affordably cheap to transmit. In addition, they carry enough information about the body to enable the recipient to verify its suitability. The body is only sent once the recipient has done this. This prevents the unnecessary transmission of block bodies when they are not required. Since bodies are typically several orders of magnitude larger than headers, considerable network bandwidth can be saved in this way. Moreover, the upstream node is not permitted to send another header until given permission to do so by the downstream node in order to prevent a denial-of-service attack in which a node is bombarded with fake headers, so this approach also reduces latency when bodies are rejected. In practice, the first permission is sent when the connection between peers is established and the permission renewed immediately after the header is received, so that the upstream peer does not have to wait unnecessarily. Therefore, the design engineer can refine o A Z into the finer-grained outcomes shown in Figure 14. The corresponding outcome expression is o Figure 14. Splitting a block transmission into its constituent parts: header (ph/th) and body (pb/tb).
Note that the protocol described here is between directly connected neighbours-these requests are not forwarded to other nodes. Thus, this is a refinement of the one-hop block transfer process. The significance of this refinement is that it shows that an individual outcome that, at a given level of granularity, is unidirectional (i.e., only from one entity in the system to another) might, at a lower level of granularity, very well be a multidirectional conversation.

Obtaining One Block from each Neighbour when Rejoining the Blockchain
Consider the situation where a node Z rejoins the blockchain after being disconnected for some period of time. Z will be out-of-date w.r.t. the recently generated blocks and will need to update itself. Let us consider the lucky situation where Z can acquire all the blocks that it is missing from its neighbours; that is, it can acquire the blocks with only one hop but from different neighbours. For demonstration purposes, we now make a number of simplifying assumptions: • Upon its return to the blockchain, Z is m blocks behind, where m is less than or equal to the number of Z's neighbours. With those simplifications in place, the outcome diagram will be as shown in Figure 15. This shows that Z will be up-to-date when all its m (selected) neighbours are granted permission and have finished sending their blocks to Z. Note that the outcome diagram has, in fact, m starting observation locations and m terminating observation locations. This is the reason for the 1. . .m notation immediately below each of those observation locations.
The corresponding outcome expression is

Load Analysis
One reason why this refinement is particularly interesting is that it allows an easy demonstration of our load analysis from Section 5.4. Fix a resource ρ such as network capacity. Pick a time t between the first observation made at an A − i and the last observation made at a Z + i . According to Definition 10, the static amount of work S at time t that is required for performing Equation (3) describes an approach to aggregating offered load on a resource. Considering an ephemeral resource-such as a communications network interface-a design interest might be to understand the intensity of use of this interface. We say a resource is ephemeral if it is lost if unused. For example, for a design requirement to be (at this level of detail) feasible, the average use of the interface has to be less than its capacity. This is the basic precondition for the demand on the resource to possess a feasible schedule. The RHS of Equation (3) captures this process as a piece-wise summation of the load intensities. Building on the time to transfer blocks (Table 1), and noting (from Section 2.1) that the body of a block is forwarded in response to a request (which takes one round-trip time), the total block volume is delivered in the total time minus the round trip time. For the 'Near' peers shipping a 64 kB block, this means an intensity of 42.7 Mb/s (8 × 64, 000/(0.024 − 0.012)) before incorporating any other network-related overheads (such as layered headers). Table 3 captures that load intensity approximation. This provides an insight into the likely capacity constraints for differing degrees of connectivity and, by inference, an insight into the system-level design trades. From Tables 1 and 3, it can be seen that smaller geographic distribution can lead to lower forwarding times assuming that (for a fixed communications capacity) the number of associating peers is suitably reduced. Assessments such as this give a measure of the likely "slack" in the design; those portions of the design that have less "slack" represent design elements that might need more detailed refinement and/or other strategies to ensure their feasibility. Note that a dedicated support tool for ∆QSD would easily be able to manipulate these complex outcome diagrams, giving a formally correct analysis, with very little mental burden for the design engineer.

Obtaining a Block from the Fastest Neighbour
Section 4.5 discussed splitting the header and body for optimisation reasons. One assumption in that design is that the header and the body will be taken from the same neighbour. It turns out that this assumption will not necessarily lead to the fastest solution. In fact, when Z determines that it is interested in a block that it has received the header of, it may obtain it from any of its neighbours that have signalled that they have it. In particular, Cardano nodes keep a record of the ∆Qs of their neighbours' block delivery. This allows them to obtain bodies from their fastest neighbour(s). In other words, once a node determines the desirability of a block (via its header), it is free to choose to take the body from any of its neighbours that have provided the corresponding header. As long as only timeliness is a concern-and not when resource consumption is also of interest-a race can occur between all neighbours, with the fastest neighbour winning the race. The diagrams in this section assume such a race. Now, as in Section 4.6, consider the situation where Z reconnects to the blockchain after being disconnected for some time. Our design in Section 4.6 assumes that there is no causality between the m blocks that Z needs to obtain. In reality, that is not correct: there is a causal order between those blocks, and that order can be rather tricky to define; it might take a couple of reads before the matter is fully digested. There are two separate total orders between blocks: CO1. For each block, the header must be transmitted before the body (so that the recipient node can determine the suitability of the block before the body transmission); CO2. Headers of the older blocks need to be transmitted before those of the younger blocks (note, however, that there is no causal relationship between the body transmissions).
This section considers the situation when the design engineer investigates the above race as well as CO1 and CO2. Suppose that once Z reconnects to the blockchain, it is exactly m = 3 blocks behind the current block. Suppose also that Z has k neighbours. The corresponding outcome diagram is shown in Figure 16. The fork that is causally dependent on o th 3 A Z is done when any of its prongs is done, that is, as soon as any neighbour of Z has finished transmitting the third block to Z. The other "∃" forks are similar.    The corresponding outcome expression is: We would like to invite the reader to take their time to pair the above diagram with our explanations above. We understand that the diagram and to a greater degree the expression can look impenetrable. The compositionality of our formalism (inherited from that of ∆QSD) comes to the rescue! Indeed, we can observe that the race pattern is rather repetitive. Thus, we can wrap the entire race into three new outcomes . Z , for example, to be the outcome of obtaining the first body transmitted to Z by any one of its k neighbours (that is, we are using "." in the subscript of o b 1 . Z as a wildcard). This makes the outcome diagram considerably simpler: These new diagrams make it easy to spot the lack of causal relationship between the o b i . Z s. Hence, there is no causal order between the body transmission despite the existence of CO1 and CO2. The corresponding outcome expression also becomes considerably simpler: The latter outcome diagrams and outcome expressions are now relatively easy to follow.

Summary
The refinements and analysis that are described in this section capture an important part of the design journey for the Shelley implementation of Cardano. In Section 4.1, we defined a 'top level' outcome of interest: that of diffusing a block from an arbitrary source node to an arbitrary destination in a bounded time and with bounded resource consumption. In Section 4.2, we refined this to examine the implications of forwarding the block through a sequence of intermediate nodes, and in Section 4.3, we factored in the expected distribution of path lengths. This allows an exploration of the trade-offs between graph size, node degree, block size, and diffusion time. In Section 4.4, we showed how ∆QSD can be used to explore orthogonal aspects of the design, in this case how blocks of data are in fact transmitted as a sequence of packets. This could be extended into a full analysis of some transmission protocol such as TCP or QUIC. In Section 4.5, we analysed the effects of splitting blocks into a header and a body in order to reduce resource consumption, and in Section 4.6, we analysed the potential for speeding up block downloading by using multiple peers in parallel. This analysis informed critical design decisions in the Cardano Shelley implementation, in particular the block header/body split, which was shown to significantly improve the resource consumption while increasing the diffusion time only slightly. An analysis of the network resource consumption in this case gave a flavour of how the ∆QSD paradigm encompasses resource as well as timeliness constraints. Finally, in Section 4.7, we discussed how ∆Q is used in the Shelley implementation of Cardano in operation as well as in design, to optimise the choice of peer from which to obtain a block.
All of this, together with further optimisations such as controlling the formation of the node graph to achieve a balance between fast block diffusion and resilience to partitioning, has produced an industry-leading blockchain implementation that reliably and consistently delivers blocks of up to 72 kB every 20 s on average across a globally distributed network of collaborating block producing nodes. Figure 17 gives a snapshot of the 95th percentile of block diffusion times over a period of nearly 48 h. This clearly shows highly consistent timing behaviour regardless of block size, with the vast majority of blocks diffused across the global network within 1-2 s. Such measurements, based on the ∆QSD paradigm, are used on an ongoing basis to avoid performance regressions as new features such as smart contracts are added to the Cardano blockchain.

Comparison with Simulation
It is informative to consider how the insights delivered by using ∆QSD could have been obtained otherwise, using, e.g., discrete-event simulations. This would require implementing the design to a sufficient level of detail for the timing to be considered accurate and then running many instances of the simulation to explore the variability of the context. For instance, obtaining the results of Figure 12 would require the following: • Generating a random graph with 2500 nodes having degree 10; • Randomly choosing whether each link is 'short', 'medium', or 'long', and applying the corresponding delay from Let us estimate how many simulation runs might be required. As a rule of thumb, we could consider that having any confidence in a 99th percentile result requires at least 1000 samples, so we would need to measure the diffusion time of at least 1000 blocks of the selected size; following Table 2, this would typically require each block to traverse four hops, hence needing 4000 simulation steps. So far, this seems quite tractable. However, let us consider how many graphs would need to be considered to have confidence in the results. According to McKay [19], if k ≤ 2n/9 and nk is even, then the number of labelled k-regular graphs (i.e., having degree k) on n vertices is given by: Taking logarithms and using Stirling's approximation for factorials ln(n!) ∼ n(ln(n) − 1), we can rewrite this as: If we substitute k = 10 and n = 2500, we get ln(M(n, k))) ∼ 12500 × 7.21 − 99/4 90, 158 which means M(n, k) ∼ 10 39,155 . So, obtaining a reasonable coverage of the set of possible random graphs with 2500 nodes of degree 10 is clearly infeasible. Using ∆QSD, we only process enough information to establish the performance hazard instead of constructing a lot of detail that is then discarded; combining probability distributions is a highly computationally efficient way to derive the distribution of interest (all the figures in this paper were produced on an ordinary laptop in a matter of seconds). This is not to say that ∆QSD replaces simulation, far from it: simulations can produce precise results whereas ∆QSD delivers probabilistic estimates. The limitation of ∆QSD are discussed further in Section 7.2.

A Formalisation of ∆QSD
The examples that were presented in Section 4 all build on the formalisms that we will present in this section. We start by describing the notational conventions that we will use here (Section 5.1). Then, we provide the syntax (Definition 1) for outcome expressions and formalise the rewrite rules that define the valid transitions between possible outcomes (Definition 3). In Sections 5.3 and 5.4, we provide corresponding denotational semantics for both timeliness and load. These provide the bases for constructing formal timeliness and load analyses that can be used as part of ∆QSD. The analyses have so far been deployed manually to inform design decisions for a number of complex real-world systems. Our longer-term intention is that they should be implemented as part of a design exploration toolset that will support ∆QSD. Additional semantics and analyses are also possible, of course, and could be used to support alternative design explorations or to provide further details about timeliness, load, etc.  , for a set A, we write A a to indicate that a, a , a , . . . , a 1 , a 2 , . . . all range over A. For predicates, we write pred(x).

Let B
and O v o v . We refer to black boxes and outcome variables together as base variables: Definition 1. The abstract syntax of outcome expressions is: We take o o to be commutative.
In Section 4, we used these syntax elements as follows: • in Section 4.1.

Definition 2.
The evaluation contexts C of an outcome are defined as follows: where "[]" is the empty context.
Evaluation contexts are useful in the definition of outcome transitions, which we define next.
Formally speaking, a refinement step is an instance of an outcome transition. The formal description of the system is refined when one or more refinement steps are taken.
The restriction on (UNBX) is because it makes no sense to replace a black box with another black box. (See the trailing discussion of Section 3.1 on the intention behind black boxes.) The restriction on (ELAB) is because it makes no sense for an outcome variable to be replaced by another outcome variable or a black box.
Considering Definition 3 to be part of the syntax is unusual. After all, evaluation contexts are a formalism for the semantics of programming languages. However, for ∆QSD, it turns out that the rewrites only cause syntactic changes to the outcome expressions (and the corresponding diagrams). Note that a refinement is not a system evolution, but rather, an update in the system description. It is only at analysis time that one tries to understand the meaning of an outcome diagram/expression.

Timeliness Analysis
We are now ready to describe the process of ∆Q analysis. The idea is that the design engineer provides the basic ∆Q analysis to the formulation in Definition 4. Then, our formulation enables them to determine the ∆Q analysis of the larger parts of their system or even all of it. This formulation is both compositional and simple. We call the ∆Q analysis that is provided by the design engineer the basic (∆Q) assignment (Definition 4). In the basic assignment, the design engineer only maps B expressions. They map those expressions to either CDFs or ∆Q variables. In return, they receive more complex ∆Q expressions. This is shown in Figure 18. The process is similar for load analysis except that there, the values exchanged between the design engineer and the respective formulation refer instead to static amounts of work.
The reason for including the CDFs in the input type of basic assignments is rather obvious. The choice to allow ∆Q variables here might be less so. The assignment of those B expressions that are mapped to ∆Q variables are considered to be left by the design engineer for later. As such, the formulation in Definition 4 takes the ∆Q value of those expressions to be , which lets the design engineer investigate feasibility even when those particular expressions are disregarded for the moment.
where * denotes the convolution of two ∆Qs. We denote the set of all basic assignments by We demonstrated the use of this definition in Section 4.2. In programming language theory, Definition 4 is said to give a denotational semantics for O. This is because the formulation works by compositionally denoting the O syntax into a familiar domain, which is deemed to be simpler (in our case, it is Γ). Definition 4 gives the design engineer the possibility of determining the ∆Q behaviour of a snapshot of their system. Armed with that information, the design engineer needs to figure out whether such ∆Q behaviour is affordable. In other words, they need to make sure the actual ∆Q is within the acceptable bounds. In order to do that, we assume that the design engineer's customer will provide them with a demand CDF: one that defines the acceptable bounds. Definition 5 below is a recipe for comparing the actual behaviour against a demand CDF. Definition 5. Given a demand CDF γ and a partial order < on Γ, say that a basic assignment ∆ • is a witness that an outcome o is a hazard w.r.t. γ Likewise, say ∆ • is a witness that an outcome o has slack once compared with γ The formulation of Definition 5 enables the design engineer to perform the ∆Q analysis of a single snapshot of their system. In some cases, that is enough because it can, for example, reveal the absolute infeasibility of a design. However, for the majority of cases, it is not enough. After all, a snapshot ∆Q analysis might not be conclusive for a variety of reasons. For example, one might not see any indication of a hazard by employing just Definition 5 because more detail is required. That takes us to Definition 8. When a design engineer works out the ∆Q analysis of a snapshot, the results might be favourable at the given level of refinement but still inaccurate. In such a case, a design engineer may wish to refine the system and perform the snapshot ∆Q again to check whether the refinement confirms the initial ∆Q analysis. Definition 8 examines that overall confirmation. Definitions 6 and 7 set the stage.

Definition 6.
Let ∆ • be a basic assignment. Write for those B outcomes in the domain of ∆ • that ∆ • maps to CDFs.
In other words, a basic assignment refines another one when it keeps all the CDFs in place and possibly adds more. We are now ready for Definition 8.

Definition 8.
Fix an outcome transition o → o and a ∆Q refinement ∆ • → ∆ • . Given a partial order < on Γ, we say that ∆ • → ∆ • witnesses that o → o arms a hazard As can be seen from Definitions 5 and 8, all the decisions for the timeliness analysis are made by scrutinising the CDFs (which represent ∆Q values). This is a consequence of the simple denotational semantics of Definition 4. The fact that the latter formalism is denotational implies that comparisons can be made in the domain of CDFs. Moreover, these comparisons are affordable because the denotational semantics is simple (as well as being effective).

Load Analysis
This section describes how the same approach can be used to analyse the load on given resources. Resources can be of different types; in particular, we distinguish ephemeral resources that are available at a certain rate and fixed resources that are available in a fixed number or amount. Examples of ephemeral resources are CPU cycles, network interface capacity, and disk IO operations. Fixed resources include CPU cores, memory capacity, and disk capacity. In this paper, we consider only ephemeral resources. The analysis that we want is an answer to the following question: will the resource manage the amount of work assigned to it in the available time frame?
We first need to set up some terminology for specifying the available time frame as well as the amount of work that is assigned to a given resource. Write t • (o) for the time an observable from the starting set of an outcome o occurs.
where d(o) denotes the duration limit of o. Fix a set of resources H ρ. Note that the amount of work that is assigned to a resource ρ is not scalar. Of course, it is necessary to provide the unit of measurement. For example, when ρ represents CPU resources, a sensible unit of measurement is the number of CPU cycles. When ρ represents network resources, a sensible unit of measurement is the message size. However, at the current level of formalisation, we wish to set ourselves free from thinking about units of measurement. Therefore, given a resource ρ, we write W ρ for the set of values of the right unit of measurement for an amount of work that has been assigned to ρ.
The design engineer utilises our load analysis in the same way that they utilise our ∆Q analysis. That is, they must provide some basic load analysis (Definition 9). Then, exactly as shown in Figure 18, they use the formulation in Definition 10 to determine the load analysis for larger parts of their system or possibly all of it. We now formalise what we mean by a basic load analysis. Definition 9. For a given ρ, a basic "static (amount of) work assignment for ρ" is a function: Definition 10. Given a basic static work assignment S • for ρ, the static work assignment (i.e., the amount of work to perform a single outcome per unit of size) Whether or not a given resource ρ is overloaded when performing an outcome o is determined by whether ρ can bear the offered load in the required duration, d(o). The smaller that d(o) is, the faster (i.e., the more intensely) o must be performed. However, that can only be done up to a certain threshold that is determined by the system's configuration. In other words, whether the intensity brought to ρ passes a given threshold is what determines whether ρ is overloaded. As with W ρ , at our current level of abstraction, we wish to disregard the units of measurement for intensity. That is, we write I ρ for the set of values of the right unit of measurement for the intensity of the load that is imposed on ρ. We single out θ I (ρ) ∈ I ρ for the threshold of intensity ρ can bear. When it is clear, we write θ I for θ I (ρ).
Definition 11. For a fixed ρ, given a threshold of intensity θ I (ρ) and a basic static work assignment S • for ρ, the static slack of an outcome in ρ-consumption: Define the static hazard of an outcome in ρ-consumption: Our emphasis on considering the analyses of Definitions 9-11 "static" is intentional. Firstly, they all assume that a base outcome's work is spread uniformly over its duration limit. That is obviously not always correct. The work assignment typically varies over the duration limit. However, if to every base outcome β, the design engineer chooses to assign the highest amount of work that β needs to do during its duration limit, the analyses given in Definition 11 would lead to a safe upper bound, which is useful as a first estimate. Secondly, Definitions 9-11 assume that an outcome's amount of work is always the same throughout its execution. Again, that is not realistic. Various reasons might cause the amount of work assigned to a base outcome to change over time. Examples are congestion, nonlinear correlations between outcomes, and cascading effects. This suggests more advanced load analyses that are "dynamic" rather than the "static" ones we have described here. We leave the development of such analyses to future work.

Related Work
Several theoretical or practical approaches have previously been proposed that address parts of the problem that has been identified above, but none of these addresses the whole problem in a comprehensive way.
6.1. Alternative Theoretical Approaches 6.1.1. Queuing Theory Steady-state performance has been widely studied as an aid to analysis, for example in queuing theory. Such approaches tend to take a resource-centric view of the system components, focusing on their individual utilisation/idleness. Where job/customer performance is considered, such as in mean-value analysis [20] or Jackson/BCMP networks [21], it is also in the context of steady-state averages. However, these traditional approaches cannot deliver metrics such as the time distribution of the system's response to an individual stimulus or even the probability that such a response will occur within a given time bound. These metrics are key for any time-critical and/or customer-experience-centric service.

Extending Existing Modelling Approaches
With the exception of hard real-time systems, it is rare to see performance treated as a "first-class citizen" in a system design process. At best, performance is considered as a property that will emerge during the system development life-cycle and thus something that can only be retrospectively validated. Thus, in contrast with ∆QSD, performance is unverifiable when using such an approach.
A common approach has been to extend existing approaches to modelling distributed systems such as Petri nets or process calculi with the goal of integrating performance modelling. Examples include stochastic Petri nets [22], timed and probabilistic process calculi [23,24], and performance evaluation process algebra (PEPA) [25]. These systems consider passage-time [26], which is the time taken for the system to follow a particular path to a state, that path being characteristic of an outcome of interest [27][28][29][30]. As mentioned above, these are all retrospective validation tools, requiring fully specified systems, that will give probabilistic measures of outcomes under steady-state assumptions. These systems are susceptible to state space explosion as a model grows in complexity, and therefore, this limits their usage to less complex systems. Furthermore, as with queuing models, they do not model failure nor do they model typical real-world responses to failure such as timeouts and retries.

Real-Time Systems and Worst-Case Execution Time
In real-time systems, actions must be completed by strict deadlines. Missed deadlines can be catastrophic (hard real-time systems) or lead to significant delay and loss caused by roll-backs or recovery (soft real-time systems). Performance analysis has focused on giving guarantees that deadlines can be met by studying worst-case execution time [31]. These approaches generally aim to analyse the behaviour of specific implementations, providing information about specific interactions. Thus, this approach is complementary to design-time approaches such as ∆QSD.

Block Propagation
Bitcoin's block propagation has been measured by Decker and Wattenhofer [32] and later by Croman et al. [33], who proposed guidelines on block size and interval to ensure adequate throughput for 90% of nodes. Shahsavari et al. [34] propose a random graph model for modelling the performance of block propagation. The recent survey article of Dotan et al. [35] covers block propagation (Section 3) and the mapping of blockchain networks.

Distributed System Design
Designing large distributed systems is costly and error-prone. This might seem paradoxical given the proliferation of modern Internet-based companies whose core business is based on large distributed systems, such as Google, Facebook, Amazon, Twitter, Netflix, and many others. Given the existence of these successful companies, it might seem that building large distributed systems is a solved problem. It is not: successful companies have built their systems over many years, using vast amounts of effort and ingenuity to find usable solutions to difficult problems. Unsuccessful companies are forgotten.

Iterative Design
There does not exist a standard approach for designing large distributed systems that allows prediction of high-load performance early on during the design process. We explain the problem by giving an overview of the current design approach for distributed systems. The approach is iterative. It starts with a specification of the system's desired performance and scale. Then, the system architecture is designed by determining the system components according to the system's scale and estimating the performance they must have to give the required overall performance. The next step is performance validation to verify that the design satisfies the performance requirements.
Performance validation is performed either as part of unit, subsystem, and/or system testing or via discrete-event simulation. Testing the performance of a component or subsystem is inconclusive without a reliable means to relate it to the resulting system performance, and testing of the whole system only reveals issues very late in the system development life-cycle. It is good practice to perform integration testing at this late stage. However, this is a poor and expensive substitute for performance analysis throughout the development process. Simulation can be performed earlier in the development process, and it may be less costly than testing, but it is limited in its ability to expose rare cases and hence cannot test tight bounds on the performance.
In the final analysis, obtaining reliable performance numbers at high load requires actually building a large part of the final system and subjecting it to a realistic load. If the system does not satisfy the requirements, then it is back to the drawing board. The system architecture is redesigned to remove observed and predicted bottlenecks and rebuilt.
Several iterations of the design may be necessary until the system behaves satisfactorily. It often happens that the system only behaves satisfactorily at a fraction of the required load, but because of market constraints, this is considered acceptable, and the system is deployed. In parallel to the deployment, the design engineers continue to work on a system that will accept the larger load under the assumption that the deployment will be successful so that the load will increase. This methodology is workable, but it is highly risky due to its high cost and development time. To have a good chance of success, it requires experienced developers. The development budget may be exhausted before achieving a satisfactory system; it may even be determined that the requirements are impossible to satisfy (infeasibility). If this is discovered early on, then the company may be able to retarget itself to become viable. Otherwise, the company simply folds.

Role of the ∆QSD Paradigm in Distributed System Design
The ∆QSD paradigm is designed specifically to reduce cost and development time. The system is designed as a sequence of increasingly refined outcome diagrams. At each stage, performance is computed using the ∆Q parameters. If the system is infeasible, this is detected early on, and it is immediately possible to change the design. If the design has sufficient slack, then the design process continues. The ∆QSD paradigm is effective insofar as the ∆Q computations provide realistic results. This depends on (i) having correct ∆Q distributions for the basic components and (ii) correctly specifying causality and resource constraints. Experience with ∆QSD in past industrial designs gives us confidence in the validity of the results. The additional rigour that is provided by the ∆QSD formalism that has been introduced in this paper gives us confidence that the paradigm is being applied correctly and allows the paradigm to be integrated into new design tools. Programming paradigms each focus on their particular discipline for bringing more opportunities for code reuse. The most familiar examples are perhaps Object-Oriented Programming, Functional Programming, and Genericity by Type, which promote code-reuse between a base class and derived ones by refactoring into functions and type parameterisation. Gibbons [36] has an excellent survey on different flavours of Generic Programming with the different opportunities for code reuse that each provides. Some programming paradigms have widely accepted formalisms, and some do not. Regardless of the underlying programming paradigm, ∆QSD is a paradigm for systems development rather than simply for programming, and it comes with its own formalism.

Software Development Paradigms
Three paradigms focus on the process of software development and hence are closer to ∆QSD:

1.
Design-by-Contract. [37] Similarly to ∆QSD, in this paradigm, the programmer begins by coding by describing the pre-conditions and the post-condition. Over the years, the concept of refining initial designs from specification to code has gained increasing weight [38]. However, unlike ∆QSD, the focus is on functional correctness rather than performance.

2.
Software Product Lines. [39] This paradigm targets families of software systems that are closely related and that clearly share a standalone base. The aim is to reuse the development effort of and the code for the base across all the different variations in the family. The similarity with ∆QSD is that this approach also allows variation in the implementation so long as the required quality constraints are met. In other words, variations can share a given expected outcome and its quality bounds.

3.
Component-Based Software Engineering. [40] Components, in this paradigm, are identified by their so-called 'requires' and 'provides' interfaces. That is, so long as two components have the same 'requires' and 'provides' interfaces, they are deemed equivalent in this paradigm, and they can be used interchangeably. In ∆QSD, subsystems can also have quality contracts that involve quantitative 'demand' and 'supply' specifications. Such contracts impose quality restrictions (say, timeliness or pattern of consumption) on the respective outcomes of those subsystems. However, we have not shown examples of quality contracts in this paper, because their formalisation is not yet complete.

Algebraic Specification and Refinement
Algebraic specification languages such as CLEAR [41], Extended ML [42], Institutions [43], and CASL [44] work on the basis of specifying requirements using algebraic signatures and equations that are then refined progressively until one makes it to the level of actual code. Refinement in such languages is managed using various media, for example by module systems with rigorously defined formal semantics. Whilst the focus of such languages is almost exclusively on functional correctness, studying possibilities for enhancing algebraic specifications so that they also accommodate the quality of outcomes would be an interesting avenue for future work.

Amortised Analysis
Amortised resource analysis is an approach for promoting resource analysis as a first-class citizen of programming languages specification. Various operational semantics, type systems, and category theoretical approaches have been employed. See [45][46][47], for example, where memory consumption for functional languages such as HASKELL and ML are automatically calculated for programs written in those languages. ∆QSD advises on specification at the much higher level of outcomes and outcome diagrams, leaving the actual implementation and its host language completely unconstrained. As a result, ∆QSD is much more flexible and permits rapid performance estimation throughout the system development life-cycle.

Conclusions
This paper has presented the ∆QSD systems development process that is driven by performance predictability concerns and is supported by a rigorous formalism (Section 5). Our formalism builds on the simple concept of quality attenuation (∆Q, Section 3.3) that captures the notion of performance hazard. This helps early detection of infeasibility, thus preventing the waste of resources (financial, people, time, and systems).
∆QSD has been successfully used in a wide range of industries, including telecommunications, avionics, space and defence, and cryptocurrency. It complements other approaches that are focused primarily on functional concerns, such as functional programming or model checking.
Our formalisation of ∆QSD is a part of a wider initiative both within Predictable Network Solutions and IO Global [9]. In particular, it has been applied to the development of the current iteration of the Cardano blockchain, which uses a proof-of-stake (PoS) consensus algorithm rather than the proof-of-work (PoW) approach used by most other blockchains, including Bitcoin. PoS algorithms have significant advantages over PoW, such as vastly better energy efficiency and the potential to deliver much higher performance, both in terms of processing transactions and embedding them more rapidly in the immutable chain. However, for this to work, blocks must be diffused within a predictably short time-frame across a globally distributed system with no central control so that the chain can be most efficiently extended. Only by using ∆QSD was the Cardano engineering team able to untangle this knot to deliver a secure and efficient system.
∆QSD is based on taking the observable outcomes of a system as the central point of focus (Section 3.1), capturing the causal dependencies between outcomes in the form of outcome diagrams (Section 3.2). The formalism also describes the process of refining outcome diagrams (Definition 7) as part of a system design process. The formal specification of a system serves as a basis for different analyses such as timeliness (Section 5.3) and behaviour under load (Section 5.4). Although we have illustrated the ∆QSD paradigm in the context of design refinement, the aim is that these aspects should permeate throughout the complete system development life-cycle.

Takeaways for System Designers
Let us summarise the main insights of the ∆QSD paradigm for the system designer. The main new concept is focus on performance as determined by observations, which are captured using outcome diagrams. Designing with outcome diagrams allows problems to be discovered early on in the design process, which saves time and reduces cost. We are working on tools and documentation to disseminate the ∆QSD paradigm in the system design community.

Outcome Diagrams
The outcome diagram defines a system in terms of what is observable from the outside (of the (sub)system under consideration), whereas traditional approaches such as UML (discussed in Section 3.5.1) all describe what is inside the system. A major advantage of this approach is that it avoids making decisions prematurely on how the system should be built. Outcome diagrams allow infeasibility to be discovered early on, avoiding costly dead ends and reducing time-to-market. On the other hand, all these advantages do not come for free. The main difficulty of using ∆QSD is psychological: some decisions on the actual system structure have to be "kept in the air" for long periods as the designer works with outcome diagrams. This can conflict with the natural urge to make decisions at the earliest opportunity and the often-imposed requirement to demonstrate 'progress'. Quantifying design risks is rarely understood as progress, although this is often the most valuable part of the entire design process. Outcome diagrams provide a framework for 'rigidly defined areas of doubt and uncertainty' [48], enabling such value to be evidenced. Figure 19 compares ∆QSD with a traditional approach. The figure shows a design tree. Each nonleaf node corresponds to one design decision. The design starts at the root and continues down the tree until it reaches a leaf node, which corresponds to a completely designed system. The subtree outlined on the left contains all designs where decision D x took the leftmost branch. In our case, all these designs are infeasible. In ∆QSD, this fact would be detected immediately after the D x decision is made by observing that the quality attenuation required from any subsequent refinement is infeasibly small, for instance less than the time taken for signals to move between components of the distributed system. Using an approach based on refining the system's structure, such as a UML-based approach, would require specifying much more of the system before this fact became evident. In many cases, it can only be seen by actually building the system and checking that it cannot satisfy the requirements. With ∆QSD, the cost of designing and building all these infeasible systems is saved. This example summarises the actual experience of Predictable Network Solutions (PNSol) in many industrial projects.

Recommendations
We recommend that you think about how the two main concepts of ∆QSD, outcomes and quality attenuation, can apply to your own work. Try to express one of your own designs in terms of the outcomes that a user sees without making any decisions about how the system is built. Instead of describing the system structure, as UML does, try to think only of externally visible outcomes. The blockchain example of Section 4 gives a realistic example of how this is done. Note that in practice, we expect that a software tool would do all the tedious bookkeeping needed to keep track of the outcome diagrams.
To design a system, start from the outcomes that the user expects, and work your way in from there. A primary outcome, such as a request-reply, can be divided into smaller outcomes. Bigger outcomes decompose into smaller ones, either by sequencing small outcomes, by creating a choice between small outcomes, or by synchronising on small outcomes. Eventually, you get to primitive outcomes that can be directly provided by components, such as networks, servers, or databases. At any time, you can combine the quality attenuation of small outcomes to get the quality attenuation of a bigger outcome. This means that you can start answering questions immediately, even if the system is only partially designed. The main question is, is the system feasible? In other words, is there a probability close to 1 that the reply returns with an acceptable delay? For cutting-edge systems, the answer to this question might be 'no'. In that case, you need to step back and build an alternative outcome decomposition.

Limitations of the ∆QSD Paradigm
There are two main limitations of the work that has been described here.

1.
Contextuality vs. Compositionality:As a performance modelling tool, ∆QSD deliberately trades detail in exchange for compositionality. The highest level of detail is provided by timed traces of a real system or a discrete event simulation thereof. A level of abstraction is provided by the use of generator functions [49], which obscure some details such as data-dependency but retain the local temporal context. Representing behaviour using random variables removes the temporal context, treating aspects of the system as Markovian. Thus, the ∆QSD paradigm is most applicable to systems that execute many independent instances of the same action, such as diffusing blocks, streaming video frames, or responding to web requests. For systems that engage in long sequences of highly dependent actions, it may only deliver bounding estimates.

2.
Non-linearity: In many systems, resource sharing may introduce a relationship between load and ∆Q, which can be incorporated in to the analysis. An obvious example is a simple queue (which is ubiquitous in networks), where the delay/loss is a function of the applied load. However, where system behaviour introduces a further relation-ship between ∆Q and load, for example due to timeouts and retries, the coupling becomes non-linear. In this case, a satisfactory performance analysis requires iterating to a fixed point, which may not be forthcoming. Failure to find a fixed point can be considered a warning that the performance of the system may be unstable.

Future Work
The ∆QSD paradigm has been developed for over 30 years by a small group of people in and around PNSol, and it has shown its value in large-scale industrial projects. It has matured enough that it should be more widely known. Unfortunately, applying it today requires a high level of commitment and effort, because there is no tool support and little documentation. The ultimate goal of our work is to make it usable with much less effort; this paper takes the first step by defining a formal framework for outcome diagrams. Ideally, the system designer will mostly need domain expertise to apply ∆QSD and very little expertise in the paradigm itself. To achieve this goal, we are working towards building tools to handle most of the details of creating outcome diagrams and computing quality attenuation.
The immediate next step after this paper is a tutorial on ∆QSD given at the HiPEAC conference in June 2022 [50]. This tutorial will give a broad introduction to the use of ∆QSD through a variety of practical examples that come from PNSol's experience. That will help the adaptation of ∆QSD by other practitioners and therefore will help us with further tuning of the ∆QSD tool we are currently developing.
Future work will also include the development of new analyses for non-ephemeral resources and for dynamic loads as well as an extension to non-linear systems in which the load and timeliness are coupled. In parallel, we plan to use our formalism as an intermediate step to better teaching and dissemination of ∆QSD.
We will build additional tools that will enable us to track the key observables/outcomes from the design into the implementation so that they can support ongoing system design and development throughout the system development life-cycle. Given appropriate tools, it would become feasible to systematically articulate the benefits of the paradigm, for instance by comparing various metrics between design projects that do or do not use it, such as the time/budget to compete the project, number of major design changes, etc. This line of research would require new collaborators with expertise in social science disciplines.
The wider ∆Q framework is also under active development within the International Broadband Forum [51] as a means of characterising quality attenuation associated with networks.