4.2. Early Analysis
Given that the design engineer is not content with the current level of granularity, they wish to further detail the diagram by giving the black box a name, such as
. In
SD, we call adding that further detail a
refinement. That refinement step is depicted below.
Here, the outcome diagram that is above the dashed line is refined into the one below the dashed line. As will be discussed in
Section 5, the (rewrite) rule that authorises this refinement is
We call this rule
(Unbx) for unboxing (a black box). The rule states that in a context
, a black box can be rewritten to any other outcome expression (but not to a black box). In this case, we choose the black box to be rewritten to an
outcome variable called
. This indicates the outcome of hopping directly from
A to
Z.
Before producing more of our block diffusion algorithm’s outcome diagram, we would like to take the time to apply some analysis. Refinements aside, suppose for a moment that there are two hops to make from
A to
Z: first from
A to an intermediate node
B, and, then, from
B to
Z. The corresponding outcome diagram for the two-hop journey from
A to
Z would then be:
Here,
and
are the outcomes of hopping from
A to
B and from
B to
Z, respectively. Note also that the observation location between the above two outcomes is labelled
. That is because the observation
and
take place at the same location. For that reason, we will simply write
B to refer to that observation location. The same convention is used for similar intermediate locations. Then, it is easy to obtain the outcome diagram for three hops:
While outcome diagrams are visually more attractive, outcome expressions are algebraically more attractive. For example, the corresponding expression for two hops is
, where “
” is the symbol we use for
sequential composition: The sequential composition of
and
is needed because the latter causally depends on the former. Likewise, the outcome expression for three hops is
. Then, generalising that to
n hops is easy:
, which we abbreviate as
. Parameterisation by
n hops is useful because it helps the design engineer determine the right
n for their blockchain. For example, a relevant question is:
What is the optimal n for block diffusion to be timely and for its load to be bearable? The formalisation in
Section 5 instructs the design engineer as to how to achieve that and other goals. Before detailing the how, we take our moment to analyse a smaller example. Consider the two-hop scenario. Provided that the design engineer has
s for both
and
, they can use Definition 4 to work out the
of
, which is the convolution of the two constituent
s:
In a similar vein, the design engineer can work out the
n-hop scenario’s
for
.
Then, using the formulation given in Definition 5, the design engineer can determine the constraints on
n that are needed in order for block diffusion to meet the overall timeliness requirements.
In practice, the time that is needed to transfer a block of data one hop depends on four main factors:
The size of the block;
The speed of the network interface;
The geographical distance of the hop (as measured by the time to deliver a single packet);
Congestion along the network path.
When we consider blockchain nodes that are located in data centres (which most block producers tend to be), the interface speed will typically be 1 Gb/s or more. This is not a significant limiting factor for the systems of interest (see
Section 5.4 for an analysis that explains this). In the setting that we are considering, congestion is generally minimal, and so this can also be ignored in the first instance. This leaves (i) block size, which we will take as a design parameter to be investigated later; and (ii) distance, which we will consider now. For simplicity, we will consider three cases of geographical distance:
Short: The two nodes are located in the same data centre;
Medium: The two nodes are located in the same continent;
Long: The two nodes are located in different continents.
For pragmatic reasons, Cardano relies on the standard TCP protocol for data transfers. TCP transforms loss into additional delay, so the residual loss is negligible. At this point, we could descend into a detailed refinement of the TCP protocol, but equally we could simply take measurements; the compositionality of
SD means that it makes no difference where the underlying values come from.
Table 1 shows measurements of the transit time of packets and the corresponding transfer time of blocks of various sizes, using hosts running on AWS data centre servers in Oregon, Virginia, London, Ireland, and Sydney. Since we know that congestion is minimal in this setting, the spread of values will be negligible, and so in this case, the CDFs for the
s will be step functions. The transfer time for each block size is given both in seconds and in multiples of the basic round-trip time (RTT) between the hosts in question. Since the TCP protocol relies on the arrival of acknowledgements to permit the transmission of more data, it is unsurprising to see a broadly linear relationship, which could be confirmed by a more detailed refinement of the details of the protocol.
Given the randomness in the network structure and the selection of block-producing nodes, there remains some uncertainty on the length of an individual hop. At this point, we will assume that short, medium, and long hops are equally likely, which we can think of as an equally-weighted probabilistic choice. In numerical terms, this becomes a weighted sum of the corresponding
s, as given in
Table 1. This gives the distribution of transfer times per block size shown in
Figure 7.
4.3. Refinement and Probabilistic Choice
Recall that
A and
Z are names for randomly chosen nodes, so the number of hops between
A and
Z is unkown.
SD tackles that uncertainty by offering an outcome diagram that involves probabilistic choice between the different number of hops that might be needed. Strictly speaking, a probabilistic choice is a binary operation. Hence, when there are more than two choices, the outcome diagram will cascade probabilistic choices. In the general formulation, there are at most
n hops. In order to produce that, the design engineer exercises a step-by-step refinement of the single-hop outcome diagram. The first refinement introduces the choice between one or two or more hops, as shown in
Figure 8.
There are two outcome diagrams in
Figure 8: the one above the dashed line and the one below. The underlying green area is not a part of the two outcome diagrams itself, but it is there to indicate which part of the diagram above the dashed line is being refined into which part of the diagram below. In the absence of the left-side arrow, the direction of refinement can also be determined using the colour of the underlying green area. The pale side of an underlying green area is for what is being refined, whereas the dark side is for the result of the refinement.
The equivalent outcome expression of the lower diagram in
Figure 8 is
, which is a probabilistic choice between one or two hops with respective weights
and
. The corresponding (rewrite) rule of the figure is:
which we call
(Prob) (for probabilistic choice). Here is how we applied
(Prob) to arrive from the single hop to the probabilistic choice between one hop and two hops:
That is,
in the above refinement is an empty
context.
Next, the design engineer further refines the two+-hop part to the probabilistic choice between two or three hops, as shown in
Figure 9. Again, in that figure, the underlying green area is not a part of either diagram. It only serves as a visual indicator, showing which part of the upper diagram is being refined into which part of the lower one.
For the equivalent term rewriting of
Figure 9, we use
(Prob) again. However, instead of an empty context, here, the context is
:
The design engineer can continue refinement until a predetermined number of hops is reached. Alternatively, they can keep the number of hops as a parameter and analyse the corresponding parameterised outcome expression for timeliness, behaviour under load, etc.
Figure 10 shows the result of applying Equation (
2) to the sequence of outcome expressions corresponding to one, two, …five sequential hops using the transfer delay distribution shown in
Figure 7, for a 64 kB block size. It can be seen that there is a
probability of the block arriving within 2 s. In contrast,
Figure 11 shows the corresponding sequence of delay distributions for a 1024 kB block size, where the 95th percentile of transfer time is more than 5 s.
If we know the distribution of expected path lengths, we can combine the
s for different hop counts using
(Prob).
Table 2 shows the distribution of paths lengths in simulated random graphs having 2500 nodes and a variety of node degrees [
18]. Using the path length distribution for nodes of degree 10, for example, then gives the transfers delay distribution shown in
Figure 12.
Alternative Refinements
Suppose that instead of investigating the number of hops, the design engineer is now interested in studying the steps within a single hop. There are various ways to do this. In
Section 4.4,
Section 4.5,
Section 4.6 and
Section 4.7, we will consider four different ways that can be used when
A and
Z are neighbours, each of which refines
. These refinements are all instances of the
(Elab) (rewrite) rule (for elaboration):
The following sections are also important for another reason. So far, we have traversed the threaded tree of refinement in a depth-first way; the upcoming subsections traverse that tree in a breadth-first way.
SD allows the design engineer to choose between depth-first and breadth-first refinement at any point in their design exploration.
4.5. Header–Body Split
In Cardano Shelley, an individual block transmission involves a dialogue between a sender node, A, and a recipient node, Z. We represent the overall transmission as . This can be refined into the following sequence:
Permission for Header Transmission (): Node Z grants the permission to node A to send it a header.
Transmission of the Header (): Node A sends a header to node Z.
Permission to for Body Transmission (): Node Z analyses the header that was previously sent to it by A. Once the suitability of the block is determined via the header, node Z grants permission to A to send it the respective body of the previously sent header.
Transmission of the Body (): Finally, A sends the block body to Z.
The motivation for the header/body split and the consequential dialogue is optimisation of transmission costs. Headers are designed to be affordably cheap to transmit. In addition, they carry enough information about the body to enable the recipient to verify its suitability. The body is only sent once the recipient has done this. This prevents the unnecessary transmission of block bodies when they are not required. Since bodies are typically several orders of magnitude larger than headers, considerable network bandwidth can be saved in this way. Moreover, the upstream node is not permitted to send another header until given permission to do so by the downstream node in order to prevent a denial-of-service attack in which a node is bombarded with fake headers, so this approach also reduces latency when bodies are rejected. In practice, the first permission is sent when the connection between peers is established and the permission renewed immediately after the header is received, so that the upstream peer does not have to wait unnecessarily. Therefore, the design engineer can refine
into the finer-grained outcomes shown in
Figure 14. The corresponding outcome expression is
.
Note that the protocol described here is between directly connected neighbours—these requests are not forwarded to other nodes. Thus, this is a refinement of the one-hop block transfer process. The significance of this refinement is that it shows that an individual outcome that, at a given level of granularity, is unidirectional (i.e., only from one entity in the system to another) might, at a lower level of granularity, very well be a multi-directional conversation.
4.7. Obtaining a Block from the Fastest Neighbour
Section 4.5 discussed splitting the header and body for optimisation reasons. One assumption in that design is that the header and the body will be taken from the same neighbour. It turns out that this assumption will not necessarily lead to the fastest solution. In fact, when
Z determines that it is interested in a block that it has received the header of, it may obtain it from
any of its neighbours that have signalled that they have it. In particular, Cardano nodes keep a record of the
s of their neighbours’ block delivery. This allows them to obtain bodies from their fastest neighbour(s). In other words, once a node determines the desirability of a block (via its header), it is free to choose to take the body from any of its neighbours that have provided the corresponding header. As long as only timeliness is a concern—and not when resource consumption is also of interest—a
race can occur between all neighbours, with the fastest neighbour winning the race. The diagrams in this section assume such a race.
Now, as in
Section 4.6, consider the situation where
Z reconnects to the blockchain after being disconnected for some time. Our design in
Section 4.6 assumes that there is no causality between the
m blocks that
Z needs to obtain. In reality, that is not correct: there is a causal order between those blocks, and that order can be rather tricky to define; it might take a couple of reads before the matter is fully digested. There are two separate total orders between blocks:
- CO1.
For each block, the header must be transmitted before the body (so that the recipient node can determine the suitability of the block before the body transmission);
- CO2.
Headers of the older blocks need to be transmitted before those of the younger blocks (note, however, that there is no causal relationship between the body transmissions).
This section considers the situation when the design engineer investigates the above race as well as CO1 and CO2. Suppose that once
Z reconnects to the blockchain, it is exactly
blocks behind the current block. Suppose also that
Z has
k neighbours. The corresponding outcome diagram is shown in
Figure 16. The fork that is causally dependent on
is done when
any of its prongs is done, that is, as soon as any neighbour of
Z has finished transmitting the third block to
Z. The other “∃” forks are similar.
The corresponding outcome expression is:
We would like to invite the reader to take their time to pair the above diagram with our explanations above. We understand that the diagram and to a greater degree the expression can look impenetrable. The compositionality of our formalism (inherited from that of
SD) comes to the rescue! Indeed, we can observe that the race pattern is rather repetitive. Thus, we can wrap the entire race into three new outcomes
,
, and
. The intention is for
, for example, to be the outcome of obtaining the first body transmitted to
Z by any one of its
k neighbours (that is, we are using “.” in the subscript of
as a wildcard). This makes the outcome diagram considerably simpler:
where
These new diagrams make it easy to spot the lack of causal relationship between the
s. Hence, there is no causal order between the body transmission despite the existence of CO1 and CO2. The corresponding outcome expression also becomes considerably simpler:
where
which we abbreviate as
The latter outcome diagrams and outcome expressions are now relatively easy to follow.
4.8. Summary
The refinements and analysis that are described in this section capture an important part of the design journey for the Shelley implementation of Cardano. In
Section 4.1, we defined a ‘top level’ outcome of interest: that of diffusing a block from an arbitrary source node to an arbitrary destination in a bounded time and with bounded resource consumption. In
Section 4.2, we refined this to examine the implications of forwarding the block through a sequence of intermediate nodes, and in
Section 4.3, we factored in the expected distribution of path lengths. This allows an exploration of the trade-offs between graph size, node degree, block size, and diffusion time. In
Section 4.4, we showed how
SD can be used to explore orthogonal aspects of the design, in this case how blocks of data are in fact transmitted as a sequence of packets. This could be extended into a full analysis of some transmission protocol such as TCP or QUIC. In
Section 4.5, we analysed the effects of splitting blocks into a header and a body in order to reduce resource consumption, and in
Section 4.6, we analysed the potential for speeding up block downloading by using multiple peers in parallel. This analysis informed critical design decisions in the Cardano Shelley implementation, in particular the block header/body split, which was shown to significantly improve the resource consumption while increasing the diffusion time only slightly. An analysis of the network resource consumption in this case gave a flavour of how the
SD paradigm encompasses resource as well as timeliness constraints. Finally, in
Section 4.7, we discussed how
is used in the Shelley implementation of Cardano in operation as well as in design, to optimise the choice of peer from which to obtain a block.
All of this, together with further optimisations such as controlling the formation of the node graph to achieve a balance between fast block diffusion and resilience to partitioning, has produced an industry-leading blockchain implementation that
reliably and
consistently delivers blocks of up to 72 kB every 20 s on average across a globally distributed network of collaborating block producing nodes.
Figure 17 gives a snapshot of the 95th percentile of block diffusion times over a period of nearly 48 h. This clearly shows highly consistent timing behaviour regardless of block size, with the vast majority of blocks diffused across the global network within 1–2 s. Such measurements, based on the
SD paradigm, are used on an ongoing basis to avoid performance regressions as new features such as smart contracts are added to the Cardano blockchain.
4.9. Comparison with Simulation
It is informative to consider how the insights delivered by using
SD could have been obtained otherwise, using, e.g., discrete-event simulations. This would require implementing the design to a sufficient level of detail for the timing to be considered accurate and then running many instances of the simulation to explore the variability of the context. For instance, obtaining the results of
Figure 12 would require the following:
Generating a random graph with 2500 nodes having degree 10;
Randomly choosing whether each link is ‘short’, ‘medium’, or ‘long’, and applying the corresponding delay from
Table 1;
Running the simulation of the whole system for enough steps to obtain statistical confidence;
Repeating for each block size;
Repeating this for enough different graphs to have confidence in the results.
Let us estimate how many simulation runs might be required. As a rule of thumb, we could consider that having any confidence in a 99th percentile result requires at least 1000 samples, so we would need to measure the diffusion time of at least 1000 blocks of the selected size; following
Table 2, this would typically require each block to traverse four hops, hence needing 4000 simulation steps.
So far, this seems quite tractable. However, let us consider how many graphs would need to be considered to have confidence in the results. According to McKay [
19], if
and
is even, then the number of labelled
k-regular graphs (i.e., having degree
k) on n vertices is given by:
Taking logarithms and using Stirling’s approximation for factorials
, we can rewrite this as:
If we substitute
and
, we get
which means
. So, obtaining a reasonable coverage of the set of possible random graphs with 2500 nodes of degree 10 is clearly infeasible. Using
SD, we only process enough information to establish the performance hazard instead of constructing a lot of detail that is then discarded; combining probability distributions is a highly computationally efficient way to derive the distribution of interest (all the figures in this paper were produced on an ordinary laptop in a matter of seconds). This is not to say that
SD replaces simulation, far from it: simulations can produce precise results whereas
SD delivers probabilistic estimates. The limitation of
SD are discussed further in
Section 7.2.