Mind the ˜ O : asymptotically better, but still impractical, quantum distributed algorithms

,


Introduction
The classical CONGEST-CLIQUE Model (cCCM henceforth) in distributed computing has been carefully studied as a model central to the field, e.g., (Korhonen & Suomela, 2017;Saikia & Karmakar, 2019;Fischer & Oshman, 2021;Lenzen, 2012;Dolev, Lenzen, & Peled, 2012;Nowicki, 2019). In this model, processors in a network solve a problem whose input is distributed across the nodes under significant communication limitations, described in detail in §2. For example, a network of aircraft or spacecraft, satellites, and control stations, all with large distances between them, may have severely limited communication bandwidth to be modeled in such a way. The quantum version of this model, in which quantum bits can be sent between processors, the quantum CONGEST-CLIQUE Model (qCCM), as well as the quantum CONGEST model, have been the subject of recent research (Izumi & Gall, 2019;Censor-Hillel, Fischer, Le Gall, Leitersdorf, & Oshman, 2022;van Apeldoorn & de Vos, 2022;Elkin, Klauck, Nanongkai, & Pandurangan, 2012) in an effort to understand how quantum communication may help in these distributed computing frameworks. For the quantum CONGEST Model, however, (Elkin et al., 2012) showed that many problems cannot be solved more quickly than in the classical model. These include shortest paths, minimum spanning trees, Steiner trees, min-cut, and more; the computational advantages of quantum communication are thus severely limited in the CONGEST setting, though a notable positive result is sub-linear diameter computation in (Le Gall & Magniez, 2018). No comparable negative results exist for the qCCM, and in fact, (Izumi & Gall, 2019) provides an asymptotic quantum speedup for computing all-pairs shortest path (APSP henceforth) distances. Hence, it is apparent that the negative results of (Elkin et al., 2012) cannot transfer over to the qCCM, so investigating these problems in the qCCM presents an opportunity for contribution to the understanding of how quantum communication may help in these distributed computing frameworks. In this paper, we contribute to this understanding by formulating algorithms in the qCCM for finding approximately optimal Steiner trees and exact directed minimum spanning trees using O(n 1/4 ) rounds -asymptotically fewer rounds than any known classical algorithms. This is done by augmenting the APSP algorithm of (Izumi & Gall, 2019) with an efficient routing table scheme, which is necessary to make use of the shortest paths information instead of only the APSP distances, and using the resulting subroutine with existing classical algorithmic frameworks. Beyond asymptotics, we also characterize the complexity of our algorithms as well as those of (Izumi & Gall, 2019;Censor-Hillel et al., 2016;Saikia & Karmakar, 2019;Fischer & Oshman, 2021) to include the logarithmic and constant factors involved to estimate the scales at which they would be practical, which was not included in the previous work. It should be noted that, like APSP, these problems cannot see quantum speedups in the CONGEST (non-clique) setting as shown in (Elkin et al., 2012). Our Steiner tree algorithm is approximate and based on a classical polynomial-time centralized algorithm of (Kou, Markowsky, & Berman, 1981). Our directed minimum spanning tree problem algorithm follows an approach similar to (Fischer & Oshman, 2021), which effectively has its centralized roots in (Lovasz, 1985).

Background and Setting
This section provides the necessary background for our algorithms' settings and the problems they solve.

The CONGEST and CONGEST-CLIQUE Models of Distributed Computing
In the standard CONGEST model, we consider a graph of n processor nodes whose edges represent communication channels. Initially, each node knows only its neighbors in the graph and associated edge weights. In rounds, each processor node executes computation locally and then communicates with its neighbors before executing further local computation. The congestion limitation restricts this communication, with each node able to send only one message of O(log(n)) classical bits in each round to its neighbors, though the messages to each neighbor may differ. In the cCCM, we separate the communication graph from the problem input graph by allowing all nodes to communicate with each other, though the same O(log(n)) bits-per-message congestion limitation remains. Hence, a processor node could send n − 1 different messages to the other n − 1 nodes in the graph, with a single node distributing up to O(n · log(n)) bits of information in a single round. Taking advantage of this way of dispersing information to the network is paramount in many efficient CONGEST-CLIQUE algorithms. The efficiency of algorithms in these distributed models is commonly measured in terms of the round complexity, the number of rounds of communication used in an algorithm to solve the problem in question. A good overview of these distributed models can be found in (Ghaffari, 2020).
1. Each node may execute unlimited local computation.
2. Each node may send a message consisting of either a register of O(log n) qubits or a string of O(log n) classical bits to each other node in the network. Each of those messages may be distinct.
3. Each node receives and saves the messages the other nodes send it.
The input graph G is distributed across the nodes as follows: Each node knows its own ID number, the ID numbers of its neighbors in G, the number of nodes n in G, and the weights corresponding to the edges it is incident upon. The output solution to a problem must be given by having each node v ∈ V return the restriction of the global output to NG(u) := {v : uv ∈ E}, its neighborhood in G. No entanglement is shared across nodes initially.
This is an analog of the cCCM, except that quantum bits may be sent in place of classical bits. To clarify the output requirement, in the Steiner tree problem, we require node u to output the edges of the solution tree that are incident upon u. Since many messages in our algorithms need not be sent as qubits, we define the qCCM slightly unconventionally, allowing either quantum or classical bits to be sent. We specify those that may be sent classically. However, even without this modification, the quantum versions of CONGEST and cCCM are at least as powerful as their classical counterparts. This is because any n-bit classical message can be instead sent as an n-qubit message of unentangled qubits; for a classical bit reading 0 or 1, we can send a qubit in the state |0⟩ or |1⟩ respectively, and then take measurements with respect to the {|0⟩ , |1⟩} basis to read the same message the classical bits would have communicated. Hence, one can also freely make use of existing classical algorithms in the qCCM. Further, the assumption that IDs are in [n], with n known, is not necessary but is convenient; without this assumption, we could have all nodes broadcast their IDs to the entire network and then assign a new label in [n] to each node according to an ordering of the original IDs, resulting in our assumed situation.
Remark 2.2. Definition 2.1 does not account for how the information needs to be stored. In this paper, it suffices for all information regarding the input graph to be stored classically as long as there is quantum access to that data. We provide some details on this in §8.4 of the appendix.
Remark 2.3. No entanglement being shared across nodes initially in definition 2.1 results in quantum teleportation not being a trivial way to solve problems in the qCCM.
Example 2.4. To provide some intuition on how allowing communication through qubits in this distributed setting can be helpful, we now describe and give an example of distributed Grover search, first described in (Le Gall & Magniez, 2018). The high-level intuition for why quantum computing gives an advantage for search is that quantum operations use quantum interference effects to have canceling effects among non-solutions. Grover search has a generalization called "amplitude amplification" we will use; see (Rieffel & Polak, 2011) for details on these algorithms. Now, for a processor node u in the network and a Boolean function g : X → {0, 1}, suppose there exists a classical procedure C in the cCCM that allows u to compute g(x), for any x ∈ X in r rounds. The quantum speedup will come from computing C in a quantum superposition, which enables g to be evaluated with inputs in superposition so that amplitude amplification can be used for inputs to g. Let Ai : {x ∈ X : g(x) = i}, for i = 0, 1, and suppose that 0 < |A1| ≤ |X|/2. Then classically, node u can find an x ∈ A1 in Θ(r|X|) rounds by checking each element of X. Using the quantum distributed Grover search of (Le Gall & Magniez, 2018) enables u to find such an x with high probability in onlyÕ(r |X|) rounds by evaluating the result of computing g on a superposition of inputs.
We illustrate this procedure in an example case where a node u wants to inquire whether one of its edges uv is part of a triangle in G. We first describe a classical procedure for this, followed by the corresponding quantumdistributed search version. For v ∈ NG(u), denote by Iv : V → {0, 1} the indicator function of NG(v), and by guv : NG(u) → {0, 1} its restriction to inputs in NG(u). Classically, node u can evaluate guv(w) in two rounds for any w ∈ NG(u) by sending the ID of w (of length log n) to v, and having v send back the answer Iv(w). Then u can check guv(w) for each w ∈ NG(u) one at a time to determine whether uv is part of a triangle in G or not in 2 · |NG(u)| rounds.
For the distributed quantum implementation, u can instead initialize a register of log n qubits as |ψ⟩0 := u) |x⟩, all the inputs for guv in equal superposition. To do a Grover search, u needs to be able to evaluate guv with inputs |ψ⟩ in superposition. For the quantum implementation of C, u sends a quantum register in state |ψ⟩|0⟩ to node v, and has node v evaluate a quantum implementation of Iv, which we will consider as a call to an oracle mapping |x⟩|0⟩ to |x⟩|Iv(x)⟩ for all x ∈ V . Node v sends back the resulting qubit register, and node u has evaluated guv(|ψ⟩) in 2 rounds. Now, since u can evaluate guv in superposition, node u may proceed using standard amplitude amplification, using 2 rounds of communication for each evaluation of guv, so that u can find an element w ∈ NG(u) satisfying guv(w) = 1 with high probability inÕ(r |NG(u)|) rounds if one exists. We note that in this example, v cannot execute this procedure by itself since it does not know NG(u) (and sending this information to v would take |NG(u)| rounds), though it is able to evaluate Iv in superposition for any w ∈ NG(u). For any classical procedure C evaluating a different function from this specific g (that can be implemented efficiently classically and, therefore, translated to an efficient quantum implementation), the same idea results in the square-root advantage to find a desired element such that g evaluates to 1.

Notation and Problem Definitions
For an integer-weighted graph G = (V, E, W ), we will denote n := |V |, m := |E|, and We the weight of an edge e ∈ E throughout the paper. Let δ(v) ⊂ V be the set of edges incident on node v, and NG(u) := {v : uv ∈ E} the neighborhood of u ∈ G. Denote by dG(u, v) the shortest-path distance in G from u to v. For a graph G = (V, E, W ) two sets of nodes U and U ′ , let PG(U, U ′ ) := {uv ∈ E : u ∈ U, w ∈ U ′ } be the set of edges connecting U to U ′ . Let P(U ) := P(U, U ) as shorthand. All logarithms will be taken with respect to base 2, unless otherwise stated.
Definition 2.5 (Steiner Tree Problem). Given a weighted, undirected graph G = (V, E, W ), and a set of nodes Z ⊂ V , referred to as Steiner Terminals, output the minimum weight tree in G that contains Z.
Definition 2.6 (Approximate Steiner Tree). For a Steiner Tree Problem with terminals Z and solution SOP T with edge set ES OP T , a tree T in G containing Z with edge set ET such that Wuv is called an approximate Steiner Tree with approximation factor r.
Definition 2.7 (Directed Minimum Spanning Tree Problem (DMST)). Given a directed, weighted graph G = (V, E, W ) and a root node r ∈ V , output the minimum weight directed spanning tree for G rooted at r. This is also known as the minimum weight arborescence problem.

Contributions
We provide an algorithm for the qCCM that produces an approximate Steiner Tree with high probability (w.h.p.) iñ O(n 1/4 ) rounds and an algorithm that produces an exact Directed Minimum Spanning Tree w.h.p. inÕ(n 1/4 ) rounds.
To do this, we enhance the quantum APSP algorithm of (Izumi & Gall, 2019) in an efficient way to compute not only APSP distances but also the corresponding routing tables (described in §4) that our algorithms rely on. Further, in addition to theseÕ results, in sections 4.7, 5.4, and 6.3, we characterize the constants and logarithmic factors involved in our algorithms as well as related classical algorithms to contribute to the community's understanding of their implementability. This reveals that the factors commonly obscured byÕ notation in related literature, especially the logarithms, have a severe impact on practicality.
We summarize the algorithmic results in the following two theorems: Theorem 3.1. There exists an algorithm in the Quantum CONGEST-CLIQUE model that, given an integer-weighted input graph G = (V, E, W ), outputs a 2(1 − 1/l) approximate Steiner Tree with probability of at least 1 − 1 poly(n) , and usesÕ(n 1/4 ) rounds of computation, where l denotes the number of terminal leaf nodes in the optimal Steiner Tree.
Theorem 3.2. There exists an algorithm in the Quantum CONGEST-CLIQUE model that, given a directed and integer-weighted input graph G = (V, E, W ), produces an exact Directed Minimum Spanning Tree with high probability, of at least 1 − 1 poly(n) , and usesÕ(n 1/4 ) rounds of computation.

APSP and Routing Tables
We first describe an algorithm for the APSP problem with routing tables in the qCCM, for which we combine an algorithm of (Izumi & Gall, 2019) with a routing table computation from (Zwick, 2000). For this, we reduce APSP with routing tables to triangle finding via distance products as in .

Distance Products and Routing Tables
Definition 4.1. A routing table for a node v is a function Rv : V → V mapping a vertex u to the first node visited in the shortest path going from v to u other than v itself.
Definition 4.2. The distance product between two n × n matrices A and B is defined as the n × n matrix A ⋆ B with entries: (4.1) The distance product is also sometimes called the min-plus or tropical product. For shortest paths, we will repeatedly square the graph adjacency matrix with respect to the distance product. For a n × n matrix W and an integer k, let us denote W k,⋆ := W ⋆ (W ⋆ (. . . (W ⋆ W )) . . . ) as the k th power of the distance product. For a graph G = (V, E, W ) with weighted adjacency matrix W (assigning Wuv = ∞ if uv / ∈ E), W k,⋆ uv is the length of the shortest path from v to u in G using at most k hops. Hence, for any N ≥ n, W N,⋆ contains all the shortest path distances between nodes in G. As these distance products obey standard exponent rules, we may take N = 2 ⌈log n⌉ to recursively compute the APSP distances via taking ⌈log n⌉ distance product squares: This procedure reduces computing APSP distances to computing ⌈log n⌉ distance products. In the context of the CONGEST-CLIQUE model, each node needs to learn the row of W n that represents it. As we also require nodes to learn their routing tables, we provide a scheme in §4.3 that is well-suited for our setting to extend (Izumi & Gall, 2019) to also compute routing tables.

Distance Products via Triangle Finding
Having established reductions to distance products, we turn to their efficient computation. The main idea is that we can reduce distance products to a binary search in which each step in the search finds negative triangles. This procedure corresponds to (Izumi, Le Gall, & Magniez, 2020, Proposition 2), which we describe here, restricting to finding the distance product square needed for Eq. (4.2). A negative triangle in a weighted graph is a set of edges ∆ − = (uv, vw, wu) ⊂ E 3 such that e∈∆ − We < 0. Let us denote the set of all negative triangles in a graph G as ∆ − G . Specifically, we will be interested in each node v being able to output edges vu ∈ δ(v) such that vu is involved in at least one negative triangle in G. Let us call this problem FindEdges, and define it formally as:

FindEdges
Input: An integer-weighted (directed or undirected) graph G = (V, E, W ) distributed among the nodes, with each node v knowing NG(v), as well as the weights Wvu for each u ∈ NG(v).
Output: For each node v, its output is all the edges vu ∈ E that are involved in at least one negative triangle in G.
Proposition 4.3. If FindEdges on a n-node integer-weighted graph G = (V, E, W ) can be solved in T (n) rounds, then the distance product A ⋆ B of two n × n matrices A and B with entries in [M ] can be computed in T (3n) · ⌈log 2 (2M )⌉ rounds.
Proof. Let A and B be arbitrary n × n integer-valued matrices, and D be an n × n matrix initialized to 0. Let each u ∈ V simulate three copies of itself,u1, u2, u3, writing V1, V2, V3 as the sets of copies of nodes in V . Consider the graph An edge zv is part of a negative triangle in G ′ exactly whenever min u∈V {Avu + Buz} < −Dzv.
Assuming we can compute FindEdges for a k-node graph in T (n) rounds, with a non-positive matrix D = 0 initialized we can apply simultaneous binary searches on Dzv, with values between {−2M, 0}, updating it for each node v after each run of FindEdges to find minu∈V {Avu + Buz} for every other node z in T (3n) · ⌈log(maxv,z∈V {minu∈V {Avu + Buv}})⌉ rounds, since G ′ is a tripartite graph with 3n nodes.
Remark 4.4. This procedure can be realized in a single n-node distributed graph by letting each node represent the three copies of itself since G ′ is tripartite. The T (3n) stems from each processor node possibly needing to send one message for each node it is simulating in each round of FindEdges. If bandwidth per message is large enough (3 times the bandwidth needed for solving FindEdges in T (n) rounds), then this can be done in T (n) rounds.
So for this binary search, each node v initializes and locally stores Dvz = 0 for each other z ∈ V , after which we solve FindEdges on G ′ . The node then updates each Dvz according to whether or not the edge copies of vz were part of a negative triangle in G ′ , after which FindEdges is computed with the updated values for D. This is repeated until all the minu∈V {Avu + Buz} have been determined.

Routing Tables via Efficient Computation of Witness Matrices
For the routing table entries, we also need each node v to know the intermediate node u that is being used to attain minu∈V {Wvu + Wuz}.
Definition 4.5. For a distance product A ⋆ B of two n × n matrices A, B, a witness matrix C is an n × n matrix such that Put simply, a witness matrix contains the intermediate entries used to attain the values in the resulting distance product. We present here a simple way of computing witness matrices along with the distance product by modifying the matrix entries appropriately, first considered by (Zwick, 2000). The approach is well-suited for our algorithm, as we only incur O(log n) additional calls to FindEdges for a distance product computation with a witness matrix.
For an n × n integer matrix W , obtain matrices W ′ and W ′′ by taking Claim 1. With W, W ′ , W ′′ , and K as defined immediately above, The claim follows from routine calculations of the quantities involved and can be found in the Appendix, §8.1. Hence, we can obtain witness matrices by simply changing the entries of our matrices by no more than a multiplicative factor of n and an addition of n. Since the complexity of our method depends on the magnitude of the entries of W logarithmically, we only need logarithmically many more calls to FindEdges to obtain witness matrices along with the distance products, making this simple method well-suited for our approach. More precisely, we can compute W 2 with a witness matrix using log 2n · maxi,j{W 2 ij < ∞} . calls to FindEdges. We obtain the following corollary to proposition 4.3 to characterize the exact number of rounds needed: Corollary 4.6. If FindEdges on an n-node integer-weighted graph G = (V, E, W ) can be solved in T (n) rounds, then the distance product square W 2,⋆ , along with a witness matrix H, can be computed in T (3n) · ⌈log 2 (n · maxv,z∈G{minu∈V {Wvu + Wuv}} + n)⌉ rounds.
Proof. This follows from claim 1 and proposition 4.3 upon observing that Once we obtain witness matrices along with the distance product computations, constructing the routing tables for each node along the way of computing APSP is straightforward. In each squaring of W in Eq. (4.2), each node updates its routing table entries according to the corresponding witness matrix entry observed. It is worth noting that these routing table entries need only be stored and accessed classically so that we avoid using unnecessary quantum data storage.

Triangle Finding
Given the results from sections 4.3 and 4.2, we have reduced finding both the routing tables and distance product to having each edge learn the edges involved in a negative triangle in the graph. This section will thus describe the procedure to solve the FindEdges subroutine. We state here a central result from (Izumi & Gall, 2019): Proposition 4.7. There exists an algorithm in the quantum CONGEST-CLIQUE model that solves the FindEdges subroutine inÕ(n 1/4 ) rounds.
We will proceed to describe each step of the algorithm to describe the precise round complexity beyond thẽ O(n 1/4 ) to characterize the constants involved in the interest of assessing the future implementability of our algorithms.
As a preliminary, we give a message routing lemma of (Dolev et al., 2012) for the congested clique, which will be used repeatedly: Lemma 4.8. Suppose each node in G is the source and destination for at most n messages of size O(log n) and that the sources and destinations of each message are known in advance to all nodes. Then all messages can be routed to their destinations in 2 rounds.
We introduce the subproblem FindEdgesWithPromise (FEWP henceforth). Let Γ(u, v) denote the number of nodes w ∈ V such that (u, v, w) forms a negative triangle in G.

FEWP
Input: An integer-weighted graph G = (V, E, W ) distributed among the nodes and a set S ⊂ P(V ), with each node v knowing NG(v) and S.
Output: For each node v, its output is the edges vu ∈ S that satisfy Γ(u, v) > 0.
We give here a description of the procedure of (Izumi & Gall, 2019) to solve FindEdges given an algorithm A to solve FEWP. Let εA be the failure probability of the algorithm A for an instance of FEWP. From step 2 of this above algorithm, it is straightforward to check that this requires a maximum of cn := ⌈log ( n 60 log n )⌉ + 1 calls to the A subroutine to solve FEWP. Further, it succeeds with probability at least 1 − cn/n 3 − cn/n 2 8 − (cn + 1)εA. We refer the reader to (Izumi & Gall, 2019, §3) for the proof of correctness. We now turn toward constructing an efficient algorithm for FEWP.
To solve this subroutine, we must first introduce an additional labeling scheme over the nodes that will determine how the search for negative triangles will be split up to avoid communication congestion in the network. Assume for simplicity that n 1/4 , √ n, n 3/4 are integers. Let Clearly, |M| = n, and M admits a total ordering lexicographically. Since we assume each node vi ∈ V is labeled with unique integer ID i ∈ [n], vi can select the element in M that has place i in the lexicographic ordering of M without communication occurring. Hence, each node v ∈ V is associated with a unique triple (i, j, k) ∈ M. We will refer to the unique node associated with (i, j, k) ∈ M as node v (i,j,k) . The next ingredient is a partitioning scheme of the space of possible triangles. Let U be a partition of V into n 1/4 subsets containing n 3/4 nodes each, by taking Ui := {vj : j ∈ {(i − 1) · n 3/4 , . . . , i · n 3/4 }} for i = 1, . . . , n 1/4 , and U := {U1, . . . , U n 1 4 }. Apply the same idea to create a partition U ′ of √ n sets of size √ n, by taking can then locally determine its association with the element (Ui, Uj, U ′ k ) ∈ V since |V| = n. Further, if we use one round to have all nodes broadcast their IDs to all other nodes, each node v (i,j,k) can locally compute the (Ui, Uj, U ′ k ) it is assigned to, so this assignment can be done in one round.
We present here the algorithm ComputePairs used to solve the FEWP subroutine.

ComputePairs
Input: An integer-weighted graph G = (V, E, W ) distributed among the nodes, a partition of V × V × V of (Ui, Uj, U ′ k ) associated with each node as above, and a set S ⊂ P(V ) such that for uv ∈ S, Γ(u, v) ≤ 90 log n. Output: For each node v, its output is the edges vu ∈ S that satisfy Γ(u, v) > 0.
1: Every node v (i,j,k) receives the weights Wuv, Wvw for all uv ∈ P(Ui, Uj) and vw ∈ P(Uj, U ′ k ). 2: Every node v (i,j,k) constructs the set Λ k (Ui, Uj) ⊂ P(Ui, Uj) by selecting every uv ∈ P(Ui, Uj) with probability 10 · log n √ n . If |{v ∈ U1 : uv ∈ Λ k (Ui, Uj)}| > 100n 1/4 log n for some u ∈ Uj, abort the algorithm and report failure. Otherwise, v (i,j,k) keeps all pairs uv ∈ Λ k (Ui, Uj) ∩ S and receives the weights W uv for all of those pairs. Denote those elements of Λ k (Ui, Uj) ∩ S as u k whether there is some U ∈ U ′ that contains a node w such that (u k l , v k l , w) forms a negative triangle, and outputs all pairs u k l v k l for which a negative triangle was found.
With probability at least 1 − 2/n, the algorithm ComputePairs does not terminate at step 2 and every pair (u, v) ∈ S appears in at least one Λ k (Ui, Uj). The details for this result can be found in (Izumi & Gall, 2019, Lemma 2).
Step 1 requires 2n 1/4 ⌈ log W log n ⌉ rounds and can be implemented fully classically without any qubit communication.
Step 2 requires at most 200 log n⌈ log W log n ⌉ rounds and can also be implemented classically.
Step 3 can be implemented iñ O(n 1/4 ) rounds quantumly taking advantage of distributed Grover search but would take O( √ n) steps to implement classically. The remainder of this section is devoted to illustrating how this step can be done inÕ(n 1/4 ) rounds.
Define the following quantity: forming a negative triangle in G} For simultaneous quantum searches, we divide the nodes into different classes based on the number of negative triangles they are a part of with the following routine:

IdentifyClass
Input: An integer-weighted graph G = (V, E, W ) distributed among the nodes, and a set S ⊂ E as in FEWP.
Output: For each node v, a class α the node belongs to.
1: Every node u (i,j,k) ∈ V samples each node in {v ∈ V : (u (i,j,k) , v) ∈ S} with probability 10 log n n , creating a set Λ(u) of sampled vertices. If maxu |Λ(u)| > 20 log n, abort the algorithm and report a failure. Otherwise, have each node broadcast Λ(u) to all other nodes, and take R := ∪u∈V {uv|v ∈ Λ(u)}.
w} forms a negative triangle in G}|, then determines its class α to be min{c ∈ N : d i,j,k < 10 · 2 c log n}.
This uses at most 20 log n rounds (each node sends at most that many IDs to every other node) and can be implemented by having all exchanged messages consist only of classical bits. Using Chernoff's bound, one can show that the procedure succeeds with probability of at least 1 − 1/n as seen in (Izumi et al., 2020, Proposition 5).
Let us make the convenient assumption that α = 0 for all v i,j,k , which avoids some technicalities around congestion in the forthcoming triangle search. Note that α ≤ 1 2 log n, so we can run successive searches for each α for nodes in with class α in the general case. The general case is discussed in §8.2 of the appendix and can also be found in (Izumi & Gall, 2019), but this case is sufficient to convey the central ideas.
We have all the necessary ingredients to describe the implementation of step 3 of the ComputePairs procedure.
3.2: For each α, for every l ∈ [m], every node v (i,j,k) in class α executes a quantum search to find whether there is a U ′ k ∈ U ′ with some w ∈ U ′ k forming a negative triangle (u k l , v k l , w) in G, and then reports all the pairs u k l v k l for which such a U ′ k was found.
This provides the basis of the triangle-searching strategy. To summarize the intuition of the asymptotic speedup in this paper: Since the U ′ k have size √ n (recall that |U ′ | = √ n), if each node using a quantum search can search through its assigned U ′ k inÕ(n 1/4 ) rounds, simultaneously, we will obtain our desired complexity. We will complete this argument in §4.6 and first describe the quantum searches used therein in the following subsection.

Distributed Quantum Searches
With this intuition in mind, we now state two useful theorems of (Izumi & Gall, 2019) for the distributed quantum searches. Let X denote a finite set throughout this subsection.
Theorem 4.10. Let g : X → {0, 1}, if a node u can compute g(x) in r rounds in the CONGEST-CLIQUE model for any x ∈ X, then there exists an algorithm in the Quantum CONGEST-CLIQUE that has u output some x ∈ X with g(x) = 1 with high probability usingÕ(r |X|) rounds.
This basic theorem concerns only single searches, but we need a framework that can perform multiple simultaneous searches. Let g1, . . . , gm : X → {0, 1} and Assume there exists an r-round classical distributed algorithm Cm that allows a node u upon an input χ = (x1, . . . , xm) ∈ X m to determine and output (g1(x1), . . . , gm(xm)). In our use of distributed searches, X will consist of nodes in the network, and searches will need to communicate with those nodes for which the functions gi are evaluated. To avoid congestion, we will have to consider those χ ∈ X m that have many repeated entries carefully. We introduce some notation for this first. Define the quantity the maximum number of entries in χ that are all identical. Next, given some β ∈ N, assume that in place of Cm we now have a classical algorithmC m,β such that upon input χ = (x1, . . . , xm) ∈ X m , a node u outputs g1(x1), . . . , gm(xm) if α(χ) ≤ β and an arbitrary output otherwise. The following theorem summarizes that such aC m,β with sufficiently large β is enough to maintain a quantum speedup as seen in the previous theorem: Theorem 4.11. For a set X with |X| < m/(36 log m), suppose there exists such an evaluation algorithm C m,β for some β > 8m/|X| and that α(χ) ≤ β for all χ ∈ A 1 1 × · · · × A 1 m . Then there is aÕ(r |X|)-round quantum algorithm that outputs an element of A 1 1 × · · · × A 1 m with probability at least 1 − 2/m 2 .

Final Steps of the Triangle Finding
We continue here to complete the step 3.2 of the ComputePairs procedure, armed with Theorem 4.11. We need simultaneous searches to be executed by each node v (i,j,k) to determine the triangles in Ui × Uj × U ′ k . We provide a short lemma first that ensures the conditions for the quantum searches: Lemma 4.12. The following statements hold with probability at least 1 − 2/n 2 : The proofs of these statements are technical but straightforward, making use of Chernoff's bound and union bounds; hence we skip them here. To invoke Theorem 4.11, we describe a classical procedure first, beginning with an evaluation step, EvaluationA implementable inÕ(1) rounds.

EvaluationA
Input: Every node v (i,j,k) receives m elements (u i,j,k 1 , . . . , u i,j,k m ) of U ′ Promise: For every node v i,j,k and every w ∈ U ′ , |L i,j,k w | ≤ 800 √ n log n. Output: Each node outputs a list of exactly those u i,j,k l such that there is a negative triangle in Ui × Uj × u i,j,k l . 1. Every node v (i,j,k) , for each r ∈ √ n routes the list L i,j,t w to node v (i,j,t) .
2. Every node v (i,j,k) , for each vu it received in step 1, sends the truth value of the inequality to the node that sent vu.
Each node is the source and destination of up to 800n log n messages in step 1, meaning that this step can be implemented in 1600 log n rounds. The same goes for step 2, noting that the number of messages is the same, but they need only be single-bit messages (the truth values of the inequalities). Hence, the evaluations for Theorem 4.11 can be implemented in 3200 log n rounds. Now, applying the theorem with X = U ′ , β = 800 √ n log n, noting that then the assumptions of the theorem hold with probability at least 1 − 2/n 2 due to Lemma 4.12, implies that step 3.2 is implementable inÕ(n 1/4 ) rounds, with a success probability of at least 1 − 2/m 2 .
For the general case in which we do not assume α = 0 for all i, j, k in IdentifyClass, covered in the appendix, one needs to modify the EvaluationA procedure in order to implement load balancing and information duplication to avoid congestion in the simultaneous searches. These details can be found in the appendix, where a new labeling scheme and different evaluation procedure EvaluationB, are described for this, or in (Izumi & Gall, 2019).

Complexity
As noted previously and in (Izumi & Gall, 2019), this APSP scheme usesÕ(n 1/4 ) rounds. Let us characterize the constants and logarithmic factors involved to assess this algorithm's practical utility. Suppose that in each round, 2 · log n qubits can be sent in each message (so that we can send two IDs or one edge with each message), where n is the number of nodes. For simplicity, let's assume W ≪ n and drop W .
1. APSP with routing tables needs log(n) distance products with witness matrices.
2. Computing the i th distance product square for Eq. (4.2) with a witness matrix needs up to log 2 i = i calls to FindEdges, since the entries of the matrix being squared may double each iteration. Then APSP and distance products together make ⌈log n⌉ i=1 i = ⌈log(n)⌉(⌈log(n)⌉+1) 2 calls to FindEdges.

Solving FindEdges needs log
n 60 log n calls to FEWP, using FindEdgesViaFEWP.

4.
Step 1 of ComputePairs needs up to 2 · n 1/4 rounds and step 2 takes up to 200 log n rounds.

5.
Step 1 of IdentifyClass needs up to 20 log n rounds.
6. In step 2 of IdentifyClass, the cuvw are up to 1 2 log n large, and hence α may range up to 1 2 log n. 7.
Step 0 of the EvaluationB procedure needs n 1/4 rounds. Steps 1 and 2 of the EvaluationB (or EvaluationA, in the α = 0 case) procedure use a total of 3200 log n rounds. 8. EvaluationB (or EvaluationA) procedure is called up to log(n)n 1/4 times for each value of α in step 3.2 of ComputePairs.
Without any improvements, we get the following complexity, using 3n in place of n for the terms of steps 3-8 due to corollary 4.6: ⌈log(n)⌉(⌈log(n)⌉ + 1) 2 log 3n 60 log 3n 2(3n) 1/4 + 220 log 3n + 2(3n) 1/4 + 1 2 log 3n · log 3n · (3n) 1/4 3200(log 3n) , (4.4) which we will call f (n), so that f (n) = O(n 1/4 log 6 (n)), with the largest term being about 800 log 6 (n)n 1/4 , and we have dropped W to just consider the case W ≪ n. We can solve the problem trivially in the (quantum or classical) CONGEST-CLIQUE within n log(W ) rounds by having each node broadcast its neighbors and the weight on the edge. Let us again drop W for the case W ≪ n so that in order for the quantum algorithm to give a real speedup, we will need f (n) < n, which requires n > 10 18 (even with the simpler under-approximation 800 log 6 (n)n 1/4 in place of f ). Hence, even with some potential improvements, the algorithm is impractical for a large regime of values of n even when compared to the trivial CONGEST-CLIQUE n-round strategy. For the algorithm of (Izumi & Gall, 2019) computing only APSP distances, the first term in 4.4 becomes simply ⌈log n⌉, so that when computing only APSP distances the advantage over the trivial strategy begins at roughly n ≈ 10 16 .
Remark 4.13. In light of logarithmic factors commonly being obscured byÕ notation, we point out that even an improved algorithm needing only log 4 (n)n 1/4 would not be practical unless n > 10 7 , for the same reasons. Recall that n is the number of processors in the distributed network -tens of millions would be needed to make this algorithm worth implementing instead of the trivial strategy. Practitioners should mind theÕ if applications are of interest, since even relatively few logarithmic factors can severely limit practicality of algorithms, and researchers should be encouraged to fully write out the exact complexities of their algorithms for the same reason.

Memory Requirements
Although in definition 2.1 we make no assumption on the memory capacities of each node, the trivial n-round strategy uses at least 2 log(n)|E| 2 · log(W ) memory at the leader node that solves the problem. For the APSP problem in question, using the Floyd-Warshall algorithm results in memory requirements of 2n 2 log(n) · log(nW ) at the leader node. Hence, we may ask whether the quantum APSP algorithm leads to lower memory requirements. The memory requirement is largely characterized by up to 720n 7/4 log(n) log(nW ) needed in step 0 of the EvaluationB procedure, which can be found in the appendix. This results in a memory advantage for quantum APSP over the trivial strategy beginning in the regime of n > 1.6 · 10 10 . rounds, the details of which can be found in the appendix, §8.2.1. Then g(n) > n up until about n ≈ 2.6 · 10 11 . As with the quantum APSP, though this algorithm gives the best known asymptotic complexity ofÕ(n 1/3 ) in the classical CONGEST-CLIQUE, it also fails to give any real improvement over the trivial strategy across a very large regime of values of n. Consequently, algorithms making use of this APSP algorithm, such as (Saikia & Karmakar, 2019) or (Fischer & Oshman, 2021), suffer from the same problem of impracticality. However, the algorithm only requires within 4n 4/3 log(n) log(nW ) + n log(n) log(nW ) memory per node, which is less than required for the trivial strategy even for n ≥ 4.

Algorithm Overview
We present a high-level overview of the proposed algorithm to produce approximately optimal Steiner Trees, divided into four steps.
Step 2 -Shortest-path Forest: Construct a shortest-path forest (SPF), where each tree consists of exactly one source terminal and the shortest paths to the vertices whose closest terminal is that source terminal. This step can be completed in one round and n messages, per (Saikia & Karmakar, 2019, §3.1). The messages can be in classical bits.
Step 3 -Weight Modifications: Modify the edge weights depending on whether they belong to a tree (set to 0), connect nodes in the same tree (set to ∞), or connect nodes from different trees (set to the shortest path distance between root terminals of the trees that use the edge). This uses one round and n messages.
Step 4 -Minimum Spanning Tree: Construct a minimum spanning tree (MST) on the modified graph in O(1) rounds as in (Nowicki, 2019), and prune leaves of the MST that do not connect terminal nodes since these are not needed for the Steiner Tree.
The correctness of the algorithm follows from the correctness of each step together with the analysis of the classical results of (Kou et al., 1981), which uses the same algorithmic steps of constructing a shortest path forest and building it into an approximately optimal Steiner Tree.

Shortest Path Forest
After the APSP distances and routing tables have been found, we construct a Shortest Path Forest (SPF) based on the terminals of the Steiner Tree. ii) For each v ∈ Zi, dG(v, zi) = minz∈Z dG(v, z), and a shortest path connecting v to zi in G is contained in Tz i iii) The Vz i form a partition of V , and Ez 1 ∪ Ez 2 · · · ∪ Ez k = EF ⊂ E In other words, an SPF is a forest obtained by gathering, for each node, a shortest path in G connecting it to the closest Steiner terminal node.
For a node v in a tree, we will let par(v) denote the parent node of v in that tree, s(v) the Steiner Terminal in the tree that v will be in, and ID(v) ∈ [n] the ID of node v ∈ V . Let Q(v) := {z : dG(v, z) = minz∈Z dG(v, z)} be the set of Steiner Terminals closest to node v. We make use of the following procedure for the SPF:

DistributedSPF
Input: For each node v ∈ G, APSP distances and the corresponding routing table Rv.
Output: An SPF distributed among the nodes.
2: Each node v sets par(v) := Rv(s(v)), Rv being the routing table of v, and sends a message to par(v) to indicate this choice. If v receives such a message from another node u, it registers u as its child in the SPF.
Step 1 in DistributedSPF requires no communication since each node already knows the shortest path distances to all other nodes, including the Steiner Terminals, meaning it can be executed locally. Each node v choosing par(v) in step 2 can also be done locally using routing table information, and thus step 2 requires 1 round of communication of n − |Z| classical messages, since all non-Steiner nodes send one message.
Proof. i) holds since each Steiner Terminal is closest to itself. iii) is immediate. To see that ii) holds, note that for v ∈ Vz k , par(v) ∈ Vz k and {v, par(v)} ∈ Ez k as well. Then par(par(. . . par(v) . . . )) = z k and the entire path to z k lies in Tz k .
Hence, after this procedure, we have a distributed SPF across our graph, where each node knows its label, parent, and children of the tree it is in.

Weight Modified MST and Pruning
Finally, we introduce a modification of the edge weights before constructing an MST on that new graph that will be pruned into an approximate Steiner Tree. These remaining steps stem from a centralized algorithm first proposed by (Kou et al., 1981) whose steps can be implemented efficiently in the distributed setting, as in (Saikia & Karmakar, 2019). We first modify the edge weights as follows: Partition the edges E into three sets -tree edges EF as in 5.1 that are part of the edge set of the SPF, intra-tree edges EIT that are incident on two nodes in the same tree Ti of the SPF, and inter-tree edges EXT that are incident on two nodes in different trees of the SPF. Having each node know which of these its edges belong to can be done in one round by having each node send its neighbors the ID of the terminal it chose as the root of the tree in the SPF that is a part of. Then the edge weights are modified as follows, denoting the modified weights as W ′ : noting that dG (u, s(u)) is the shortest-path distance in G from u to its closest Steiner Terminal.
Next, we find a minimum spanning tree on the graph G ′ = (V, E, W ′ ), for which we may implement the classical O(1) round algorithm proposed by (Nowicki, 2019). On a high level, this constant-round complexity is achieved by sparsification techniques, reducing MST instances to sparse ones, and then solving those efficiently. We skip the details here and refer the interested reader to (Nowicki, 2019). After this step, each node knows which of its edges are part of this weight-modified MST, as well as the parent-child relationships in the tree for those edges.
Finally, we prune this MST by removing non-terminal leaf nodes and the corresponding edges. This is done by each node v sending the ID of its parent in the MST to every other node in the graph. As a result, each node can locally compute the entire MST and then decide whether or not it connects two Steiner Terminals. If it does, it decides it is part of the Steiner Tree; otherwise, it broadcasts that it is to be pruned. Each node that has not been pruned then registers the edges connecting it to non-pruned neighbors as part of the Steiner Tree. This pruning step takes 2 rounds and up to n 2 + n classical messages.

Overall Complexity and Correctness
In algorithm 5.1, after step 1, steps 2, and 3 can each be done within 2 rounds. Walking through (Nowicki, 2019) reveals that the MST for step 4 can be found in 54 rounds, with an additional 2 rounds sufficing for the pruning. Hence, the overall complexity remains dominated by Eq. (4.4). Hence, the round complexity isÕ(n 1/4 ), which is faster than any known classical CONGEST-CLIQUE algorithm to produce an approximate Steiner tree of the same approximation ratio. However, as a consequence of the full complexity obtained in §4.7, the regime of n in which this algorithm beats the trivial strategy of sending all information to a single node is also n > 10 18 . For the same reason, the classical algorithm provided in (Saikia & Karmakar, 2019) making use of the APSP subroutine from  discussed in §4.7.2 has its complexity mostly characterized by Eq. (4.5), so that the regime in which it provides an advantage over the trivial strategy lies in n > 10 11 . Our algorithm's correctness follows from the correctness of each step together with the correctness of the algorithm by (Kou et al., 1981) that implements these steps in a classical, centralized manner.

Directed Minimum Spanning Tree Algorithm
This section will be concerned with establishing Theorem 3.2 for the Directed Minimum Spanning Tree (DMST) problem, in definition 2.7. Like (Fischer & Oshman, 2021), we follow the algorithmic ideas first proposed by (Lovasz, 1985), implementing them in the quantum CONGEST-CLIQUE. Specifically, we will use log n calls to the APSP and routing tables scheme described in §4, so that in our case, we retrieve complexityÕ(n 1/4 ) and success probability (1 − 1 poly(n) ) log n = 1 − 1 poly(n) . Before describing the algorithm, we need to establish some preliminaries and terminology for the procedures executed during the algorithm, especially the ideas of shrinking vertices into super-vertices and tracking a set H of specific edges as first described in (Edmonds et al., 1967). We use the following language to discuss super-vertices and related objects.
is a partition of V , and each V * i is called a super-vertex. We will call a super-vertex simple if V * is a singleton. The corresponding minor G * := (V * , E * , W * ) is the graph obtained by creating edges ( Notably, we continue to follow the convention of an edge of weight ∞ being equivalent to not having an edge. We will refer to creating a super-vertex V * as contracting the vertices in V * into a super-vertex.

Edmonds' Centralized DMST Algorithm
We provide a brief overview of the algorithm proposed in (Edmonds et al., 1967), which presents the core ideas of the super-vertex-based approach. The following algorithm produces a DMST for G:

Edmonds DMST Algorithm
Input: An integer-weighted digraph and a root node r.
Output: A DMST for G rooted at r.
1. Initialize a subgraph H with the same vertex set as G by subtracting for each node the minimum incoming edge weight from all its incoming edges, and selecting exactly one incoming zero-weight edge for each nonroot node of G. Set G0 = G, H0 = H, t = 0.
2. WHILE Ht is not a tree: (a) For each cycle of H, contract the nodes on that cycle into a super-vertex. Consider all non-contracted nodes as simple super-vertices, and obtain a new graph Gt+1 as the resulting minor.
(b) If there is a non-root node of Gt+1 with no incoming edges, report a failure. Otherwise, obtain a subgraph Ht+1 by, for each non-root node of Gt+1, subtracting the minimum incoming edge weight from all its incoming edges, and selecting exactly one incoming zero-weight edge for each non-root, updating t ← t + 1.
3. Let Bt = Ht. FOR k ∈ (t, t − 1, . . . , 1): (a) Obtain B ′ k−1 by expanding the non-simple super-vertices of B k and selecting all but one of the edges for each of the previously contracted cycles of H k to add to B k−1 .

Return B0.
Note that the edge weight modifications modify the weight of all directed spanning trees equally, so optimality is unaffected. In step 2., if Ht is a tree, it is an optimal DMST for the current graph Gt. Otherwise, it contains at least one directed cycle, so that indeed step 2. is valid. Hence, at the beginning of step 3., Bt is a DMST for Gt. Then the first iteration produces Bt−1 a DMST for Gt−1 since only edges of zero weight were added, and Bt−1 will have no cycles. The same holds for Bt−2, Bt−3, . . . , B0, for which B0 corresponds to the DMST for the original graph G. If the algorithm reports a failure at some point, no spanning tree rooted at r exists for the graph, since a failure is reported only when there is an isolated non-root connected component in Gt+1.
Note that in iteration t of step 2., H has one cycle for each of its connected components that does not contain the root node. Hence, the drawback of this algorithm is that we may apply up to O(n) steps of shrinking cycles. This shortcoming is remedied by a more efficient method of selecting how to shrink nodes into super-vertices in (Lovasz, 1985), such that only log n shrinking cycle steps take place.

Lovasz' Shrinking Iterations
We devote this subsection to discuss the shrinking step of (Lovasz, 1985) that will be repeated log n times in place of step 2. of Edmonds' algorithm to obtain Lovasz' DMST algorithm.
Lovasz' Shrinking Iteration LSI Input: A directed, weighted graph G = (V, E, W ) and a root node r ∈ V . Output: Either a new graph G * , or a success flag and a DMST H of G.
1. If there is a non-root node of G with no incoming edges, report a failure. Otherwise, for each non-root node of G, subtract the minimum incoming edge weight from all its incoming edges. Select exactly one incoming zero-weight edge for each non-root node to create a subgraph H of G with those edges.
2 Find all cycles of H, and denote them H1, . . . , HC . If H has no cycles, abort the iteration and return (SUCCESS, H). For j = 1, . . . , C, find the set Vj of nodes that dipaths in H from Hj can reach.
3. Compute the All-Pairs-Shortest-Path distances in G.
5. Create a minor G * by contracting each Uj into a super-vertex U * j , considering all other vertices of G as simple super-vertices V * 1 , . . . , V * k . For each vertex N * of G * , let the edge weights in G * be: for all the simple super-vertices V * of G * .
To summarize these iterations: The minimum-weight incoming edge of each node is selected. That weight is subtracted from the weights of every incoming edge to that node, and one of those edges with new weight 0 is selected for each node to create a subgraph H. If H is a tree, we are done. Otherwise, we find all cycles of the resulting directed subgraph, then compute APSP and determine the Vj, Uj, and βj, which we use to define a new graph with some nodes of the original G contracted into super-vertices.
The main result for the DMST problem in (Lovasz, 1985) is that replacing (a) and (b) of step 2. in the Edmonds DMST Algorithm, taking the new H obtained at each iteration to be Ht+1 and the G * to be Gt+1, leads to no more than ⌈log n⌉ such shrinking iterations needed before a success is reported.

Quantum Distributed Implementation
Our goal is to implement the Lovasz iterations in the quantum distributed setting inÕ(n 1/4 ) rounds by making use of quantum APSP of §4. In the distributed setting, processor nodes cannot directly be shrunk into super-vertices. As in (Fischer & Oshman, 2021), we reconcile this issue by representing the super-vertex contractions within the nodes through soft contractions.
First, note that a convenient way to track what nodes we want to consider merging into a super-vertex is to keep a mapping sID : V → S, where S is a set of super-vertex IDs, which we can just take to be the IDs of the original nodes. We will refer to a pair of (G, sID) as an annotated graph. An annotated graph naturally corresponds to some minor of G, namely, the minor obtained by contracting all vertices sharing a super-vertex ID into a super-vertex.
Definition 6.2 (Soft Contractions). For an annotated graph (G, sID), a set of active edges H, and active component Hi with corresponding weight modifiers βi, and a subset A ⊂ S of super-vertices, the soft contraction of Hi in G is the annotated graph (G H i , sID ′ ) obtained by taking Wuv otherwise and updating the mapping sID to sID ′ defined by sID ′ (v) = sID(v), ∀v / ∈ A, sID ′ (v) = min{sID(u) : u ∈ A}.

Quantum Distributed Lovasz' Iteration
We provide here a quantum distributed implementation of Lovasz' iteration that we will form the core of our DMST algorithm.  (Fischer & Oshman, 2021, §7) or the Unpacking procedure in §8.3 of the appendix. All messages in the algorithm other than those for computing the APSP in QDLSI may be classical. We provide here the full algorithm for completeness:

Quantum DMST Algorithm
Input: An integer-weighted digraph and a root node r.
Output: A DMST for G rooted at r.
1. Initialize a subgraph H with the same vertex set as G by subtracting for each node the minimum incoming edge weight from all its incoming edges, and selecting exactly one incoming zero-weight edge for each nonroot node of G. Set t = 0, H0 = H, and G0 = G with annotations sID0 to be the identity mapping.
3. Let Tt := Ht. For k = t, . . . , 1: For each super-vertex of the k th iteration of QDLSI applied, simultaneously run the Unpacking procedure with input tree T k to obtain T k−1 .
4. Return T0 as the distributed minimum spanning tree.

Complexity
In the QDLSI, all steps other than the APSP step 3 of the quantum Lovasz iteration can be implemented within 2 rounds. In particular, to have all nodes know some tree on G for which each node knows its parent, every node can simply broadcast its parent edge and weight. Since this iteration is used up to ⌈log(n)⌉ times and expanding the DMST at the end of the algorithm also takes logarithmically many rounds, we obtain a complexity dominated by the APSP computation ofÕ(n 1/4 ), a better asymptotic rate than any known classical CONGEST-CLIQUE algorithm. However, beyond theÕ, the complexity is largely characterized by log(n) · f (n), with f (n) as in Eq. (4.4). In order to have log(n)f (n) < n to improve upon the trivial strategy of having a single node solve the problem, we then need n > 10 21 . Using the classical APSP from  in place of the quantum APSP of §4 as done in (Fischer & Oshman, 2021) to attain theÕ(n 1/3 ) complexity in the cCCM, one would need log(n) · g(n) < n to beat the trivial strategy, with g as in Eq. (4.5), or more than n > 10 14 .

Discussion and Future Work
We have provided algorithms in the Quantum CONGEST-CLIQUE model for computing approximately optimal Steiner Trees and exact Directed Minimum Spanning trees that use asymptotically fewer rounds than their classical known counterparts. As Steiner Tree and Minimum Spanning Trees cannot benefit from quantum communication in the CONGEST (non-clique) model, the algorithms reveal how quantum communication can be exploited thanks to the CONGEST-CLIQUE setting. A few open questions remain as well. In particular, there exist many generalizations of the Steiner Tree problem, so these may be a natural starting point to attempt to generalize the results. A helpful overview of Steiner-type problems can be found in (Hauptmann & Karpinski, 2015). Regarding the DMST, it may be difficult to generalize a similar approach to closely related problems. Since the standard MST can be solved in a (relatively small) constant number of rounds in the classical CONGEST-CLIQUE, no significant quantum speedup is possible. Other interesting MST-type problems are the bounded-degree and minimum-degree spanning tree problems. However, even the bounded-degree decision problem on an unweighted graph, "does G have a spanning tree of degree at most k?" is NP-complete, unlike the DMST, so we suspect that other techniques would need to be employed. (Dinitz, Halldorsson, Izumi, & Newport, 2019) provides a classical distributed approximation algorithm for the problem. Additionally, we have traced many constants and log factors throughout our description of the above algorithms, which, as shown, would need to be significantly improved for these and related algorithms to be practical. Hence, a natural avenue for future work is to work towards such practical improvements. Beyond the scope of the particular algorithms involved, we hope to help the community recognize the severity with which the practicality of algorithms is affected by logarithmic factors that may be obscured byÕ notation, and thus encourage fellow researchers to present the full complexity of their algorithms beyond asymptotics. Particularly in a model like CONGEST-CLIQUE, where problems can always be solved trivially in n rounds, these logarithmic factors should clearly not be taken lightly. Further, a question of potential practical interest would be to ask the following: What algorithms solving the discussed problems are the most efficient with respect to rounds needed in the CONGEST-CLIQUE in the regimes of n in which the discussed algorithms are impractical?
(ii) Next, gives us which proves the claim.

The α ¿ 0 case
The strategy will be to assign each v (i,j,k) ∈ V into classes in accordance with approximately how many negative triangles are in Ui × Uj × U ′ k before starting the search. To assign each node to a class, we use the routine IdentifyClass of (Izumi & Gall, 2019), also described in the main text.
The main body of this paper discussed the special case assuming α = 0. Hence we now consider the α > 0 case. For each α ∈ N, let us denote c i,j,k the smallest nonnegative integer satisfying d i,j,k < 10 · 2 c log n, and for any i, j ∈ [n 1/4 ]. Notably, P(i, j) contains at most √ n edges, so that d i,j,k ≤ √ n as well. Hence, c = 1 2 log n provides an upper bound for the minimum in step 2. The important immediate consequence is that we only need to consider Vα up to at most α = 1 2 log n. (iii): For α > 0, v (i,j,k) ∈ Vα, we have 2 α−3 n ≤ |∆(i, j, k)| ≤ 2 α+1 n.
This provides an adapted version of lemma 4.12 for the α > 0 case.
The following lemma provides a tool that will allow for "duplication" of information to avoid message congestion in the network in the EvaluationB procedure. Proof. The α = 0 case is immediate since |U ′ | = √ n, so consider α ≥ 1. The "promise" in the FEWP subroutine we are in guarantees that for all (u, v) ∈ S, Γ(u, v) ≤ 90 log n, so that for any i, j ∈ [n 1/4 ], each edge in P(Ui, Uj) ∩ S has at most 90 log n other nodes forming a negative triangle with it, leading to the inequality k:v (i,j,k) ∈Vα |∆(i, j, k)| ≤ 90n 3/2 log n.
We now describe the implementation of step 3 of the ComputePairs procedure for the α > 0 case.

3.2: For each α:
For every l ∈ [m], every node v (i,j,k) executes a quantum search to find whether there is a U ′ k ∈ Vα[Ui, Uj] with some w ∈ U ′ k forming a negative triangle (u k l , v k l , w) in G, and then reports all the pairs u k l v k l for which such a U ′ k was found.
The α = 0 case was described in the main text. We proceed to describe the classical procedure for invoking theorem 4.11 to obtain the speedup for the general α case, as in (Izumi & Gall, 2019, §5.3.2). Some technical precautions must be taken to avoid congestion of messages between nodes. This crucially relies on information duplication to effectively increase bandwidth between nodes. Lemma 8.2 provides a strong bound for the size of each Vα. For this duplication of the information stored by the relevant nodes, a new labeling scheme is convenient. Suppose for simplicity that Cα := 2 α /(720 log n) is an integer, and assign each node a label (u, v, w, y) ∈ Vα × [Cα], which is possible due to the bound of lemma 8.2. The following EvaluationB implementable in O(log n) rounds (using a slightly sharper complexity analysis than (Izumi & Gall, 2019)) can then be used for invoking theorem 4.11: EvaluationB Input: A list (w k 1 , . . . , w k m ) of elements of Vα [u, v] assigned to each node k = (u, v, x). Promise: |L k w | ≤ 800 · 2 α √ n log n for each node k and all |w ∈ Vα [u, v]. Output: Every node k = (u, v, x) outputs for each ℓ ∈ [m] whether some w ∈ w k l forms a negative triangle {u k ℓ , v k ℓ , w}. 0. Every node (u, v, w) ∈ Vα broadcasts the edge information loaded in step 1 of ComputePairs to (u, v, w, y) for each y ∈ [Cα].
1. Every node (u, v, x) splits each L k w into smaller sublists L k w,1 , . . . , L k w,Cα for each w, with each sublist containing up to ⌈|L k w |/Cα⌉ = ⌈800·720 √ n log 2 n⌉ elements,and sends each L k w,y to node (u, v, w, y) along with the relevant edge weights.
2. Every (u, v, w, y) node returns the truth value min w∈w {Wuw + Wwv} ≤ Wvu to node k for each uv ∈ L k w,y received in step 1.
For each value of α, we separately solve step 3.2 of the ComputePairs procedure. Since lemma 8.2 tells us that there are Cα times more nodes not in Vα than there are in Vα, every node in Vα can use Cα of those nodes not in Vα to relay messages and effectively increase its message bandwidth, which is exactly what EvaluationB takes advantage of. Steps 1 and 2 of the procedure take up to 2 · ⌈|L k w |⌉/n ≤ 1600 · log n rounds, since lists of size ⌈|L k w |/Cα⌉ are sent to Cα nodes, and the bound on α gives ⌈|L k w |⌉ ≤ 800n log(n).
part of ζ, which can then add the appropriate edge to Ti, needing yet another round, so that step 2 can be done in three rounds of classical communication only.
Step 3 is handled similarly. For the outgoing edge, each node in V * H i ,β i sends W G i vu to the other nodes in V * H i ,β i so that the appropriate edge to add to Ti can be determined (in case of a tie, the node with smaller ID can be the one to add the edge), so this can be done in one round. For step 4, every node in ζ notifies its neighbors that it is in ζ, after which every node can determine which edges to add to Ti. For the unpacking of V * H i ,β i , the information and communication for implementing its unpacking is contained in the nodes of V * H i ,β i , so we can indeed unpack all vertices synchronously to obtain Gi even when multiple super-vertices were contracted to get Gi+1. Hence, one layer of unpacking using this procedure can be implemented in 5 rounds (making use of the APSP and routing table information computed earlier before the contractions in QDLSI). Since there are at most ⌈log n⌉ contraction steps, the unpacking procedure can be implemented in 5 · ⌈log n⌉ rounds.

Information access
In remark 2.2, we mention that it suffices for all information regarding the input graph to be stored classically, with quantum access to it. Here, we expand on what we mean by that and refer the interested reader to (Booth, O'Gorman, Marshall, Hadfield, & Rieffel, 2021) for further details.
While our algorithms use quantum subroutines, the problem instances and their solutions are encoded as classical information. The required quantum access refers to the ability to access the classical data so that computation in superposition of this data is possible. For instance, in the standard (non-distributed) Grover search algorithm, with a problem instance described by a function g : X → 0, 1, we need the ability to apply the unitary Uw|x⟩ = (−1) g(x) |x⟩ to an N -qubit superposition state |s⟩ = 1 √ N N −1 x=0 |x⟩. This unitary is also referred to as the "oracle", and a call to it as a "query". If we wish to use the distributed Grover search in example 2.4, in which the node u leading the search tries to determine whether each edge uv incident on it is part of a triangle in graph G, the unitary that node v must be able to evaluate is the indicator function of its neighborhood, and u must be able to apply the Grover diffusion unitary restricted to its neighborhood. Then after initializing the N -qubit equal superposition, nodes u and v can send a register of qubits back and forth between each other, with v evaluating the unitary corresponding to the indicator of its neighborhood and u applying the Grover diffusion operator restricted to its neighborhood.The same ideas transfer over to a distributed quantum implementation of the EvaluationA (or EvaluationB) procedure. There, instead of evaluating unitaries corresponding to indicators, in step 2, each node v (i,j,k) evaluates the unitary corresponding to the truth values of inequality 4.3 for the evaluation steps. That information is then returned to the node that sent it, which can then apply the appropriate Grover diffusion operator.
In general, quantum random access memory (QRAM) is the data structure that allows queries to the oracle. We can use circuit QRAM in our protocols or could make use of special-purpose hardware QRAM if it were to be realized. This choice does not affect the number of rounds of communication but would affect the efficiency of computation at each node. A main component of the distributed algorithms discussed in this work is quantum query access for each node to its list of edges and their weights in some graph G. This information is stored in memory, and the QRAM implementing the query to retrieve it can be called in time O(log n), resulting in a limited overhead for our algorithms. This retrieval of information takes place locally at each node; hence, this overhead does not add to the round complexity of our algorithms in the CONGEST-CLIQUE setting. We refer to (Giovannetti, Lloyd, & Maccone, 2008) for more details on QRAM.