Efficient decomposition of unitary matrices in quantum circuit compilers

Unitary decomposition is a widely used method to map quantum algorithms to an arbitrary set of quantum gates. Efficient implementation of this decomposition allows for translation of bigger unitary gates into elementary quantum operations, which is key to executing these algorithms on existing quantum computers. The decomposition can be used as an aggressive optimization method for the whole circuit, as well as to test part of an algorithm on a quantum accelerator. For selection and implementation of the decomposition algorithm, perfect qubits are assumed. We base our decomposition technique on Quantum Shannon Decomposition which generates O((3/4)*4^n) controlled-not gates for an n-qubit input gate. The resulting circuits are up to 10 times shorter than other methods in the field. When comparing our implementation to Qubiter, we show that our implementation generates circuits with half the number of CNOT gates and a third of the total circuit length. In addition to that, it is also up to 10 times as fast. Further optimizations are proposed to take advantage of potential underlying structure in the input or intermediate matrices, as well as to minimize the execution time of the decomposition.


I. INTRODUCTION
Quantum computing is promising to provide the next phase of performance improvement for large scale computing.To this end, many different algorithms have been developed in the theoretical domain, such as Shor's algorithm for prime factorization in polynomial time [1], or Grover's algorithm for finding a specific input corresponding to some output in √ N time [2].
Recent years have seen some big strides in the field of physical implementations of quantum computers as well.However, these still have some big limitations on the number of qubits, the error rates and the length of the circuits that can be executed on them.Although quantum computers with as many as 128 qubits already exist [3], error-rates are of the order 10 −2 − 10 −3 per gate [4].Therefore, to execute a circuit on a physical quantum chip, requires significant error-correction, as well as mapping, scheduling and other such measures [5].
In the meantime, many high-level quantum programming languages are being developed.These can be used to write algorithms for future quantum computers, without the strict limits imposed by the current state of physical qubits.
These algorithms are executed on simulators, which comes with its own set of restrictions.Some simulators require the use of specific qubit topology, limit possible qubit states or the number of qubits, and all of them are bound by the classical resources of the system the simulation is run on.The main resource limit is the memory necessary to store the quantum circuit and the total qubit state, which is dependent on the length of the circuit, the number of qubits and the degree of superposition.These also influence the processing time necessary to simulate the full circuit, which is generally done by some form of matrix multiplications of the qubit state and each gate in the circuit.
Unitary Decomposition is the process of translating an arbitrary unitary1 gate into a specific (universal) set of single and two-qubit gates.Unitary decomposition is necessary because it is not otherwise possible to execute an arbitrary quantum gate on either a simulator or quantum accelerator.This makes it a required feature for algorithms that use any type of gate that is not supported by the target platform, or just produce an arbitrary unitary gate that will need to be translated.
This paper proposes a highly-efficient method to implement unitary decomposition for quantum algorithms using the Quantum Shannon Decomposition.The paper shows that our approach is up to 10x more efficient in terms of the number of gates generated for a given unitary matrix size, and requires up to a 100 times less wall-clock execution time than other implementations.The contributions of this paper are as follows: • Implementation of Quantum Shannon Decomposition for unitary decomposition of quantum algorithms.• Decomposition optimizations that take advantage of the underlying matrix structure.
• Integration and evaluation of our method in the OpenQL quantum programming framework.• Optimizing the implementation of quantum genome analysis use-case using our method This paper is structured as follows.In Section II, applications for unitary decomposition are discussed.Then, in Section III, some background is given on qubits, gate-based computation and the special qubit gates that will be used.The specific decomposition method for multi-controlled gates is given in Section IV.In Section V several decomposition algorithms are compared based on their resulting CNOT-count.The implementation of the selected algorithm, Quantum Shannon Decomposition, is outlined in Section VI.Optimizations to this implementation can be found in Section VII.Experimental results are shown in Section VIII, and compared to other implementations in Section IX.Finally the conclusion and future work can be found in Section X.

II. MOTIVATION FOR UNITARY DECOMPOSITION
Unitary decomposition is useful in several contexts.The first is the broad class of algorithms that generate arbitrary unitary gates that need to be translated into a quantum circuit.But also to enable more modular design of quantum algorithms or as an aggressive optimization method.
We will use two quantum algorithms that we have developed in the context of genome sequencing as an example of a possible application for unitary decomposition.With genome sequencing, a genome sequence is first read as many short pieces, which then need to be combined to get the full DNA sequence.This is currently done using many different algorithms, which are executed using (classical) high performance computing systems [7].
For genome sequencing using quantum accelerators, the DNA sequences can be stored in superposition.The two algorithms that will be discussed both use a unitary matrix in the process of finding the position of a short read (sequence of a small piece of DNA) on a reference genome.That matrix needs to be decomposed before the algorithm can be run on a quantum accelerator or simulator [8].
The first quantum genome sequencing algorithm we will use is Quantum indexed Bidirectional Associative Memory (QiBAM) [8] (Eq.( 1)), which uses a unitary oracle assembled from a binomial distribution (Eq.( 2)).Here, γ is a factor which influences the width of the distribution, h(p, x) is the Hamming distance between the query pattern (p) and all memory states (x), and d is the number of qubits required to store the memory states.d is also the size of the vector and resulting matrix.
The second genome sequencing algorithm is Quantum Associative Memory (QAM).This uses a Hadamard-like transformation to store the patterns, assembled using Eq.(3) [9].
In order to apply either gate from these two algorithms to qubits, they need to be translated into some combination of (elementary) quantum gates that can be executed on a quantum accelerator.And the same goes for other such algorithms.
Besides that, unitary decomposition also facilitates shortcuts in the design of new algorithms.With unitary decomposition, a developer can keep part of an algorithm as a unitary gate/matrix while working on some other part and test this.Otherwise, the algorithm can only be executed in full when all of it is made out of known quantum gates.Unitary decomposition allows the full algorithm to be tested and checked much earlier in the development process on the target quantum chip or simulator.
Furthermore, unitary decomposition can be used as an aggressive optimization method, because the maximum number of gates resulting from a decomposition can be calculated easily beforehand.The maximum length of the circuit resulting from the decomposition is only dependent on the number of qubits affected by the gate.For circuits longer than this maximum, so consisting of more gates, assembly of all gates into a unitary matrix and then decomposing that matrix will always result in a shorter circuit.
Someone programming in OpenQL might, for example, specify a circuit with three qubits with 180 gates, this might be because of application semantics, code-readability or because they did not consider the optimal way to program their quantum algorithm.180 gates is more than the gates that would result from decomposing an arbitrary 3-qubit gate.So if the circuit is combined into a single unitary matrix and then that matrix is decomposed using Shannon Decomposition for example, then the length of the circuit will have gone from 180 gates to only 120 (84 rotation gates and 36 CNOT gates).
Something to consider, however, is that the circuit resulting from the decomposition of a unitary matrix is longer than the theoretical minimum.And even the theoretical minimum number of gates for a general n-qubit unitary gate becomes quite large very quickly, since it scales with 4 n−1 in the leading term.So in most cases, a hand-optimized and application specific circuit will be shorter than the one resulting from universal unitary decomposition.But these hand-optimized circuits are labour-intensive and require a significant amount of time to develop, while unitary decomposition can be done automatically.

III. BACKGROUND
In this section, background and notation will be given for qubits, quantum gates, unitary matrices, the universal set of gates that will be used, quantum multiplexers and multicontrolled gates.

A. Qubit notation
A qubit state is represented in braket notation as: Besides the | notation, quantum states can also be represented as complex vectors: T .This is especially useful for the combined state of multiple qubits, where the first row of the vector corresponds to the binary number "0" in as many bits as there are qubits.The second row corresponds to the number "1", etc.As an example, for a three qubit state the first row corresponds to |000 , and the second to |001 .This continues to the final row, which is |111 .
The state vector has 2 n rows for the state of n qubits.

B. Quantum gates
Qubits are manipulated using gates, which are matrices that operate on the qubit state vector.To calculate the effect of gates on the combined qubit state, the state vector is multiplied by the matrix representations of the gates in reverse order.
In the circuit notation, each line going into or out of a gate represents one qubit.To represent n-qubit gates, so gates that affect an unspecified number of qubits, a line with a backslash through it is used.

C. Unitary matrices
The matrix representation of a (pure) quantum gate always corresponds to a unitary matrix, which is why the decomposition method in this paper is called unitary decomposition.
Unitary matrices are written as U (2 n ), which means a 2 n ×2 n matrix corresponding to an n-qubit gate, which has the following properties [6]: The Special Unitary group, SU is a subgroup of unitary matrices where: Unitary matrices in the Special Unitary group are written as SU (2 n ).
When a measurement is performed, the global phase (Φ) of the qubits does not influence the measurement probabilities.This means that all quantum gate operations can be represented by a matrix in SU (2 n ) [11].These properties will be used to decompose the matrix, using one of the algorithms described in Section V.

D. Universal set of gates
In order to decompose all possible unitary matrices into quantum gates, a universal gate set is selected.This means the decomposition will result in circuits with (only) the following three gates: rotations around the Z and Y axis by an arbitrary angle, the R z (θ) and R y (θ) gates, and the controlled not, the CNOT gate.The matrices for these are shown in Eqs. ( 5) to (7).

E. Quantum multiplexers
Besides these conventional gates, there are several gates used in this paper as intermediate results for the various decomposition algorithms.
The first is the quantum multiplexer, which corresponds to a unitary matrix corresponding with the following structure Eq. (8).
Here, U (2 n ) denotes a unitary gate over n qubits, which is a unitary matrix of 2 n rows and 2 n columns.U 0 (2 n−1 ) and U 1 (2 n−1 ) are both (n-1)-qubit gates.The rest of the matrix of U is zero.The gate is uniformly controlled, which means that when the control is 0, the upper left (U 0 ) of the matrix affects the qubits.But when the control is 1, the lower right gate (U 1 ) gets applied.In a circuit, this looks like: The first line is the controlling qubit, and the lower line is the rest (n-1) of the qubits.The box with the line to the lower gate means that it is uniformly controlled.

F. Multi-controlled (rotation) gates
Another common intermediate gate is the multi-controlled (rotation) gate.This is a 1-qubit gate with k control bits.Rather than just applying a gate when all control bits are zero, the applied operation to the target qubit can be different for each of the 2 k possible classical values of the control qubits.This is written as F k m (U (2)), which is a fully or multicontrolled U(2) gate with k control qubits, with the target qubit at position m.The circuit representation of this gate is shown in Fig. 1.To indicate that an operation is applied for either state of the control bits, a square control box is used.
These multi-controlled gates correspond to a (block) diagonal unitary matrix, which is why they show up frequently in decomposition schemes.This is shown in Eq. ( 9).
A multi-controlled rotation gate around axis a corresponds to the matrix shown in Eq. ( 10).This can be any axis, but in the paper mainly the multi-controlled R y and R z will be used.

IV. DECOMPOSING MULTI-CONTROLLED ROTATION GATES
The multi-controlled rotation gates from Section III-F can be decomposed into a combination of CNOTs and regular rotation gates.This can be done using the method from [12], which results in 2 k CNOTs gates and 2 k 1-qubit rotation gates for a controlled rotation gate with k control bits.To get from an This can be extended until only CNOT gates and 1-qubit rotation gates are left, which leads to an example decomposition of a rotation gate with 3 control bits as shown in Fig. 3.
Fig. 3: Decomposition of an To directly calculate which qubit is the control bit for each CNOT, can be determined using Gray code.This is shown in the table below the circuit.The number of the bit that gets changed in the Gray code is the number of the qubit that will be the control bit.
For each control bit of the multi-controlled gate, a 1-qubit rotation gate and a single CNOT is used, so for the total decomposition of an F k m -gate requires 2 k rotation gates and CNOTs [12].This is the least-known number of gates for decomposing such a matrix, and is therefore used in almost all decomposition methods for (block) diagonal matrices of this form.

METHODS
In this section, first the selection criteria for the various decomposition methods will be outlined in Section V-A.Then, the theoretical lower bounds for the number of gates resulting from decomposition is given in Section V-B, with implementations for a 1-and 2-qubit gate in Sections V-C and V-D.This is followed by an examination of various general decomposition methods from literature in Sections V-E to V-H and finally the selection in Section V-I.

A. Selection criteria
Quantum computers are currently limited by the error-rates and decoherence of qubits [4].And the longer the circuit, the higher the chance of errors will become.So therefore the selection will be based on circuit length, although the decomposition algorithm will for now only be tested with perfect qubits on a simulator.
For all decomposition methods, the number of gates resulting from the decomposition is only dependent on the number of qubits affected by the unitary gate.So for generic n-qubit unitary gates, the resulting circuit length can be calculated from the size of the input matrix.
To measure the length of the resulting circuit, the number of CNOT gates will be used.There are several reasons for that.The first is that not all papers distinguish between generic 1qubit gates and rotation gates.The decomposition of a generic 1-qubit gate takes three rotation gates (see Section V-C) so the comparison might be a factor 3 off if 1-qubit gates are used to judge circuit length.The CNOT gate is used as the result for all decomposition methods, and always has the same definition.This makes it a good metric for the total circuit length.
Secondly, each CNOT can generate entangled states between qubits [13].And for execution of the circuit on (nearterm) quantum devices, each CNOT between non-neighboring qubits might introduce additional mapping operations [5].So to reduce mapping in the future, a circuit with as few CNOTs as possible is desired.
Thirdly, the error-rates for two-qubit gates are currently considerably higher than for 1-qubit gates [4].So the chance that an error occurs in a circuit becomes much bigger with more CNOTs.So to make the decomposition feasible for nearterm quantum applications, it is not only important to keep the circuit-length low, but especially the CNOT count.

B. Theoretical lower bounds
There is a theoretical lower bound for the number of CNOTs resulting from the decomposition of an n-qubit gate, and it is mathematically proven to be 1  4 (4 n − 3n − 1) [14].There are implementations that reach this number for 1 and 2 qubit gates [14], which will be outlined in the next sections.This lower limit is included in the comparison, because it is useful to keep in mind what is and is not possible in terms of algorithms for unitary decomposition.

C. ZYZ decomposition
For a 1-qubit gate, no CNOT gates are necessary.And if rotation gates around any axis are possible, only one such gate is needed to apply any 1-qubit operation.But when using standard elementary gates, such as rotations around the Pauli X, Y or Z-axis, the decomposition of an arbitrary 1-qubit gate results in 3 rotation gates using ZYZ decomposition [14].
One way to do this is with two rotation-z gates and one rotation-y gate.For this decomposition, the angles Φ, α, β, γ can be found so that the following equation is satisfied: These angles can be calculated using the eigenvalues of the matrix, and are used in the circuit shown in Fig. 4.This is a universal decomposition for a 1-qubit SU (2) gate [14].

D. Minimal decomposition of 2-qubit gates
From the theoretical lower bounds we know that at least 2.25 CNOT gates are needed for a 2-qubit gate.This rounds up to 3 CNOTs, and a circuit that achieves that number is shown in Fig. 5 [14].
To obtain the values for the gates of this circuit, first angles α, β and δ are found as in the ZYZ decomposition (V-C).These are used to make circuit v so that: With v as the circuit: Fig. 6: The circuit v, used to construct a universal 2-qubit gate [14].
Then, to get the 1-qubit gates, first matrix A ∈ SO(4) can be found so that AU U T A † is diagonal 2 .Through more diagonalization, B ∈ SO(4) can be found so AU U T A T = Bvv T B T and matrix C as C = v † B T AU ∈ SO(4).This leads to A T BvC = U , and because A, B and C are in the special orthogonal group they can be implemented by two unitary gates.After combining A T and B, the four gates can be found as: [14] Which gives the circuit in Fig. 5.The four 1-qubit gates can be implemented by three rotation gates each, through ZYZ decomposition, so that the total rotation count is 4 • 3 + 3 and the total CNOT count is just the ones for the circuit v, so 3.This matches the theoretical lower bounds for an arbitrary 2-qubit gate.

E. Decomposition through unentangeling of qubits
The first of the general decomposition methods uses consecutive unentangling of qubits, since one of the ways to specify an n-qubit gate is by its effect on the base vectors.This technique from [15] implements the correct behavior for each vector iteratively using a method similar to QR decomposition, which leaves previous assessed vectors preserved.
To get here, the qubit state vector is divided into 2 n−1 vectors of each 2 elements, which are labeled |ψ j for j = 0, . . ., 2 n−1 − 1.For each of these vectors, Eq. ( 16) is used to determine r j , t j , φ j and θ j .
So that: This corresponds to a circuit with a multi-controlled R y and R z , which is used to unentangle the last qubit.The new qubit vector is assembled as in Eq. (18).The circuit to translate |φ into |φ |0 will be called E k as in Eq. (19).
This circuit is implemented with the multi-controlled R y and R z gate, as is shown in Fig. 7.
This method can be applied recursively to map an n-qubit state φ to a scalar multiple of a bit-string |b , which uses 2 n+1 − 2n CNOT gates.
Using this method to decompose unitary gates requires 2 n−1 of these state preparation steps.At each step, a circuit C j is found that maps a unitary gate U j to a scalar multiple of |j so that U j+1 = C j U j .The final product U 2 n −1 will be a diagonal gate D, which can be implemented with a multi-controlled R Z gate, so that CNOT gates, and with the final diagonal gate this leads to a total of 2 • 4 n − (2n + 3) • 2 n + 2n CNOT gates [15].

F. Decomposition with Givens rotations
In [16] a method of decomposition is described that uses the Givens rotation matrices to do QR factorization of a unitary matrix.Each Givens rotation nullifies the element on the i t h column and j t h row of a U (2 n ) matrix, as: The modified elements of U are indicated with a tilde, and the element on the lower left u n,1 is nullified by the Givens rotation.Each Givens rotation matrix is equal to the identity matrix except for c = cos(θ) and s = sin(θ) for the elements at positions (i, i), (i, j), (j, i) and (j, j), with θ the angle of the Givens rotation.These are to nullify elements until all elements below the diagonal are zero, at which point the following equality holds: [16]   By reordering the base vectors according to Gray code (see Section III-F), the cosine and sine elements will all be adjacent.This is convenient for quantum computation, because that means that each Givens rotation matrix can be implemented by a controlled 1-qubit gate, C k i , with k control bits.For one specific combined state of the control qubits the Γ gate gets applied to qubit i, while for all other states the target qubit is left unaffected.So that: where γ(i) denotes the i th number of the Gray code and the gates i Γ j,k are formed from the matrix for the Givens rotations.This results in the circuit shown in Fig. 8 for the decomposition of a 2-qubit gate.
The number of elementary gates and CNOTs were calculated using [17], which are the numbers included in the table.Generally, this decomposition requires approximately 8.4 • 4 n controlled gates, which follows from a recursive relation of

G. Recursive CSD
With the circuit presented in [18], an n-qubit gate is decomposed into multi-controlled rotation gates.CSD is applied recursively until all the matrices are diagonal.
With CSD, any even-dimensional unitary matrix U can be decomposed into real diagonal matrices C and S, and smaller unitary matrices L 0 , L 1 , R 0 , R 1 as shown in Eq. ( 24) [19].
The left and right matrices are uniformly controlled gates, see Section III-E.C and S are diagonal matrices with as the diagonal elements respectively the cosines and sines of angles θ j between the subspaces, as shown in Eqs. ( 25) and (26).
where the values θ are ordered from large to small, and are between 1 2 π and 0. The central matrices from each recursive step correspond to multi-controlled R y gates which are decomposed as in Section III-F.The other diagonal gates can be decomposed into a circuit consisting of 1 This is significantly improved upon in [21], which stops the recursion at uniformly controlled 1-qubit gates.Furthermore, it proves that any uniformly controlled 2-qubit gate (F n−1 n (U (2))) can be decomposed into a specific sequence of 2 n−1 − 1 CNOT gates, 2 n−1 1-qubit gates and one total global phase gate expressed as ∆ n .
Furthermore, it proves that each multi-controlled 2-qubit gate can be decomposed into a diagonal gate (∆) and a Gray code sequence of CNOTs and 1-qubit gates.The diagonal gates are folded into the central matrix from the CSD, so the total decomposition is: ) is decomposed with 2 n−1 − 1 CNOTs, and the ∆ n gate is implemented with multi-controlled R Z gates.This results in 2 n − 2 CNOTs, which makes the total CNOT count 1 /2 H.Quantum Shannon Decomposition [15] introduces another way of using the CSD from Section V-G, called Quantum Shannon Decomposition (QSD).The decomposition of a 2-qubit gate is shown in Fig. 10.
Fig. 10: Quantum Shannon Decomposition [15] The start of the decomposition is the same as in Section V-G, but the L and R matrices are decomposed using Eigenvalue decomposition.This is shown in Fig. 11.The resulting matrices are unitary gates applied to one less qubit than the starting unitary.This leads to the circuit in Fig. 10, where the D-matrix is implemented as a multi-controlled R z gate.
Quantum Shannon Decomposition is applied recursively until the final 1-qubit gates can be implemented with ZYZ decomposition.This means only the multi-controlled rotation gates contribute to the number of CNOTs, each of which requires 2 n−1 CNOT gates for a single step of the recursion of an n-qubit gate.This leads to a total CNOT count of 3 /4 • 4 n − 3 /2 • 2 n for this decomposition method.
There are two optimizations that can be implemented on top of this implementation of Quantum Shannon Decomposition.The first is to stop the recursion at 2-qubit gates, and translate those as in Section V-D.The second optimization is to implement the central multi-controlled R z gate using CZ gates rather than CNOTs, of which one can be absorbed into the neighboring multiplexer.This results in one less CNOT gate at each level of the recursion.With these two implementations the CNOT count comes to 23 /48 • 4 n − 3 /2 • 2 n + 4 /3 [15] .Fig. 12: CNOT counts for different implementations of unitary decomposition for 1 through 5-qubit gates

I. Selection of the algorithm
For each decomposition method, the CNOT gate counts are compiled in Table I and plotted in Fig. 12.As an indication, the number of CNOT gates resulting from the decomposition of a 1 to 5-qubit unitary gate is given.Along with the general formulas for the number of CNOT gates resulting from the decomposition of an n-qubit gate, if such a formula was available.
As can be seen in Table I, the optimized version of QSD results in the fewest CNOT gates.The choice was therefore made to implement this decomposition, although not the optimized version.The optimizations from [16] can be implemented without any modifications to a base implementation of the algorithm.
Besides that, QSD has several other advantages.The recursion is performed at general n-qubit gates rather than multicontrolled 1-qubit gates, which makes it relatively simple to implement.If algorithmic implementations for 3-qubit, 4-qubit or 5-qubit or bigger general gates are found, they can be easily implemented.The same goes for other specific optimizations.And because the mathematical decompositions are done separately for each step in the recursion, rather than all at once at the beginning, any underlying structure in the beginning or intermediate matrices can be taken advantage of immediately, therefore potentially skipping many computational steps as well as decreasing the size of the resulting circuit.
For these reasons, the choice was made to go with Quantum Shannon Decomposition for the implementation of unitary decomposition in OpenQL.
For unitary decomposition in OpenQl, first a Unitary object is defined, which is then decomposed to calculate all the angles for all the rotation gates.The Unitary is then added to a kernel as any other gate.The kernel is added to a program, which is compiled with a compiler.The implementation is thus split between the Unitary class and the call to kernel.gate().

A. The Unitary class
The Unitary is instantiated with a string and an array.The content of this array is the unitary matrix, which is of size 2 n × 2 n for an n-qubit gate.The complete Quantum Shannon Decomposition is computed only when "decompose()" is called, and the calculated angles for the resulting rotation gates are added to a list.This is done so that the Unitary can be used multiple times in a program without recalculation of the whole decomposition.
But before the decomposition is started, it is first checked if the input matrix is unitary.If this is the case, all of the intermediate matrices will also be unitary [19], so this check is only necessary once.Furthermore, all of the M k (Gray code) matrices, which are needed for the multi-controlled rotation gates, are added to a lookup table so they do not need to be calculated anew at each decomposition step.
To make certain that the decomposition is correct, each single intermediate decomposition is checked.For each step only three matrices need to be multiplied, and this saves any calculations that might be done on an incorrect matrix.If any step of the decomposition is not correct, an exception is thrown and the decomposition is stopped.
The Eigen [22] library is used to do Singular Value Decomposition (SVD), eigenvalue decomposition and matrix multiplication.The recursion is centered on a main function, which takes as parameters a unitary matrix and the number of qubits.The latter is to keep track of the level of recursion.
Computation of the CSD is done using the method from [19], which uses SVD.The demultiplexing function uses Schur matrix decomposition for the (sub)matrices smaller than 2 6 × 2 6 , and eigenvalue decomposition for bigger matrices.This is done because Schur matrix decomposition is faster for small matrices [22].
The algorithm is recursive, and the demultiplexing step calls on the main function again for the decomposition of the smaller unitary matrices.If the matrices are of size 2 × 2, the rotation angle for the 1-qubit rotation gates are calculated using ZYZ decomposition as in Section V-C.
Because the Unitary does not have access to the qubit numbers of the circuit, only the angles for the multi-controlled R y and R z are calculated at this point.This is done as in Section III-F, through solving the following matrix equalities: where M k is a square matrix where all the entries are either "+1" or "-1", which are calculated using Gray code using Eq. ( 29).
where the exponent is the bit-wise inner product of two binary vectors: b i and γ j .b i is the integer i and γ j is the jth value of the Gray code.
For the multi-controlled R y gate the values of α i are calculated by taking the arc sine of the diagonal entries of the S-matrix from the CSD.
For the multi-controlled R z gates the values of α i is calculated by taking the natural logarithm of the D-matrix from the demultiplexing.
All the angles for all rotation gates are added to a list, which is used to generate the correct gates when the Unitary is added to a circuit.

B. kernel.gate()
At the kernel level, when the (decomposed) Unitary object is added to the circuit, the gates and CNOTs are assembled and added to the circuit list.At this point, it is checked whether the Unitary is decomposed and if it is applied to the correct number of qubits.The first is checked from a flag that is set to "true" at the end of the decomposition.The latter is calculated from the size of the unitary matrix, which should be 2 n × 2 n for an n-qubit gate.
Because the kernel only has the qubit numbers and the list of rotation angles, it does not have insight into whether any optimizations have happened.Therefore, the gates are added purely sequentially to the circuit, and each recursive call to the main function returns the total number of rotation angles that was used up until that point.If gates have been removed by an optimization, a specific angle is added to the circuit which signals how many gates have been removed, and these gates are skipped during circuit generation.
It is expected that the decomposition will take the most time to compute, as well as the most memory, since it contains the mathematical algorithms and matrix multiplications.Comparatively, using the calculated angles to make the circuit will not require much time or memory.So adding the circuit sequentially is not expected to have much of an impact on the total resources required by the circuit, while it allows for a much more modular implementation of unitary decomposition.

C. Compilation of the OpenQL program
After all gates have been added to the circuit, the kernel is added to a program which is compiled in OpenQL.From this point, the gates from the decomposition are handled in the same way as any manually added gates.So the features and optimizations from the lower levels of the programming language can be fully used for the circuit [23].Afterwards, the circuit is transformed into quantum assembly language and written to an output file as usual, or directly passed on to the simulator.

VII. IMPLEMENTATION OPTIMIZATION
For execution of the resulting circuit, it is important that it is as short as possible for the reasons mentioned in Section V. To this end, the algorithm itself was selected to generate as few gates as possible.Combining and removing individual gates will be done at a later compile step by the OpenQL compiler [5], but more structural optimizations can be done during the decomposition.For example, QAM, one of the algorithms from Section II, generates a unitary matrix that has an internal structure which can be used to skip many steps in the recursion (see Section II).The implemented optimizations take advantage of the matrix structure through early detection of multiplexers and the detection of unaffected qubits.

A. Detection of multiplexers
Before the CSD is started, it is checked whether the upper right and lower left quarters of the matrix are already zeromatrices.If that is the case, the matrix already has the structure of a multiplexer, and is directly passed to the demultiplexing step.This is signaled to the kernel by adding a specific gate angle to the list of rotation angles.This operation halves the number of resulting gates for this step of the decomposition.

B. Unaffected qubits
If a decomposition step leaves a qubit unaffected, then it is not necessary to apply any gates to that qubit, and an nqubit gate can be handled as an (n-1)-qubit gate.This reduces the resulting number of gates for this step by more than 3 /4.So before the main decomposition is called, it is checked if the matrix is of the form A ⊕ I or I ⊕ A. Each step of the QSD evaluates unitary gates on one less qubit, so any unaffected qubits become the first or last qubit at some point in the decomposition.If an unaffected qubit is detected, this is also signaled to the kernel.The unitary matrix of size (n-1) is then assembled and passed back to the main function of the decomposition.

C. Execution time optimizations
There are also some optimizations to reduce the execution time and memory use of the decomposition.
One of the things done to reduce the total execution time and memory use is the fitting of ".noalias()" flags to all places where the product of multiple matrices is assigned to a different matrix.The Eigen library assumes aliasing for all such operations and without this flag, it evaluates the result of a matrix product into a temporary matrix that is then copied [22].Another optimization is that all matrices are passed as references where possible, to prevent any unnecessary copying of data.
The execution time and memory use of the decomposition after these and other optimizations can be found in Section VIII.For most of the decomposition the total execution time scales with approximately 4 n for an 2 n × 2 n unitary matrix, which corresponds to an n-qubit gate.This is a linear relation with the number of generated gates and the number of elements in the input matrix.

VIII. RESULTS
The execution time of different parts of the decomposition is measured as the elapsed wall-clock time, with measurements in between function calls to determine the relative time consumption.The final execution times are shown in Fig. 13.These tests were executed using a Dell Latitude 7400 with The program in Code Example 2 has been used to determine the execution time and memory used by the decomposition.To measure execution time, the Python "time" package is used to determine the time difference between the start and various points of the program.The time for each part of the code, as well as the resulting aggregated execution time, can be found in Fig. 13 and Table II.
As expected, the decomposition itself takes the most time, more than 10 times that of any other part.This is because of the considerable mathematical decompositions and the number of matrix operations.One of the algorithms used in the decomposition is eigenvalue decomposition, which is an iterative algorithm that requires O(6 n ) operations for an 2 n × 2 n matrix [24].The data also shows that generation of the rotation gates and CNOTs does not contribute much to the total execution time of the algorithm, as expected.And since the complete decomposition is calculated at design time, it does not influence the run-time of the final circuit when it is executed on a quantum accelerator.The same program has Fig. 14: Additional memory allocated per line, for different sizes of unitary matrices also been used to determine the memory allocation.This has been measured using the Python memory profiler package.The results of this are shown in Table III and Fig. 14.After an initial allocation of about 40 MiB, noteworthy additional allocation of memory occurs only when k.gate(...) is called.This means that the complete unitary decomposition requires much less memory than generating and storing the resulting circuit in OpenQL.

IX. COMPARISON TO OTHER IMPLEMENTATIONS
We also compared our implementation of unitary decomposition to that of Qubiter [18].Qubiter is a quantum compiler/programming language that aims to provide a set of tools for designing and simulating quantum circuits.As part of that, they offer unitary decomposition based on the recursive CSD from Section V-G.It is the only other programming language that offers a decomposition implementation focused on generated circuit length.Qubiter is written in Python and uses numpy for the mathematics, as well as the LAPACK cuncsd function for the CSD [20].Because we use QSD in our implementation of unitary decomposition in OpenQL, the decomposition generates much shorter circuits than the one in Qubiter.To get the total gate count for both languages, the number of lines in the output text files have been counted.Both can generate a file with a representation of the quantum circuit, with each gate on a separate line.The total gate count also includes rotation gates and not just CNOTs.The results for OpenQL and Qubiter are plotted in Fig. 15 and can be found in Table V.It is clear that OpenQL always generates fewer gates than Qubiter, and almost all of the difference is in the number of CNOTs.For a 10-qubit gate, unitary decomposition with OpenQL generates half as many CNOTs as Qubiter, and produces a total circuit that is almost 3 times as short.
The implementations are also compared on the time used to compute the unitary decompositions.The aggregated execution times for decompositions of 2 to 10-qubit unitary gates can be found in Table IV, and are plotted in Fig. 16.The execution time of both decompositions scale approximately linearly with the input matrix size.As can be seen in the table and the figure, OpenQL is considerably faster than Qubiter.When comparing the total execution times, it becomes clear that the OpenQL implementation takes more time per input matrix element (4 n ) due to the Eigenvalue decomposition.Qubiter does not have that issue, but using unitary decomposition in OpenQL is about 10 to a 100 times faster for the decomposition of 1 to 10-qubit unitary gates.In addition to being faster, unitary decomposition in OpenQL generates a much shorter circuit for all sizes of unitary matrices.

X. CONCLUSION AND FUTURE WORK
With the implementation of unitary decomposition, OpenQL can be used for any quantum algorithm that uses arbitrary unitary gates.One such algorithm is QiBAM [8], which can now be implemented using OpenQL.This is not possible without unitary decomposition.The decomposition generates more gates than the theoretical minimum, but the structure of the decomposition means that further optimizations can be easily integrated with the current implementation.The decomposition is done using Quantum Shannon Decomposition, which is up to 10× more efficient in number of generated gates than other examined algorithms.Two optimizations were implemented to take advantage of the internal structure of the input or intermediate unitary matrices, which can drastically reduce the length of the resulting circuit.With these optimizations, the final resulting gate count can be much lower than the illustrated worst case numbers.
The decomposition results in O(3 4 4 n ) CNOT gates and O( 9 4 4 n ) total gates.Compared to other implementations of unitary decomposition, specifically Qubiter, it generates half the number of CNOTs and a total circuit that is three times as short for the decomposition of 10-qubit gates.Although the execution time of the decomposition scales with O(6 n ) for matrices of size 2 n × 2 n , for the decomposition of up to 10-qubit gates our implementation is 10-100 times faster than Qubiter.
There are several avenues that can further bring down the number of gates the decomposition generates, which are: • Implementation of a minimum 2-qubit circuit, such as the one described in [15].• Additionally, implementation of a universal 3-qubit gate, such as the one in [25].• Implementing the multiplexed R z gate with a CZ gate, as expressed in [15].• Reworking the QSD so that the intermediate matrices cancel out, as the input matrix has fewer degrees of freedom than the matrices resulting from the QSD.Therefore, it might be possible to choose some of these intermediate matrices in such a way that they can be decomposed using fewer elementary gates.• Implementation of other specific efficient decompositions, such as controlled unitary gates (as opposed to uniformly controlled gates), quantum multiplexers or specialized multi-controlled rotation gates.
Another possibility to bring down gate count is to implement other application specific optimizations when the input for the unitary decomposition is known to have more constraints.Such as arbitrary unitary gates that only apply right-angle operations to the qubits, or matrices that are Hermitian 3 as well as unitary.The decomposition might be tailored to take advantage of these constraints, so that the decomposition of these more constrained input matrices results in shorter circuits than the implemented general unitary decomposition.For near-term quantum applications, the decomposition generates too many gates for unitary matrices bigger than a certain size.Although the precise limit depends on the specific implementation, the decomposition of a 3-qubit unitary gate might already result in a circuit that is too long.But there are several optimizations that can be done to make unitary decomposition more feasible for near-term quantum applications with non-perfect qubits, such as modifying the decomposition to generate a more parallel circuit, or splitting the resulting circuit in several pieces that can be executed separately.Nearestneighbor circuits can be used to minimize the cost of mapping.And due to the identical structure for each decomposed circuit, the structure of a real quantum system can be adjusted so that it perfectly fits unitary decomposition, which can reduce or completely remove the need for mapping operations.
Ultimately, the goal of all of these suggestions is to keep unitary decomposition and OpenQL relevant and useful both for near-term and future quantum applications.

Fig. 13 :
Fig. 13: Execution time for the timed intervals, for different sizes of unitary matrices

Fig. 15 :
Fig. 15: Number of generated CNOTs and total gates for OpenQL and Qubiter from the decomposition of different sizes of unitary matrices.

Fig. 16 :
Fig. 16: Execution time of the preamble, matrix load times and total decomposition for OpenQL and Qubiter for different sizes of unitary matrices.

TABLE III :
Additional memory allocated at each line of Listing 2 for the decomposition of unitary matrices of different sizes, in MiB.

TABLE IV :
Aggregated execution times for the decomposition of unitary matrices of different sizes for OpenQL and Qubiter, in seconds.

TABLE V :
Number of gates for OpenQL and Qubiter resulting from the decomposition of unitary matrices of different sizes, from counting lines in the generated assembly language files.