Parallel Computing of Edwards—Anderson Model

: A scheme for parallel computation of the two-dimensional Edwards—Anderson model based on the transfer matrix approach is proposed. Free boundary conditions are considered. The method may ﬁnd application in calculations related to spin glasses and in quantum simulators. Performance data are given. The scheme of parallelisation for various numbers of threads is tested. Application to a quantum computer simulator is considered in detail. In particular, a parallelisation scheme of work of quantum computer simulator.


Introduction
The Edwards-Anderson model was proposed by Edwards and Anderson in 1975 to describe the physics of spin glass [1,2]. The model represents a set of spins arranged at the nodes of the lattice. Each spin can take on a value of +1 or −1 (or the direction up or down in some one dimensional isotopic space). All adjacent spins interact with each other. The interaction energy of one pair is calculated by the formula where the sum symbol with the index < i, j > denotes the summation over all adjacent lattice nodes, J ij is a coupling constants between spins, taking values +1 or −1 with equal probability for bimodal distribution and any values with probability • The polynomial increase in computation time (without increasing the amount of RAM) while fixing the size of one side and increasing the size of the other, which makes it possible to compute 40 × 100, 40 × 1000, 40 × 10,000 lattices etc. This feature is not available to the method presented in [23]. • The method is exact. • The possibility to handle different distributions (not only bimodal). It is possible to take into account impurities and heterogeneities. • The ease of implementation.
The important issue of the computer simulations is the possibility to parallelise the task because it gives an opportunity to use the power of modern high performance systems and, consequently, to calculate more complex and larger lattices of spins. The main material of the article will be devoted to parallelisation.
The method allows us to find both the ground state energy and ground state spin configurations. Free boundary conditions are used in the model. For periodic or toroidal boundary conditions parallelisation will be trivial. In the latter case we would have just a set of completely independent calculations of parts of the partition functions. In the end we would sum up all parts taking into account the additivity of partition functions.
The article is organized as follows: the first section provides the description of the method and the parallelisation scheme. The second section provides performance data for bimodal distribution. Further we give the conclusions.

Algorithm
To find the minimum energy it is necessary to enumerate all possible states of the lattice spins, calculate their energies and compare them with each other. In order not to calculate it every time, one can calculate once the energies of states of a subsystem (part of the initial system) with given boundaries, and then extend the subsystem by adding spins to it. Knowing the boundary state makes it possible to calculate the change of energy of the ground state when adding spins. So E → E + ∆E. Expanding the subsystem by adding one spin at each step until the initial given system is achieved, we compute at each step a new minimum energy. The calculated energy of the ground state at the last step will be the final answer. This approach is quite widespread and is used, in particular, in the transfer matrix method in the calculation of the Ising model [28] and quantum many-body problem models [29,30]. Further, let's describe the parallelisation scheme, preliminarily introducing some notations and formalizing the task.
Let a lattice have size L y × L x . The above described energy calculation scheme for filling the spins of one lattice row (let the system expand line by line) can be represented as follows From the scheme one can see that all values of the minimum energy of the subsystem with a given boundary consisting of L x spins are written in vector-columns. Each column component corresponds to a given subsystem boundary. The state of the boundary is given by the set (s 1 , s 2 , . . . , s L x ). The column has 2 L x components. Thus all states of the upper boundary are covered. The bottom two indices y, x to the right after the bracket E(. . . ) y,x denote the subsystem consisting of y rows with x spins in the top y line. We extend it at each step, adding one spin from left to right, so these indices allow us to understand which subsystem we are dealing with at the current step. The notation system is given for the lattice with L x = 4 in Figure 1. After filling one line, one should repeat the process for the next line until the original system is obtained.
Let us consider in details the calculation of E. Knowing the boundary and all possible states of the added spin s m,n (the indices m and n denote the vertical and horizontal coordinates, respectively) one can calculate the minimum value of E of two subsystems with boundaries differing in the state of only one spin, by the formula E(s 1 , s 2 , . . . , s n = −1, . . . , s L x ) m,n = = min E(s 1 , s 2 , . . . , s n = −1, . . . , s L x ) m,n−1 + ∆E((s 1 , s 2 , . . . , s n = −1, . . . , s L x ) m,n−1 , s m,n = −1), E(s 1 , s 2 , . . . , s n = +1, . . . , s L x ) m,n−1 + ∆E((s 1 , s 2 , . . . , s n = +1, . . . , s L x ) m,n−1 , s m,n = −1) Here ∆E is the energy change after spin addition, the first argument ∆E in brackets denotes the boundary state, the second one is the state of the added spin. Let us explain for better understanding that in the expressions (4) and (5) the values of spins s 1 , s 2 , s 3 , . . . , s n , . . . , s L x (excepting s n ) have the same values for a given boundary to the left and to the right of the equality sign. In the process of calculations we run over all states the boundaries. Thus, Formula (4) will be applied 2 N /2 times (as well as (5)) to calculate 2 N new E when adding spin at position of the column number n. The minimum energy of the subsystem of the new set for one boundary state is calculated using only two values of E of the previous set, distinguished by a single spin state with the column number n (to a given column with the number n a spin is added): E(s 1 , s 2 , . . . , s n = −1, . . . , s L x ) m,n−1 E(s 1 , s 2 , . . . , s n = +1, . . . , s L x ) m,n−1 E(s 1 , s 2 , . . . , s n = −1, . . . , s L x ) m,n E(s 1 , s 2 , . . . , s n = +1, . . . , s L x ) m,n c r r r r r r r r r r r r r r j c % Thus, to calculate a new set of minimal energies, all states must be grouped into pairs differing by the spin state at position n. After adding the next spin already at the n+1 column position, the configurations of the spin boundaries of each pair will differ only by the state of n + 1 spin. An example of the pairing of a set of boundary spins for a lattice with L x = 5 for all five steps of row filling is given in Figure 2. Two-headed arrows signify data exchange (swap). One part takes up the place of the other part which occupies in turn the place of the first part.
The pair energies of this set E are determined by the energies of only one pair of the previous set. The number of boundary spin states is 2 5 , so the number of pairs will be 2 4 . Listing 1 gives an example of pseudocode for splitting the states of the boundary into pairs.
According to the diagram in Figure 2, two nested loops are needed to split the elements into pairs. In the listing, these loops are executed with the use of variables index_out and index_in. The variables state1 and state2 store the state numbers of the boundary of the respective paired states. The energies of the states are stored in the array E[. . . ]. It can be seen that as the column number increases, the distance between the spin numbers increases as well. Due to the fact that the links change when a spin is added, it is clear that in the parallelisation scheme one will need to do transitions of data between threads in a non-trivial way. To understand how this should be done, let's rewrite the parallelisation scheme in a different way. Namely we can write the spin state numbers of the pair together (one under the other), as done in Figure 3. From the diagram in Figure 3 one can understand how to move data between threads when doing parallelisation. As the column number increases, the difference of the pairing state numbers increases. So, starting from a certain n exch column number at the position of which spin must be added, it is necessary to move data between threads according to the Figure 3. One can see that in the blocks with data circled by a dotted line in the figure, the state numbers are ordered. This makes it possible to construct the following data exchange scheme. When dealing with subsystems with the number of top row spins n < n exch , no data is moved within the thread. The splitting of elements into pairs and energy calculation is done in Listing 1. Then, when the column number n is such that n ≥ n exch , we do a data exchange between streams so that the paired elements are arranged as in Figure 3. The scheme of data exchange between threads will look as shown in Figure 4. It should be emphasized that the scheme in Figure 4 literally the same as one as in Figure 3. Pulling numbers with their arrows and lining them up in one row we get the scheme in Figure 4. Also each 2 areas (pair of E) of the memory are joined into one thread (they are enumerated).
This circuit is made for 16 threads. For 16 streams 4 exchange operations will be performed to sequentially form four new subsystems. If data with a pair of E is located within one thread after data, the energy will be calculated according to the scheme of Figure 2. The code for calculating the energy minima within each thread is given in Listing 2. The paired elements do not change their occupying locations when adding a spin with number n horizontally such that n ≥ n exch . The listing demonstrates this process by calculating the variables state1 and state2.
Listing 2. Pseudocode of splitting states into pairs, calculating new energy minima for subsystems with column number n ≥ n exch . Nthr is the number of thread. startstate1 and startstate2 are the numbers of start states stored inside the memory blocks and transferred by threads.   Figure 2). For L x = 5 we have 5 possible subsystems of different forms with spins filled from left to right (with one top spin, two, etc.). The bidirectional arrows here mean data exchange. One part takes the place of the another part, which in turn takes up the place of the first part. Recall that Figure 4 corresponds to only 2 E per thread. Since according to the Figure 3 scheme the data within blocks (they are surrounded by a dashed line) are arranged in order, to facilitate the calculation of state numbers, the initial state numbers in the blocks can also be transmitted in the array of energy values. It can be seen that in the diagram of Figure 4 data exchange between threads can be done using two nested loops, as was done in Listing 1. Based on the diagram in Figure 4, namely by observing how groups of threads vary in size, we can construct the whole system of data exchange between all threads. Thus, the pseudocode calculating the energy minima for different subsystems is given in Listing 3.

Listing 3.
Pseudocode for rows and columns enumeration, memory exchange between threads, energy calculation for cases col < n exch and col ≥ n exch . After completion of building each row, the memory locations are moved in the opposite direction. The code does not consider the calculation of the subsystem consisting of the lowest row only, because in this case there is no data exchange between threads. In the latter case, its calculation is trivial. The data exchange between threads is performed using two nested cycles on variables thrd_num_out and thrd_num_in. When calculating the energy by the subprogram calc_E(...), it is necessary to dispatch as arguments the array of energy values, the number of the column, the length of the side Lx. Knowing the minimum energy for the given system and the energies for each state of boundary spins of the row with number Ly, one can obtain configurations of spins with the minimum energy by carrying out Ly · Lx − 1 runs of the program, decreasing the number of spins by one each time and setting the upper boundary of the subsystem corresponding to the minimum energy or the current reduced lattice. This follows from the Formulas (4) and (5), where for E(s 1 , s 2 , . . . , s n = −1, . . . , s L x ) m,n and E(s 1 , s 2 , . . . , s n = +1, . . . , s L x ) m,n we assume the known values, and E(s 1 , s 2 , . . . , s n = −1, . . . , s L x ) m,n−1 and E(s 1 , s 2 , . . . , s n = +1, . . . , s L x ) m,n−1 are unknown. In calculating the energy minimum, we increase the subsystem each time, starting with the subsystem consisting of the lowest row. But when calculating the spin configuration corresponding to the energy minimum, we move "in the opposite direction", decreasing the subsystem at each step and starting from the subsystem equal to the initial system.

Performance
In Table 1  Parallelisation was performed for 64 threads (threads). The time relations are also given in the table. It can be seen that the times grow exponentially with the increasing linear size of the square lattice. Knowing the time ratios for the 35 × 35 and 40 × 40 lattices, equal to 43.07, we can estimate the computation time of the 40 × 40 lattice. It will be approximately 5 h.
Further Table 2 shows the performance data of the lattice calculation algorithm with the fixed side length Lx = 35 and different side lengths Ly. Parallelisation was performed on 64 threads. The table shows that the time grows linearly with the increasing length of one side of the lattice.
To estimate the effectiveness of the parallelisation scheme, a series of calculations were carried out for 1, 2, 4, 8, 16, 32 and 64 threads. 25 × 25 and 28 × 28 lattices were considered. The data are given in Table 3. One can roughly accept that the running time depends on the size Lx as t ∼ 2 Lx /N thrd for N thrd threads. In this case we neglect the data transfer time. Thus, if N thrd increased by 2, the computation time will be reduced by 2. We can see from Table 3 that this ratio is more or less satisfied. Therefore, in our case the computational complexity is 2 Lx (NP-hard).

Use in a Quantum Computer Simulator
Further, we will show how to use the proposed scheme in a quantum simulator. Unitary operators in quantum computers can be represented as a product of 1-qubit and 2-qubit operators [31]. Therefore, to understand the principle of operation, it is sufficient to consider a 2-qubit quantum gate.
To reproduce an action of a quantum gate on n qubits in the common case we have to apply a 2 n × 2 n matrix operatorÂ on the 2 n components of a vector in a 2 n -dimensional Hilbert space (see [31]). Each component of a new vector-state is computed using 4 other components as input for a 2-qubit operatorÂ. Let us a 2-qubit operator acts on qubits number p and number q. So in the result qubit and in 4 input ones all qubit states are the same except for the states with the numbers numbers p and q as it's shown here So the problem is to find and match all groups of 5 components (1 output and 4 inputs) from the 2 n Hilbert space.
Applying the proposed approach of pairing elements we can find all groups of 5 elements to apply a 2-qubit operator to them. In the scheme in Figure 3 we can notice that after each addition of a spin in the resulting pairs the 2 spin states differ by only 1 number (0 or 1) in the binary representation. We can use it in quantum simulators.
The operation scheme of the algorithm can be clearly understood using the example of a quantum circuit of 4 qubits. Let us an operatorÂ act on the qubits number 2 and number 4. It's shown in Figure 5. Next, we are going to reproduce the sequence of operations in Figure 3. Our goal is to calculate all components of Ψ which are linear combinations of components Ψ. Every component of Ψ contains amplitudes and phases of the appropriate wave function. The whole scheme for 4-qubit is given in Figure 6 with the components of Ψ which are written in the binary representation.
Here we collect 4 input variables-components Ψ in one place in the memory and match them with the components of Ψ . In Figure 6 we get in each step pairs of states differing only by one qubit state. We must catch the components of Ψ with different states of qubit 2 and qubit 4. As it's seen in Figure 6 on steps 2 and 4 these qubit states are diferrent in every pair. Keeping this in mind, we can try to copy the pair as done in the diagram. Beforehand we allocate memory to store the additional data after copying. Thus, in step 3 we transmit already the pairs, not just one component (with only the state of the second qubit being different). In step 4, by copying pairs, we get exactly states with all 4 combinations of states of the qubits 2 and 4, but with the same states of the qubits 1 and 3. The big numbers denote memory locations.
Next, we transmit the data backwards in the same way. So eventually we have Ψ-components in the right memory locations as it's shown in Figure 7. Figure 6. Using the parallelisation scheme of computation of Edwards-Anderson model to simulate the 2-qubit quantum operatorÂ acting on the 4-qubit quantum scheme. We have 4 steps with 3-fold data exchange and 2-fold data copying. The arrows denote the directions in which the components of Ψ move. The large numbers indicate the memory locations. In steps 2 and 4 we copy pairs of states to transmit them further. Eventually we have all 4 input components of Ψ to applyÂ to them.Â is given so that it acts only on qubit 2 and 4. We can make sure that by copying the pairs in steps 2 and 4 we get exactly necessary 4 components with the different states of the qubits 2 and 4 and the same states of the qubit 1 and 3 in each of the 4 components. To copy the data, we allocate the extra memory to store the 4 variables for each component. Some of the memory cells are unoccupied at the beginning. Therefore we see the empty cells at the beginning.
ApplyingÂ to every one of the four components of Ψ we get the result wave function. One can notice that the result component of Ψ corresponds to one of the 4 input components.
Using the given scheme, we can parallelise computations in quantum computer simulators.

Conclusions
In the paper the algorithm for parallel computation of two-dimensional the Edwards-Anderson model with nearest-neighbor interaction and performance data are presented. The algorithm can do a data exchange between threads so that the data of one thread can be redirected to any other thread. This follows from the diagram in Figure 4. From any thread, starting from the top line, correctly selecting the arrows and moving along them, we will definitely get to any thread we want. This allows us to use the algorithm in creating a quantum simulator. Using 64 threads it is possible to create a simulator for about 40 qubits. The computation time would roughly correspond to the data in Table 1 and Table 2, where Lx would equal the number of qubits and Ly would be the number of 1-qubit and 2-qubit operators. We showed that computational aspects of quantum computer simulators and Edwards-Anderson model have a lot in common. The algorithm can be used in neural networks (including the Boltzmann machine) and as a benchmark to evaluate the performance of quantum computers. Polynomial growth time of the operation time when only one side increases in size is an advantage of the method. In common case the scheme has the complexity class NP. The C++ code of the calculation program is available for downloading at [32]. It implements both minimum energy search and minimum energy configuration search.