A New FPGA-Based Real-Time Digital Solver for Power System Simulation

: Considering the rational use of field programmable gate array (FPGA) resources, this paper proposes a new FPGA-based real-time digital solver (FRTDS) for power system simulation. Based on the relationship between the number of computing components, the operating frequency, and the pipeline length, the best selection principle is given. By analyzing the implementation method of the Multi-Port Read/Write Circuit, the computing formula of the Look-Up-Table (LUT) consumption was derived. Given the excessive use of LUTs in the original computing components, the computing components were assembled in a single typical arithmetic expression of the power system simulation program, as the basic computing formula was characterized by a subset of the typical computing formula and multiple uses of the same variable. Data communication between different computing components was realized by using Multi-Port Input Circuits that share some outputs of read controller, and Multi-Port Output Circuits, which share some outputs of computing cores. According to the test results of original FRTDS and new FRTDS, it was found that the solution proposed in this paper had a shorter ideal simulation time and a higher parallel computing capability, which was very suitable for real-time digital simulation of power systems.


Introduction
The real-time digital simulation of a power system can operate in the hardware-in-the-loop environment, which plays an important role in the design, test, and inspection of a power system's automatic control and protection, as well as in professional training [1][2]. At present, real-time digital simulation devices mainly include Real Time Digital Simulator (RTDS) and the real-time simulation platform of Opal-RT Technologies (RT-LAB). These real-time digital simulation devices adopt multi-CPU or Digital Signal Processor (DSP) structures. The relatively narrow communication bandwidths between CPU or DSP and long communication delay, seriously restricts the scale of real-time digital simulation [3].
Graphics Processing Unit (GPU) has powerful parallel computing capability and is used in the real-time digital simulation of power systems. Reference [4] developed the GPU through the Compute Unified Device Architecture (CUDA) and built a GPU-based electromagnetic transient simulation prototype system. Reference [5] proposed a computation-level parallel-algorithm based on Single Instruction Multiple Data (SIMD) and shared memory on GPU. Reference [6] built a real-time digital simulation platform based on CPU-GPU, realizing the real-time digital simulation of a 117-node power system. Due to the data transmission delay between GPU and CPU, GPU memory access the latency and computational efficiency of CUDA and it is still very difficult to implement large-scale, real-time digital simulation using GPU.
Field Programmable Gate Array (FPGA) has a fully configurable parallel hardware structure, deep pipeline structure, and high parallelism, which make it suitable for real-time digital simulation of power systems [7]. Reference [8,9] introduced several key modules based on FPGA, realizing the real-time digital simulation of an active distribution network with a modular structure. Reference [10] realized the real-time digital simulation for large-scale power systems based on multi-FPGA. Reference [11] introduced the FPGA-based power electronic simulation model and the control system simulation model, realizing the real-time digital simulation of the photovoltaic power generation system under five simulation step sizes. The above references describe the real-time digital simulation based on coarse-grained calculations, where the utilization of FPGA's resource was relatively low.
The Smart Grid Laboratory of Tianjin University developed the FPGA-based real-time digital solver (FRTDS) and its application tools, with the design idea of order flow. Reference [12] introduced the hardware design and the order flow generation tools of FRTDS. Reference [13] added Generic Object Oriented Substation Event (GOOSE) and Sampled Value (SV) communication interfaces in FRTDS, which enables FRTDS to be used for real-time digital simulation of the intelligent substation. Reference [14] developed a set of power system electromagnetic transient simulation program generation tools that are compatible with the order flow so that users do not have to care about the simulation program. To alleviate the data storage pressure of FRTDS and improve the simulation calculation speed, the multi-value parameters which are required by the direct calculation method (not to solve the sub-network equation) are defined according to the sub-network, and the principles for dividing the sub-network of the large-scale power system is given in [15]. Reference [14] used the features of network symmetry and the same switch resistances and gave the multi-valued parameter compression storage method and the search method in hardware.
In [12], the FRTDS developed on Virtex-7 FPGA VC709 can operate normally at 200 MHz and has a high computing capability. However, due to the timing closure issues, much of the FPGA resource is not used, as shown in Table 1. This paper analyzes the factors that affect the FPGA timing closure and cause the use of too many LUTs, realizes a balanced use of FPGA resource, and improves the computing capability of FRTDS.
Section 2 describes the computing component design based on computing formula flow operations, the order flow arrangement based on directed acyclic graphs (DAG), and the multi-value parameter query method. Section 3 analyzes the factors affecting the operating frequency of the FPGA, and gives a method to judge whether the operating frequency of the computing component is suitable. Section 4 gives the information flow between the data storage and the read/write controllers in the computing component, and analyzes the reasons why the multi-port read/write circuits use too many LUTs. Section 5 discusses the deficiencies of the original FRTDS, optimizes the computing core, and gives a new data communication method between the computing components. Section 6 verifies the efficiency of the new FRTDS proposed in this paper. Section 7 gives some experience in improving the performance of FRTDS.

Power System Electromagnetic Simulation
The original FRTDS uses the node analysis method to perform electromagnetic transient calculation, and uses a certain numerical integration method (such as trapezoidal integral method) to differentiate the characteristic equations of dynamic components such as inductors and capacitors. The dynamic component is equivalent to a Norton equivalent circuit, in parallel with a historical current source and a conductance.
For the RL series circuit, the branch current and equivalent current source are For the RC series circuit, the branch current and equivalent current source are According to Kirchhoff's law, the network is solved by the node voltage formula where G is the equivalent conductance matrix, which is constant unless when the network parameters or topology changes, (t) u is the voltage of the nodes at clock t, s (t) i is the independent current source at clock t, and (t t) − Δ I is the equivalent current source at time The instantaneous value of the node voltage is calculated by the Gauss Elimination method. The process includes two main steps-forward and backward substitution. When node k is eliminated, the self-admittance of its associated nodes and their mutual admittance is When node k is used for forward and backward substitution, the current source of node I is When node k is used for forward substitution, the voltage of k is The process of power system electromagnetic simulation is shown in Figure 1.

Computing Component Design.
The power system electromagnetic simulation process based on the node analysis method can be summarized as the following seven steps-generating the node admittance matrix, calculating the historical current source, calculating the node injection current, using the elimination method to calculate the equivalent network parameter, using the back-substitution method to calculate the node voltage, calculating the branch circuit voltage, and calculating the branch circuit current. If each step is designed as a specific functional module, not only is the FPGA resource wasted, but also the simulation time is increased.
In the electromagnetic transient simulation program of the power system, the computing formula of step 1 is Y A =  , the computing formula of step 2 is Y A B C = × + , the computing formula of step 3 is Y A =  , and the computing formula of step 4 is Y A B / C D = × + , Y A B / C = × and Y A =  , the computing formula of step 5 is Y A B C = × + and Y A / B = , the computing formula of step 6 is Y A B = + , and the computing formula of step 7 is Y A B C = × + . Table 2 shows the computing formulas which are used in the electromagnetic transient simulation program of a power plant, and their proportions.
FRTDS refers to the seven computing formulas in Table 2 as the basic computing formula, and other computing formulas are represented by the basic computing formula by splitting and making variable constant. For example, Y A B / C D / E F = × + + can be split into Y A B / C Y1 = × + and Y1 1 D / E F = × + . To save the DSP resource of the FPGA, all basic computing formulas are realized by using two adders, two multipliers, and one divider, as shown in the computing core in Figure 2.  The FRTDS combines the data storage address and the control word of the controller, which are involved in a certain computing formula, in the specified data format, and defines it as an order. Since the adder, multiplier, and divider are performed in pipeline, the input data and output data of the computing formula do not interact with the data storage at the same time. Even if there are only the input data at the same time. To make the computing component operate orderly, the order buffer queue is added to the read controller, write controller, and select controller in Figure 1, which could delay the output of the data storage address and the control word. The length of the order buffer queue in Figure 1 is given under the condition that the pipeline lengths of the adder, multiplier, and divider are, respectively, 4, 4, and 14.
The basic computing formula in the simulation program is expressed as a single order and kept in the order storage. The computing component reads orders from the order storage at a specified operating frequency and writes them to the read controller, the write controller, and the select controller. The read controller reads the input data from the data storage through the multi-port read/write circuit, and the computing core performs the numerical calculation according to the calculation mode given by the selection controller, then the write controller writes the calculation result to the data storage through the multi-port read/write circuit. This process continues until the simulation step ends and performs all over again.

Order Flow Arrangement
As long as the self-admittance, mutual admittance, and injection current of a node are determined, the elimination operation can be performed. However, the calculation of the node voltage by the back-substitution method must be performed after the equivalent network parameters are obtained by the elimination method. Therefore, some computing formulas are dependent (called serial computation), and some can be calculated simultaneously (called parallel computation). The computing formula is treated as a task, and the dependencies between tasks are described by the directed acyclic graph (DAG). Figure 3 shows the DAG diagram of a simulation program, in which the pipeline length of the addition, multiplication, and division tasks is marked on the left side of the task box.
x + x  , and Y Y3 / Y4 = , is the 1st, 1st, 1st, 5th, and 9th clock, and the ideal minimum execution time of the program is 22 clocks. If there are only one adder, one multiplier, and one divider in the computing component, and the task Y3 E F = × is arranged at the first clock, it is impossible to finish the whole simulation program within 22 clocks. According to the ideal minimum execution time of the simulation program and the pipeline length of the task, the latest execution time of each task can be deduced in the opposite direction of the DAG diagram. It can be seen from Figure 2 that the latest execution time of the tasks Y Y3/ Y4 = , Y4 Y1 Y2 = × , Y3 E F = × , Y2 C D = + , and Y1 A B = × , is the 9th, 5th, 5th, 1st, and 1st clock, respectively. Reference [16] prioritizes the ready task whose latest execution time is the earliest, so that the simulation time is as short as possible.
Due to the limited resource of the computing component, three conditions need to be considered in the order arrangement process-first, whether the left capability of the computing core can undertake the specified task; second, whether the input data required by the specified task can be directly read from the data storage; third, whether the data storage has a read or write conflict. When the latter two conditions cannot be satisfied, the data can be adjusted in advance to execute the specified task smoothly. The data adjustment methods consist of the following three types-first, move the input data to the appropriate storage (change the storage location of the variable); second, move the input data to the output of the read controller in advance (using the latch function of the read controller); third, copy the input data to the appropriate storage in advance (using the data transmission channel of the computing component).

Multi-Value Parameter Query
In the real-time digital simulation of power system, the operations of the circuit breaker, isolating switch and grounding switch, and the settings of short circuit and open circuit are indispensable. Whether the circuit breaker, isolating switch and grounding switch are open, and whether there are a short circuit and an open circuit, makes the switch resistance time-varying in nature, and its conductance can be expressed as ON G KG = (7) where K is the state of the switch (0 or 1), ON G is the conductance value when the switch is closed.
For electromagnetic transient simulation, the nonlinear characteristics of electrical components also need to be considered. In practical projects, the piecewise linearization strategy is used to solve this problem. For example, the relationship between ferromagnetic coil current i and flux linkage ψ in Figure 4 can be linearized to k s k L i (k 0,1,2) ψ +ψ = = (8) where k L and sk ψ depend on the curve segment k on which i is located. If the network contains many switch elements or non-linear elements, the calculation of the parameter values of the external equivalent network becomes very cumbersome, and there are no benefits at all with the direct calculation method.
The FRTDS refers to a network parameter related to the state of a switch element or a nonlinear element as a multi-value parameter and stores it as an array. The current address of the multi-value parameter is obtained from the start address and the offset of the array, and this addressing method is called indirect addressing. Figure 5 shows the hardware circuit that implements this indirect addressing.
In Figure 5, the contents of the boot word include the start address of the multi-value parameter, the address of the influence word (the state of the switch element and the non-linear element), and the decoding mode of the offset, depending on the influence word. For example, the influence words of ferromagnetic coil inductance L are 1 0 S S . When 1 0 S S 00 Obviously, the number of "1"s in the influence word can determine the decoding method. The decoding method is used to reduce the amount of storage, and this has been studied in depth in [14].

Timing Transmission Path
In FPGAs, there are both timing and combined components. The output of the timing components, which include the flip-flop, BRAM, DSP, etc., changes only on the rising edge of the clock pulse. The output of the combined components, which include the LUT, the carry chain, etc., changes whenever the inputs change (with a certain delay). The timing transmission path between the timing components includes the signal wire and the combined circuit, as shown in Figure 6a. Since the input in2 of Timing Component 2 has a delay compared with the output out1 of the Timing Component 1 and each of the clock pulses CP1 and CP2 has a delay compared with the common clock pulse CP, there are a relatively large number of factors that affect the timing closure. This study discusses the timing closure by the setup margin setup T and the hold margin hold T , as shown in Figure 6b.
To resolve the timing closure problem, both the setup margin setup T and the hold margin hold T must be non-negative. When the operating frequency becomes higher, the clock period T becomes shorter, and the non-negative condition of setup T is more difficult to satisfy, which makes the timing component unable to operate normally. Therefore, to increase the operating frequency, it is necessary to shorten the timing transmission path between the two timing components. A combined circuit with complex input-output relationships requires multiple LUTs to implement its function and has a long timing transmission path. The solution is to decompose the complex combined circuit into a plurality of simple combined circuits, and then organically combine them by the flip-flop.
When a source timing component is associated with multiple destination timing components, a relatively large amount of FPGA resource is involved. Then, the degree of freedom of automatic routing is reduced, and it is not easy for setup T to be non-negative. The solution is to turn a single source timing component into a plurality of source timing components connecting radially, to carry a smaller number of destination timing components.

Selection of the Operating Frequency
The FRTDS consists of multiple computing components and their communication circuits that interact with the peripherals. The computing capability of FRTDS can be achieved by increasing the number of computing components or the operating frequency. Since the timing transmission path is not only related to the complexity of the combined circuit and the number of associated timing components, but is also related to the utilization of the FPGA resource. The highest operating frequency of FRTDS is inversely proportional to the number of computing components. However, with the increase of FRTDS operating frequency, the pipeline length of the computing component also increases, which is directly related to the ideal minimum execution time of the simulation program. Therefore, when selecting the operating frequency of the computing component, the shortest pipeline length of the computing component must be under consideration. Table 3 shows the shortest pipeline length that the computing components have under different conditions. It can be seen from Table 3 that when the number of computing components is three, the minimum pipeline length increases slightly, along with the increase in operating frequency; when f/MHz the number of computing components is four, the minimum pipeline length increases significantly along with the increase in operating frequency; when the number of computing component is five, the operating frequency can only be below 140 MHz. From the perspective of the parallel computing capability of FRTDS, the bigger the product of the operating frequency f and the number n of computing components, the better. From the perspective of the serial computing capability of the FRTDS, the bigger the ratio of the operating frequency f to the computing component pipeline length p, the better. Taking these two factors into consideration, the principle of selecting the operating frequency is 2 n m max{ f } p q + + (12) where m 0 ≥ and q 0 ≥ , and the larger m, the more emphasis is placed on the serial computing capability; the larger n, the more emphasis is placed on the parallel computing capability.

Information Flow of Multi-Port Read/Write Circuit
To simulate the short circuit and open circuit faults, the FRTDS uses 64-bit double-precision floating-point numbers. The minimum BRAM unit of Virtex-7 FPGA is 36K, which can be used to construct a data storage with a capacity of 512 64 × . Larger data storage are constructed with multiple blocks of BRAM, and in this study, the single data storage discussed has a capacity of 1024 64 × . The data storage, read controller, and write controller in Figure 1 are all general terms. Specifically, the number of read controllers is the same as the number of computing core inputs, and the number of write controllers is the same as the number of computing core outputs. The number of data storages must meet the requirements of the computing core for parallel computing. Preferably, each data storage can interact with every read controllers and write controllers.
To realize reading and writing of the data storage correctly, it is necessary to understand the communication circuit between the data storage, the read controller, and the write controller. Assuming that the number of data storage is q, the number of read controllers is n and the number of write controllers is m. The data storage has a 10-bit address line (ad), a 64-bit data write line (di), a 64-bit data read line (do), and a 1-bit write enable signal (we). The read controller has a 5-bit select line (se), a 10-bit address line (ad), a 64-bit data write line (di), and a 64-bit data read line (do). The write controller has a 5-bit select line (se), a 10-bit address line (ad), a 64-bit data write line (di), and a 64-bit data read line (do).
The information flow of the multi-port read/write circuit is as shown in Figure 7. The information flow of the multi-port read/write circuit is determined by tracing the source of the information. After tracing, it is found that the "we" of the data storage is determined by the "se" of all write controllers. The "ad" of the data storage is determined by the "se" and "ad" of all read controllers and write controllers. The "di" of the data storage is determined by the "se" and "do" of all write controllers. The "di" of read controllers is determined by the "se" of all read controllers and the "do" of the data storage.

LUT Implementation of Multi-Port Read/Write Circuit
The LUT of a Virtex-7 FPGA is essentially a 6 × 1 RAM. Different values are written to different addresses of the LUT in advance. When the address is written to the LUT input, the LUT outputs a value corresponding to the address. The relationship between input and output of the LUT is described by a truth table.
When the "se" of read controller Y or write controller Z is equal to the specific number of data storage X, it means data storage X is selected by read controller Y or write controller Z. For data storage X, each read controller and write controller has a specific LUT for which the truth table is written as "logical output is 1 only when the logical input is equal to the number of data storage X". These LUTs are a combined circuit, which selects data storage X for the read controller and the write controller, and the circuit is called the chip select combined circuit. The number of LUTs required for the chip select combined circuit is select q( lu n tn m ) = + The output of the chip select combined circuit of all write controllers for data storage X is connected to the "we" of data storage X by a "or" logical summary circuit, which effectively enables the "se" of the write controller control and the "we" of the data storage. The number of LUTs required for the combined circuit to control the "we" of the data storage is: we nt m 1 q i lut ( n ) 5 − = (14) where int(*) is the round-up function.
Since the 10-bit "ad" of the data storage is independent of each other, it can be discussed with one bit of the "ad". In Figure 8, Aad0 is the 0th bit "ad" of read controller A and Aen is the corresponding chip select combined circuit output of data storage X; Bad0 is the 0th bit "ad" of read controller B and Ben is the corresponding chip select combined circuit output of data storage X; Cad0 is the 0th bit "ad" of read controller C and Cen is the corresponding chip select combined circuit output of data storage X. The LUT is capable of gating "ad", when its truth table is written as = when en C 1 = ". This kind of LUTs compose the gating combined circuit. Connect all the "ad" of read controllers and write controllers in groups of three to a plurality of gating combined circuits, and then connect all the gating combined circuits outputs to the "ad" of data memory X through the summary circuit, which enables the "se" and "ad" of the read and write controller to effectively control the "ad" of the data storage. The number of LUTs required for the combined circuit to control the "ad" of the data storage is: The combined circuits controlling the "di" of the data storage and the read controller are similar to the combined circuits controlling the "ad" of the data storage in principle, but there are still two differences-first, the width of "ad" is 10-bit while the width of "di" is 64-bit; second, to control the "di" of the read controller, all the "do" of the data storage connect to a plurality of gating combined circuits, in groups of three.
The number of LUTs required for the combination circuit to control the "di" of the data storage is Formula (16), and the number of LUTs required for the combination circuit to control the "di" of the read controller is Formula (17). ramdi nt nt nt The analysis above shows-first, the multi-port read/write circuit requires a large number of LUTs, which is proportional to the product of the number of data storages and the number of read and write controllers; second, more than 80% of LUTs are used to control the "di" of the data storage and the read controller di; third, when the number of data storages, read controllers, and write controllers is a multiple of three, each LUT has almost no free input pins, and their capabilities are fully utilized.
In addition to affecting the use of LUTs, the number of data storages, read controllers, and write controllers also play a pivotal role in timing closure. However, since only the summary circuit might have the LUT cascading situation, the timing closure is mainly concerned with whether the source timing component is associated with too many destination timing components.

Deficiencies of Original FRTDS
Since the length of the serial calculation of the node admittance matrix, the historical current source, the node injection current, the branch voltage, and the branch current is very short, the main factor affecting the ideal execution time of the simulation program is the solution of the node voltage equation. Therefore, the FRTDS solves the problem of eliminating the interval nodes simultaneously, as shown in Figure 9, by using the basic computing formulas Y A B / C = × and Y A =  . The basic computing formulas Y A B C = × + and Y A / B = are used to decompose the back-substitution formula, which solves the problem of waiting for the node voltage, as shown in Figure 10. However, the last step of the decomposition is inappropriate because of using Y A / B = instead of Y (A B C)/ D = × + , which undoubtedly increases the ideal execution time of the simulation program.

Nodes Eliminated Simultaneously
Nodes Preserved  In the electromagnetic transient simulation program of the power system, it is very common for the same variable used in different computing formulas. However, the FRTDS only considers the problem of variable reuse in the elimination formula Y A B / C D = × + , and does not consider the problem of the same variable used in the back-substitution formula Y A B C = × + . In addition, the method of using one divider to complete two formulas Y A B / C D = × + reduces the parallel computing capability of the elimination operation itself.
In order to reduce the LUT use of multi-port read/write circuit, the data storage is divided into three types-own storage, hand-in-hand storage, and sharing storage. The data communication between the computing components is realized by the hand-in-hand storage and the sharing storage, as shown in Figure 11. Obviously, this type of communication method requires the addition of the data storage. To ensure the smooth communication between the computing components and solve the problem of read and write conflicts in the data storage, the data transmission channels of the computing core need the separate read controllers and write controllers. In this way, to realize the communication between the computing components, it is necessary to increase the data storage, the read controller, and the write controller simultaneously, and the increment of LUTs expenditure of the multi-port read/write circuit cannot be underestimated.  In addition, the length of the pipeline of the elimination formula Y A B / C D = × + is relatively long, which makes it necessary to take the serial computing capability of the FRTDS into consideration.

Optimization of Computing Core
In actual engineering projects, the pipeline lengths of the adder, multiplier, and divider in Figure 2 are related to the FPGA model, operating frequency, and resource usage. The changing tendency of pipeline lengths of the adder and multiplier provided by the Virtex-7 FPGA VC709 is basically the same, and the divider does not use DSP resource. In order to save LUT resource, the divider in Figure 2 is implemented with a multiplier and a reciprocator. Table 4 shows the pipeline lengths and resources of the adder, multiplier, and reciprocator in Figure 2 (using three computing components and operating at 165MHz).
In Figure 2 Formula (20) only adds 1 multiplier but exchanges for a reduction of 8 in the pipeline length, which is worthwhile.
The formulas in Table 2 only one output for the computing core does not greatly affect the actual execution time of the simulation program, but can significantly reduce the LUT consumption of the multi-port read/write circuit. A new computing core is designed according to the condition that all computing formulas are the subsets of the typical computing formula, as shown in Figure 12. To minimize the pipeline length of Y A =  , the two adders and multipliers selectively input in parallel and output.  Figure 12. New computing core.
In actual engineering projects, the inputs of 0 and 1 in Figure 12 do not exist and are included in the truth table of the selector. Table 5 shows the corresponding selector output and the pipeline length of each computing formula. Table 5. Selector output and pipeline length of the computing formula.

Formula
Selector Output Pipeline Length

New Communication Methods Between Computing Components
Like the original computing component, the new one also connects the data storage and the computing core through the read controller and the write controller. The difference is that the output of the read controller in the new computing component is connected to the input of the computing core, through the multi-port input circuit, and the output of the computing core is connected to the input of the write controller through a multi-port output circuit, as shown in Figure 13. In Figure 13, R#_A represents the output A of all read controllers in other computing components, R#_E represents the output E of all read controllers in other computing components, C#_Y represents the output Y of all computing cores in other computing components, and C#_N represents the output N of all computing cores in other computing components. It can be seen from the multi-port input circuit in Figure 13, the A and E of the computing core are considered as inputs where the variables of computing formulas are the same. In terms of space, it enables the A and E of the computing core to share the outputs A and E of the read controllers in other computational components. In terms of time, it enables the A and E of the computing core to have a self-locking function. Thus, during the solution of the node voltage equation, the outputs A, D, and E of the read controller are idle most of the time. Further analysis shows that in other parts of the simulation program, since the number of input variables of the computing formulas is generally no more than three, more than two outputs of the read controller are often in an idle state. Therefore, these idle outputs of the read controller can be used to provide the input data for the data transmission channel of the computing core. The distribution characteristics of the idle outputs of the read controller are taken into account by selecting one of the outputs A, C, D, and E for connecting the input F of the computing core, and another for connecting the input G.
The multi-port output circuit in Figure 13 shows that the calculation result of the computing core can store not only two copies of data in its own storage for self-computing, but also in the storage of all other computing components. This fundamentally reduces the pressure of the data transmission channel of the computing core. In this way, setting only two data transmission channels in the computing core can ensure the smooth data communication between the computing components. To ensure that the output M of the write controller does not cause pipeline turbulence, a 2-pipeline-length delay and a 12-pipeline-length delay circuits are added to the multi-port input circuit and the multi-port output circuit.
To reduce the internal data adjustment of the computing component and the data adjustment between the computing components, it is necessary to use the one-to-more function (write one result to more data storages in other computing components) of the multi-port output circuit. The one-to-more function can reduce rather than avoid the data adjustment problem during the order arrangement. To ensure that the data communication between the computing components can be smooth, when the internal data is adjusted, the data transmission channel F-M is used first and then the data transmission channel G-N is used. Figure 13 shows that not only the number of LUTs required for the multi-port input circuit and the multi-port output circuit is small, but also the circuit has no effect on the LUT expenditure of the multi-port read/write circuit. Moreover, in the new computing components, the number of the input of the read controller and the output of the write controller are much smaller than it is in the original computing component. The LUT expenditure of the new multi-port read/write circuit is only 10% of the original.

Computing Capability Test
The computing capability of the solver is mainly reflected in its seriality and parallelism. The seriality is defined as the number of associated computing formulas that can be computed in one second. When performing a seriality test, the metric s  Table 6.  To verify the accuracy of the new FRTDS, the simulation system was also simulated with Power Systems Computer Aided Design (PSCAD).
After the simulation started, at time-point t = 0.15 s, the A and B phase-to-phase short-circuit faults were set at 110 kV I bus, the voltage waveform of the fault point is shown in Figure 15 and the current waveform of 111-input line is shown in Figure 16.    Figure 16 shows that after the two-phase fault occurs the currents of fault phases increase sharply relative to the opposite phase. The non-fault phase current is much lower, as compared to the other two phases.
According to Figure 15 and Figure 16, the simulation results of FRTDS and PSCAD are consistent, which proves the accuracy of new FRTDS.
To verify the efficiency of the new FRTDS, it simulates five kinds of systems using the minimum degree method. A is the left half system, with the 10 kV bus connected to the two resistive load outlets 711, 712; B is the left half system, with the 10 kV bus connected to the four resistive load outlets 711, 712, 713, 714; C is all systems, with the two 10 kV buses connected to four resistive load outlets 711, 712, 721, 722; D for all systems, with the two 10 kV buses connected to eight resistive load outlets 711, 712, 713, 714, 721, 722, 723, 724; E is the whole system, with the two 10 kV buses connected to twelve outlets.
It was simulated with six FRTDSes of different performance-FRTDS1 (large seriality, 175MHz, 3 computing components, the original version); FRTDS2 (high parallelism, 195MHz, 4 computing components, the original version); FRTDS3 (max computing capability, 175MHz, 4 computing components, the original version); FRTDS4 (large seriality, 175MHz, 11 computing components, the new version); FRTDS5 (high parallelism, 195MHz, 12 computing components, the new version); and FRTDS6 (max computing capability, 175MHz, 12 computing components, the new version). The results are shown in Table 11. The test proved that when the simulation scale was small, the parallelism required by the system was not high, and the actual execution time mainly depended on the seriality of the FRTDS; when the simulation scale was large, the actual execution time mainly depended on the parallelism of the FRTDS. The new FRTDS serial computing capability and parallel computing capability were greatly improved. When using the new FRTDS, the requirements for the simulation scripts were lower and it was much easier for developers to operate. With a more balanced resource usage, shorter ideal execution time and higher parallel computing capability, the new FRTDS was ideal for real-time digital simulation of power systems. System sim t / s μ 7. Conclusions 1. There are many factors affecting timing closure, including the scale of the logic combined circuit, the number of destination timing components associated with the source timing components, and the use of FPGA resources. In the actual project, it is necessary to understand the hardware implementation of the logic combined circuit, and improve the timing closure conditions appositely.
2. The number of computing components, the operating frequency, and the pipeline length are inherently related. Their optimization must consider serial computing capability and parallel computing capability. When the simulation step size is short, it is necessary to emphasize the serial computing capability. When the simulation scale is relatively large, the parallel computing capability is mainly emphasized.
3. It is necessary to configure the FRTDS function properly and balance the use of FPGA resources, including the scale setting issues and resource consumption issues in the multi-value parameter query circuit, the computing core, etc. By analyzing the implementation of multi-port read/write circuit, this paper clarifies that the LUT consumption of a computing component is proportional to the square of the number of its read and write controllers.
4. It is necessary to utilize the features of the simulation program to improve the efficiency of the real-time digital solver. In this paper, according to the features of the power system simulation program the basic computing formula is characterized by a typical computing formula subset and there are multiple uses of the same variable, and the computing components with short pipeline length and low LUT consumption are constructed.